Downloading & Describing Data

Assumptions & Ground Rules

The purpose of this assignment is to learn how to download and provide basic descriptions of data and specific variables analyzed in a published study. Up to this point, we have used built-in R data (e.g., R Assignment 2), provided you with the data you were working with (e.g., subsets of the NYS data in R Assignment 3), or you have downloaded the data manually from ICPSR and placed it within your reproducible file structure (e.g., R Assignment 4). The approach we used in the last R Assignment would work fine when you are using your own data and/or data that you have permission to share. However, this is not generally the case with data on ICPSR. According to ICPSR’s bylaws, you are not technically allowed to share ICPSR data in your own online repository (e.g. OSF or GitHub). For this assignment, we will show you how to download SPSS data directly within R and begin looking at the data via basic descriptive statistics.

Specifically, for this assignment, we will:

Create part of file structure within R
Download NYS data directly from ICPSR into file structure
Subset data and combine multiple data sets into one
Identify, rename, and provide basic descriptions of specific variables/items used in the study.
Provide basic introduction to the ifelse function and logic in R.

I assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Packages:

library(tidyverse)
library(here)
library(haven)
library(icpsrdata)
library(gt)
library(sjmisc)

Part 1 (Assignment 5.1): Create file structure within R

In the last R Assignment (“Reproducible File structure”) we created a basic reproducible file structure and shared it using your computer’s operating system. Here, we are going to create most of the folders we need using R code. This is useful for our specific purposes–downloading data directly from ICPSR–because we do not have to rely on someone placing their data in the correct folder, we can simply share with them the code to create the folder in their own root directory.

1.1: Create root folder for R Assignment #5

Before we start creating folders and downloading data within R, we need to create a root folder, save our RMD file inside it, and close and open the assignment directly from that root folder (so the “here” package will start in the correct folder on our computer):

Go to your “LastName_CRM495_work” folder on OneDrive
Create a new root folder titled “LastName_CRM495_RAssignment5” inside it.
Save your RMD file as LastName_CRM495_RAssign5_YEAR_MO_DY
Close RStudio, go to root folder and open RMD file.

1.2: Create “NYS_data” folder within R

Create a subfolder within your “LastName_CRM495_RAssignment5” subfolder called “NYS_data.” Technically, you could do this yourself by navigating to the folder on your computer and creating a new “NYS_data” folder manually. But we can also do it in R with the following code. Again, doing it in the R environment helps ensure that anyone else (including our future selves) can easily reproduce our work with minimal effort.

# check if "NYS_data" folder exists (TRUE if it does) & create if it does not exist. 
ifelse(dir.exists(here("NYS_data")), TRUE, dir.create(here("NYS_data")))

## [1] TRUE

Let us try to explain the above code to you. The “ifelse” command is a logical function within base R. To get more details about it, type ?ifelse into the console window. Here is the description of that function:

ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

It takes the form of the following: ifelse(test, yes, no). This means, you give R a logical test (or a logical question) that can be answered yes or no and then it gives you a value or performs another function based on the solution of that test (i.e., based upon the answer to that question).

In the above code, we are asking if the “NYS_data” folder exists within our root folder (i.e., your “LastName_CRM495_RAssignment5” folder) with the dir.exists function. If the answer is yes, it simply returns the logical value I told it to - in this case TRUE. If the answer is no, you instruct R to create that “NYS_data” folder with the dir.create function. Again, type ?dir.exists or ?dir.create for more information.

If you want to have some fun, you can actually have R return a text string instead of the logical value. For example:

ifelse(dir.exists(here("NYS_data")), "You already created that folder, dummy!", dir.create(here("NYS_data")))

## [1] "You already created that folder, dummy!"

Generally, it is probably not a great idea to have R call the user (yourself in this case) a “dummy” with code you plan to eventually share publicly. Yet, it is also OK to have some fun when doing science. I (Jake) think that having a computer program call me a “dummy” is fun - perhaps you do not.

Note: there is probably a programming rationale for using the logical value rather than a string of which I am unaware.
Note: tidyverse syntax has a stricter if_else function. According to the documentation, what makes it more strict is that “It checks that true and false are the same type.” I’ll be honest, I’m not sure exactly when this strictness is useful (tidyverse says it can allow for more predictable use and is somewhat faster). For most of what we will be using it for, either the base ifelse and tidyverse if_else functions will likely work just fine. I am going to use the more general ifelse function from base R.

You should now have a file structure for your “LastName_CRM495_RAssignment5” folder that looks like this:

Part 2 (Assignment 5.2): Download ICPSR data directly from within R

Now that you have the basic file structure for this assignment and specifically the “NYS_data” folder, it’s time to download the first five waves of NYS data. Recall these are the waves of data that Warr (1993) used in his study on the delinquent peer influences and the age-distribution of crime.

As we mentioned previously, it is technically against ICPSR’s bylaws to share data housed on ICPSR “without the written agreement of ICPSR.” This means that if you included ICPSR data and/or documentation directly within a reproducible file structure that you shared with someone else, you would technically be violating the bylaws. Fortunately, there is a package called “icpsrdata” that allows you to download data housed on ICPSR directly from within R. This means you simply need to provide the code for downloading and wrangling the data and you are 1) not violating ICPSR’s bylaws and 2) adhearing to open and reproducible research practices. Let’s show you how to do that now.

2.1: Identify ICPSR numbers for data you want to download

We need to know the ICPSR numbers for the first five waves of the NYS. Go to the NYS Series page on ICPSR and make note of the ICPSR numbers for the first five waves of data.

Here is a table to remind us of the ICPSR numbers.

Warr (1993) NYS Items
Wave	ICPSR
Wave 1	8375
Wave 2	8424
Wave 3	8506
Wave 4	8917
Wave 5	9112

2.2: Use the “icpsrdata” package to download NYS data

In order to use the “icpsrdata” package, you need to install it and load it into your current R environment.

Recall that it is good practice to install packages in the “Console” and only load them within your RMD file (see R Assignments #1 and #2 for details on how to install and load packages)

library(icpsrdata)

To actually download the data, we will use the icpsr_download function that is a part of the “icpsrdata” package. The core arguments of the function are specifying the file_id (i.e. ICPSR numbers) and download_dir (the file on your computer to where you want the data files to be downloaded). Let’s just show you the code and then explain it.

Note: In order to prevent R from continually trying to download data we have already downloaded and to prevent issues when you knit your Rmd file, in the code below we added the icpsr_download function to an ifelse function. It is the same logic as ifelse function above when we created the “NYS_data” folder. It first checks to see if the “ICPSR_09112” folder exists (wave 5–the last wave we are telling R to download) in the “NYS_data” folder. Then it returns the logical statement “TRUE” if it the folder does exist and, if it does not exist, runs the icpsr_download command to download the first five waves from ICPSR.

ifelse(dir.exists(here("NYS_data", "ICPSR_09112")), TRUE, 
icpsr_download(file_id=c(8375, 8424, 8506, 8917, 9112),
               download_dir = here("NYS_data")))

Note: when you first run this chunk during an R session you will be asked to enter your ICPSR account information into the R console. R should remember this once you enter it once. So you will likely need to do this once per R session before trying to knit your Rmarkdown document.
- Again, to be clear: Check the R Console after trying to run the icpsr_download command to see and respond to the ICPSR username/password prompts. If you have not yet created a free account on ICPSR (you should have already for an earlier project assignment), then you will need to do this on the ICPSR website first. Then, after each prompt in the console, you would put your ICPSR username instead of “your_icpsr_username” (I added that as a placeholder) and your ICPSR password instead of “your_icpsr_password.”

2.3: Read the data into R.

You already did this with the wave 1 data that you downloaded directly from ICPSR in “R Assignment 4.” Now you just need to do it for each of the first five waves of NYS data you just downloaded by telling R where the specific data file is within your file structure. You want to make note of the specific files that were downloaded to the “NYS” folder.

Recall from earlier that the actual data are within a folder called “DS0001” within each of the ICPSR folders. You simply want to use the “here” package to tell the read_spss function from the “haven” package where to find the SPSS data for each wave of the data. Make sure you pay close attention to which study numbers are associated with each specific wave of data!

nys_w1 <- read_spss(here("NYS_data", "ICPSR_08375", "DS0001", "08375-0001-Data.sav"))
nys_w2 <- read_spss(here("NYS_data", "ICPSR_08424", "DS0001", "08424-0001-Data.sav"))
nys_w3 <- read_spss(here("NYS_data", "ICPSR_08506", "DS0001", "08506-0001-Data.sav"))
nys_w4 <- read_spss(here("NYS_data", "ICPSR_08917", "DS0001", "08917-0001-Data.sav"))
nys_w5 <- read_spss(here("NYS_data", "ICPSR_09112", "DS0001", "09112-0001-Data.sav"))

If you have done everything correctly up to this point, you should have five data sets in your RStudio Environment representing each of the waves 1 through 5 data that we downloaded with the icpsr_download function above (named “nys_w1,” “nys_w2,” “nys_w3,” “nys_w4,” and “nys_w5”).

Part 3 (Assignment 5.3): Trim, Rename, and Pool Data used by Warr (1993)

In “R Assignment 3” we reproduced Figure 1 from Warr’s (1993) article “Age, Peers, and Delinquency.” Feel free to go back to assignment 3 for a refresher on the article, including a description of the specific variables that Warr (1993) constructed and analyzed. For Figure 1, Warr (1993) plotted the age distribution for the percentage of respondents who reported having no friends who engaged in eight delinquent behaviors in the previous year. Over the next couple of R Assignments, we are going to focus on reproducing and extending Figures 2, 3, and 4 from Warr (1993):

These figures plot the age distribution of 1) Percentage of respondents reporting they average three or more nights per week socializing (i.e. “going on dates, to parties, or other social activities”); 2) Percentage of respondents reporting it was “Very important” or “Pretty Important” to socialize; and 3) Percentage of respondents who reported they “would lie to protect their friends if they got into trouble with the police.”

In order to reproduce these figures we need to:

Identify specific items from each wave of the NYS from which these variables were constructed.
Rename items so they have informative names.
Trim each wave of data so that they only include variables needed to reproduce the figures.
Produce basic descriptive statistics and frequency tables for our key variables.
Recode the specific items to align with Warr’s (1993) coding decisions.
Wrangle the data so that it is in a format we can plot.
Reproduce the plots.

In what follows, we will walk through the first four of these steps and save the last three for R Assignment 6.

3.1: Identify survey items for key variables from Warr (1993)

In order to reproduce Figures 2, 3, and 4 from Warr (1993) we need to start by identifying the specific survey items that Warr used to construct those figures. Let’s start with his description of the items in the article. On page 24-25 he describes the survey questions that are supposed to measure these “other elements of peer relations” by listing the questions respondents were asked:

“How many evenings in the average week, including weekends, have you gone on dates, to parties, or to other social activities?” (“Less than once a week” to 7).
“How important has it been to you to have dates and go to parties and other social activities?” (1 = not important at all, 2 = not too important, 3 = somewhat important, 4 = pretty important, 5 = very important)
“If your friends got into trouble with the police, would you be willing to lie to protect them?” (1 = no; 2 = maybe; 3 = yes).

Warr (1993) provided us with the specific question wording, which is helpful and not always the case in the published literature. Of course, if you open any of the data sets you downloaded above, you won’t see the specific wording of each survey question. Here is what the first six rows of the wave 1 data look like:

head(nys_w1) %>%
  gt()

CASEID	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	V38	V39	V40	V41	V42	V43	V44	V45	V46	V47	V48	V49	V50	V51	V52	V53	V54	V55	V56	V57	V58	V59	V60	V61	V62	V63	V64	V65	V66	V67	V68	V69	V70	V71	V72	V73	V74	V75	V76	V77	V78	V79	V80	V81	V82	V83	V84	V85	V86	V87	V88	V89	V90	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100	V101	V102	V103	V104	V105	V106	V107	V108	V109	V110	V111	V112	V113	V114	V115	V116	V117	V118	V119	V120	V121	V122	V123	V124	V125	V126	V127	V128	V129	V130	V131	V132	V133	V134	V135	V136	V137	V138	V139	V140	V141	V142	V143	V144	V145	V146	V147	V148	V149	V150	V151	V152	V153	V154	V155	V156	V157	V158	V159	V160	V161	V162	V163	V164	V165	V166	V167	V168	V169	V170	V171	V172	V173	V174	V175	V176	V177	V178	V179	V180	V181	V182	V183	V184	V185	V186	V187	V188	V189	V190	V191	V192	V193	V194	V195	V196	V197	V198	V199	V200	V201	V202	V203	V204	V205	V206	V207	V208	V209	V210	V211	V212	V213	V214	V215	V216	V217	V218	V219	V220	V221	V222	V223	V224	V225	V226	V227	V228	V229	V230	V231	V232	V233	V234	V235	V236	V237	V238	V239	V240	V241	V242	V243	V244	V245	V246	V247	V248	V249	V250	V251	V252	V253	V254	V255	V256	V257	V258	V259	V260	V261	V262	V263	V264	V265	V266	V267	V268	V269	V270	V271	V272	V273	V274	V275	V276	V277	V278	V279	V280	V281	V282	V283	V284	V285	V286	V287	V288	V289	V290	V291	V292	V293	V294	V295	V296	V297	V298	V299	V300	V301	V302	V303	V304	V305	V306	V307	V308	V309	V310	V311	V312	V313	V314	V315	V316	V317	V318	V319	V320	V321	V322	V323	V324	V325	V326	V327	V328	V329	V330	V331	V332	V333	V334	V335	V336	V337	V338	V339	V340	V341	V342	V343	V344	V345	V346	V347	V348	V349	V350	V351	V352	V353	V354	V355	V356	V357	V358	V359	V360	V361	V362	V363	V364	V365	V366	V367	V368	V369	V370	V371	V372	V373	V374	V375	V376	V377	V378	V379	V380	V381	V382	V383	V385	V386	V387	V388	V389	V391	V392	V393	V395	V396	V397	V399	V400	V401	V403	V405	V406	V407	V408	V409	V410	V411	V412	V413	V414	V415	V416	V417	V418	V419	V421	V423	V424	V425	V426	V427	V429	V431	V432	V433	V434	V435	V436	V437	V439	V441	V442	V443	V445	V446	V447	V448	V449	V450	V451	V453	V454	V455	V456	V457	V458	V459	V460	V461	V462	V463	V464	V465	V466	V467	V468	V470	V471	V472	V473	V475	V476	V477	V478	V479	V480	V481	V482	V483	V484	V485	V486	V487	V488	V489	V490	V491	V492	V493	V494	V495	V496	V497	V498	V499	V500	V501	V502	V503	V504	V505	V506	V507	V508	V509	V510	V511	V512	V513	V514	V515	V516	V517	V518	V519	V520	V521	V522
1	1	2	2	4	4	4	5	6	1	1	5	59	1	4	4	1	2	0	2	1	2	1	1	1	1	1	1	1	1	2	1	1	NA	1	NA	1	1	1	1	1	1	1	4	2	1	1	1	2	1	1	2	2	4	1	2	5	4	2	1	4	4442	2	3	5	1	5	3	3	1	3	3	3	3	4	4	1	4	4	2	3	4	4	4	2	2	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	4	3	2	3	3	3	4	3	4	3	3	3	3	3	4	4	3	3	3	1	1	1	4	2	5	4	2	4	4	2	4	5	5	1	1	2	2	2	2	2	2	2	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	1	1	3	NA	NA	1	NA	NA	NA	1	1	1	1	13	8	3	2	NA	1	2	3	3	5	3	3	2	3	2	1	5	2	NA	4	0	4	3	1	2	NA	NA	NA	5	1	3	NA	NA	NA	NA	2	NA	1	1	3	2	3	33	3	4	2	2	1	3	3	5	5	3	3	5	3	5	3	5	5	5	3	5	5	5	3	5	5	5	5	5	5	5	3	5	5	5	5	5	3	3	2	1	1	1	1	1	3	5	4	1	1	3	1	1	5	1	1	4	1	3	3	1	3	3	1	5	3	3	4	3	2	4	1	1	3	4	5	5	2	1	1	4	4	3	1	2	3	2	1	2	1	4	1	4	2	1	2	4	5	1	2	1	2	5	1	1	5	5	3	5	1	5	5	4	1	5	2	5	5	2	5	5	3	4	3	3	5	3	3	3	3	4	4	4	4	3	3	3	3	4	4	3	2	2	2	1	4	3	4	3	2	2	3	4	4	3	3	1	2	3	3	2	1	1	1	1	3	1	0	1	0	1	0	1	1	0	1	0	1	1	0	1	1	1	2	1	0	1	1	1	0	1	0	1	0	1	0	1	0	1	2	2	0	1	1	1	1	2	0	1	1	1	0	1	0	1	0	1	1	1	3	2	1	0	1	0	1	0	1	1	0	1	0	1	0	1	5	3	2	2	4	3	0	1	2	1	0	1	1	1	1	3	2	1	NA	1	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	1	1	NA	NA	NA	1	1	NA	NA	NA	NA	1	2	1	2	NA	NA	NA	NA	NA	NA	1	1	1	3	12	1
2	1	2	2	4	4	4	5	6	1	1	7	73	1	3	3	1	2	NA	2	1	2	1	1	1	1	1	1	1	1	1	1	1	NA	1	NA	2	1	1	1	1	1	1	4	3	3	2	1	1	1	1	1	5	5	1	1	4	5	1	1	4	4469	2	5	5	5	3	3	3	5	3	3	1	5	4	4	2	4	4	2	4	4	1	4	2	2	2	4	4	4	1	5	1	5	5	1	2	4	4	3	4	4	4	4	4	3	4	4	4	4	4	4	4	4	4	4	4	3	2	4	4	2	4	4	2	4	5	1	4	4	4	1	1	2	2	1	1	1	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	1	15	10	4	2	NA	5	5	4	3	4	4	3	2	3	3	3	5	1	3	NA	NA	NA	NA	1	3	NA	NA	NA	5	1	2	NA	NA	NA	NA	2	NA	1	1	5	5	4	55	5	4	3	1	1	5	3	3	3	3	3	5	3	5	3	3	1	5	5	5	3	3	3	5	3	5	3	5	3	5	5	5	5	5	3	3	3	3	3	4	2	2	2	2	2	5	4	2	2	4	2	3	4	3	1	4	2	4	2	2	3	2	2	4	2	2	2	2	2	4	2	1	2	4	4	1	2	2	2	4	4	1	1	2	4	1	2	2	2	4	1	4	1	2	2	4	4	2	1	2	2	4	1	2	4	5	2	5	2	5	5	5	2	5	1	5	5	2	4	4	2	5	2	5	4	4	2	4	2	5	5	3	3	3	3	3	3	3	3	3	3	3	3	2	3	4	4	4	4	4	4	4	4	3	1	1	1	2	2	1	1	1	1	1	3	3	0	1	0	1	0	1	1	0	1	0	1	1	0	1	1	0	1	1	0	1	1	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	1	1	0	1	0	1	1	1	0	1	0	1	0	1	1	1	0	1	1	0	1	0	1	0	1	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	0	0	1	0	0	0	2	1	1	NA	1	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	NA	NA	NA	1	NA	NA	1	NA	NA	1	2	1	3	NA	NA	NA	NA	NA	NA	1	1	1	3	11	2
3	1	2	2	4	4	4	5	6	1	1	7	73	1	3	3	1	2	NA	2	1	2	1	1	1	1	1	1	1	1	1	1	1	NA	1	NA	2	1	1	1	1	1	1	4	3	3	2	1	1	1	1	1	5	5	1	1	4	5	1	1	4	4469	3	3	5	1	5	3	3	3	1	3	3	3	5	2	1	5	2	2	5	1	1	5	1	1	2	5	5	4	1	3	1	4	4	1	2	4	4	3	4	4	4	4	4	3	4	4	4	4	4	4	4	4	4	4	4	2	1	3	4	3	5	5	1	5	5	2	5	5	5	2	2	2	2	2	2	3	3	2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	2	1	3	NA	NA	NA	NA	NA	NA	1	1	2	1	11	6	5	2	NA	4	0	4	4	5	1	5	2	4	3	2	5	1	3	NA	NA	NA	NA	1	5	NA	NA	NA	4	1	5	NA	NA	NA	NA	1	3	NA	1	3	5	5	54	4	3	4	2	1	5	5	3	3	1	NA	5	5	5	3	5	5	5	5	5	5	3	3	5	5	5	5	5	3	5	5	5	5	5	5	3	5	3	3	5	4	5	2	2	2	5	4	2	2	4	2	2	4	2	2	4	2	2	2	2	2	2	2	4	2	2	4	2	2	5	4	2	4	4	4	3	2	4	2	4	4	2	2	2	4	2	1	2	3	2	1	4	2	2	4	5	4	4	2	3	2	4	1	2	2	4	5	4	4	4	4	4	2	5	4	4	4	4	2	4	2	5	4	4	4	4	2	4	3	4	4	4	4	4	4	4	3	4	4	2	2	3	2	3	4	3	4	4	4	4	4	4	4	1	1	1	1	1	1	1	1	1	1	1	3	1	0	1	0	1	0	1	1	0	1	0	1	1	0	1	1	0	1	1	0	1	1	1	0	1	0	1	0	1	0	1	0	1	12	4	0	1	1	1	0	1	0	1	1	1	0	1	0	1	0	1	1	1	0	1	1	0	1	0	1	0	1	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	2	0	0	0	0	0	0	1	NA	1	NA	1	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	1	1	4	NA	NA	NA	NA	NA	NA	1	1	1	3	11	2
4	3	2	2	4	4	5	6	6	1	1	6	66	5	NA	4	1	1	0	2	1	2	1	2	1	1	1	1	1	1	1	1	1	NA	1	NA	1	1	1	1	1	1	1	3	1	1	1	1	1	1	1	1	2	2	2	2	4	4	2	4	2	4415	2	3	5	3	3	5	3	5	1	5	5	3	4	2	2	4	3	2	4	2	2	4	2	2	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	1	1	1	3	2	5	5	4	5	5	2	5	5	5	1	1	2	1	2	2	2	1	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	2	1	2	1	1	NA	NA	NA	NA	1	1	1	2	16	11	3	2	NA	5	4	5	3	5	2	3	2	3	2	2	3	2	NA	5	2	3	3	1	4	NA	NA	NA	3	1	3	NA	NA	NA	NA	2	NA	1	1	3	2	3	44	4	2	2	1	1	3	3	3	3	3	3	3	3	3	3	5	3	5	5	3	5	3	1	3	1	3	3	5	5	3	3	5	3	5	3	3	3	2	3	2	2	2	1	1	1	5	5	4	2	4	2	4	4	1	1	4	3	4	2	1	2	2	2	5	4	4	4	4	3	4	3	2	3	3	4	2	3	3	2	3	4	2	3	2	4	3	4	2	2	3	4	4	4	3	2	3	4	3	4	2	4	3	3	2	4	4	2	5	3	4	5	4	2	4	2	5	5	2	3	3	2	3	2	2	3	3	3	2	2	2	3	3	2	4	3	1	4	2	2	3	4	3	4	2	2	3	1	2	2	1	3	4	3	4	3	4	3	3	4	2	2	2	3	1	3	2	0	1	2	2	2	2	1	1	2	1	2	1	0	1	1	2	2	1	0	1	1	1	1	2	0	1	2	2	3	2	4	3	1	2	0	1	1	1	3	2	2	2	1	1	1	2	0	1	0	1	1	1	0	1	1	0	1	6	3	1	2	1	1	2	2	2	3	2	0	1	1	2	0	1	0	1	0	2	2	0	1	0	0	5	3	7	3	1	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	1	1	1	3	NA	NA	NA	NA	NA	NA	1	1	1	3	12	1
5	1	2	3	4	3	3	6	5	4	1	6	66	NA	NA	1	2	3	0	2	1	1	1	1	1	2	1	1	1	1	1	1	1	NA	2	NA	2	2	1	1	1	1	3	1	2	2	2	1	1	1	1	1	4	3	2	2	4	4	4	2	4	4409	2	5	5	1	3	5	1	1	1	1	3	1	4	2	2	4	2	2	4	2	2	4	2	2	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3	3	2	2	3	3	3	3	1	3	3	3	3	3	3	3	4	3	4	4	4	2	3	4	2	4	4	2	5	4	2	4	4	4	2	NA	1	NA	2	NA	2	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	1	1	3	NA	NA	1	NA	NA	NA	1	2	2	1	14	9	4	1	4	NA	NA	NA	NA	NA	0	4	2	1	5	2	4	1	2	NA	NA	NA	NA	2	4	1	0	1	3	1	2	NA	NA	NA	NA	2	NA	1	1	0	3	4	43	3	2	2	2	1	3	3	1	NA	3	3	5	3	5	3	5	5	5	3	5	5	1	NA	5	3	3	3	5	5	5	3	5	3	5	3	1	1	3	2	2	2	2	2	2	2	4	4	2	2	4	2	NA	4	2	2	4	3	4	2	2	3	2	3	4	2	4	2	2	2	5	2	2	2	4	4	2	2	2	3	4	5	1	2	2	4	2	3	2	3	4	3	4	2	2	2	4	4	2	2	2	2	4	2	2	4	4	2	5	2	4	4	4	2	3	2	4	5	2	3	3	2	4	3	4	4	4	2	2	2	4	4	4	4	4	3	3	4	4	4	4	1	1	1	2	3	4	3	3	3	3	4	4	4	NA	NA	NA	NA	NA	NA	NA	NA	NA	9	2	NA	NA	0	1	0	1	0	1	1	0	1	2	2	1	0	1	1	0	1	1	5	3	1	1	0	1	0	1	0	1	2	2	2	2	0	1	2	2	1	1	0	1	10	4	1	1	0	1	3	2	0	1	1	1	2	2	1	3	2	0	1	0	1	1	0	1	0	1	1	2	1	2	3	2	0	1	5	3	0	0	0	0	0	0	0	4	1	1	NA	1	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	1	1	1	4	NA	NA	1	NA	NA	NA	1	2	0	4	10	2
6	NA	NA	NA	NA	NA	NA	NA	NA	9	NA	NA	66	NA	NA	NA	NA	9	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	4409	3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	1	11	6	3	2	NA	5	3	3	3	4	2	3	2	4	2	2	4	1	5	NA	NA	NA	NA	1	4	NA	NA	NA	4	2	NA	5	0	5	5	1	3	NA	1	3	2	3	44	4	2	4	2	1	3	5	5	3	1	NA	5	3	5	3	5	1	5	3	1	NA	3	1	5	3	5	3	3	3	5	3	5	5	5	3	1	3	2	2	2	NA	2	2	1	NA	4	4	2	4	4	1	2	5	NA	2	4	4	5	2	2	4	2	2	4	1	2	4	2	2	4	2	1	2	4	4	2	2	2	2	4	4	2	1	2	4	2	2	1	2	4	2	4	2	2	2	4	5	2	2	1	2	2	1	2	4	1	2	5	1	5	5	4	2	5	2	4	1	2	4	2	2	4	2	4	5	4	2	4	2	5	4	3	3	2	3	3	2	2	3	2	3	4	2	2	4	3	4	2	3	4	3	4	4	4	2	2	3	1	1	2	1	1	2	1	3	3	2	2	0	1	2	2	1	0	1	0	1	1	2	2	1	0	1	1	1	2	1	1	0	1	2	2	0	1	2	2	0	1	0	1	2	2	1	1	0	1	0	1	1	1	0	1	2	2	2	2	1	1	2	2	1	0	1	0	1	0	1	1	0	1	0	1	2	2	1	2	0	1	0	1	0	1	0	1	0	0	1	0	1	2	3	1	NA	1	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	1	NA	NA	NA	NA	NA	1	1	NA	NA	1	NA	1	2	1	4	NA	NA	NA	NA	NA	NA	3	2	0	4	10	2

Instead of descriptive variable names we get a bunch of columns with variable numbers in the format “V###” (note: if you open up the actual dataset in R you will also see short descriptive variable labels). In order to find the specific survey items Warr (1993) is referring to, we need to go to the codebook and identify the variable numbers.

Note: Because variable numbers are not the same across each wave, this requires going to each of the codebooks and looking them up.
Note: In the above code we use the gt() function to simply have the data print out in a nice html formatted table. We will show you how to harness the power of the “gt” package to create publishable-ready tables in a later R Assignment.

Fortunately, the NYS codebooks make this relatively easy as they include bookmarks to different sections of the survey. Here is how it looks in the Wave 1 codebook:

Notice on the left that wave 1 included both a “parent interview” and a “youth interview.” If you go to wave 1 and look through the bookmarks, you’ll notice multiple sections of peer measures, including a “exposure to delinquent peers” and “commitment to delinquent peers.” However, the particular questions for Warr’s (1993) figures 2-3 are in the “Social Integration” section while the dependent variable for Figure 4 is in the “commitment to delinquent peers” section.

Note: We actually found them relatively quickly by simply searching for the verbiage in each of the questions above (e.g. “How many evenings”, “have dates and go to parties”, “friends got into trouble with the police”). Figures 2-4 also include age on the x-axis which is in the “Respondent Characteristics” section of the youth interview.

For wave 1, the key variables we need are:

V169 - Age
V179 - Evenings in average week spent in social activities
V180 - Importance of engaging in social activities
V377 - Lie to protect friends from trouble with police

Since Warr (1993) used the first five waves, to figure out each specific variable used to construct Figures 2-4, you would simply need to go to each codebook and find each of these items. Fortunately for you, we already did this and made this handy table. You’re welcome!

Warr (1993) Figures 2-4 NYS Items
Item	Wave 1	Wave 2	Wave 3	Wave 4	Wave 5
ICPSR number¹	8375	8424	8506	8917	9112
Age	V169	V7	V10	V6	V6
Evenings spent socializing	V179	V17	V81	V24	V37
Importance of socializing	V180	V18	V82	V25	V38
Lie to police	V377	V223	V321	V301	V328
¹ Note: indicates the icpsr number for the data set and not a survey item

3.2: Trim data

When working on a specific analysis or set of analyses from a large data set, I generally think it’s good practice to create a more manageable data set with just the items you need. This ensures that your raw data is kept intact and that you do not unintentionally make changes to it. In this particular case, it will also allow me to look at the data and see if the changes I am making to it are working (this is not always possible with analyses that utilize many variables).

Let’s start selecting the specific variables we need to reproduce Warr’s (1993) Figures 2-4. We are going to use the select() function in the “dplyr” package (one of the core packages within the tidyverse suite) to select those specific items in each of the five waves of data. That means, for the wave 1 data, we need to select “V169,” “V179,” “V180,” and “V377.” (“V7,” “V17,” “V18,” and “V223” for wave 2, and so on for waves 3 through 5).

nys_w1_trim <- nys_w1 %>%
  dplyr::select(V169, V179, V180, V377)

In the code above, we are telling R to select “V169,” “V179,” “V180,” and “V377” from the “nys_w1” data object and create a new data set object called “nys_w1_trim”. Our new object has the same number of observations, but only 4 variables. You can check this by looking in your RStudio “Environment.”

NOTE: The code above also introduces you to a new way to call a command directly from a specific package. Recall that R is an open-source program within which anyone could conceivably create their own useful packages. While this is one of the program’s greatest strengths, it also poses some challenges. One such challenge is the lack of strict curation across its countless and ever-growing list of packages to ensure that programmers do not incorporate conflicting package commands. From our experience, select() is one of those popular commands that frequently poses conflicts when you have several packages loaded at once. So, in this code chunk, we ensured that the select() command was invoked using the “dplyr” package by appending the package name followed by two colons directly in front of the command (i.e., dplyr::select()).

Let’s take a look at the first six observations of the data with teh head() function and see what the trimmed data look like.

head(nys_w1_trim) %>%
  gt()

V169	V179	V180	V377
13	3	3	1
15	4	3	3
11	1	5	1
16	2	3	2
14	0	4	NA
11	2	3	3

There are a few problems with the wave 1 trimmed data in its current form:

First, the variable names are not informative. They are simply the names in the original data file. Like with naming your computer files, it is usually a good practice to give informative names to your variables (and other R objects; see part 1 of Navarro’s series on “dplyr”). Using meaningful and systematic naming conventions will also be useful when we combine the data sets, since we can assign the same name across each wave before merging or combining them.
Second, the trimmed data we created has no information about which wave these data come from (except in the object name) nor does it include the unique identifier for individuals. If we were just working with the wave 1 data, this would not be a huge problem; also, since Warr (1993) simply pooled all five waves of data, the individual identifiers are less important. Nonetheless, it is generally good practice to preserve such important information.

3.3. Rename existing variables and create a new variable indicating NYS wave.

In order to rename the five variables in which we are currently interested (i.e., age, evenings spent socializing, importance of socializing, and lie to police) we will use the rename() function. To create a new variable indicating the wave of the data, we will use the mutate() function. Both of these functions are part of the “dplyr” package. The rename() function does exactly what it says - it renames existing items (or columns) in a data set, whereas the mutate() function allows us to create new variables (or columns) and to manipulate existing items in the data.

Let’s do this with the wave 1 data again.

Note: Below, we are writing over the nys_w1_trim data that we created earlier. Usually, we avoid writing over objects, as doing so can cause confusion and errors as we try to keep track of what exactly is in the object. Nonetheless, as you will see, we are going to repeat the select() function that we used before, though, this time, that command will be followed by a pipe and additional commands using rename and mutate functions. One of the nice things about the “dplyr” package and the “pipe” (%>%) you learned about in a previous R Assignment is that you sequentially invoke various commands all at once within the same code chunk.

nys_w1_trim <- nys_w1 %>%
  dplyr::select(CASEID, V169, V179, V180, V377) %>%
  rename(age = V169, 
         evsoc = V179,
         socimp = V180,
         liepolice = V377) %>%
  mutate(wave = 1)

A couple things to note about the above code:

First, like before, we are telling R to select “V169,” “V179,” “V180,” and “V377” from “nys_w1.” However, one key difference this time is that we immediately followed this with a pipe to a sequential rename command, which tells R to subsequently rename specific items after selecting them. Within the rename command, we specifically tell R what to rename each variable by invoking a new name as equal to an old name (i.e. new name = old name).
Second, the rename command is then followed by another pipe to a sequential mutate command that tells R to create or modify a variable after completing the rename command. In this case, our mutate command tells R to create a new variable named “wave” that equals “1” to indicate the wave of the data we are working with (i.e. variable_name = value). Since this command is not conditional (e.g., there is no ifelse operator), all 1,725 rows or observations (i.e., “cases” or “respondents”) from NYS wave 1 will include a variable column named “wave” with a value that equals “1” in each row. And, of course, the first line of code tells R to assign all of these operations into a new object called “nys_w1_trim.”
- Note: We also included the “CASEID” item in the select command above; ICPSR and the NYS made it easy on us by consistently naming the identifier “CASEID” in each wave of data.
- Note: The mutate() function can do a lot more than just assign a value to a new variable. We’ll discuss this more in the next R Assignment.

Here is what the data look like:

head(nys_w1_trim) %>%
  gt()

CASEID	age	evsoc	socimp	liepolice	wave
1	13	3	3	1	1
2	15	4	3	3	1
3	11	1	5	1	1
4	16	2	3	2	1
5	14	0	4	NA	1
6	11	2	3	3	1

Now, let’s create the trimmed data for each of the first five waves of the NYS. Again, we will write over the nys_w1_trim data we created above, as we would usually do all of these commands in the same code chunk. Here is the table of the items as a reminder:

Warr (1993) Figures 2-4 NYS Items
Item	Variable name	Wave 1	Wave 2	Wave 3	Wave 4	Wave 5
ICPSR number¹		8375	8424	8506	8917	9112
Age	age	V169	V7	V10	V6	V6
Evenings spent socializing	evsoc	V179	V17	V81	V24	V37
Importance of socializing	socimp	V180	V18	V82	V25	V38
Lie to police	liepolice	V377	V223	V321	V301	V328
¹ Note: indicates the icpsr number for the data set and not a survey item

Here is the code to create five trimmed data objects corresponding to each of the first five waves of NYs data.

#Wave 1:
nys_w1_trim <- nys_w1 %>%
  dplyr::select(CASEID, V169, V179, V180, V377) %>%
  rename(age = V169, 
         evsoc = V179,
         socimp = V180,
         liepolice = V377) %>%
  mutate(wave = 1)

head(nys_w1_trim) %>%
  gt()

#Wave 2:
nys_w2_trim <- nys_w2 %>%
  dplyr::select(CASEID, V7, V17, V18, V223) %>%
  rename(age = V7, 
         evsoc = V17,
         socimp = V18,
         liepolice = V223) %>%
  mutate(wave = 2)

head(nys_w2_trim) %>%
  gt()

#Wave 3:
nys_w3_trim <- nys_w3 %>%
  dplyr::select(CASEID, V10, V81, V82, V321) %>%
  rename(age = V10, 
         evsoc = V81,
         socimp = V82,
         liepolice = V321) %>%
  mutate(wave = 3)

head(nys_w3_trim) %>%
  gt()

#Wave 4:
nys_w4_trim <- nys_w4 %>%
  dplyr::select(CASEID, V6, V24, V25, V301) %>%
  rename(age = V6, 
         evsoc = V24,
         socimp = V25,
         liepolice = V301) %>%
  mutate(wave = 4)

head(nys_w4_trim) %>%
  gt()

#Wave 5:
nys_w5_trim <- nys_w5 %>%
  dplyr::select(CASEID, V6, V37, V38, V328) %>%
  rename(age = V6, 
         evsoc = V37,
         socimp = V38,
         liepolice = V328) %>%
  mutate(wave = 5)

head(nys_w5_trim) %>%
  gt()

In the code above, we simply kept the same basic code that we used for wave 1 earlier, but replaced the item information with the appropriate items for each wave that we identified previously for the other four waves of NYS data. Now, you should have five separate “trimmed” data objects each with the same seven variables in them (“CASEID,” “age,” “marijuana,” “alcohol,” “cheating,” “vandalism,” and “wave”).

Note: In the above code we repeat the same code structure but change certain details. For now, it may be efficient for you to copy and paste from your own code in these situations. However, it is important to recognize that copying and pasting repeatedly is generally considered an ill-advised and error-prone coding/programming practice (see R for Data Science for an introduction to functions). Eventually, you want to be able to use existing functions and write your own simple functions for accomplishing these repetitive tasks.
- We will not introduce writing functions in this class and, to be honest, we are still working on consistently using functions in our own work. However, one of the key advantages of using R is that it is a functional programming language. So, unlocking the ability to use functions really does supercharge your R abilities (R for Data Science and Advanced R provide book introductions to functional programming in R and Danielle Navarro has a video series about functional programming in R).

3.4. Pool data from all five waves

Now that we have all five data sets with the same variables and variable names, pooling them together is relatively easy. But first, we want you to try to build some intuition regarding what Warr (1993) did here when he says that “…all five years of the NYS data were pooled, producing a composite sample of 8,625 persons aged 11-21 (pg. 20).”

Actually, this statement is somewhat misleading as, to the untrained reader, it seems to imply that their are 8,625 different persons in the pooled data. However, this is not the case. Remember, the NYS is a longitudinal panel study. This means the researchers set out to study the same people over an extended period of time–in this case once per year over the first five years of the study. So, the 8,625 persons are really 1,725 unique individuals, each of whom were surveyed (or, more accurately, whom the researchers attempted to survey) five times in the first five years of the study (1,725 persons x 5 waves = 8,625 person-waves).

Because we created five trimmed data sets with the same variables in section 3.3 above, “pooling” the data in this case really just means stacking the waves of data on top of each other. In other words, I want to put Wave 1 data on top, Wave 2 data next, then Wave 3 data, then Wave 4 data, until the Wave 5 observations are at the bottom of the data set. Fortunately, the “dplyr” package makes this relatively easy with the bind_rows() function. Essentially, when you use the bind_rows() function, you are simply telling R which data sets to stack on top of each other by the order in which you list them.

Note: there is a corollary function bind_cols() that tells R to place (columns of) data sets next to each other.

nys_fwtrim <- bind_rows(nys_w1_trim, nys_w2_trim, nys_w3_trim, nys_w4_trim, nys_w5_trim)

head(nys_fwtrim) %>%
  gt()

In the code above, we told R to stack waves 1 through 5 on top of each other in chronological order and assign it to the object “nys_fwtrim” (we used fw to indicate we were creating a data set that included all “five waves” of data).

Part 4 (Assignment 5.4): Describe items in pooled data

You should now have a pooled data set called “nys_fwtrim” that has 8,625 observations and six variables that have informative names. An important first step in analyzing data is looking at basic descriptives for your key variables. This is often the first step in identifying the basic distribution of variables, identifying outliers, and identifying potential problems in the data (e.g., missing data).

4.1: View frequency tables for key variables

If you look again at Figures 2-4, you will notice that Warr (1993) reports, by age, the “Percentage of Respondents…” who:

“Averaged three or more nights per week” engaging in social activities (e.g., dates and parties).
responded that it “was ‘very imiportant’ or ’pretty important” to engage in social activities.
reported that “they would lie to protect their friends if they got in trouble with the police.”

Each of these variables were dichotomized from specific survey questions with more than two response categories. One reason for this may be because the data are fairly skewed, or concentrated in certain answer categories. For example, it is likely pretty rare for teenage respondents to report averaging seven nights a week socializing via things like dates and parties. Of course, we’ll be able to confirm this below.

We can check the frequency distributions and modal categories for each of the three peer items creating a frequency table for them. There are lots of ways to create frequency tables and calculate and produce tables of descriptive statistics (see here for review), including the base R command table(), the more tidyverse-oriented command, tabyl(), that is part of the “janitor” package, and the frq() and flat_table() commands in the “sjmisc package.”

Note: If you use the base R command, you’ll need to use the $ operator to tell R which variable to use from the data set—table(nys_fwtrim$marijuana). For now, we’ll use the “sjmisc” package.
Note: we also include the frequency table for age as well.

library(sjmisc)

nys_fwtrim %>%
frq(age, evsoc, socimp, liepolice)

## 
## age <numeric>
## # total N=8625  valid N=8625  mean=15.87  sd=2.40
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##    11 |  252 |  2.92 |    2.92 |   2.92
##    12 |  509 |  5.90 |    5.90 |   8.82
##    13 |  778 |  9.02 |    9.02 |  17.84
##    14 | 1036 | 12.01 |   12.01 |  29.86
##    15 | 1289 | 14.94 |   14.94 |  44.80
##    16 | 1276 | 14.79 |   14.79 |  59.59
##    17 | 1216 | 14.10 |   14.10 |  73.69
##    18 |  947 | 10.98 |   10.98 |  84.67
##    19 |  689 |  7.99 |    7.99 |  92.66
##    20 |  436 |  5.06 |    5.06 |  97.72
##    21 |  197 |  2.28 |    2.28 | 100.00
##  <NA> |    0 |  0.00 |    <NA> |   <NA>
## 
## 
## Y5-37:EVENINGS/WK SPENT DATING/SOCIAL (evsoc) <numeric>
## # total N=8625  valid N=8029  mean=2.08  sd=1.57
## 
## Value |               Label |    N | Raw % | Valid % | Cum. %
## -------------------------------------------------------------
##     0 | Less than once a wk | 1288 | 14.93 |   16.04 |  16.04
##     1 |                   1 | 1899 | 22.02 |   23.65 |  39.69
##     2 |                   2 | 2068 | 23.98 |   25.76 |  65.45
##     3 |                   3 | 1463 | 16.96 |   18.22 |  83.67
##     4 |                   4 |  677 |  7.85 |    8.43 |  92.10
##     5 |                   5 |  386 |  4.48 |    4.81 |  96.91
##     6 |                   6 |  111 |  1.29 |    1.38 |  98.29
##     7 |                   7 |  137 |  1.59 |    1.71 | 100.00
##  <NA> |                <NA> |  596 |  6.91 |    <NA> |   <NA>
## 
## 
## Y1-15: HOW IMPORTANT SOCIAL (socimp) <numeric>
## # total N=8625  valid N=8028  mean=3.31  sd=1.13
## 
## Value |              Label |    N | Raw % | Valid % | Cum. %
## ------------------------------------------------------------
##     1 |      Not important |  474 |  5.50 |    5.90 |   5.90
##     2 |  Not too important | 1460 | 16.93 |   18.19 |  24.09
##     3 | Somewhat important | 2543 | 29.48 |   31.68 |  55.77
##     4 |   Pretty important | 2176 | 25.23 |   27.11 |  82.87
##     5 |     Very important | 1375 | 15.94 |   17.13 | 100.00
##  <NA> |               <NA> |  597 |  6.92 |    <NA> |   <NA>
## 
## 
## Y1-191: WILLING TO LIE (liepolice) <numeric>
## # total N=8625  valid N=7638  mean=1.59  sd=0.77
## 
## Value |      Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------
##     1 |         No | 4469 | 51.81 |   58.51 |  58.51
##     2 | Don't know | 1830 | 21.22 |   23.96 |  82.47
##     3 |        Yes | 1338 | 15.51 |   17.52 |  99.99
##     4 |          4 |    1 |  0.01 |    0.01 | 100.00
##  <NA> |       <NA> |  987 | 11.44 |    <NA> |   <NA>

As you can see above, the frq() command in the “sjmisc” package prints out some pretty bare bones tables of the frequency distributions for our key variables. One thing we particularly like about this command is it allows us to include all of the variables for which we want frequency tables in the same command. Thus, it is a good command for getting a quick look at the key variables in your data.

We also like that the frq() command defaults to printing out the the frequency of missing observations for each variable (the “” entry). When you are first looking at a given set of data, you want to know for which variables missing observations are particularly pronounced. In the case of the NYS, these missing observations likely result from item-level non-response (e.g., respondents not answering specific questions) as well as attrition (e.g., respondents taking the first survey but not responding to subsequent surveys).

Note: One thing we do not like about the frq() command in the “sjmisc” package is that taking the outputted tables above and converting them to a tidy data frame (i.e., tibble) for using with tidyverse packages like “ggplot2” is not as easy as with other commands.
Assignment: Take some time to look over the distributions for our key variables above. What do you notice about each variable’s distribution (e.g., where do most respondents fall in the distribution, what is the most common answer across the five waves, which items have more missing data, etc.)? From the above tables, do you notice any potential problems with the items (e.g., values that don’t make sense based on description of variable in the codebook)? Create a sub-header in your RMD file and write out your responses to these questions.

4.2: View Cross-tabulations for Age by Peer Relations Variables

Warr (1993) was fundamentally interested in the age distribution of these “Other elements of peer relations.” Essentially, Figures 2 through 4 are simply presenting cross-tabulations of Age by dichotomized versions of the variables for which we just looked at frequency tables. Although we’ll save dichotomizing the variables for the next assignment, we can produce basic cross-tabulation for age and the non-dichotomized versions of the variables relatively easily using the flat_table() function in the “sjmisc” package.

nys_fwtrim %>%
  flat_table(evsoc, age, margin = "col") #note: margin = "col" tells it to give me column percentages

##                     age    11    12    13    14    15    16    17    18    19    20    21
## evsoc                                                                                    
## Less than once a wk     40.80 41.25 31.36 23.09 15.36 10.52  7.09  4.88  6.47  6.32 10.30
## 1                       28.80 31.79 30.70 29.72 25.10 22.45 17.80 15.81 19.24 19.74 25.45
## 2                       18.00 14.89 21.74 22.19 26.67 27.96 30.82 27.11 26.87 30.00 32.12
## 3                        8.40  7.24  8.83 14.36 18.66 19.20 22.76 25.68 23.05 24.47 21.21
## 4                        2.40  3.22  3.69  5.82  8.01 10.60 10.72 11.89 12.77 10.26  4.85
## 5                        0.80  1.41  2.50  3.31  3.22  6.43  6.38  8.44  6.30  5.79  3.64
## 6                        0.00  0.00  0.79  0.60  0.99  1.00  2.13  3.09  2.82  1.84  0.61
## 7                        0.80  0.20  0.40  0.90  1.98  1.84  2.30  3.09  2.49  1.58  1.82

nys_fwtrim %>%
  flat_table(socimp, age, margin = "col")

##                    age    11    12    13    14    15    16    17    18    19    20    21
## socimp                                                                                  
## Not important          12.40 13.28  9.59  7.43  4.71  4.59  2.92  3.69  3.98  5.00  6.67
## Not too important      24.00 26.76 24.05 19.98 17.84 17.46 15.32 12.75 15.75 13.68 20.00
## Somewhat important     25.60 25.55 30.22 31.73 32.62 31.41 28.52 34.56 37.48 36.32 35.76
## Pretty important       25.20 22.33 22.08 25.20 27.09 29.16 31.53 29.32 26.04 28.16 24.24
## Very important         12.80 12.07 14.06 15.66 17.75 17.38 21.70 19.67 16.75 16.84 13.33

nys_fwtrim %>%
  flat_table(liepolice, age, margin = "col")

##            age    11    12    13    14    15    16    17    18    19    20    21
## liepolice                                                                       
## No             74.77 69.80 63.95 61.06 56.93 55.58 53.12 56.85 53.18 58.31 63.80
## Don't know     15.89 20.81 24.32 22.87 24.49 23.79 26.28 22.74 26.76 25.33 22.70
## Yes             9.35  9.40 11.59 16.06 18.58 20.63 20.60 20.42 20.07 16.36 13.50
## 4               0.00  0.00  0.14  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00

Above we produced three basic tables with our three “other elements of peer relations” items representing the rows and age representing the columns. We used the margins = "col" argument in the flat_table() function to get the percentage of each response category for each age group across all five waves of data. Again, this is fundamentally what Figures 2-4 in Warr’s (1993) paper are doing, only here we’re showing this for the raw/untransformed items rather than the dichotomized versions that Warr (1993) created.

You can get a sense of whether we are getting the same results as Warr from these tables by eyeballing the total percentage in each age category that fall into the particular category group that Warr (1993) is plotting. For example, in Figure 2, Warr (1993) plots the “Percentage of respondents reporting that they averaged three or more nights per week going ‘on dates, to parties, or to other social activities,’ by age.” This means he is plotting the total percentage of the bottom five rows of our first table above (rows representing 3 to 7 nights a week on average). So, for example, the total percentage of respondents aged 11 who report socializing 3 to 7 nights a week on average is 8.4 + 2.4 + 0.8 + 0 + 0.8 = r 8.4 + 2.4 + 0.8 + 0 + 0.8`. This appears to correspond to the value at age 11 in Warr’s (1993) Figure 2 plot.

Part 5 (Assignment 5.5): Draw the Owl

In order for you to demonstrate that you can apply the basic data wrangling and descriptive analysis skills that you learned above on your own, in the last part of the assignment, you will consider alternative operationalizations of one of the “other elements of peer relations” that Warr (1993) was examining in Figures 2-4. In doing so, you will provide a type of robustnes check to one of Warr’s (1993) methodological decisions.

Specifically, Warr (1993) used the question about being willing to “lie to protect their friends if they got in trouble with the police” as an indicator of respondents’ “commitment or loyalty to their own particular set of friends (pg. 19).” However, in the section on “Committment to Delinquent Peers” in the codebooks for the first five waves of NYS data, their are two other questions that are meant to measure “commitment” to peers who are engaging in delinquency:

“If you found that your group of friends was leading you into trouble, would you still run around with them?” (1 = No, 2 = Maybe, 3 = Yes)
“If you found that your group of friends was leading you into trouble, would you try to stop these activities?” ( 1 = No, 2 = Maybe, 3 = Yes)

These were in addition to the question Warr (1993) examined:

“If your friends got into trouble with the police, would you be willing to lie to protect them?” (1 = No, 2 = Maybe, 3 = Yes)

Here is a table, similar to what I provided above, that shows you where each item is located in each of the first five waves of NYS data:

Warr (1993) Figures 2-4 NYS Items
Item	Wave 1	Wave 2	Wave 3	Wave 4	Wave 5
ICPSR number¹	8375	8424	8506	8917	9112
Age	V169	V7	V10	V6	V6
Still run around with friends	V375	V221	V319	V299	V326
Try to stop activities	V376	V222	V320	V300	V327
Lie to police	V377	V223	V321	V301	V328
¹ Note: indicates the icpsr number for the data set and not a survey item

5.1: Extend Warr’s Analysis of Commitment to Delinquent Peers

In order to complete the assignment, here is what you need to do:

Before looking at the data, write a brief statement or commentary about whether you think the other two “commitment to delinquent peers” items will have a similar age distribution to the “lie to police” item for which you already produced the descriptive table.
Trim, rename, and pool waves 1-5 data so that you have all three “commitment to delinquent peers” items in the same pooled data set.
- Note: You are welcome to modify the code you wrote in Part 3 to create a pooled data set.
Produce frequency tables for each o the “commitment to delinquent peers” items as well as cross-tabulations for these items by age.
- Note: see code in Part 4 for example.
Write a brief statement or commentary about the similarities and differences between each of the “commitment to delinquent peers” items in terms of their raw frequency distribution and their age distribution.
Write a “Conclusion” section where you write about what you learned in this assignment and any problems or issues you had in completing it.

Part 6 (Assignment 5.6)

Submit your assignment

“knit” your final RMD file to html format and save it using an informative file name (e.g., “LastName_CRM495_RAssgin5_YEAR_MO_DY”) within a file structure you create for this assignment (e.g., “LastName_CRM495_RAssign5”)
Submit your knitted html file on Canvas.
Place a copy of your root folder your LastName_495_commit folder on OneDrive.
- Note: The root folder should contain your reproducible file structure for this assignment. This means it should include anything necessary to reproduce your knitted html document with “one click.”