Data Wrangling

Assumptions & Ground Rules

Packages:

Part 1 (Assignment 6.1): Create file structure within R

1.1: Create folder for R Script files
1.2: Source R Script file

Part 2 (Assignment 6.2): Recode the Pooled Data

2.1: Review Warr’s (1993) Coding Decisions for “Peer Relations” Variables
2.2: Dichotomize “Peer Relations” Variables
2.3: Checking That Our Code Worked as Intended
2.4: Some More Thoughts on the Mutate Function

Part 3 (Assignment 6.3): Recreate Warr’s (1993) Figures 2 - 4

3.1: Wrangling Summary Data
3.2: Plot the data

Part 4 (Assignment 6.4): Draw the Owl

4.1: Extend Warr’s Analysis of Commitment to Delinquent Peers

Part 6 (Assignment 5.6)

Submit your assignment

Assumptions & Ground Rules

The purpose of this assignment is to learn how to wrangle data in order to reproduce results from a published study (specifically figures 2-4 in Warr, 1993). As such, this assignment will directly build upon “R Assignment 5: Downloading & Describing Data.” For that assignment we learned how to download the data directly from ICPSR, trim data to just inlcude the items in which we are interested, rename columns/variables, pool different data sets together, and then look at basic descriptive statistics for the variables in their raw form. However, we rarely analyze data in their raw form. Instead, most data analysis involves a large amount of wrangling or processing the data to get it ready to analyze and describe. This is completely normal. When we are working with data, it is common for the bulk of that work to be taken up with these data management and data wrangling tasks (see this blog for a review).

Specifically, for this assignment, we will:

Source code from an R script file.
Wrangle and recode the pooled NYS data to align with Warr’s (1993) coding decisions.
Learn how to create dummy variables (e.g., 0/1 variables) by utilizing the ifelse funciton and logic.
Reproduce Warr’s (1993) Figures 2 - 4 using the “ggplot2” package.

We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Packages:

library(tidyverse)
library(here)
library(haven)
library(icpsrdata)
library(gt)
library(sjmisc)
library(janitor)
library(patchwork)

Part 1 (Assignment 6.1): Create file structure within R

Every time we start a project or an assignment, we want to think about and create a “reproducible file structure.” In this case, since we are primarily continuing the work from the last R Assignment (“R Assignment 5: Downloading & Describing Data”), you are welcome to copy your file structure from “R Assignment 5” and simply rename it to refer to “R Assignment 6” (e.g., “Day_CRM495_RAssignment6”). Of course, you could also recreate the file structure from scratch by following the directions from in “R Assignment 5.”

1.1: Create folder for R Script files

The file structure for “R Assignment 5” included just your “NYS_data” folder and your RMD file. We will need one additional folder for this assignment. Specifically, we will create a folder to house the RScript file I provided on Canvas with the assignment. You can create that within your file explorer on your computer or, like with “R Assignment 5,” you and create it within R using the following code (see “R Assignment 5” instructions/walkthrough for explanation of the logic of the code chunk):

ifelse(dir.exists(here("R_scripts")), TRUE, dir.create(here("R_scripts")))

## [1] TRUE

Note: Be sure you are working from the root folder in which you want to create the folder to house the R script file (e.g., “Day_CRM495_RAssignment6”).

1.2: Source R Script file

Download the “Warr_1993_nysfwtrim_create.R” script file from canvas and place it into the “R_scripts” folder you just created within your file structure. This will allow us to call it from within our RMD file. Specifically, we’ll use the source function to simply run the code housed in the script file. Here is what the code to do that looks like:

source(here("R_scripts", "Warr_1993_nysfwtrim_create.R"))

Note: the code above simply tells R to look in the “R_scripts” folder, open up the “Warr_1993_nysfwtrim_create.R” script file and run the code. This should create multiple objects in your “Environment” within RStudio similar to the various NYS data objects you created in “R Assignmnet 5.” Here is what your Environment should look like:
Note: The “Warr_1993_nysfwtrim_create.R” contains the code for downloading the data directly from icpsr. So if you did not copy over the “NYS_data” folder from the “R Assignment 5” file structure, it will try to download the data again, which means you will need to enter your ICPSR email address and password in Console window the first time you run the above code chunk.

So far, we have primarily written all of our code within RMD files. This works well for our purposes because it allows you to interactively write, explain, and share your R work. When creating tutorials or assignments, this is especially useful. However, for larger and more complex research projects (e.g., that include multiple data sets from different sources, dozens to hundreds of variables, analyses that build from simple description to complex modeling, etc.), researchers often split these tasks up across multiple script files and only use RMD files to present the results (e.g., in a paper or presentation). This kind of workflow fits with Scott Long’s ideas of having a “dual workflow” where data management and analysis are kept separate (see this video for an overview of his ideas regarding computational workflow).

Note: Regardless of the specific workflow you ultimately choose, you want to be purposive in making sure it is reproducible by future you as well as others (see Long’s video presentation linked above for other criteria to consider).

Part 2 (Assignment 6.2): Recode the Pooled Data

Recall from last assignment that we are focusing on reproducing and extending Figures 2, 3, and 4 from Warr (1993):

Also, recall that in order to produce these plots we needed to accomplish seven general steps, the first four of which, we already accomplished in “R Assignment 5”:

Identify specific items from each wave of the NYS from which these variables were constructed.
Rename items so they have informative names.
Trim each wave of data so that they only include variables needed to reproduce the figures.
Produce basic descriptive statistics and frequency tables for our key variables.
Recode the specific items to align with Warr’s (1993) coding decisions.
Wrangle the data so that it is in a format we can plot.
Reproduce the plots.

In what follows, we will walk through the last three steps.

2.1: Review Warr’s (1993) Coding Decisions for “Peer Relations” Variables

The three “other aspects of peer relations” variables that Warr (1993) plotted in Figures 2-4 were originally asked with answer categories ranging from the self-reported number of evenings socializing, a Likert-style five-point “importance” scale, and “yes/no/maybe”. As a reminder, here are the three specific questions along with their answer categories:

“How many evenings in the average week, including weekends, have you gone on dates, to parties, or to other social activities?” (“Less than once a week” to 7).
“How important has it been to you to have dates and go to parties and other social activities?” (1 = not important at all, 2 = not too important, 3 = somewhat important, 4 = pretty important, 5 = very important)
“If your friends got into trouble with the police, would you be willing to lie to protect them?” (1 = no; 2 = maybe; 3 = yes).

In the last assignment, along with survey question asking for respondents’ age, we renamed these items (age, evsoc, socimp, liepolice) and examined their descriptive statistics and distributions by age. Running the following code chunk will produce these descriptive statistics:

library(sjmisc)

nys_fwtrim %>%
  frq(age, evsoc, socimp, liepolice)

nys_fwtrim %>%
  descr(age, evsoc, socimp, liepolice)

nys_fwtrim %>%
  flat_table(evsoc, age, margin = "col") #note: margin = "col" tells it to give me column percentages

nys_fwtrim %>%
  flat_table(socimp, age, margin = "col")

nys_fwtrim %>%
  flat_table(liepolice, age, margin = "col")

2.2: Dichotomize “Peer Relations” Variables

Warr (1993) did not analyze these data in their raw form. He recoded these specific items into dichotomous variables. In other words, he chose to categorize each of the the specific “aspects of peer relations” variables into two categories:

Spending three or more nights per week socializing vs. two or less nights per week socializing (Figure 2).
Reporting that it is “Very important” or “Pretty important” to socialize via dates, parties, etc. vs. those reporting it is “not important at all,” “not too important,” or “somewhat important”(Figure 3).
Reporting “yes” they would be willing to lie to protect friends who got in trouble with police vs. “no” or “maybe” (Figure 4).
- Note: There are a lot of reasons why Warr (1993) may have recoded these variables into dichotomies. One reason might be because binary variables (i.e., those with two response values) are somewhat easier to visualize across age categories. Another reason may be because these specific cutoffs reflect theoretically important distinctions. For example, spending three or more evenings per week socializing suggest the respondents are spending more than simply the primary weekend nights (Friday and Saturday) socializing. Thus, this may reflect a particularly high level of socializing relative to some norm. Of course, this particular question does not ask specifically about what nights they were socializing (the NYS does ask specifically about how much time they generally spend with friends on the weekends - V179 in Wave 1). Ultimately, Warr (1993) didn’t fully justify his decision to dichotomize these items. Of course, given the data are available, we could always check to see if his results are sensitive to these coding decisions.
  - Note: An interesting descrepancy in Warr’s (1993) reporting of these items compared to the actual data is with the question about being willing to lie to police. In Warr’s (1993) article, he describes the answer categories as being either “No” or “Yes” (see pg. 25). However, in looking at the data, we know one of the answer options was “don’t know.” It is likely he collapsed these answers into the “No” category for the analyses. By removing the distinction between answering “no” and answering “Don’t know”, Warr (1993) is implicitly suggesting that this distinction is not theoretically meaningful.

Let’s go ahead and create dichotmous variables for each of the three “other aspects of peer relations” variables that Warr (1993) plotted in Figures 2 - 4. We’ll explain the code below.

nys_fwtrim_dic <- nys_fwtrim %>%
  mutate(evsoc_dic = ifelse(evsoc >= 3, 1, 0),
         socimp_dic = ifelse((socimp == 4 | socimp == 5), 1, 0),
         liepolice_dic = ifelse(liepolice == 3, 1, 0),
         liepolice_dic = ifelse(liepolice == 4, NA, liepolice_dic))

In the above code, we created a new data set object called “nys_fwtrim_dic”. We could have also just overwritten the “nys_fwtrim” data set but, as we mentioned before, we generally try to avoid overwriting objects so we can keep clear exactly what is in the objects we create. We assume there are different perspectives on this practice, as our approach can lead to the creation of a lot of superfluous objects in our R environment (Of course, if you followed our recommended RStudio “Global options” settings, then you will be starting with a clean environment for every R session, which makes this issue a bit more tolerable).

Note: As data sets get larger and memory becomes an issues, creating multiple objects of essentially the same data becomes less feasible.

Within this new data set, we created three new dichotomous variables with the mutate() function from the “dplyr” package. Recall from last assignment, that the mutate() funciton creates new variables (i.e. new columns) in the data set. Here we named those new variables by appending the non-dichotomized names with "_dic" to tell our future selves and others that these are dichotomized versions of the raw items.

In the previous assignment, we simply used the mutate() function to create a new variable with a specific value (e.g. mutate(wave = 1)). Above, we used the mutate function along with the ifelse logical operator to create new variables. Essentially, each of the new variables are created through a logical test that 1) asks if the raw variable is equal to certain values, 2) assigns a numerical value of one if it is and 3) if not, assigns a numerical value of zero. In dichotomizing the variables in this way, we created what are often called “dummy” variables. In this case, our dummy variables have the value of 1 if the respondent reports “Three or more evenings socializing” (evsoc_dic), answered that it is “Very important” or “Pretty important” to socialize in these ways (socimp_dic), and answered “Yes” that they would lie to the police to protect friends in trouble (liepolice_dic); they have the value of 0 if they answered otherwise.

Let’s walk through the logic of exaclty what the code above is doing for each of these three new variables:

evsoc_dic: create a new variable called evsoc_dic based on the following logic - if evsoc is greater than or equal to 3 (evsoc >= 3,), assign evsoc_dic the value of 1, otherwise assign evsoc_dic the value of 0.
socimp_dic: create a new variable called socimp_dic based on the following logic - if socimp is equal to 4 OR (the “|” symbol indicates “OR”) is equal to 5 ((socimp == 4 | socimp == 5)), assign socimp_dic the value of 1, otherwise assign socimp_dic the value of 0.
liepolice_dic: create a new variable called liepolice_dic based on the following logic - if liepolice is equal to 3 (liepolice == 3), assign liepolice_dic the value of 1, otherwise assign liepolice_dic the value of 0.
- Note: the second line that regarding the liepolice_dic variable simply takes the nonsensical value of 4 and makes it missing based on the following logic. Specifically it writes over the liepolice_dic variable we just created with teh following logic - if liepolice is equal to 4 (liepolice == 4), assign liepolice_dic the value of NA, otherwise assign it the value from the liepolice_dic variable we just created (i.e. 1 or 0).

Let’s go ahead and look at the data and make sure the variables we told R to create were actually created.

head(nys_fwtrim_dic) %>%
  gt()

CASEID	age	evsoc	socimp	liepolice	wave	evsoc_dic	socimp_dic	liepolice_dic
1	13	3	3	1	1	1	0	0
2	15	4	3	3	1	1	0	1
3	11	1	5	1	1	0	1	0
4	16	2	3	2	1	0	0	0
5	14	0	4	NA	1	0	1	NA
6	11	2	3	3	1	0	0	1

Great! It looks like R created the variables we wanted inside the new data object named “nys_fwtrim_dic.” However, just looking at the first six rows isn’t enough to ensure that the code above worked as expected. In the next section we’ll walk through some basic strategies for checking this.

2.3: Checking That Our Code Worked as Intended

Anytime you recode and/or manipulate data, you want to check that R did what you wanted it to do. The thing about programming languages like R is that they will do exactly what you tell them to do (or won’t do something because you didn’t speak to them correctly). But what you tell them to do is not always what you expect them to do. So it is crucial to check your data after you have made changes.

Let’s check to make sure our new “dummy” variables are categorizing the data as we expect them to. To do this we could simply use the flat_table() function from the “sjmisc” package that you learned about in the previous assignment. However, the flat_table() function doesn’t include the frequencies for missing data (at least not by default) which we want when checking that our code worked. So below, I’m going to use the tabyl() function from the “janitor” package. The tabyl() function was designed largely to replicate and expand on the functionality of the base R table() function within a tidyverse framework.

Note: remember, in order to use the tabyl() function, you need to install and load the “janitor” package.

library(janitor)

nys_fwtrim_dic %>%
  tabyl(evsoc, evsoc_dic) %>%
  gt()

evsoc	0	1	NA_
0	1288	0	0
1	1899	0	0
2	2068	0	0
3	0	1463	0
4	0	677	0
5	0	386	0
6	0	111	0
7	0	137	0
NA	0	0	596

nys_fwtrim_dic %>%
  tabyl(socimp, socimp_dic) %>%
  gt()

socimp	0	1	NA_
1	474	0	0
2	1460	0	0
3	2543	0	0
4	0	2176	0
5	0	1375	0
NA	0	0	597

nys_fwtrim_dic %>%
  tabyl(liepolice, liepolice_dic) %>%
  gt()

liepolice	0	1	NA_
1	4469	0	0
2	1830	0	0
3	0	1338	0
4	0	0	1
NA	0	0	987

As you can see from each of these tables above, we simply created a crosstab with the original variable in the rows and our “dummy” variables in the columns. This allows you to see that the values in the original variable you intended to count as 1 or 0 in the dummy variable are indeed counting as those values.

Note: One of the cool things about the mutate() function is that it automatically handles missing values. So, if the observation was coded as NA in the original variable, the mutate() function will automatically assign the observation for the new variable constructed from it an NA value. We assume there are limits to this functionality, but it works in most cases we would need it for. Regardless, you should always check to make sure your data wrangling and recoding worked as you anticipated, including the treatment of missing values.

2.4: Some More Thoughts on the Mutate Function

The mutate() function can do a lot more than just assign a value to a new variable or create dummy variables like what we’ve primarily used it for to this point. For instance, it can also create a new variable based on a mathematical formula or based on some function of existing variables. Let’s show you some relatively simple examples:

nys_fwtrim_dictemp <- nys_fwtrim_dic %>%
  mutate(peerrel_index = evsoc_dic + socimp_dic + liepolice_dic,
         age_squared = age * age,
         age_squaredB = age^2,
         evsoc_mean = mean(evsoc, na.rm = TRUE),
         evsoc_sd = sd(evsoc, na.rm = TRUE),
         evsoc_z = (evsoc - evsoc_mean) / evsoc_sd,
         evsoc_zB = (evsoc - mean(evsoc, na.rm = TRUE)) / sd(evsoc, na.rm = TRUE))

In the above code, we show you how you can create new variables by performing different mathematical functions on existing variables. Also, we show how R has some built in functions for calculating common statistics like the mean and standard deviation which themselves can be used to create new variables. Let us briefly walk you through what we did for each new variable create above:

peerrel_index: Created an index of our “other aspects of peer relations” dummy variables so that for each individual, the value indicates how many of those variables subjects responded in the range identified as important (i.e. categorized as 1) by Warr (1993). You’ll often see this type of “variety index” applied to multiple types of criminal behavior (see Sweeten (2012) for a review).
age_squared and age_squaredB: Shows two different ways of creating a polynomial variable (in this case the squared value of age). These types of variables are often used to look at non-linear relationships within regression models.
evsoc_mean: Assigns everyone the mean value of “evsoc” after removing values with NA.
evosc_sd: Calculates and assigns everyone the standard deviation for “evsoc” after removing values with NA.
evsoc_z and evsoc_zB: Calculates a standardized variable of evsoc or what is often referred to as a z-score. These variables essentially rescale the variable to be on the standard deviation scale. They indicate how many standard deviations each subject is from the mean of the sample. In this case “evsoc_z” uses existing variables in the equation (i.e. “evsoc_mean” and “evsoc_sd”) whereas “evsoc_zB” simply uses the mean() and sd() functions within the mutate equation.

We are just scratching the surface of what the mutate() function can do. Be sure to take a minute to examine the first six rows of data with the head() function so you can see that these commands worked as intended. You’ll notice, for example, that the first observation has values of 1 only for the “evsoc_dic” variable and values of 0 for “socimp_dic” and “liepolice_dic” variables. Thus, that respondent has a value of 1 for the “peerrel_index” variable (1 + 0 + 0 = 1).

head(nys_fwtrim_dictemp) %>%
  gt()

CASEID	age	evsoc	socimp	liepolice	wave	evsoc_dic	socimp_dic	liepolice_dic	peerrel_index	age_squared	age_squaredB	evsoc_mean	evsoc_sd	evsoc_z	evsoc_zB
1	13	3	3	1	1	1	0	0	1	169	169	2.078341	1.572359	0.58616326	0.58616326
2	15	4	3	3	1	1	0	1	2	225	225	2.078341	1.572359	1.22215039	1.22215039
3	11	1	5	1	1	0	1	0	1	121	121	2.078341	1.572359	-0.68581101	-0.68581101
4	16	2	3	2	1	0	0	0	0	256	256	2.078341	1.572359	-0.04982388	-0.04982388
5	14	0	4	NA	1	0	1	NA	NA	196	196	2.078341	1.572359	-1.32179815	-1.32179815
6	11	2	3	3	1	0	0	1	1	121	121	2.078341	1.572359	-0.04982388	-0.04982388

Finally, it is also worth examining the descriptive statistics for these variables, especially for the standardized variables, so you can see what this looks like in terms of the mean and standard deviation (when you construct z-scores the variable should have a mean of 0 and a standard deviation of 1):

nys_fwtrim_dictemp %>%
  descr(peerrel_index:evsoc_zB)

## 
## ## Basic descriptive statistics
## 
##            var    type         label    n NA.prc   mean    sd   se     md
##  peerrel_index numeric peerrel_index 7624  11.61   0.98  0.90 0.01   1.00
##    age_squared numeric   age_squared 8625   0.00 257.77 76.82 0.83 256.00
##   age_squaredB numeric  age_squaredB 8625   0.00 257.77 76.82 0.83 256.00
##     evsoc_mean numeric    evsoc_mean 8625   0.00   2.08  0.00 0.00   2.08
##       evsoc_sd numeric      evsoc_sd 8625   0.00   1.57  0.00 0.00   1.57
##        evsoc_z numeric       evsoc_z 8029   6.91   0.00  1.00 0.01  -0.05
##       evsoc_zB numeric      evsoc_zB 8029   6.91   0.00  1.00 0.01  -0.05
##  trimmed             range        iqr skew
##     0.91           3 (0-3)   2.000000 0.47
##   254.81     320 (121-441) 128.000000 0.33
##   254.81     320 (121-441) 128.000000 0.33
##     2.08     0 (2.08-2.08)   0.000000  NaN
##     1.57     0 (1.57-1.57)   0.000000  NaN
##    -0.09 4.45 (-1.32-3.13)   1.271974 0.79
##    -0.09 4.45 (-1.32-3.13)   1.271974 0.79

Part 3 (Assignment 6.3): Recreate Warr’s (1993) Figures 2 - 4

Now that we have recoded the data, we are almost ready to reproduce Warr’s (1993) plots. But first, we need to wrangle the data some more to get it in the form we want to plot. Specifically, for each age group, we need to calculate the percentage of respondents that have a value of 1 on each of our “other aspects of peer relations” dummy variables. You already did this with the raw versions of the variables last assignment with the flat_table() function from the “sjmisc” package. We can see the basic form in which we want to get the data by doing that again with our dummy variables.

nys_fwtrim_dic %>%
  flat_table(evsoc_dic, age, margin = "col")

##           age    11    12    13    14    15    16    17    18    19    20    21
## evsoc_dic                                                                      
## 0             87.60 87.93 83.79 75.00 67.13 60.93 55.71 47.80 52.57 56.05 67.88
## 1             12.40 12.07 16.21 25.00 32.87 39.07 44.29 52.20 47.43 43.95 32.12

nys_fwtrim_dic %>%
  flat_table(socimp_dic, age, margin = "col")

##            age    11    12    13    14    15    16    17    18    19    20    21
## socimp_dic                                                                      
## 0              62.00 65.59 63.86 59.14 55.16 53.47 46.77 51.01 57.21 55.00 62.42
## 1              38.00 34.41 36.14 40.86 44.84 46.53 53.23 48.99 42.79 45.00 37.58

nys_fwtrim_dic %>%
  flat_table(liepolice_dic, age, margin = "col")

##               age    11    12    13    14    15    16    17    18    19    20    21
## liepolice_dic                                                                      
## 0                 90.65 90.60 88.40 83.94 81.42 79.37 79.40 79.58 79.93 83.64 86.50
## 1                  9.35  9.40 11.60 16.06 18.58 20.63 20.60 20.42 20.07 16.36 13.50

If you compare the above tables to Figures 2 - 4 in Warr (1993) you’ll notice that the percentage in the 1 category align with the value for that age group in the figures. To get this into a format that can be plotted with the “ggplot2” package, we need to create a summary data set that has a variable for age and a variable variables that reflects the values in the bottom row in each of the tables above.

3.1: Wrangling Summary Data

In order to get the data ready to plot, we need to use these dichotomous variables to create a summary of the data that reflects the “percentage of respondents” classified as 1 in each of our “other aspects of peer relations” variables for each age group. To do that we will go back to “dplyr” and use the group_by() and summarize() functions. Here is what the code looks like (explained below):

nys_fwsum <- nys_fwtrim_dic %>%
  group_by(age) %>%
  summarize(perc_evsoc = mean(evsoc_dic, na.rm = TRUE) * 100,
            perc_socimp = mean(socimp_dic, na.rm = TRUE) * 100,
            perc_liepolice = mean(liepolice_dic, na.rm = TRUE) * 100)

nys_fwsum %>%
  gt()

age	perc_evsoc	perc_socimp	perc_liepolice
11	12.40000	38.00000	9.345794
12	12.07243	34.40644	9.395973
13	16.20553	36.13666	11.604585
14	25.00000	40.86345	16.063830
15	32.86540	44.83898	18.578767
16	39.06511	46.53300	20.632133
17	44.28698	53.23295	20.596459
18	52.19976	48.98689	20.415648
19	47.42952	42.78607	20.066890
20	43.94737	45.00000	16.358839
21	32.12121	37.57576	13.496933

In the above code, we created summary data set that includes variables for the age grouping (“age”) and for the percentage of respondents classified as 1 in each of our three “other aspects of peer relations” dummy variables. Let us briefly explain the logic:

First, we told R to group the data set by the variable “age.” This basically tells R to perform whatever functions come next within the values of the grouping variable - in this case age (e.g., within age == 11; then within age == 12; etc.).
Second, we used the summarize() function to create a new summary data set with the variables calculated as instructed in the parentheses.
- Each variable uses the mean() function to indicate the proportion of respondents in each age group who were classified as 1 in each of our dummy variables. This works because when you calculate the mean of a dummy variable, the mean represents the proportion of cases that have the category coded as 1.
- Then we simply multiplied each proportion by 100 to calculate the percentage as reported by Warr (1993).

3.2: Plot the data

Now that we have these summary data, plotting them is relatively easy. We just need to tell ggplot what data and variables to use and the specific geom we want to represent the data as. There are some other details, but we already created the basic template for the plots in “R Assignment 3.” So we can just re-use that here and update the information.

Note: Go back to “R Assignment 3” to refresh your memory on the basic logic of the “ggplot2” package and the following code. Like with every plot created with ggplot(), at minimum we need to tell it 1) what data to use, 2) mapping variables in the data to visual properties of the plot and geoms (e.g., x-axis, y-axis, etc.), and 3) and specify the geometric features for visualizing the data (i.e. specify a geom. Everything else is essentially fine-tuning the details to get the plots to look exactly how we want them to.

theme_set(theme_classic())

#Evenings Socializing Plot:
evsoc_plot <- ggplot(data = nys_fwsum, aes(x = age, y = perc_evsoc)) + 
  geom_line() + 
  geom_point(shape = "square") +
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 60), breaks = seq(0, 60, 10)) + 
  labs (title = "Figure 2: Percentage of Respondents Reporting That They Averaged Three or More 
Nights Per Week Going “On Dates, To Parties, or to Other Social Activities,” by Age", 
        x = "Age", 
        y = "Percent") + 
  theme(plot.title = element_text(size = 11),
        axis.title = element_text(size = 10))
evsoc_plot

#Importance of Socializig Plot:
socimp_plot <- ggplot(data = nys_fwsum, aes(x = age, y = perc_socimp)) + 
  geom_line() + 
  geom_point(shape = "square") +
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(30, 60), breaks = seq(30, 60, 10)) + 
  labs (title = "Figure 3: Percentage of Respondents Who Said That It Is “Very Important” or “Pretty Important” 
to “Have Dates and Go to Parties and Other Social Activities,” by Age", 
        x = "Age", 
        y = "Percent") + 
  theme(plot.title = element_text(size = 11),
        axis.title = element_text(size = 10))
socimp_plot

#Lie to Police Plot:
liepolice_plot <- ggplot(data = nys_fwsum, aes(x = age, y = perc_liepolice)) + 
  geom_line() + 
  geom_point(shape = "square") +
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 25), breaks = seq(0, 25, 5)) + 
  labs (title = "Figure 4: Percentage of Respondents Who Said That They Would Lie 
to Protect Their Friends if They Got into Trouble with the Police, by Age", 
        x = "Age", 
        y = "Percent") + 
  theme(plot.title = element_text(size = 11),
        axis.title = element_text(size = 10))
liepolice_plot

Recall that we could also put each of these plots together using the “patchwork” package:

library(patchwork)
fig234_peerrel = evsoc_plot + socimp_plot + liepolice_plot + 
  plot_layout(ncol = 1) + 
  plot_annotation(
    title = "Age Distribution of Other Peer Relations Variables from Warr (1993)")
fig234_peerrel

In the above plot, I stacked the plots in a single column so the titles of the individual plots would not be cut off. Of course, with the “patchwork” package, you can adjust the layout in a multitude of ways.

Note: I controlled the size of the combined plots by placing the following in the code chunk options: {r, fig.width=4, fig.height=9}

Part 4 (Assignment 6.4): Draw the Owl

Like with the last assignment, in order for you to demonstrate that you can apply the basic data wrangling and recoding skills that you learned above on your own, in the last part of the assignment, you will consider alternative operationalizations of one of the “other elements of peer relations” that Warr (1993) was examining in Figures 2-4.

Recall that Warr (1993) used the question about being willing to “lie to protect their friends if they got in trouble with the police” as an indicator of respondents’ “commitment or loyalty to their own particular set of friends (pg. 19).” However, in the section on “Committment to Delinquent Peers” in the codebooks for the first five waves of NYS data, their are two other questions that are meant to measure “commitment” to peers who are engaging in delinquency:

“If you found that your group of friends was leading you into trouble, would you still run around with them?” (1 = No, 2 = Maybe, 3 = Yes)
“If you found that your group of friends was leading you into trouble, would you try to stop these activities?” ( 1 = No, 2 = Maybe, 3 = Yes)

These were in addition to the question Warr (1993) examined:

“If your friends got into trouble with the police, would you be willing to lie to protect them?” (1 = No, 2 = Maybe, 3 = Yes)

Here is the table that shows you where each item is located in each of the first five waves of NYS data:

Warr (1993) Figures 2-4 NYS Items
Item	Wave 1	Wave 2	Wave 3	Wave 4	Wave 5
ICPSR number¹	8375	8424	8506	8917	9112
Age	V169	V7	V10	V6	V6
Still run around with friends	V375	V221	V319	V299	V326
Try to stop activities	V376	V222	V320	V300	V327
Lie to police	V377	V223	V321	V301	V328
¹ Note: indicates the icpsr number for the data set and not a survey item

You should have created the pooled data with each of these items in “R Assignment 5,” now you simply need to recode the items into dummy variables and plot them like we did above.

4.1: Extend Warr’s Analysis of Commitment to Delinquent Peers

In order to complete the assignment, here is what you need to do:

Before looking at the data, write a brief statement or commentary about whether you think the other two “commitment to delinquent peers” items will have a similar age distribution to the “lie to police” item for which you already produced the descriptive plot.
Trim, rename, and pool waves 1-5 data so that you have all three “commitment to delinquent peers” items in the same pooled data set.
- Note: You are welcome to copy the code your wrote in “R Assignment 5” to do this. Of course, you could also copy your code from “R Assignment 5” into an R script file and source the code like you did at the beginning of this assignment. It’s up to you.
Recode the three “commitment to delinquent peers” items into dummy variables and check to see that your code worked by examining cross-tabulations between the dummy variables and the original variables.
- Note: see code in Part 2 above for an example.
Create a summary data set with the percentage of respondents for each variable who report being “committed to delinquent peers” for each age group.
- Note: see code in Part 3 above for an example.
Write a brief statement or commentary about the similarities and differences between each of the “commitment to delinquent peers” items in terms of their raw frequency distribution and their age distribution.
Write a “Conclusion” section where you write about what you learned in this assignment and any problems or issues you had in completing it.

Part 6 (Assignment 5.6)

Submit your assignment

“knit” your final RMD file to html format and save it using an informative file name (e.g., “LastName_CRM495_RAssgin6_YEAR_MO_DY”) within a file structure you create for this assignment (e.g., “LastName_CRM495_RAssign6”)
Submit your knitted html file on Canvas.
Place a copy of your root folder in your LastName_495_commit folder on OneDrive.
- Note: The root folder should contain your reproducible file structure for this assignment. This means it should include anything necessary to reproduce your knitted html document with “one click.”