R Assignment 3: Downloading Data & Reproducing Figure

Assumptions & Ground Rules

The Study

The Data: National Youth Survey

Part 1 (Assignment 3.1)

Goal: Download Wave 1 of the NYS Data from ICPSR Website

1. Download the SPSS data
2. Move the data to your “Datasets” subfolder in your root “work” folder and load it into R using the “haven” package like you have done before. Assign it to an object named “nys_w1_icpsr.”

Part 2 (Assignment 3.2)

Goal: Download First Five Waves of the NYS Data within R

1. Create “NYS” Subfolder
2. Install, load, and use the icpsrdata package
3. Read the data into R.

Part 3 (Assignment 3.3)

Goal: Identify Variables used by Warr (1993)

1. Identify Variables used by Warr in first part of Figure 1.
2. Identify the items in each data set that correspond to the variables above

Part 4 (Assignment 3.4)

Goal: Select Items and Pool Data

1. Select items to use in analyses and give them informative names
2. Rename existing variables and create a new variable indicating NYS wave.
3. Pool data from all five waves
4. Filter the pooled data

Part 5 (Assignment 3.5)

Goal: Recode the Pooled Data

1. View frequency tables for key variables
2. View cross-tabulations of peer delinquency variables by age.
3. Dichotomize peer delinquency variables
4. Dummy code the “cheating” and “vandalism” variables and check that it worked.

Part 6 (Assignment 3.6)

Goal: Recreate the first part of Warr’s (1993) Figure 1

1. Understanding the basic logic of “ggplot2”.
2. Wrangle data and plot part of Figure 1
3. Plot the Data
4. Customize plot details
5. Combine separate plots into one.
6. “Draw the Owl”

Assumptions & Ground Rules

The purpose of this assignment is to reproduce findings from a published study in R, particularly one where the data are housed on ICPSR. In order to accomplish this we will need to do and learn the following:

Identify the data being used in the now classic study by Mark Warr published in Criminology titled: “Age, Peers, and Delinquency” (Warr, 1993)
Download it from ICPSR
Identify and wrangle the specific variables/items used in the study using the dplyr package that is a part of the Tidyverse.
Combine multiple data sets into one.
Reproduce the first part of Figure 1 from Warr’s (1993) study (as displayed on pg. 22 in his article) by introducing you to the powerful data visualization package ggplot2, which is also a part of the Tidyverse.

We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, we also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

For this assignment, there are points where you will repeat code structures but change certain details. For now, it may be efficient for you to copy and paste from your own code in these situations. However, it is important to recognize that copying and pasting repeatedly is generally considered an ill-advised and error-prone coding/programming practice (see R for Data Science for an introduction to functions). Eventually, you want to be able to use existing functions and write your own simple functions for accomplishing these repetitive tasks.
- We will not introduce writing functions in this class and, to be honest, we are still working on consistently using functions in our own work. However, one of the key advantages of using R is that it is a functional programming language. So, unlocking the ability to use functions really does supercharge your R abilities (R for Data Science and Advanced R provide book introductions to functional programming in R and Danielle Navarro has a video series about functional programming in R).

The Study

I recommend doing a quick AIC reading of Warr’s study. To get you started, here is the abstract:

Hirschi and Gottfredson (1983; Gottfredson and Hirschi, 1990) have argued that the age distribution of crime cannot be explained by any known variables, and they point specifically to the failure of sociological theories to explain this phenomenon. This paper examines a quintessentially sociological theory of crime-differential association-and evaluates its ability to explain the age distribution of crime. Analysis of data from the National Youth Survey on persons aged ll-21 reveals that peer relations (exposure to delinquent peers, time spent with peers, loyalty to peers) change dramatically over this age span, following much the same pattern as crime itself. When measures of peer influence are controlled, the effects of age on self-reported delinquency are largely rendered insignificant. Additional analyses show that delinquent friends tend to be ’’sticky” friends (once acquired, they are not quickly lost) and that Sutherland’s arguments concerning the duration and priority of delinquent associations are only partially correct.

The paper is fundamentally about investigating the age-crime relationship and specifically examining whether delinquent peer influence explains the age-distribution of crime. To investigate this, Warr (1993) uses the first five waves of the National Youth Survey (NYS). Here is the first section of the “Data and Methods” section where he outlines the specific data he is analyzing (p. 19):

The data for this study come from the National Youth Survey (NYS). The NYS is a longitudinal study of a national probability sample of 1,726 persons aged 11-17 in 1976 (see Elliott et al., 1985). The sample was obtained through a multistage, cluster sampling of households in the continental United States in 1976. Five consecutive annual waves of the survey were conducted from 1976 through 1980, and these five are used in this analysis.

The Data: National Youth Survey

Fortunately, the data for Warr’s study are available on ICPSR. So, for this assignment, we are going to work with the National Youth Survey (NYS) Series on ICPSR. It includes seven waves of data spanning 1976 to 1987. Overseen by Delbert Elliott, it was one of the first national-level longitudinal studies specifically designed to measure self-reported crime and has been a popular data source for many high profile criminological studies, including Warr’s (1993). Here is the general description of the series from ICPSR:

For this series, parents and youth were interviewed about events and behavior of the preceding year to gain a better understanding of both conventional and deviant types of behavior by youths. Data were collected on demographic and socioeconomic status of respondents, disruptive events in the home, neighborhood problems, parental aspirations for youth, labeling, integration of family and peer contexts, attitudes toward deviance in adults and juveniles, parental discipline, community involvement, drug and alcohol use, victimization, pregnancy, depression, use of outpatient services, spouse violence by respondent and partner, and sexual activity. Demographic variables include sex, ethnicity, birth date, age, marital status, and employment of the youths, and information on the marital status and employment of the parents.

In what follows, we will walk through two different ways to download data from ICPSR. First, we will download data directly from the ICPSR website. Second, we will download data from ICPSR entirely within the R environment.

Part 1 (Assignment 3.1)

Goal: Download Wave 1 of the NYS Data from ICPSR Website

Start by navigating to the ICPSR landing page for Wave 1 of the NYS (ICPSR 8375). You’ll notice that the NYS has lots of options in terms of file formats for which you can download the data. Unfortunately, R is not one of them. However, this is a case where you could download either the Stata, SPSS, or SAS data files and import them into R using the haven package. Indeed, if you only needed one data file, you could simply download the SPSS file directly from ICPSR, place it in your “Datasets” subfolder within your root “work” folder, and then import it into R using haven’s read_spss command (see previous R Assignments for examples). Let’s go ahead and do that with Wave 1 of the NYS data from 1976.

1. Download the SPSS data

Downloading SPSS Data

It is instructive to take a look at what is actually downloaded from ICPSR. First, notice that what ICPSR does is download a zip file with the title “ICPSR_08375-V2.zip” to your computer.

Opening this zip file reveals a folder titled “ICPSR_08375” inside.

When you open this folder, notice it contains several files that tell you about the data, but not the actual data. There is also another folder titled “DS0001” inside.

The data file itself is in this “DS0001” folder. Sometimes data you download from ICPSR will have multiple data sets associated with them, and each data file will be contained its own folder. So, for example, if there were a second data set associated with this particular wave of the NYS, there would likely be another folder named “DS0002.”
Inside the “DS0001” folder is a file named “08375-0001-Data.sav” - this is the actual data file. The folder also contains the codebook for the data.

2. Move the data to your “Datasets” subfolder in your root “work” folder and load it into R using the “haven” package like you have done before. Assign it to an object named “nys_w1_icpsr.”

library(haven)
library(here)

nys_w1_icpsr <- read_spss(here("Datasets", "08375-0001-Data.sav"))

Part 2 (Assignment 3.2)

Goal: Download First Five Waves of the NYS Data within R

If we just wanted to download one data set, the above method would work fine. Technically, you could also repeat what we just did for Waves 2 through 5. However, this is not a very efficient approach. It also poses some complications in terms of reproducibility because, according to ICPSR’s bylaws, technically you are not allowed to share ICPSR data in your own online repository (e.g. OSF or GitHub). To address these inefficiencies and potential threats to reproducibility, in this part of the assignment we will download the first five waves of the NYS data entirely within the R environment instead using the icpsrdata package (see here and here for more information).

1. Create “NYS” Subfolder

Create a subfolder within your “Datasets” subfolder called “NYS.” Technically, you could do this yourself by navigating to the folder on your computer and creating a new “NYS” folder manually. But we can also do it in R with the following code. Again, doing it in the R environment helps ensure that anyone else (including our future selves) can easily reproduce our work with minimal effort.

# check if "NYS" folder exists (TRUE if it does) & create if it does not exist. 
ifelse(dir.exists(here("Datasets/NYS")), TRUE, dir.create(here("Datasets/NYS")))

## [1] TRUE

Let us try to explain the above code to you. First, the “ifelse” command is a logical function within base R. To get more details about it, type ?ifelse into the console window. Here is the description of that function:

ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

It takes the form of the following: ifelse(test, yes, no). This means, you give R a logical test (or a logical question) that can be answered yes or no and then it gives you a value or performs another function based on the solution of that test (i.e., based upon the answer to that question).
In the above code, we are asking if the “NYS” folder exists within the “Datasets” folder in the working directory (i.e., within your root “work” folder) with the dir.exists function. If the answer is yes, it simply returns the logical value I told it to - in this case TRUE. If the answer is no, you instruct R to create that “NYS” folder with the dir.create function. Again, type ?dir.exists or ?dir.create for more information.
If you want to have some fun, you can actually have R return a text string instead of the logical value. For example:

ifelse(dir.exists(here("Datasets/NYS")), "You already created that folder, dummy!", dir.create(here("Datasets/NYS")))

## [1] "You already created that folder, dummy!"

Generally, it is probably not a great idea to have R call the user (yourself in this case) a “dummy” with code you plan to eventually share publicly. Yet, it is also OK to have some fun when doing science. I (Jake) think that having a computer program call me a “dummy” is fun - perhaps you do not.
- Note: there is probably a programming rationale for using the logical value rather than a string of which I am unaware.

2. Install, load, and use the `icpsrdata` package

Go to the NYS Series page on ICPSR and make note of the ICPSR numbers for the first five waves of data.

install and load the “icpsrdata” package and use the icpsr_download function to download files for first five waves of the NYS based on their ICPSR numbers. Note that you need to specify the where to find the “NYS” folder you created earlier with the download_dir option and the here package.

library(icpsrdata)
icpsr_download(file_id=c(8375, 8424, 8506, 8917, 9112),
               download_dir = here("Datasets/NYS"))

Note: when you first run this chunk during an R session you will be asked to enter your ICPSR account information into the R console. R should remember this once you enter it once. So you will likely need to do this once per R session before trying to knit your Rmarkdown document.
Again, to be clear: Check the R Console after trying to run the icpsr_download command to see and respond to the ICPSR username/password prompts. If you have not yet created a free account on ICPSR (you should have already for an earlier project assignment), then you will need to do this on the ICPSR website first. Then, after each prompt in the console, you would put your ICPSR username instead of “your_icpsr_username” (I added that as a placeholder) and your ICPSR password instead of “your_icpsr_password.”

3. Read the data into R.

You already did this with the wave 1 data you downloaded directly from ICPSR above. Now you just need to do it for each of the first five waves of NYS data you just downloaded by telling R where the specific data file is within your file structure. You want to make note of the specific files that were downloaded to the “NYS” folder.

Recall from earlier that the actual data are within a folder called “DS0001” within each of the ICPSR folders. You simply want to use the here package to tell the read_spss function where to find the SPSS data for each wave of the data. Make sure you pay close attention to which study numbers are associated with each specific wave of data!
- Note: In the code chunk below, we use commas to indicate that we are moving further into subfolders (e.g., “Datasets”, “NYS”) within the root directory specified by the here package (which should have been set to your root “work” folder if you opened your RMD file directly from that folder). You can also use more traditional path structures as well (e.g., “Datasets/NYS”). For more information on using the here package, see here (pun intended).

nys_w1 <- read_spss(here("Datasets", "NYS", "ICPSR_08375", "DS0001", "08375-0001-Data.sav"))
nys_w2 <- read_spss(here("Datasets", "NYS", "ICPSR_08424", "DS0001", "08424-0001-Data.sav"))
nys_w3 <- read_spss(here("Datasets", "NYS", "ICPSR_08506", "DS0001", "08506-0001-Data.sav"))
nys_w4 <- read_spss(here("Datasets", "NYS", "ICPSR_08917", "DS0001", "08917-0001-Data.sav"))
nys_w5 <- read_spss(here("Datasets", "NYS", "ICPSR_09112", "DS0001", "09112-0001-Data.sav"))

If you have done everything correctly up to this point, you should have six data sets in your RStudio Environment: the wave 1 data we downloaded directly from ICPSR (named “nys_w1_icpsr”) and the waves 1 through 5 data that we downloaded with the icpsrdownload package (named “nys_w1” to “nys_w5”).

Note: The “nys_w1” and “nys_w1_icpsr” objects in your environment are the exact same (1,725 observations and 519 variables), as they should be - you just downloaded the same data set in two different ways.

Part 3 (Assignment 3.3)

Goal: Identify Variables used by Warr (1993)

Ok, so we have the data in R; now we need to get them in a usable form. In this case, that means 1) identifying the specific items Warr used from each wave of the NYS to construct the first part of his Figure 1, and 2) combining each of the five waves of data into a single datafile.

1. Identify Variables used by Warr in first part of Figure 1.

Here is the first page of Warr’s Figure 1 (1993, p.22).

As implied by the title of the article, the first key variable across each of the figures that we are going to try to reproduce is “Age.”
The second set of items is “exposure to delinquent peers,” which Warr (1993) describes on pg. 20:

Consider, first, exposure to delinquent peers, that is, the number of delinquents within the respondent’s immediate circle of friends. In the NYS, respondents were asked this question: “Think of the people you listed as close friends. During the last year how many of them have [act]?” ( 1 = none of them, 2 = very few of them, 3 = some of them, 4 = most of them, 5 = all of them). Following the question was a series of acts, including vandalism (“purposely damaged or destroyed property that did not belong to them”), cheating (“cheated on school tests”), marijuana use (“used marijuana or hashish”), petty theft (“stolen something worth less than $5”), alcohol use (“used alcohol”), burglary (“broken into a vehicle or building to steal something”), selling hard drugs (“sold hard drugs such as heroin, cocaine, and LSD”), and grand theft (“stolen something worth more than $50”).
At this point, we will just focus on the first four behaviors: Marijuana, Alcohol, Cheating, and Vandalism.

2. Identify the items in each data set that correspond to the variables above

Now that we know what variables we are looking for, we need to identify the specific items corresponding to these variables in each wave of data. To do this, you need to look at the codebook for each wave. You can find the codebook in the “DS0001” file for each wave of the data that you downloaded in the previous section of the assignment.

We are going to assume that you are able to find the proper variables within each respective codebook. Beware, however, that data for wave 1 of the NYS includes data from both a parent survey and a child survey, whereas the subsequent waves are just surveys of the children. Warr (1993) was focused on just the child data, so we want to be sure to select from those items only.
- After looking through the codebook for each wave of data, we find that the following specific items in each data file corresponds to each measure from the first part of Figure 1 in Warr (1993):

	Wave 1	Wave 2	Wave 3	Wave 4	Wave 5
icpsr	8375	8424	8506	8917	9112
age	V169	V7	V10	V6	V6
marijuana	V367	V210	V308	V288	V315
alcohol	V370	V213	V311	V291	V318
cheating	V365	V208	V306	V286	V313
vandalism	V366	V209	V307	V287	V314

Part 4 (Assignment 3.4)

Goal: Select Items and Pool Data

Now that we have the data and know the items we need, the next step is to select the appropriate items in each data set and then pool the data from each separate wave together into one datafile (or one R object). To do this, we will need to learn some new skills and, in particular, a new package that is part of the “Tidyverse”— dplyr. The “dplyr” package is used for data manipulation. Danielle Navarro has a really good video series that introduces you to the details of the various “dplyr” functions. We will only cover the operations we need for this assignment but, if you want to know more, Navarro’s videos are a great place to start.

1. Select items to use in analyses and give them informative names

Let’s start with the first four delinquent peer behaviors of marijuana use, alcohol use, cheating, and vandalism. We are going to use the select() function in “dplyr” to select those specific items in each of the five waves of data. That means, for the wave 1 data, we need to select “V336,” “V370,” “V365,” and “V366” (“V210,” “V213,” “V208,” and “V209” for wave 2, and so on for waves 3 through 5). We also need to select the “Age” item in each wave as that is the variable on the x-axis in Figure 1 (“V169” in Wave 1).

nys_w1_trim <- nys_w1 %>%
  dplyr::select(V169, V367, V370, V365, V366)
nys_w1_trim

In the code above, we are telling R to select “V169,” “V367,” “V370,” “V365,” and “V366” from “nys_w1” and create a new data set object called “nys_w1_trim”. Our new object has the same number of observations, but only 5 variables. You can check this by looking in your RStudio “Environment.”
- NOTE: The code above also introduces you to a new way to call a command directly from a specific package. Recall that R is an open-source program within which anyone could conceivably create their own useful packages. While this is one of the program’s greatest strengths, it also poses some challenges. One such challenge is the lack of strict curation across its countless and ever-growing list of packages to ensure that programmers do not incorporate conflicting package commands. From our experience, select() is one of those popular commands that frequently poses conflicts when you have several packages loaded at once. So, in this code chunk, we ensured that the select() command was invoked using the “dplyr” package by appending the package name followed by two colons directly in front of the command (i.e., dplyr::select()).
There are a few problems with the wave 1 trimmed data in its current form:
- First, the variable names are not informative. They are simply the names in the original data file. Like with naming your computer files, it is usually a good practice to give informative names to your variables (and other R objects; see part 1 of Navarro’s series on “dplyr”). Using meaningful and systematic naming conventions will also be useful when we combine the data sets, since we can assign the same name across each wave before merging or combining them.
- Second, the trimmed data we created has no information about which wave these data are from (except in the object name) nor does it include the unique identifier for individuals. If we were just working with the wave 1 data, this would not be a huge problem; also, since Warr (1993) simply pooled all five waves of data, the individual identifiers are less important. Nonetheless, it is generally good practice to preserve such important information.

2. Rename existing variables and create a new variable indicating NYS wave.

In order to rename the five variables we are currently interested in (i.e., age, marijuana use, alcohol use, cheating, and vandalism) we will use the rename() function. To create a new variable indicating the wave of the data, we will use the mutate() function. Both of these functions are part of the “dplyr” package. The rename() function does exactly what it says - it renames the items, whereas the mutate() function allows us to create new variables and to manipulate existing items in the data.
Let’s do this with the wave 1 data again.
- Note: Below, we are writing over the nys_w1_trim data that we created earlier. Usually, we avoid writing over objects, as doing so can cause confusion and errors as we try to keep track of what exactly is in the object. Nonetheless, as you will see, we are going to repeat the select() function that we used before - though, this time, that command will be followed by a pipe and additional commands using rename and mutate functions. One of the nice things about the “dplyr” package and the “pipe” (%>%) you learned about in the last R Assignment is that you sequentially invoke various commands all at once within the same code chunk.

nys_w1_trim <- nys_w1 %>%
  dplyr::select(CASEID, V169, V367, V370, V365, V366) %>%
  rename(age = V169, 
         marijuana = V367,
         alcohol = V370,
         cheating = V365,
         vandalism = V366) %>%
  mutate(wave = 1)
nys_w1_trim

A couple things to note about the above code:
- First, like before, we are telling R to select “V169”, “V367”, “V370,” “V365,” and “V366” from “nys_w1.” However, one key difference this time is that we immediately followed this with a pipe to a sequential rename command, which tells R to subsequently rename specific items after selecting them. Within the rename command, we specifically tell R what to rename each variable by invoking a new name as equal to an old name (i.e. new name = old name).
- Second, the rename command is then followed by another pipe to a sequential mutate command that tells R to create or modify a variable after completing the rename command. In this case, our mutate command tells R to create a new variable named “wave” that equals “1” to indicate the wave of the data we are working with (i.e. variable_name = value). Since this command is not conditional (e.g., there is no ifelse operator), all 1,725 rows or observations (i.e., “cases” or “respondents”) from NYS wave 1 will include a variable column named “wave” with a value that equals “1” in each row. And, of course, the first line of code tells R to assign all of these operations into a new object called “nys_w1_trim,” while the last line of code is equivalent to running the view() command so we can see our new data object.
  - Note: We also included the “CASEID” item in the select command above; ICPSR and the NYS made it easy on us by consistently naming the identifier “CASEID” in each wave of data.
  - Note: The mutate() function can do a lot more than just assign a value to a new variable. For instance, it can also create a new variable based on a mathematical formula or based on some function of existing variables. Let’s show you some simple examples:

nys_w1_trimX <- nys_w1 %>%
  dplyr::select(CASEID, V169, V367, V370, V365, V366) %>%
  rename(age = V169, 
         marijuana = V367,
         alcohol = V370,
         cheating = V365,
         vandalism = V366) %>%
  mutate(wave = 1,
         waveB = 24 / 24,
         maralc = marijuana + alcohol,
         age_squared = age * age,
         age_squaredB = age^2)
nys_w1_trimX

We are really just scratching the surface of what the mutate() function can do. In the code above, I show you how you can perform a mathematical function like division (e.g., 24 / 24), add the values of two variables together for each respondent (e.g., marijuana + alcohol), and multiply the values of variables together - in this cases by illustrating two different ways of squaring the age variable (e.g., age^2 or age * age). Again, this is just the beginning of what you can do with this supremely useful function. Also, take a minute to briefly examine the first 10 rows of data; doing so will reveal that these commands worked as intended (e.g. for the first observation the “maralc” value of 4 is the value for the “marijuana” variable for that individual–1, plus the value for the “alcohol” variable for that individual–3). You should develop the essential habit of always checking to ensure your commands worked as intended after running them.
Now, let’s create the trimmed data for each of the first five waves of the NYS. Again, we will write over the nys_w1_trim data we created above, as we would usually do all of these commands in the same code chunk. Here is the table of the items as a reminder:

	Wave 1	Wave 2	Wave 3	Wave 4	Wave 5
icpsr	8375	8424	8506	8917	9112
age	V169	V7	V10	V6	V6
marijuana	V367	V210	V308	V288	V315
alcohol	V370	V213	V311	V291	V318
cheating	V365	V208	V306	V286	V313
vandalism	V366	V209	V307	V287	V314

And here is our code to create five trimmed data objects corresponding to each of the first five waves of NYs data.

#Wave 1:
nys_w1_trim <- nys_w1 %>%
  dplyr::select(CASEID, V169, V367, V370, V365, V366) %>%
  rename(age = V169, 
         marijuana = V367,
         alcohol = V370,
         cheating = V365,
         vandalism = V366) %>%
  mutate(wave = 1)
nys_w1_trim

#Wave 2:
nys_w2_trim <- nys_w2 %>%
  dplyr::select(CASEID, V7, V210, V213, V208, V209) %>%
  rename(age = V7, 
         marijuana = V210,
         alcohol = V213,
         cheating = V208,
         vandalism = V209) %>%
  mutate(wave = 2)
nys_w2_trim

#Wave 3:
nys_w3_trim <- nys_w3 %>%
  dplyr::select(CASEID, V10, V308, V311, V306, V307) %>%
  rename(age = V10, 
         marijuana = V308,
         alcohol = V311,
         cheating = V306,
         vandalism = V307) %>%
  mutate(wave = 3)
nys_w3_trim

#Wave 4:
nys_w4_trim <- nys_w4 %>%
  dplyr::select(CASEID, V6, V288, V291, V286, V287) %>%
  rename(age = V6, 
         marijuana = V288,
         alcohol = V291,
         cheating = V286,
         vandalism = V287) %>%
  mutate(wave = 4)
nys_w4_trim

#Wave 5:
nys_w5_trim <- nys_w5 %>%
  dplyr::select(CASEID, V6, V315, V318, V313, V314) %>%
  rename(age = V6, 
         marijuana = V315,
         alcohol = V318,
         cheating = V313,
         vandalism = V314) %>%
  mutate(wave = 5)
nys_w5_trim

In the code above, we simply kept the same basic code that we used for wave 1 earlier, but replaced the item information with the appropriate items for each wave that we identified previously for the other four waves of NYS data. Now, you should have five separate “trimmed” data objects each with the same seven variables in them (“CASEID,” “age,” “marijuana,” “alcohol,” “cheating,” “vandalism,” and “wave”).

3. Pool data from all five waves

Now that we have all five data sets with the same variables and variable names, pooling them together should be relatively easy. But first, we want you to try to build some intuition regarding what Warr (1993) did here when he says that “…all five years of the NYS data were pooled, producing a composite sample of 8,625 persons aged 11-21 (pg. 20).”
- Actually, this statement is somewhat misleading as, to the untrained reader, it seems to imply that their are 8,625 different persons in the pooled data. However, this is not the case. Remember, the NYS is a longitudinal panel study. This means the researchers set out to study the same people over an extended period of time - in this case once per year over the first five years of the study. So, the 8,625 persons are really 1,725 unique individuals - each of whom were surveyed (or, more accurately, whom the researchers attempted to survey) five times in the first five years of the study (1,725 persons x 5 waves = 8,625 person-waves).
Because we created five trimmed data sets with the same variables in step 3 above, “pooling” the data in this case really just means stacking the waves of data on top of each other. In other words, I want to put Wave 1 data on top, Wave 2 data next, then Wave 3 data, then Wave 4 data, until the Wave 5 observations are at the bottom of the data set. Fortunately, the “dplyr” package makes this relatively easy with the bind_rows() function. Essentially, when you use the bind_rows() function, you are simply telling R which data sets to stack on top of each other by the order in which you list them.
- Note: there is a corollary function bind_cols() that tells R to place data sets next to each other.

nys_fwtrim <- bind_rows(nys_w1_trim, nys_w2_trim, nys_w3_trim, nys_w4_trim, nys_w5_trim)
nys_fwtrim

In the code above, we told R to stack waves 1 through 5 on top of each other in chronological order and assign it to the object “nys_fwtrim” (we used fw to indicate we were creating a data set that included all “five waves” of data).
Note: we could still pull out the individual waves of data from this combined data set using the filter() function. Of course, the filter() function would allow us to subset the data by other variables as well (e.g. select certain age groups). Here is a quick example:

nys_w3_filter <- nys_fwtrim %>%
  filter(wave == 3)
nys_w3_filter

nys_age_filter <- nys_fwtrim %>%
  filter(age == 11 | age == 12 | age == 13,
         wave == 3)
nys_age_filter

In the code above, the first object we created (“nys_w3_filter”) simply selects all observations from Wave 3 of the data (i.e. recreates the “nys_w3_trim” data set). In the second object we created (“nys_age_filter”), we select subjects who are age 11, 12, or 13 from wave 3. Moving forward, we are just going to work with the pooled data that we created above. But imagine a situation in which you received the pooled data but only wanted to analyze wave 1. We hope this brief diversion gives you a sense of how you could easily go from a larger data set to a smaller data set.
It is also important to point out a couple of logical operators we used in the code above:
- First, when we want to select a specific value of a variable (e.g., “wave”), we use the == operator. The double equals signs (==) are used when you are performing a logical test or operation. So, in the above code, you are performing a logical test to see if the “wave” variable equals three (3) and, if it does, to return those observations while filtering out those observations where it does not. A single equals sign is used when you are assigning a value to an object or variable. Earlier, when using the mutate() function, we used single equals signs (=) because we were assigning values to the “wave” variable (e.g., mutate(wave = 3).
- Second, the “|” is another logical operator that means “or.” So, in the second filter command above, we tell R to give us observations that are age 11 or 12 or 13 and from wave 3 (signified by comma as another condition).
- Note: If this all sounds confusing, it is only because you are paying attention! To find out more about logical operators, watch Navarro’s dplyr videos and read this section of R for Data Science.

4. Filter the pooled data

Filter the pooled data so that you identify all the individuals in wave 5 who report having no friends who use Marijuana, use Alcohol, Cheat, or have engaged in Vandalism (i.e. have values of “1” on all of the peer delinquency items).
- Show your code to create your filtered data set.
- How many observations are in your filtered data set?

Part 5 (Assignment 3.5)

Goal: Recode the Pooled Data

Now that we have the pooled data, we’re almost ready to start recreating the first part of Figure 1 from Warr (1993). However, we still need to do a little more data cleaning and wrangling before we are ready to plot.

1. View frequency tables for key variables

If you look again at Figure 1, you will notice that Warr (1993) reports the “Percentage of Respondents with No Delinquent Friends, by Age and Offense.” Recall the questions about delinquent peers specifically asked the respondents:

Think of the people you listed as close friends. During the last year how many of them have [act]?” with the answer categories: “1 = none of them, 2 = very few of them, 3 = some of them, 4 = most of them, 5 = all of them.”
Warr (1993) chose to focus on those who reported none of their friends engaging in any of those acts. One reason for this may be because the data are fairly skewed. In other words, the modal category for many of the peer delinquency variables is likely “1 = none of them”.
We can check the frequency distributions and modal categories for each of the four peer delinquency items by creating a frequency table for them. There are lots of ways to create frequency tables, including the base R command table()
- Note: If you use the base R command, you’ll need to use the $ operator to tell R which variable to use from the data set—table(nys_fwtrim$marijuana). For now, we’ll use the “sjmisc” package that is a part of the same suite of packages as the “sjPlot” package you used in a previous assignment
- Note: we also include the frequency table for age as well.

library(sjmisc)

nys_fwtrim %>%
  frq(marijuana, alcohol, cheating, vandalism, age)

## Y1-181: USED MARJUANA (marijuana) <numeric> 
## # total N=8625 valid N=7612 mean=2.17 sd=1.35
## 
## Value |            Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
##     1 |     None of them | 3591 | 41.63 |   47.18 |  47.18
##     2 | Very few of them | 1281 | 14.85 |   16.83 |  64.00
##     3 |     Some of them | 1231 | 14.27 |   16.17 |  80.18
##     4 |     Most of them |  860 |  9.97 |   11.30 |  91.47
##     5 |      All of them |  649 |  7.52 |    8.53 | 100.00
##  <NA> |             <NA> | 1013 | 11.74 |    <NA> |   <NA>
## 
## Y1-184: USED ALCOHOL (alcohol) <numeric> 
## # total N=8625 valid N=7621 mean=2.82 sd=1.47
## 
## Value |            Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
##     1 |     None of them | 2081 | 24.13 |   27.31 |  27.31
##     2 | Very few of them | 1336 | 15.49 |   17.53 |  44.84
##     3 |     Some of them | 1466 | 17.00 |   19.24 |  64.07
##     4 |     Most of them | 1314 | 15.23 |   17.24 |  81.31
##     5 |      All of them | 1424 | 16.51 |   18.69 | 100.00
##  <NA> |             <NA> | 1004 | 11.64 |    <NA> |   <NA>
## 
## Y1-179: CHEATED ON SCHOOL TESTS (cheating) <numeric> 
## # total N=8625 valid N=7528 mean=2.51 sd=1.14
## 
## Value |            Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
##     1 |     None of them | 1624 | 18.83 |   21.57 |  21.57
##     2 | Very few of them | 2315 | 26.84 |   30.75 |  52.32
##     3 |     Some of them | 2197 | 25.47 |   29.18 |  81.51
##     4 |     Most of them |  915 | 10.61 |   12.15 |  93.66
##     5 |      All of them |  477 |  5.53 |    6.34 | 100.00
##  <NA> |             <NA> | 1097 | 12.72 |    <NA> |   <NA>
## 
## Y1-180: DESTROYED PROPERTY (vandalism) <numeric> 
## # total N=8625 valid N=7610 mean=1.53 sd=0.79
## 
## Value |            Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
##     1 |     None of them | 4718 | 54.70 |   62.00 |  62.00
##     2 | Very few of them | 1992 | 23.10 |   26.18 |  88.17
##     3 |     Some of them |  728 |  8.44 |    9.57 |  97.74
##     4 |     Most of them |  120 |  1.39 |    1.58 |  99.32
##     5 |      All of them |   52 |  0.60 |    0.68 | 100.00
##  <NA> |             <NA> | 1015 | 11.77 |    <NA> |   <NA>
## 
## age <numeric> 
## # total N=8625 valid N=8625 mean=15.87 sd=2.40
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##    11 |  252 |  2.92 |    2.92 |   2.92
##    12 |  509 |  5.90 |    5.90 |   8.82
##    13 |  778 |  9.02 |    9.02 |  17.84
##    14 | 1036 | 12.01 |   12.01 |  29.86
##    15 | 1289 | 14.94 |   14.94 |  44.80
##    16 | 1276 | 14.79 |   14.79 |  59.59
##    17 | 1216 | 14.10 |   14.10 |  73.69
##    18 |  947 | 10.98 |   10.98 |  84.67
##    19 |  689 |  7.99 |    7.99 |  92.66
##    20 |  436 |  5.06 |    5.06 |  97.72
##    21 |  197 |  2.28 |    2.28 | 100.00
##  <NA> |    0 |  0.00 |    <NA> |   <NA>

As you can see from the tables that outputted above, the “None of them” answer category is indeed the modal answer category for three out of four of these peer variables, though the degree of skewness varies across items.
- Note: There are a lot of reasons why Warr (1993) may have recoded the peer variables the way he did for Figure 1. For instance, one reason might be because binary variables (i.e., those with two response values) are somewhat easier to visualize across age categories. Moreover, since those reporting that “none” of their friends engage in delinquency theoretically are exposed to “no dosage” of pro-delinquency peer influence processes whereas those with “any” delinquent friends theoretically are exposed to “at least some dosage” of pro-delinquency peer influence processes, his particular binary cutoff might have made the most sense theoretically. Finally, yet another reason Warr (1993) may have used this particular cutoff might be due to concerns regarding measurement validity. The distinction between “none of them” and “very few of them” or more is perhaps the clearest distinction in the answer categories. This happens often with ordinal variables where you have a clear ordering but no precise distance between the categories.

2. View cross-tabulations of peer delinquency variables by age.

You can also get a quick descriptive look at these differences by age using the flat_table() function in the “sjmisc” package.

nys_fwtrim %>%
  flat_table(marijuana, age, margin = "col") #note: margin = "col" tells it to give me column percentages

##                  age    11    12    13    14    15    16    17    18    19    20    21
## marijuana                                                                             
## None of them         94.86 87.33 75.97 63.68 50.82 39.74 32.87 24.94 26.59 23.87 18.63
## Very few of them      3.27  8.60 11.51 15.28 17.00 19.74 19.89 20.66 16.72 19.63 21.74
## Some of them          0.93  3.39  6.91 10.79 13.65 17.62 21.57 24.82 24.75 21.49 26.71
## Most of them          0.93  0.23  4.32  5.77  9.79 12.86 13.26 16.75 19.73 23.08 18.01
## All of them           0.00  0.45  1.29  4.49  8.76 10.04 12.42 12.84 12.21 11.94 14.91

nys_fwtrim %>%
  flat_table(alcohol, age, margin = "col")

##                  age    11    12    13    14    15    16    17    18    19    20    21
## alcohol                                                                               
## None of them         86.92 78.60 57.90 40.02 24.34 18.05 12.14  8.20  7.02  6.88  8.59
## Very few of them      7.94 12.39 20.69 22.95 25.19 20.16 16.34 13.22 10.37  7.14  6.13
## Some of them          2.34  5.86 12.36 20.38 22.71 22.27 23.90 20.56 18.90 18.78 19.63
## Most of them          1.40  1.80  5.60  9.61 14.65 20.25 22.50 27.54 25.75 27.78 29.45
## All of them           1.40  1.35  3.45  7.04 13.11 19.28 25.12 30.48 37.96 39.42 36.20

nys_fwtrim %>%
  flat_table(cheating, age, margin = "col")

##                  age    11    12    13    14    15    16    17    18    19    20    21
## cheating                                                                              
## None of them         38.68 29.28 22.33 17.08 15.85 12.00 13.88 22.98 32.76 47.95 52.17
## Very few of them     29.72 39.86 36.74 31.46 28.68 27.29 28.52 30.93 31.54 30.96 26.71
## Some of them         22.17 19.14 24.50 31.14 31.96 36.53 33.99 29.07 26.17 15.07 15.53
## Most of them          5.66  7.88 10.23 13.84 15.59 15.02 14.54 12.17  7.45  4.66  4.35
## All of them           3.77  3.83  6.20  6.49  7.92  9.16  9.07  4.84  2.08  1.37  1.24

nys_fwtrim %>%
  flat_table(vandalism, age, margin = "col")

##                  age    11    12    13    14    15    16    17    18    19    20    21
## vandalism                                                                             
## None of them         64.32 65.92 59.86 60.53 57.56 60.85 62.71 61.52 64.99 68.70 76.69
## Very few of them     24.88 24.89 28.06 27.17 28.18 26.72 25.05 25.49 26.80 23.61 14.11
## Some of them          7.04  8.07  9.50  9.41 11.00 10.05 10.09 11.40  7.04  6.63  7.98
## Most of them          2.35  0.90  1.73  2.25  2.66  1.59  1.12  0.98  0.84  0.80  0.61
## All of them           1.41  0.22  0.86  0.64  0.60  0.79  1.03  0.61  0.34  0.27  0.61

Compare the percentages reported in the “none of them” row across each age group to the percentages displayed in Warr’s (1993) Figure 1. This is essentially the data that Warr (1993) plots in Figure 1.
At this point, many researchers might be tempted to simply “copy and paste” or otherwise transfer these tables into Excel and then plot from there. I (Jon) did exactly that for a long time. Do not do that. For one thing, Excel has pretty infamous reproducibility issues (see here). Also, if you stop here and move to Excel, you will never learn about the ggplot2 package, which is an immensely powerful data visualization package that allows you to create and customize virtually any plot that you can imagine. But first, we still have a little more data wrangling to do.

3. Dichotomize peer delinquency variables

We need to recode each of the peer delinquency variables so that they have two categories: 1) “No Delinquent Friends” and 2) “At least very few delinquent friends.” To do that, we need to bring back our friend the mutate() function along with the ifelse logical operator. Below, I will do it for the first two variables in the figure: 1) Marijuana and 2) Alcohol.

nys_fwtrim_dic <- nys_fwtrim %>%
  mutate(mardic = ifelse(marijuana == 1, 1, 0),
         alcdic = ifelse(alcohol == 1, 1, 0))

nys_fwtrim_dic %>%
  flat_table(marijuana, mardic)

##                  mardic    0    1
## marijuana                        
## None of them               0 3591
## Very few of them        1281    0
## Some of them            1231    0
## Most of them             860    0
## All of them              649    0

nys_fwtrim_dic %>%
  flat_table(alcohol, alcdic)

##                  alcdic    0    1
## alcohol                          
## None of them               0 2081
## Very few of them        1336    0
## Some of them            1466    0
## Most of them            1314    0
## All of them             1424    0

In the above code, I (Jake) created a new data set object called nys_fwtrim_dic. I could have also just overwritten the “nys_fwtrim” data set but, as I mentioned before, I generally try to avoid overwriting objects so I can keep clear exactly what is in the objects I create. I assume there are different perspectives on this practice, as my approach can lead to the creation of a lot of superfluous objects in my R environment. (Of course, if you followed our recommended RStudio “Global options” settings, then you will be starting with a clean environment for every R session, which makes this issue a bit more tolerable.)
In dichotomizing the variables, we created what are typically called “dummy” variables. In this case, our dummy variables have the value of 1 if the respondent reports “none of them” for the marijuana and alcohol peer delinquency variables, and they have the value of 0 otherwise. There are other ways to treat these types of variables, but this “dummy coding” method will come in handy later.
- Note: I also checked that the dichotomization worked as I expected it to by using the flat_table() command after my recoding commands. Recall, this is an important step that a lot of people who are just learning to recode and wrangle data often are tempted to skip. At risk of repeating the obvious at this point, you should always build in checks to your code to make sure that your code is doing exactly what you want it to do - no more and no less.

4. Dummy code the “cheating” and “vandalism” variables and check that it worked.

Show your code to create your dummy variables and check that it worked.
- Note: We recommend re-creating the “marijuana” and “alcohol” dummy variables as well and writing over the nys_fwtrim_dic data object with a data object that includes all four dummy variables.

Part 6 (Assignment 3.6)

Goal: Recreate the first part of Warr’s (1993) Figure 1

We are ready to start plotting!

To do that, we will use the “ggplot2” package, which is one of the core packages in the “Tidyverse.” For more about the logic behind this package and far more details on how it works than we will provide below, check out: Kieran Healy’s book Data Visualization: A Practical Introduction; Ch. 3: Data Visualization from R for Data Science; and Danielle Navarro’s video series on ggplot2.

For now, just know that “ggplot2” is a powerful data visualization package that is almost limitless in its plotting capabilities. People even create amazing art using the package (see: Danielle Navarro’s website; the #rtistry hashtag on Twitter; and learn how to create art yourself with Navarro’s video series about creating art with ggplot2).

1. Understanding the basic logic of “ggplot2”.

The “ggplot2” package is based on a famous data visualization book by Leland Wilkinson called The Grammar of Graphics. Hadley Wickham built ggplot2 to provide a layered approach to this model of data visualization. The key to ggplot2 is to understand that you can build a plot in layers after telling ggplot() what data set you are using and mapping specific aesthetics to your plot (e.g. what variables to put on the x-axis and y-axis).
The main function is ggplot(). If we simply write ggplot(), we will get a blank canvas. And maybe this is a useful way to think about a chart. You are starting with a blank canvas, and the goal is to display information in a clear and insightful manner. Here is how Hadley Wickham says it in the “Data Visualization” chapter of his book:

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to.
So, let’s try to build your intuition about what “ggplot2” is doing by building a simple plot in layers. First, let’s start by simply telling R to start a plot.

library(tidyverse)

nys_fwtrim_dic %>%
  ggplot()

Starting a ggplot() simply creates a blank canvas. Notice how we first specified our data object, and then we used a pipe (%>%) to start the canvas with the ggplot() command? We could have also specified the data within the ggplot() command, like this: ggplot(nys_fwtrim_dic). Once you get into more complicated plots that include multiple layers and, potentially, multiple data sources from different objects or data sets, specifying the data within the ggplot() command may be preferred. For now, we will keep with the approach above that starts with the data and then pipes to the ggplot() command.
Now that we have our blank canvas, we can start adding features or mapping aesthetics to it. Let’s start by adding the age variable on the x-axis as in Warr’s (1993) Figure 1.

nys_fwtrim_dic %>%
  ggplot(aes(x = age))

Now our blank canvas has the age variable mapped to the x-axis. The plot itself is still blank because we have not told R how to visualize the age variable. In ggplot(), you do this with a “geom.” Here is what Wickham and Grolemund said about geoms in R for Data Science:

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
There are lots of different geoms built into the ggplot2 package, and a lot of other people create various geoms for specialized plots (see, for example, Matthew Kay’s work here and here).

We will start by creating a simple bar chart of the age distribution in our data.

nys_fwtrim_dic %>%
  ggplot(aes(x = age)) + 
  geom_bar(fill = "lightblue")

Note: We added the “geom” layer to the plot by using the + symbol and then telling ggplot() what I wanted to add (in this case a “geom_bar” ). Once you start a ggplot(), this is generally how you add layers and/or features to the chart - you simply follow the logic of starting with a basic plot and then sequentially adding (+) layers to it.
Note: In this plot, we told R to use the geom_bar() geom to represent the age data as a bar chart. Also, note that we told geom_bar() to make the bars light blue with fill = "lightblue". There are numerous built in colors available in R (see here) and you can enter hexadecimal color values by placing a “#” before the appropriate value (e.g., fill = #add8e6). In general, we recommend using colorblind-friendly palettes whenever possible.
Note: the fill controls the color of the fill for a particular geom. If you want to change the color of the lines around the individual bars, you would add a color = command to the geom. Here I’ll add color = "black" in order to make the bars stand out a little more.

nys_fwtrim_dic %>%
  ggplot(aes(x = age)) + 
  geom_bar(fill = "lightblue", color = "black")

We could spend a lot of time going through these kinds of basic ggplot skills. For now, we want just want you to understand that any ggplot is essentially built as a bunch of layers with specific aesthetics mapped to different variables and visual features as well as different geoms used to represent the data in specific ways. Let’s move on the plots that we are actually trying to re-create from Warr’s (1993) Figure 1.

2. Wrangle data and plot part of Figure 1

Sometimes, as above, we can just take the raw data and plot it. But this is actually pretty rare. Usually we need to wrangle and recode data to get it in the form we want to plot. In this case, you already started this process earlier when you created dichotomous variables for the peer delinquency variables (e.g., “mardic” and “alcdic”). Now, we need to use these dichotomous variables to create a summary of the data that reflects the “percentage of respondents with no delinquent friends” for each age group. To do that we will go back to “dplyr” and use the group_by() and summarize() functions. Essentially, we want to create a data frame that looks like the top row in from the flat_table() command that we used above to summarize the data.

nys_fwmarsum = nys_fwtrim_dic %>%
  drop_na(marijuana) %>%
  group_by(age) %>%
  summarize(mean_mar = mean(mardic),
            perc_mar = 100 * mean_mar)

nys_fwalcsum = nys_fwtrim_dic %>%
  drop_na(alcohol) %>%
  group_by(age) %>%
  summarize(mean_alc = mean(alcdic),
            perc_alc = 100 * mean_alc)

nys_fwmarsum

## # A tibble: 11 × 3
##      age mean_mar perc_mar
##    <dbl>    <dbl>    <dbl>
##  1    11    0.949     94.9
##  2    12    0.873     87.3
##  3    13    0.760     76.0
##  4    14    0.637     63.7
##  5    15    0.508     50.8
##  6    16    0.397     39.7
##  7    17    0.329     32.9
##  8    18    0.249     24.9
##  9    19    0.266     26.6
## 10    20    0.239     23.9
## 11    21    0.186     18.6

nys_fwalcsum

## # A tibble: 11 × 3
##      age mean_alc perc_alc
##    <dbl>    <dbl>    <dbl>
##  1    11   0.869     86.9 
##  2    12   0.786     78.6 
##  3    13   0.579     57.9 
##  4    14   0.400     40.0 
##  5    15   0.243     24.3 
##  6    16   0.180     18.0 
##  7    17   0.121     12.1 
##  8    18   0.0820     8.20
##  9    19   0.0702     7.02
## 10    20   0.0688     6.88
## 11    21   0.0859     8.59

In the above code, we created two separate summary data sets for each of the first two peer delinquency variables:
- First, we told R to drop the missing values for each peer delinquency variable with the drop_na() function.
- Second, we told R to group the data set by the variable “age.” This basically tells R to perform whatever functions come next within the values of the grouping variable - in this case age (e.g., within age==11; then within age==12; etc.).
- Third, we used the summarize function to create a new summary data set with the variables calculated as instructed in the parentheses.
  - The first variable we created using the mean function provides the proportion of respondents in each age group who reported having no delinquent friends for the specific type of delinquency. In this case, the mean function returns a proportion only because we had created the “mardic” and “alcdic” variables as “dummy variables” earlier. When you calculate the mean of a dummy variable, the mean represents the proportion of cases that have the category coded as 1.
  - Then, we simply multiplied that proportion by 100 to get a percentage (“perc_mar” and “perc_alc”) rather than a proportion.

3. Plot the Data

Now that we have these summary data, plotting them is relatively easy. We just need to tell ggplot what data to use and the specific geom we want to represent the data as. There are some other details, but I think we can get something that looks like Warr’s (1993) Figure 1 relatively quickly.

theme_set(theme_classic())

marplot <- nys_fwmarsum %>%
  ggplot(aes(x = age, y = perc_mar)) + 
  geom_line() + 
  geom_point(shape = "square")

alcplot <- nys_fwalcsum %>%
  ggplot(aes(x = age, y = perc_alc)) + 
  geom_line() + 
  geom_point(shape = "square")

marplot

alcplot

There are a few of things to note in the above code:
- First, notice that we constructed the plots with two geoms (geom_point and geom_line). This is ggplot2’s layering at work. In order to create the line plot with the points, we needed to first draw the line, then add the points (we could have drawn the points and then the lines, but then the line would technically be in front of the points).
- Second, notice how we specified the specific shape we wanted inside the geom_point() command with shape = "square". We could have also specified the number code for a solid square with shape = 15 to get the same thing. There are lots of different built-in shapes you can use for points (see here).
- Third, notice that above the code for the two plots, we included the line theme_set(theme_classic()). ggplot2 has multiple built-in themes and theme_classic() is the most similar to the look or aesthetics used in Warr’s (1993) Figure 1 plots. Of course, these themes are all customizable, and there are multiple packages you can download with additional themes (e.g., “ggthemes”).

4. Customize plot details

In the above plots, the actual data displayed appears to match up with that of Warr (1993). However, the details of the plot (e.g. axis labels, scales, etc.) are not exactly how we want them. The good news is that virtually everything within ggplot2 is customizable. So, we should be able to make these changes relatively easily.

marplot <- nys_fwmarsum %>%
  ggplot(aes(x = age, y = perc_mar)) + 
  geom_line() + 
  geom_point(shape = "square") + 
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + 
  labs (title = "Marijuana", x = "Age", y = "Percent") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10))

alcplot <- nys_fwalcsum %>%
  ggplot(aes(x = age, y = perc_alc)) + 
  geom_line() + 
  geom_point(shape = "square") + 
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + 
  labs (title = "Alcohol", x = "Age", y = "Percent") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10))
        
marplot

alcplot

We essentially added four things to the above plots.
- First, we added both scale_x_continuous and scale_y_continuous functions. In each of these, we set the limits of the scale on the x-axis and y-axis, respectively, and we set the number of breaks. For the x-axis, we simply instructed ggplot2 to place a break at each whole number from 11 to 21 (breaks = 11:21). For the y-axis, we told ggplot2 to place a break at every 10 units between zero and one hundred (breaks = seq(0, 100, 10)).
- Second, we added the labs command, which allowed us to specify the labels we wanted for the “title” of the plot and the x-axis and y-axis respectively (see here for details).
- Finally, we made some small adjustments to the underlying theme we set. If you want to see what is different, you can run the code above without the theme arguments.

nys_fwmarsum %>%
  ggplot(aes(x = age, y = perc_mar)) + 
  geom_line() + 
  geom_point(shape = "square") + 
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + 
  labs (title = "Marijuana", x = "Age", y = "Percent")

As you can see, the title of the plot is now left-justified and the font is larger. We simply modified this in the previous plots by specifying a particular font size in the theme() command. The hjust = 0.5 simply tells R to place the title of the plot in the center of ggplot’s coordinate system (find out more here). The size = 10 simply tells ggplot2 to make the plot title and axis titles size 10 point font. There are almost endless ways to modify a ggplot2 theme (see here for details).

5. Combine separate plots into one.

The last thing we are going to do is show you how to combine separate plots into one, using the “patchwork” package. There are lots of ways to produce multiple plots on the same overall canvas. The “patchwork” package seems, to us, to be the easiest way to combine plots.

library(patchwork)
fig1_maralc = (marplot | alcplot) +
  plot_annotation(
    title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
    theme = theme(plot.title = element_text(hjust = 0.5))) 
fig1_maralc

Note: Look at the options for the R chunk (the text following the {r) - here we included knitted dimensions for the figure by writing [r, fig.width = 9, fig.height = 6]. This simply tells RMarkdown to produce a figure of those dimensions (essentially the size of regular letter-sized paper with one inch margins) - this will be particularly useful when knitting the final document.
Note: In the above plot, we could have placed the two plots on top of each other instead by separating them with a / rather than a |, like so:

fig1_maralc_vert <- (marplot / alcplot) +
  plot_annotation(
    title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
    theme = theme(plot.title = element_text(hjust = 0.5))) 
fig1_maralc_vert

Note: If I had four separate plots and I wanted to arrange them all in a grid, I would just include the plots I wanted next to each other in parentheses like so:

fig1_maralc_grid <- (marplot | alcplot) / (marplot | alcplot) + 
  plot_annotation(
    title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
    theme = theme(plot.title = element_text(hjust = 0.5))) 
fig1_maralc_grid

In the above plot I just plotted the same marijuana and alcohol plots on the bottom. But, if you had two different plots, say one of peer cheating and peer vandalism, you could just replace the second set of objects with those and it would produce the top part of Warr’s (1993) Figure 1 displayed on pg. 22.

6. “Draw the Owl”

You should now have everything that you need to recreate the first part of Figure 1 from Warr (1993) on pg. 22. If you have followed along up to this point, you should have the pooled data with dummy coded peer-deliquency variables needed to create summary data frames as done in Assignment section 3.6 above. You should also have the first two plots for Marijuana and Alcohol.

All that is left to do is:

Create summary data for the “Cheating” and “Vandalism” peer delinquency measures
Plot the individual plots for the “Cheating” and “Vandalism” peer delinquency measures; and
Use “patcwhork” to combine the four plots so they look like the first part of Warr’s plot on pg. 22.

Note: Pay attention to the y-axis scales in the plots for “Cheating” and “Marijuana” as they are not the same as in “Marijuana” and “Alcohol.”

Remember:

Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment parts or questions.
Write plain text after headings and before or after code chunks to explain what you are doing. This is not just for my sake - such text will serve as useful reminders to you when working on later assignments!
Upon completing the assignment, “knit” your final RMD file again and save the final knitted html document to your “Assignments” folder in your LastName_P680_work folder as: LastName_P680_Assign3_YEAR_MO_DY.
Inside the “LastName_P680_commit” folder in our shared folder, create another folder named: Assignment 3.
To submit your assignment for grading, save copies of both your (1) final knitted “Assign3” html file and (2) your “Assign3_RMD” file into the “LastName_P680_commit / Assignment 3” folder.

Remember, be sure to save copies of both files - do not just drag the files over from your “work” folder, or you may lose those original copies from your “work” folder.

Finally, just to give you a sense that you are on the right track, your final plot should look something like this:

R Assignment 3: Downloading Data & Reproducing Figure

Jake Day and Jon Brauer

9/9/2021

Assumptions & Ground Rules

The Study

The Data: National Youth Survey

Part 1 (Assignment 3.1)

Goal: Download Wave 1 of the NYS Data from ICPSR Website

1. Download the SPSS data

2. Move the data to your “Datasets” subfolder in your root “work” folder and load it into R using the “haven” package like you have done before. Assign it to an object named “nys_w1_icpsr.”

Part 2 (Assignment 3.2)

Goal: Download First Five Waves of the NYS Data within R

1. Create “NYS” Subfolder

2. Install, load, and use the `icpsrdata` package

3. Read the data into R.

Part 3 (Assignment 3.3)

Goal: Identify Variables used by Warr (1993)

1. Identify Variables used by Warr in first part of Figure 1.

2. Identify the items in each data set that correspond to the variables above

Part 4 (Assignment 3.4)

Goal: Select Items and Pool Data

1. Select items to use in analyses and give them informative names

2. Rename existing variables and create a new variable indicating NYS wave.

3. Pool data from all five waves

4. Filter the pooled data

Part 5 (Assignment 3.5)

Goal: Recode the Pooled Data

1. View frequency tables for key variables

2. View cross-tabulations of peer delinquency variables by age.

3. Dichotomize peer delinquency variables

4. Dummy code the “cheating” and “vandalism” variables and check that it worked.

Part 6 (Assignment 3.6)

Goal: Recreate the first part of Warr’s (1993) Figure 1

1. Understanding the basic logic of “ggplot2”.

2. Wrangle data and plot part of Figure 1

3. Plot the Data

4. Customize plot details

5. Combine separate plots into one.

6. “Draw the Owl”

R Assignment 3: Downloading Data & Reproducing Figure

Jake Day and Jon Brauer

9/9/2021

Assumptions & Ground Rules

The Study

The Data: National Youth Survey

Part 1 (Assignment 3.1)

Goal: Download Wave 1 of the NYS Data from ICPSR Website

1. Download the SPSS data

2. Move the data to your “Datasets” subfolder in your root “work” folder and load it into R using the “haven” package like you have done before. Assign it to an object named “nys_w1_icpsr.”

Part 2 (Assignment 3.2)

Goal: Download First Five Waves of the NYS Data within R

1. Create “NYS” Subfolder

2. Install, load, and use the icpsrdata package

3. Read the data into R.

Part 3 (Assignment 3.3)

Goal: Identify Variables used by Warr (1993)

1. Identify Variables used by Warr in first part of Figure 1.

2. Identify the items in each data set that correspond to the variables above

Part 4 (Assignment 3.4)

Goal: Select Items and Pool Data

1. Select items to use in analyses and give them informative names

2. Rename existing variables and create a new variable indicating NYS wave.

3. Pool data from all five waves

4. Filter the pooled data

Part 5 (Assignment 3.5)

Goal: Recode the Pooled Data

1. View frequency tables for key variables

2. View cross-tabulations of peer delinquency variables by age.

3. Dichotomize peer delinquency variables

4. Dummy code the “cheating” and “vandalism” variables and check that it worked.

Part 6 (Assignment 3.6)

Goal: Recreate the first part of Warr’s (1993) Figure 1

1. Understanding the basic logic of “ggplot2”.

2. Wrangle data and plot part of Figure 1

3. Plot the Data

4. Customize plot details

5. Combine separate plots into one.

6. “Draw the Owl”

2. Install, load, and use the `icpsrdata` package