Assumptions & Ground Rules

The purpose of this assignment is to reproduce findings from a published study in R, particularly one where the data are housed on ICPSR. In order to accomplish this we will need to do and learn the following:

  1. Identify the data being used in the now classic study by Mark Warr published in Criminology titled: “Age, Peers, and Delinquency” (Warr, 1993)
  2. Download it from ICPSR
  3. Identify and wrangle the specific variables/items used in the study using the dplyr package that is a part of the Tidyverse.
  4. Combine multiple data sets into one.
  5. Reproduce the first part of Figure 1 from Warr’s (1993) study (as displayed on pg. 22 in his article) by introducing you to the powerful data visualization package ggplot2, which is also a part of the Tidyverse.

We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, we also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

  • Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

  • For this assignment, there are points where you will repeat code structures but change certain details. For now, it may be efficient for you to copy and paste from your own code in these situations. However, it is important to recognize that copying and pasting repeatedly is generally considered an ill-advised and error-prone coding/programming practice (see R for Data Science for an introduction to functions). Eventually, you want to be able to use existing functions and write your own simple functions for accomplishing these repetitive tasks.

    • We will not introduce writing functions in this class and, to be honest, we are still working on consistently using functions in our own work. However, one of the key advantages of using R is that it is a functional programming language. So, unlocking the ability to use functions really does supercharge your R abilities (R for Data Science and Advanced R provide book introductions to functional programming in R and Danielle Navarro has a video series about functional programming in R).

The Study

I recommend doing a quick AIC reading of Warr’s study. To get you started, here is the abstract:

Hirschi and Gottfredson (1983; Gottfredson and Hirschi, 1990) have argued that the age distribution of crime cannot be explained by any known variables, and they point specifically to the failure of sociological theories to explain this phenomenon. This paper examines a quintessentially sociological theory of crime-differential association-and evaluates its ability to explain the age distribution of crime. Analysis of data from the National Youth Survey on persons aged ll-21 reveals that peer relations (exposure to delinquent peers, time spent with peers, loyalty to peers) change dramatically over this age span, following much the same pattern as crime itself. When measures of peer influence are controlled, the effects of age on self-reported delinquency are largely rendered insignificant. Additional analyses show that delinquent friends tend to be ’’sticky” friends (once acquired, they are not quickly lost) and that Sutherland’s arguments concerning the duration and priority of delinquent associations are only partially correct.

The paper is fundamentally about investigating the age-crime relationship and specifically examining whether delinquent peer influence explains the age-distribution of crime. To investigate this, Warr (1993) uses the first five waves of the National Youth Survey (NYS). Here is the first section of the “Data and Methods” section where he outlines the specific data he is analyzing (p. 19):

The data for this study come from the National Youth Survey (NYS). The NYS is a longitudinal study of a national probability sample of 1,726 persons aged 11-17 in 1976 (see Elliott et al., 1985). The sample was obtained through a multistage, cluster sampling of households in the continental United States in 1976. Five consecutive annual waves of the survey were conducted from 1976 through 1980, and these five are used in this analysis.

The Data: National Youth Survey

Fortunately, the data for Warr’s study are available on ICPSR. So, for this assignment, we are going to work with the National Youth Survey (NYS) Series on ICPSR. It includes seven waves of data spanning 1976 to 1987. Overseen by Delbert Elliott, it was one of the first national-level longitudinal studies specifically designed to measure self-reported crime and has been a popular data source for many high profile criminological studies, including Warr’s (1993). Here is the general description of the series from ICPSR:

For this series, parents and youth were interviewed about events and behavior of the preceding year to gain a better understanding of both conventional and deviant types of behavior by youths. Data were collected on demographic and socioeconomic status of respondents, disruptive events in the home, neighborhood problems, parental aspirations for youth, labeling, integration of family and peer contexts, attitudes toward deviance in adults and juveniles, parental discipline, community involvement, drug and alcohol use, victimization, pregnancy, depression, use of outpatient services, spouse violence by respondent and partner, and sexual activity. Demographic variables include sex, ethnicity, birth date, age, marital status, and employment of the youths, and information on the marital status and employment of the parents.

In what follows, we will walk through two different ways to download data from ICPSR. First, we will download data directly from the ICPSR website. Second, we will download data from ICPSR entirely within the R environment.

Part 1 (Assignment 3.1)

Goal: Download Wave 1 of the NYS Data from ICPSR Website

Start by navigating to the ICPSR landing page for Wave 1 of the NYS (ICPSR 8375). You’ll notice that the NYS has lots of options in terms of file formats for which you can download the data. Unfortunately, R is not one of them. However, this is a case where you could download either the Stata, SPSS, or SAS data files and import them into R using the haven package. Indeed, if you only needed one data file, you could simply download the SPSS file directly from ICPSR, place it in your “Datasets” subfolder within your root “work” folder, and then import it into R using haven’s read_spss command (see previous R Assignments for examples). Let’s go ahead and do that with Wave 1 of the NYS data from 1976.

1. Download the SPSS data

Downloading SPSS Data

Downloading SPSS Data

  • It is instructive to take a look at what is actually downloaded from ICPSR. First, notice that what ICPSR does is download a zip file with the title “” to your computer.

  • Opening this zip file reveals a folder titled “ICPSR_08375” inside.

  • When you open this folder, notice it contains several files that tell you about the data, but not the actual data. There is also another folder titled “DS0001” inside.