The purpose of this assignment is to reproduce findings from a published study in R, particularly one where the data are housed on ICPSR. In order to accomplish this we will need to do and learn the following:
We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.
At this point, we also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.
As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
For this assignment, there are points where you will repeat code structures but change certain details. For now, it may be efficient for you to copy and paste from your own code in these situations. However, it is important to recognize that copying and pasting repeatedly is generally considered an ill-advised and error-prone coding/programming practice (see R for Data Science for an introduction to functions). Eventually, you want to be able to use existing functions and write your own simple functions for accomplishing these repetitive tasks.
I recommend doing a quick AIC reading of Warr’s study. To get you started, here is the abstract:
Hirschi and Gottfredson (1983; Gottfredson and Hirschi, 1990) have argued that the age distribution of crime cannot be explained by any known variables, and they point specifically to the failure of sociological theories to explain this phenomenon. This paper examines a quintessentially sociological theory of crime-differential association-and evaluates its ability to explain the age distribution of crime. Analysis of data from the National Youth Survey on persons aged ll-21 reveals that peer relations (exposure to delinquent peers, time spent with peers, loyalty to peers) change dramatically over this age span, following much the same pattern as crime itself. When measures of peer influence are controlled, the effects of age on self-reported delinquency are largely rendered insignificant. Additional analyses show that delinquent friends tend to be ’’sticky” friends (once acquired, they are not quickly lost) and that Sutherland’s arguments concerning the duration and priority of delinquent associations are only partially correct.
The paper is fundamentally about investigating the age-crime relationship and specifically examining whether delinquent peer influence explains the age-distribution of crime. To investigate this, Warr (1993) uses the first five waves of the National Youth Survey (NYS). Here is the first section of the “Data and Methods” section where he outlines the specific data he is analyzing (p. 19):
The data for this study come from the National Youth Survey (NYS). The NYS is a longitudinal study of a national probability sample of 1,726 persons aged 11-17 in 1976 (see Elliott et al., 1985). The sample was obtained through a multistage, cluster sampling of households in the continental United States in 1976. Five consecutive annual waves of the survey were conducted from 1976 through 1980, and these five are used in this analysis.
Fortunately, the data for Warr’s study are available on ICPSR. So, for this assignment, we are going to work with the National Youth Survey (NYS) Series on ICPSR. It includes seven waves of data spanning 1976 to 1987. Overseen by Delbert Elliott, it was one of the first national-level longitudinal studies specifically designed to measure self-reported crime and has been a popular data source for many high profile criminological studies, including Warr’s (1993). Here is the general description of the series from ICPSR:
For this series, parents and youth were interviewed about events and behavior of the preceding year to gain a better understanding of both conventional and deviant types of behavior by youths. Data were collected on demographic and socioeconomic status of respondents, disruptive events in the home, neighborhood problems, parental aspirations for youth, labeling, integration of family and peer contexts, attitudes toward deviance in adults and juveniles, parental discipline, community involvement, drug and alcohol use, victimization, pregnancy, depression, use of outpatient services, spouse violence by respondent and partner, and sexual activity. Demographic variables include sex, ethnicity, birth date, age, marital status, and employment of the youths, and information on the marital status and employment of the parents.
In what follows, we will walk through two different ways to download data from ICPSR. First, we will download data directly from the ICPSR website. Second, we will download data from ICPSR entirely within the R environment.
Start by navigating to the ICPSR landing page for Wave 1 of the NYS
(ICPSR
8375). You’ll notice that the NYS has lots of options in terms of
file formats for which you can download the data. Unfortunately, R is
not one of them. However, this is a case where you could download either
the Stata, SPSS, or SAS data files and import them into R using the haven package. Indeed, if you
only needed one data file, you could simply download the SPSS file
directly from ICPSR, place it in your “Datasets” subfolder within your
root “work” folder, and then import it into R using haven’s
read_spss
command (see previous R Assignments for
examples). Let’s go ahead and do that with Wave 1 of the NYS data from
1976.
The data file itself is in this “DS0001” folder. Sometimes data you download from ICPSR will have multiple data sets associated with them, and each data file will be contained its own folder. So, for example, if there were a second data set associated with this particular wave of the NYS, there would likely be another folder named “DS0002.”
Inside the “DS0001” folder is a file named “08375-0001-Data.sav” - this is the actual data file. The folder also contains the codebook for the data.
library(haven)
library(here)
nys_w1_icpsr <- read_spss(here("Datasets", "08375-0001-Data.sav"))
If we just wanted to download one data set, the above method would
work fine. Technically, you could also repeat what we just did for Waves
2 through 5. However, this is not a very efficient approach. It also
poses some complications in terms of reproducibility because, according
to ICPSR’s bylaws,
technically you are not allowed to share ICPSR data in your own online
repository (e.g. OSF or GitHub). To address these inefficiencies
and potential threats to reproducibility, in this part of the assignment
we will download the first five waves of the NYS data entirely within
the R environment instead using the icpsrdata
package
(see here
and here for more
information).
# check if "NYS" folder exists (TRUE if it does) & create if it does not exist.
ifelse(dir.exists(here("Datasets/NYS")), TRUE, dir.create(here("Datasets/NYS")))
## [1] TRUE
?ifelse
into the console window. Here is the
description of that function:ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.
It takes the form of the following:
ifelse(test, yes, no)
. This means, you give R a logical
test (or a logical question) that can be answered
yes or no and then
it gives you a value or performs another function based on the solution
of that test (i.e., based upon the answer to that question).
In the above code, we are asking if the “NYS” folder exists
within the “Datasets” folder in the working directory (i.e., within your
root “work” folder) with the dir.exists
function. If the
answer is yes, it simply returns the logical
value I told it to - in this case TRUE
. If the answer is
no, you instruct R to create that “NYS” folder
with the dir.create
function. Again, type
?dir.exists
or ?dir.create
for more
information.
If you want to have some fun, you can actually have R return a text string instead of the logical value. For example:
ifelse(dir.exists(here("Datasets/NYS")), "You already created that folder, dummy!", dir.create(here("Datasets/NYS")))
## [1] "You already created that folder, dummy!"
Generally, it is probably not a great idea to have R call the user (yourself in this case) a “dummy” with code you plan to eventually share publicly. Yet, it is also OK to have some fun when doing science. I (Jake) think that having a computer program call me a “dummy” is fun - perhaps you do not.
icpsrdata
packageicpsr_download
function to download files for first
five waves of the NYS based on their ICPSR numbers. Note that you need
to specify the where to find the “NYS” folder you created earlier with
the download_dir
option and the here
package.library(icpsrdata)
icpsr_download(file_id=c(8375, 8424, 8506, 8917, 9112),
download_dir = here("Datasets/NYS"))
Note: when you first run this chunk during an R session you will be asked to enter your ICPSR account information into the R console. R should remember this once you enter it once. So you will likely need to do this once per R session before trying to knit your Rmarkdown document.
Again, to be clear: Check the R Console
after trying to run the icpsr_download
command to see and
respond to the ICPSR username/password prompts. If you have not yet
created a free account on ICPSR (you should have already for an earlier
project assignment), then you will need to do this on the ICPSR website
first. Then, after each prompt in the console, you would put your ICPSR
username instead of “your_icpsr_username” (I added that as a
placeholder) and your ICPSR password instead of
“your_icpsr_password.”
Recall from earlier that the actual data are within a folder
called “DS0001” within each of the ICPSR folders. You simply want to use
the here
package to tell the read_spss
function where to find the SPSS data for each wave of the data. Make
sure you pay close attention to which study numbers are associated with
each specific wave of data!
here
package (which should have been set to your root
“work” folder if you opened your RMD file directly from that folder).
You can also use more traditional path structures as well (e.g.,
“Datasets/NYS”). For more information on using the here
package, see here
(pun intended).nys_w1 <- read_spss(here("Datasets", "NYS", "ICPSR_08375", "DS0001", "08375-0001-Data.sav"))
nys_w2 <- read_spss(here("Datasets", "NYS", "ICPSR_08424", "DS0001", "08424-0001-Data.sav"))
nys_w3 <- read_spss(here("Datasets", "NYS", "ICPSR_08506", "DS0001", "08506-0001-Data.sav"))
nys_w4 <- read_spss(here("Datasets", "NYS", "ICPSR_08917", "DS0001", "08917-0001-Data.sav"))
nys_w5 <- read_spss(here("Datasets", "NYS", "ICPSR_09112", "DS0001", "09112-0001-Data.sav"))
icpsrdownload
package (named “nys_w1” to “nys_w5”).Ok, so we have the data in R; now we need to get them in a usable form. In this case, that means 1) identifying the specific items Warr used from each wave of the NYS to construct the first part of his Figure 1, and 2) combining each of the five waves of data into a single datafile.
As implied by the title of the article, the first key variable across each of the figures that we are going to try to reproduce is “Age.”
The second set of items is “exposure to delinquent peers,” which Warr (1993) describes on pg. 20:
Consider, first, exposure to delinquent peers, that is, the number of delinquents within the respondent’s immediate circle of friends. In the NYS, respondents were asked this question: “Think of the people you listed as close friends. During the last year how many of them have [act]?” ( 1 = none of them, 2 = very few of them, 3 = some of them, 4 = most of them, 5 = all of them). Following the question was a series of acts, including vandalism (“purposely damaged or destroyed property that did not belong to them”), cheating (“cheated on school tests”), marijuana use (“used marijuana or hashish”), petty theft (“stolen something worth less than $5”), alcohol use (“used alcohol”), burglary (“broken into a vehicle or building to steal something”), selling hard drugs (“sold hard drugs such as heroin, cocaine, and LSD”), and grand theft (“stolen something worth more than $50”).
At this point, we will just focus on the first four behaviors: Marijuana, Alcohol, Cheating, and Vandalism.
We are going to assume that you are able to find the proper variables within each respective codebook. Beware, however, that data for wave 1 of the NYS includes data from both a parent survey and a child survey, whereas the subsequent waves are just surveys of the children. Warr (1993) was focused on just the child data, so we want to be sure to select from those items only.
Wave 1 | Wave 2 | Wave 3 | Wave 4 | Wave 5 | |
---|---|---|---|---|---|
icpsr | 8375 | 8424 | 8506 | 8917 | 9112 |
age | V169 | V7 | V10 | V6 | V6 |
marijuana | V367 | V210 | V308 | V288 | V315 |
alcohol | V370 | V213 | V311 | V291 | V318 |
cheating | V365 | V208 | V306 | V286 | V313 |
vandalism | V366 | V209 | V307 | V287 | V314 |
Now that we have the data and know the items we need, the next step is to select the appropriate items in each data set and then pool the data from each separate wave together into one datafile (or one R object). To do this, we will need to learn some new skills and, in particular, a new package that is part of the “Tidyverse”— dplyr. The “dplyr” package is used for data manipulation. Danielle Navarro has a really good video series that introduces you to the details of the various “dplyr” functions. We will only cover the operations we need for this assignment but, if you want to know more, Navarro’s videos are a great place to start.
select()
function in “dplyr” to select those specific items in each of the five
waves of data. That means, for the wave 1 data, we need to select
“V336,” “V370,” “V365,” and “V366” (“V210,” “V213,” “V208,” and “V209”
for wave 2, and so on for waves 3 through 5). We also need to select the
“Age” item in each wave as that is the variable on the x-axis in Figure
1 (“V169” in Wave 1).nys_w1_trim <- nys_w1 %>%
dplyr::select(V169, V367, V370, V365, V366)
nys_w1_trim
In the code above, we are telling R to select “V169,” “V367,” “V370,” “V365,” and “V366” from “nys_w1” and create a new data set object called “nys_w1_trim”. Our new object has the same number of observations, but only 5 variables. You can check this by looking in your RStudio “Environment.”
select()
is one of
those popular commands that frequently poses conflicts when you have
several packages loaded at once. So, in this code chunk, we ensured that
the select()
command was invoked using the
“dplyr” package by appending the package name followed
by two colons directly in front of the command (i.e.,
dplyr::select()
).There are a few problems with the wave 1 trimmed data in its current form:
First, the variable names are not informative. They are simply the names in the original data file. Like with naming your computer files, it is usually a good practice to give informative names to your variables (and other R objects; see part 1 of Navarro’s series on “dplyr”). Using meaningful and systematic naming conventions will also be useful when we combine the data sets, since we can assign the same name across each wave before merging or combining them.
Second, the trimmed data we created has no information about which wave these data are from (except in the object name) nor does it include the unique identifier for individuals. If we were just working with the wave 1 data, this would not be a huge problem; also, since Warr (1993) simply pooled all five waves of data, the individual identifiers are less important. Nonetheless, it is generally good practice to preserve such important information.
In order to rename the five variables we are currently interested
in (i.e., age, marijuana use, alcohol use,
cheating, and vandalism) we will use the
rename()
function. To create a new variable indicating the
wave of the data, we will use the mutate()
function. Both
of these functions are part of the “dplyr” package. The
rename()
function does exactly what it says - it renames
the items, whereas the mutate()
function allows us to
create new variables and to manipulate existing items in the
data.
Let’s do this with the wave 1 data again.
nys_w1_trim
data that we created earlier. Usually, we avoid
writing over objects, as doing so can cause confusion and errors as we
try to keep track of what exactly is in the object. Nonetheless, as you
will see, we are going to repeat the select()
function that
we used before - though, this time, that command will be followed by a
pipe and additional commands using rename
and
mutate
functions. One of the nice things about the “dplyr”
package and the “pipe” (%>%
) you learned about in the
last R Assignment is that you sequentially invoke various commands all
at once within the same code chunk.nys_w1_trim <- nys_w1 %>%
dplyr::select(CASEID, V169, V367, V370, V365, V366) %>%
rename(age = V169,
marijuana = V367,
alcohol = V370,
cheating = V365,
vandalism = V366) %>%
mutate(wave = 1)
nys_w1_trim
A couple things to note about the above code:
First, like before, we are telling R to select “V169”, “V367”,
“V370,” “V365,” and “V366” from “nys_w1.” However, one key difference
this time is that we immediately followed this with a pipe to a
sequential rename
command, which tells R to subsequently
rename specific items after selecting them. Within the
rename
command, we specifically tell R what to rename each
variable by invoking a new name as equal to an old name (i.e. new name =
old name).
Second, the rename
command is then followed by
another pipe to a sequential mutate
command that tells R to
create or modify a variable after completing the rename
command. In this case, our mutate
command tells R to create
a new variable named “wave” that equals “1” to indicate the wave of the
data we are working with (i.e. variable_name = value). Since this
command is not conditional (e.g., there is no ifelse
operator), all 1,725 rows or observations (i.e., “cases” or
“respondents”) from NYS wave 1 will include a variable column named
“wave” with a value that equals “1” in each row. And, of course, the
first line of code tells R to assign all of these operations into a new
object called “nys_w1_trim,” while the last line of code is equivalent
to running the view()
command so we can see our new data
object.
Note: We also included the “CASEID”
item in the select
command above; ICPSR and the NYS made it
easy on us by consistently naming the identifier “CASEID” in each wave
of data.
Note: The mutate()
function can do a lot more than just assign a value to a new variable.
For instance, it can also create a new variable based on a mathematical
formula or based on some function of existing variables. Let’s show you
some simple examples:
nys_w1_trimX <- nys_w1 %>%
dplyr::select(CASEID, V169, V367, V370, V365, V366) %>%
rename(age = V169,
marijuana = V367,
alcohol = V370,
cheating = V365,
vandalism = V366) %>%
mutate(wave = 1,
waveB = 24 / 24,
maralc = marijuana + alcohol,
age_squared = age * age,
age_squaredB = age^2)
nys_w1_trimX
We are really just scratching the surface of what the
mutate()
function can do. In the code above, I show you how
you can perform a mathematical function like division (e.g.,
24 / 24
), add the values of two variables together for each
respondent (e.g., marijuana + alcohol
), and multiply the
values of variables together - in this cases by illustrating two
different ways of squaring the age variable (e.g., age^2
or
age * age
). Again, this is just the beginning of what you
can do with this supremely useful function. Also, take a minute to
briefly examine the first 10 rows of data; doing so will reveal that
these commands worked as intended (e.g. for the first observation the
“maralc” value of 4 is the value for the “marijuana” variable for that
individual–1, plus the value for the “alcohol” variable for that
individual–3). You should develop the essential habit of
always checking to ensure your commands worked
as intended after running them.
Now, let’s create the trimmed data for each of the first five
waves of the NYS. Again, we will write over the nys_w1_trim
data we created above, as we would usually do all of these commands in
the same code chunk. Here is the table of the items as a
reminder:
Wave 1 | Wave 2 | Wave 3 | Wave 4 | Wave 5 | |
---|---|---|---|---|---|
icpsr | 8375 | 8424 | 8506 | 8917 | 9112 |
age | V169 | V7 | V10 | V6 | V6 |
marijuana | V367 | V210 | V308 | V288 | V315 |
alcohol | V370 | V213 | V311 | V291 | V318 |
cheating | V365 | V208 | V306 | V286 | V313 |
vandalism | V366 | V209 | V307 | V287 | V314 |
And here is our code to create five trimmed data objects corresponding to each of the first five waves of NYs data.
#Wave 1:
nys_w1_trim <- nys_w1 %>%
dplyr::select(CASEID, V169, V367, V370, V365, V366) %>%
rename(age = V169,
marijuana = V367,
alcohol = V370,
cheating = V365,
vandalism = V366) %>%
mutate(wave = 1)
nys_w1_trim
#Wave 2:
nys_w2_trim <- nys_w2 %>%
dplyr::select(CASEID, V7, V210, V213, V208, V209) %>%
rename(age = V7,
marijuana = V210,
alcohol = V213,
cheating = V208,
vandalism = V209) %>%
mutate(wave = 2)
nys_w2_trim
#Wave 3:
nys_w3_trim <- nys_w3 %>%
dplyr::select(CASEID, V10, V308, V311, V306, V307) %>%
rename(age = V10,
marijuana = V308,
alcohol = V311,
cheating = V306,
vandalism = V307) %>%
mutate(wave = 3)
nys_w3_trim
#Wave 4:
nys_w4_trim <- nys_w4 %>%
dplyr::select(CASEID, V6, V288, V291, V286, V287) %>%
rename(age = V6,
marijuana = V288,
alcohol = V291,
cheating = V286,
vandalism = V287) %>%
mutate(wave = 4)
nys_w4_trim
#Wave 5:
nys_w5_trim <- nys_w5 %>%
dplyr::select(CASEID, V6, V315, V318, V313, V314) %>%
rename(age = V6,
marijuana = V315,
alcohol = V318,
cheating = V313,
vandalism = V314) %>%
mutate(wave = 5)
nys_w5_trim
Now that we have all five data sets with the same variables and variable names, pooling them together should be relatively easy. But first, we want you to try to build some intuition regarding what Warr (1993) did here when he says that “…all five years of the NYS data were pooled, producing a composite sample of 8,625 persons aged 11-21 (pg. 20).”
Because we created five trimmed data sets with the same variables
in step 3 above, “pooling” the data in this case really just means
stacking the waves of data on top of each other. In other words, I want
to put Wave 1 data on top, Wave 2 data next, then Wave 3 data, then Wave
4 data, until the Wave 5 observations are at the bottom of the data set.
Fortunately, the “dplyr” package makes this relatively easy with the bind_rows()
function. Essentially, when you use the bind_rows()
function, you are simply telling R which data sets to stack on top of
each other by the order in which you list them.
bind_cols()
that tells R to place data sets next to each
other.nys_fwtrim <- bind_rows(nys_w1_trim, nys_w2_trim, nys_w3_trim, nys_w4_trim, nys_w5_trim)
nys_fwtrim
In the code above, we told R to stack waves 1 through 5 on top of each other in chronological order and assign it to the object “nys_fwtrim” (we used fw to indicate we were creating a data set that included all “five waves” of data).
Note: we could still pull out the
individual waves of data from this combined data set using the filter()
function. Of course, the filter()
function would allow us
to subset the data by other variables as well (e.g. select certain age
groups). Here is a quick example:
nys_w3_filter <- nys_fwtrim %>%
filter(wave == 3)
nys_w3_filter
nys_age_filter <- nys_fwtrim %>%
filter(age == 11 | age == 12 | age == 13,
wave == 3)
nys_age_filter
In the code above, the first object we created (“nys_w3_filter”) simply selects all observations from Wave 3 of the data (i.e. recreates the “nys_w3_trim” data set). In the second object we created (“nys_age_filter”), we select subjects who are age 11, 12, or 13 from wave 3. Moving forward, we are just going to work with the pooled data that we created above. But imagine a situation in which you received the pooled data but only wanted to analyze wave 1. We hope this brief diversion gives you a sense of how you could easily go from a larger data set to a smaller data set.
It is also important to point out a couple of logical operators we used in the code above:
==
operator. The double equals signs
(==
) are used when you are performing a logical test or
operation. So, in the above code, you are performing a logical test to
see if the “wave” variable equals three (3
) and, if it
does, to return those observations while filtering out those
observations where it does not. A single equals sign is used when you
are assigning a value to an object or variable. Earlier, when using the
mutate()
function, we used single equals signs
(=
) because we were assigning values to the “wave” variable
(e.g., mutate(wave = 3)
.Now that we have the pooled data, we’re almost ready to start recreating the first part of Figure 1 from Warr (1993). However, we still need to do a little more data cleaning and wrangling before we are ready to plot.
If you look again at Figure 1, you will notice that Warr (1993) reports the “Percentage of Respondents with No Delinquent Friends, by Age and Offense.” Recall the questions about delinquent peers specifically asked the respondents:
Think of the people you listed as close friends. During the last year how many of them have [act]?” with the answer categories: “1 = none of them, 2 = very few of them, 3 = some of them, 4 = most of them, 5 = all of them.”
Warr (1993) chose to focus on those who reported none of their friends engaging in any of those acts. One reason for this may be because the data are fairly skewed. In other words, the modal category for many of the peer delinquency variables is likely “1 = none of them”.
We can check the frequency distributions and modal categories for
each of the four peer delinquency items by creating a frequency table
for them. There are lots of ways to create frequency tables, including
the base R command table()
$
operator to tell R which variable to use from the data
set—table(nys_fwtrim$marijuana)
. For now, we’ll use the
“sjmisc” package that is a part of the same suite of packages as the
“sjPlot” package you used in a previous assignmentlibrary(sjmisc)
nys_fwtrim %>%
frq(marijuana, alcohol, cheating, vandalism, age)
## Y1-181: USED MARJUANA (marijuana) <numeric>
## # total N=8625 valid N=7612 mean=2.17 sd=1.35
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
## 1 | None of them | 3591 | 41.63 | 47.18 | 47.18
## 2 | Very few of them | 1281 | 14.85 | 16.83 | 64.00
## 3 | Some of them | 1231 | 14.27 | 16.17 | 80.18
## 4 | Most of them | 860 | 9.97 | 11.30 | 91.47
## 5 | All of them | 649 | 7.52 | 8.53 | 100.00
## <NA> | <NA> | 1013 | 11.74 | <NA> | <NA>
##
## Y1-184: USED ALCOHOL (alcohol) <numeric>
## # total N=8625 valid N=7621 mean=2.82 sd=1.47
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
## 1 | None of them | 2081 | 24.13 | 27.31 | 27.31
## 2 | Very few of them | 1336 | 15.49 | 17.53 | 44.84
## 3 | Some of them | 1466 | 17.00 | 19.24 | 64.07
## 4 | Most of them | 1314 | 15.23 | 17.24 | 81.31
## 5 | All of them | 1424 | 16.51 | 18.69 | 100.00
## <NA> | <NA> | 1004 | 11.64 | <NA> | <NA>
##
## Y1-179: CHEATED ON SCHOOL TESTS (cheating) <numeric>
## # total N=8625 valid N=7528 mean=2.51 sd=1.14
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
## 1 | None of them | 1624 | 18.83 | 21.57 | 21.57
## 2 | Very few of them | 2315 | 26.84 | 30.75 | 52.32
## 3 | Some of them | 2197 | 25.47 | 29.18 | 81.51
## 4 | Most of them | 915 | 10.61 | 12.15 | 93.66
## 5 | All of them | 477 | 5.53 | 6.34 | 100.00
## <NA> | <NA> | 1097 | 12.72 | <NA> | <NA>
##
## Y1-180: DESTROYED PROPERTY (vandalism) <numeric>
## # total N=8625 valid N=7610 mean=1.53 sd=0.79
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ----------------------------------------------------------
## 1 | None of them | 4718 | 54.70 | 62.00 | 62.00
## 2 | Very few of them | 1992 | 23.10 | 26.18 | 88.17
## 3 | Some of them | 728 | 8.44 | 9.57 | 97.74
## 4 | Most of them | 120 | 1.39 | 1.58 | 99.32
## 5 | All of them | 52 | 0.60 | 0.68 | 100.00
## <NA> | <NA> | 1015 | 11.77 | <NA> | <NA>
##
## age <numeric>
## # total N=8625 valid N=8625 mean=15.87 sd=2.40
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## 11 | 252 | 2.92 | 2.92 | 2.92
## 12 | 509 | 5.90 | 5.90 | 8.82
## 13 | 778 | 9.02 | 9.02 | 17.84
## 14 | 1036 | 12.01 | 12.01 | 29.86
## 15 | 1289 | 14.94 | 14.94 | 44.80
## 16 | 1276 | 14.79 | 14.79 | 59.59
## 17 | 1216 | 14.10 | 14.10 | 73.69
## 18 | 947 | 10.98 | 10.98 | 84.67
## 19 | 689 | 7.99 | 7.99 | 92.66
## 20 | 436 | 5.06 | 5.06 | 97.72
## 21 | 197 | 2.28 | 2.28 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
As you can see from the tables that outputted above, the “None of them” answer category is indeed the modal answer category for three out of four of these peer variables, though the degree of skewness varies across items.
flat_table()
function in the “sjmisc” package.nys_fwtrim %>%
flat_table(marijuana, age, margin = "col") #note: margin = "col" tells it to give me column percentages
## age 11 12 13 14 15 16 17 18 19 20 21
## marijuana
## None of them 94.86 87.33 75.97 63.68 50.82 39.74 32.87 24.94 26.59 23.87 18.63
## Very few of them 3.27 8.60 11.51 15.28 17.00 19.74 19.89 20.66 16.72 19.63 21.74
## Some of them 0.93 3.39 6.91 10.79 13.65 17.62 21.57 24.82 24.75 21.49 26.71
## Most of them 0.93 0.23 4.32 5.77 9.79 12.86 13.26 16.75 19.73 23.08 18.01
## All of them 0.00 0.45 1.29 4.49 8.76 10.04 12.42 12.84 12.21 11.94 14.91
nys_fwtrim %>%
flat_table(alcohol, age, margin = "col")
## age 11 12 13 14 15 16 17 18 19 20 21
## alcohol
## None of them 86.92 78.60 57.90 40.02 24.34 18.05 12.14 8.20 7.02 6.88 8.59
## Very few of them 7.94 12.39 20.69 22.95 25.19 20.16 16.34 13.22 10.37 7.14 6.13
## Some of them 2.34 5.86 12.36 20.38 22.71 22.27 23.90 20.56 18.90 18.78 19.63
## Most of them 1.40 1.80 5.60 9.61 14.65 20.25 22.50 27.54 25.75 27.78 29.45
## All of them 1.40 1.35 3.45 7.04 13.11 19.28 25.12 30.48 37.96 39.42 36.20
nys_fwtrim %>%
flat_table(cheating, age, margin = "col")
## age 11 12 13 14 15 16 17 18 19 20 21
## cheating
## None of them 38.68 29.28 22.33 17.08 15.85 12.00 13.88 22.98 32.76 47.95 52.17
## Very few of them 29.72 39.86 36.74 31.46 28.68 27.29 28.52 30.93 31.54 30.96 26.71
## Some of them 22.17 19.14 24.50 31.14 31.96 36.53 33.99 29.07 26.17 15.07 15.53
## Most of them 5.66 7.88 10.23 13.84 15.59 15.02 14.54 12.17 7.45 4.66 4.35
## All of them 3.77 3.83 6.20 6.49 7.92 9.16 9.07 4.84 2.08 1.37 1.24
nys_fwtrim %>%
flat_table(vandalism, age, margin = "col")
## age 11 12 13 14 15 16 17 18 19 20 21
## vandalism
## None of them 64.32 65.92 59.86 60.53 57.56 60.85 62.71 61.52 64.99 68.70 76.69
## Very few of them 24.88 24.89 28.06 27.17 28.18 26.72 25.05 25.49 26.80 23.61 14.11
## Some of them 7.04 8.07 9.50 9.41 11.00 10.05 10.09 11.40 7.04 6.63 7.98
## Most of them 2.35 0.90 1.73 2.25 2.66 1.59 1.12 0.98 0.84 0.80 0.61
## All of them 1.41 0.22 0.86 0.64 0.60 0.79 1.03 0.61 0.34 0.27 0.61
ggplot2
package, which is an immensely powerful data
visualization package that allows you to create and customize virtually
any plot that you can imagine. But first, we still have a little more
data wrangling to do.mutate()
function along with the
ifelse
logical operator. Below, I will do it for the first
two variables in the figure: 1) Marijuana and 2) Alcohol.nys_fwtrim_dic <- nys_fwtrim %>%
mutate(mardic = ifelse(marijuana == 1, 1, 0),
alcdic = ifelse(alcohol == 1, 1, 0))
nys_fwtrim_dic %>%
flat_table(marijuana, mardic)
## mardic 0 1
## marijuana
## None of them 0 3591
## Very few of them 1281 0
## Some of them 1231 0
## Most of them 860 0
## All of them 649 0
nys_fwtrim_dic %>%
flat_table(alcohol, alcdic)
## alcdic 0 1
## alcohol
## None of them 0 2081
## Very few of them 1336 0
## Some of them 1466 0
## Most of them 1314 0
## All of them 1424 0
In the above code, I (Jake) created a new data set object called nys_fwtrim_dic. I could have also just overwritten the “nys_fwtrim” data set but, as I mentioned before, I generally try to avoid overwriting objects so I can keep clear exactly what is in the objects I create. I assume there are different perspectives on this practice, as my approach can lead to the creation of a lot of superfluous objects in my R environment. (Of course, if you followed our recommended RStudio “Global options” settings, then you will be starting with a clean environment for every R session, which makes this issue a bit more tolerable.)
In dichotomizing the variables, we created what are typically
called “dummy”
variables. In this case, our dummy variables have the value of
1
if the respondent reports “none of them” for the
marijuana and alcohol peer delinquency variables, and they have the
value of 0
otherwise. There are other ways to treat these
types of variables, but this “dummy coding” method will come in handy
later.
flat_table()
command after my recoding commands. Recall,
this is an important step that a lot of people who are just learning to
recode and wrangle data often are tempted to skip. At risk of repeating
the obvious at this point, you should always build in
checks to your code to make sure that your code is doing
exactly what you want it to do - no more and no less.nys_fwtrim_dic
data object with a data object that includes
all four dummy variables.We are ready to start plotting!
To do that, we will use the “ggplot2” package, which is one of the core packages in the “Tidyverse.” For more about the logic behind this package and far more details on how it works than we will provide below, check out: Kieran Healy’s book Data Visualization: A Practical Introduction; Ch. 3: Data Visualization from R for Data Science; and Danielle Navarro’s video series on ggplot2.
For now, just know that “ggplot2” is a powerful data visualization package that is almost limitless in its plotting capabilities. People even create amazing art using the package (see: Danielle Navarro’s website; the #rtistry hashtag on Twitter; and learn how to create art yourself with Navarro’s video series about creating art with ggplot2).
The “ggplot2” package is based on a famous data visualization
book by Leland Wilkinson called The Grammar of Graphics. Hadley
Wickham built ggplot2 to provide a layered
approach to this model of data visualization. The key to ggplot2 is
to understand that you can build a plot in layers after telling
ggplot()
what data set you are using and mapping specific
aesthetics to your plot (e.g. what variables to put on the x-axis and
y-axis).
The main function is ggplot()
. If we simply write
ggplot()
, we will get a blank canvas. And maybe this is a
useful way to think about a chart. You are starting with a blank canvas,
and the goal is to display information in a clear and insightful manner.
Here is how Hadley Wickham says it in the “Data
Visualization” chapter of his book:
With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to.
So, let’s try to build your intuition about what “ggplot2” is doing by building a simple plot in layers. First, let’s start by simply telling R to start a plot.
library(tidyverse)
nys_fwtrim_dic %>%
ggplot()
Starting a ggplot()
simply creates a blank canvas.
Notice how we first specified our data object, and then we used a pipe
(%>%
) to start the canvas with the ggplot()
command? We could have also specified the data within the
ggplot()
command, like this:
ggplot(nys_fwtrim_dic)
. Once you get into more complicated
plots that include multiple layers and, potentially, multiple data
sources from different objects or data sets, specifying the data within
the ggplot()
command may be preferred. For now, we will
keep with the approach above that starts with the data and then pipes to
the ggplot()
command.
Now that we have our blank canvas, we can start adding features or mapping aesthetics to it. Let’s start by adding the age variable on the x-axis as in Warr’s (1993) Figure 1.
nys_fwtrim_dic %>%
ggplot(aes(x = age))
Now our blank canvas has the age
variable mapped to
the x-axis. The plot itself is still blank because we have not told R
how to visualize the age variable. In ggplot()
, you do this
with a “geom.” Here is what Wickham and Grolemund said about geoms in R
for Data Science:
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
There are lots of different geoms built into the ggplot2 package, and a lot of other people create various geoms for specialized plots (see, for example, Matthew Kay’s work here and here).
We will start by creating a simple bar chart of the age distribution in our data.
nys_fwtrim_dic %>%
ggplot(aes(x = age)) +
geom_bar(fill = "lightblue")
Note: We added the “geom” layer to the
plot by using the +
symbol and then telling
ggplot()
what I wanted to add (in this case a “geom_bar” ).
Once you start a ggplot()
, this is generally how you add
layers and/or features to the chart - you simply follow the logic of
starting with a basic plot and then sequentially adding (+
)
layers to it.
Note: In this plot, we told R to use
the geom_bar()
geom to represent the age data as a bar
chart. Also, note that we told geom_bar() to make the bars light blue
with fill = "lightblue"
. There are numerous built in colors
available in R (see here)
and you can enter hexadecimal color
values by placing a “#” before the appropriate value (e.g.,
fill = #add8e6
). In general, we recommend using colorblind-friendly
palettes whenever possible.
Note: the fill
controls
the color of the fill for a particular geom. If you want to change the
color of the lines around the individual bars, you would add a
color =
command to the geom. Here I’ll add
color = "black"
in order to make the bars stand out a
little more.
nys_fwtrim_dic %>%
ggplot(aes(x = age)) +
geom_bar(fill = "lightblue", color = "black")
ggplot
skills. For now, we want just want you to understand
that any ggplot is essentially built as a bunch of layers with specific
aesthetics mapped to different variables and visual features as well as
different geoms used to represent the data in specific ways. Let’s move
on the plots that we are actually trying to re-create from Warr’s (1993)
Figure 1.group_by()
and summarize()
functions.
Essentially, we want to create a data frame that looks like the top row
in from the flat_table()
command that we used above to
summarize the data.nys_fwmarsum = nys_fwtrim_dic %>%
drop_na(marijuana) %>%
group_by(age) %>%
summarize(mean_mar = mean(mardic),
perc_mar = 100 * mean_mar)
nys_fwalcsum = nys_fwtrim_dic %>%
drop_na(alcohol) %>%
group_by(age) %>%
summarize(mean_alc = mean(alcdic),
perc_alc = 100 * mean_alc)
nys_fwmarsum
## # A tibble: 11 × 3
## age mean_mar perc_mar
## <dbl> <dbl> <dbl>
## 1 11 0.949 94.9
## 2 12 0.873 87.3
## 3 13 0.760 76.0
## 4 14 0.637 63.7
## 5 15 0.508 50.8
## 6 16 0.397 39.7
## 7 17 0.329 32.9
## 8 18 0.249 24.9
## 9 19 0.266 26.6
## 10 20 0.239 23.9
## 11 21 0.186 18.6
nys_fwalcsum
## # A tibble: 11 × 3
## age mean_alc perc_alc
## <dbl> <dbl> <dbl>
## 1 11 0.869 86.9
## 2 12 0.786 78.6
## 3 13 0.579 57.9
## 4 14 0.400 40.0
## 5 15 0.243 24.3
## 6 16 0.180 18.0
## 7 17 0.121 12.1
## 8 18 0.0820 8.20
## 9 19 0.0702 7.02
## 10 20 0.0688 6.88
## 11 21 0.0859 8.59
In the above code, we created two separate summary data sets for each of the first two peer delinquency variables:
First, we told R to drop the missing values for each peer
delinquency variable with the drop_na()
function.
Second, we told R to group the data set by the variable “age.” This basically tells R to perform whatever functions come next within the values of the grouping variable - in this case age (e.g., within age==11; then within age==12; etc.).
Third, we used the summarize
function to create a
new summary data set with the variables calculated as instructed in the
parentheses.
The first variable we created using the mean
function provides the proportion of respondents in each age group who
reported having no delinquent friends for the specific type of
delinquency. In this case, the mean
function returns a
proportion only because we had created the “mardic” and “alcdic”
variables as “dummy variables” earlier. When you calculate the mean of a
dummy variable, the mean represents the proportion of cases that have
the category coded as 1
.
Then, we simply multiplied that proportion by 100
to
get a percentage (“perc_mar” and “perc_alc”) rather than a
proportion.
theme_set(theme_classic())
marplot <- nys_fwmarsum %>%
ggplot(aes(x = age, y = perc_mar)) +
geom_line() +
geom_point(shape = "square")
alcplot <- nys_fwalcsum %>%
ggplot(aes(x = age, y = perc_alc)) +
geom_line() +
geom_point(shape = "square")
marplot
alcplot
There are a few of things to note in the above code:
First, notice that we constructed the plots with two geoms (geom_point and geom_line). This is ggplot2’s layering at work. In order to create the line plot with the points, we needed to first draw the line, then add the points (we could have drawn the points and then the lines, but then the line would technically be in front of the points).
Second, notice how we specified the specific shape we wanted
inside the geom_point()
command with
shape = "square"
. We could have also specified the number
code for a solid square with shape = 15
to get the same
thing. There are lots of different built-in shapes you can use for
points (see here).
Third, notice that above the code for the two plots, we included
the line theme_set(theme_classic())
. ggplot2 has multiple
built-in
themes and theme_classic()
is the most similar to the
look or aesthetics used in Warr’s (1993) Figure 1 plots. Of course,
these themes are all customizable, and there are multiple packages you
can download with additional themes (e.g., “ggthemes”).
marplot <- nys_fwmarsum %>%
ggplot(aes(x = age, y = perc_mar)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
labs (title = "Marijuana", x = "Age", y = "Percent") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10))
alcplot <- nys_fwalcsum %>%
ggplot(aes(x = age, y = perc_alc)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
labs (title = "Alcohol", x = "Age", y = "Percent") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10))
marplot
alcplot
We essentially added four things to the above plots.
First, we added both scale_x_continuous
and
scale_y_continuous
functions. In each of these, we set the
limits of the scale on the x-axis and y-axis, respectively, and we set
the number of breaks. For the x-axis, we simply instructed ggplot2 to
place a break at each whole number from 11 to 21
(breaks = 11:21
). For the y-axis, we told ggplot2 to place
a break at every 10 units between zero and one hundred
(breaks = seq(0, 100, 10)
).
Second, we added the labs
command, which allowed us
to specify the labels we wanted for the “title” of the plot and the
x-axis and y-axis respectively (see here for
details).
Finally, we made some small adjustments to the underlying theme we set. If you want to see what is different, you can run the code above without the theme arguments.
nys_fwmarsum %>%
ggplot(aes(x = age, y = perc_mar)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
labs (title = "Marijuana", x = "Age", y = "Percent")
theme()
command.
The hjust = 0.5
simply tells R to place the title of the
plot in the center of ggplot’s coordinate system (find out more here).
The size = 10
simply tells ggplot2 to make the plot title
and axis titles size 10 point font. There are almost endless ways to
modify a ggplot2 theme (see here for
details).library(patchwork)
fig1_maralc = (marplot | alcplot) +
plot_annotation(
title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
theme = theme(plot.title = element_text(hjust = 0.5)))
fig1_maralc
Note: Look at the options for the R
chunk (the text following the {r
) - here we included
knitted dimensions for the figure by writing
[r, fig.width = 9, fig.height = 6]
. This simply tells
RMarkdown to produce a figure of those dimensions (essentially the size
of regular letter-sized paper with one inch margins) - this will be
particularly useful when knitting the final document.
Note: In the above plot, we could have
placed the two plots on top of each other instead by separating them
with a /
rather than a |
, like so:
fig1_maralc_vert <- (marplot / alcplot) +
plot_annotation(
title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
theme = theme(plot.title = element_text(hjust = 0.5)))
fig1_maralc_vert
fig1_maralc_grid <- (marplot | alcplot) / (marplot | alcplot) +
plot_annotation(
title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
theme = theme(plot.title = element_text(hjust = 0.5)))
fig1_maralc_grid
You should now have everything that you need to recreate the first part of Figure 1 from Warr (1993) on pg. 22. If you have followed along up to this point, you should have the pooled data with dummy coded peer-deliquency variables needed to create summary data frames as done in Assignment section 3.6 above. You should also have the first two plots for Marijuana and Alcohol.
All that is left to do is:
Remember:
Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment parts or questions.
Write plain text after headings and before or after code chunks to explain what you are doing. This is not just for my sake - such text will serve as useful reminders to you when working on later assignments!
Upon completing the assignment, “knit” your final RMD file again and save the final knitted html document to your “Assignments” folder in your LastName_P680_work folder as: LastName_P680_Assign3_YEAR_MO_DY.
Inside the “LastName_P680_commit” folder in our shared folder, create another folder named: Assignment 3.
To submit your assignment for grading, save copies of both your (1) final knitted “Assign3” html file and (2) your “Assign3_RMD” file into the “LastName_P680_commit / Assignment 3” folder.
Finally, just to give you a sense that you are on the right track, your final plot should look something like this: