Assumptions & Ground Rules

The purpose of this assignment is to introduce you to the powerful data visualization functionality in R and specifically the data visualization capabilities of the package “ggplot2”. Specifically, we will:

  1. Motivate the importance of data visualization as an important part of any data analysis workflow.
  2. Provide basic overview of data analyzed by Warr (1993).
  3. Introduce some basic functionality of the “ggplot2” package that is part of the tidyverse suite of packages.
  4. Reproduce Figure 1 from Warr’s (1993) study (as displayed on pg. 22-23 in his article) using the first five waves of the National Youth Survey (NYS).

I assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

  • Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Packages:

library(tidyverse)
library(patchwork)
library(here)
library(gt)
library(datasauRus)

Part 1 (Assignment 3.1): Why Visualize Data?

When I was in school, both undergraduate and graduate school, data visualization was not emphasized in my methods and data analysis training. When it was introduced, it was introduced as more of an afterthought than a central part of the research and data analysis workflow. Indeed, if you look through the criminological literature, you’ll often see minimal and/or relatively poor data visualizations (although see more recent work like Pickett et al., 2022 for some cool exceptions). What you will see, however, are tables filled with descriptive statistics (e.g. means, standard deviations, and bivariate correlations) and multivariate statistical summaries (e.g. linear regression results). In this section, I hope to show you why data visualization is an important and necessary part of the research process.

1.1: Anscombe’s Quartet

The classic example of why data visualization is important is “Anscombe’s Quartet” published by Anscombe (1973) (you can find it via UNCW’s lbrary here). It’s such a classic example, that the data for Anscombe’s quartet is built into R (anscombe). Let’s take a look at it:

aq <- relocate(anscombe, x1, y1, x2, y2, x3, y3, x4, y4)
gt(aq)
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

\(~\)

What you have here are four different “toy” datasets that each contain two variables (“x” and “y”) with numbers indicating the data set they belong to. Here, I have ordered the data so that the each of the x and y variables for a particular dataset are next to each other. Let’s take a look at some traditional summary statistics for each of these four datasets.

Anscombe's Quartet: Descriptive Statistics
Variable Statistic Quartet Data
Data #1 Data #2 Data #3 Data #4
x Mean (SD) 9.00 (3.32) 9.00 (3.32) 9.00 (3.32) 9.00 (3.32)
y Mean (SD) 7.50 (2.03) 7.50 (2.03) 7.50 (2.03) 7.50 (2.03)
x-y correlation Pearson's r 0.82 0.82 0.82 0.82

\(~\)

As demonstrated above, each data set as similar means and standard deviations and identical correlation coefficients (at least to the second decimal). That looks like the same data, right? Wrong! Here is what it looks like when you plot it:

Notice how data #1 demonstrates a linear relationship between x and y that is somewhat noisy; data #2 displays a curvilinear relationship; data #3 is almost perfectly linear except one large outlier; and the values in data #4 are stacked up on the same x-value except for one outlier in the upper right corner of the graph. Looking at the data clearly changes your imppression of what these different datasets are telling you about the relationship between x and y.

1.2: Real-World Example of Importance of Data Visualization

This is clearly a toy example meant to demonstrate the importance of looking at/visualizing your data. But these issues do come up in real-world research as evidenced in Chapter 1 (section 1.1) of Kieran Healy’s book Data Visualization: A Practical Introduction.

Data visualization matters in the real world

Data visualization matters in the real world

In the above chart from Jackman (1980) which was responding to a paper by Hewitt (1977), removing the outlier–South Africa–changes the relationship between voter turnout and income inequality. As Healy states:

The original paper had argued for a significant association between voter turnout and inocome inequality based on a quantitative analysis of eighteen countries. When this relationship was graphed as a scatterplot, however, it immediately became clear that the quantitative association depended entirely on the includsion of South Africa in the sample.

  • Note: I highly recommend Healy’s book for learning the basic principles of effective data visualizations as well as specifics about how to use the “ggplot2” package in R to produce effective data visualizations yourself.

1.3: DatasauRus

For a perhaps more fun toy example, check out the “Datasaurus” package. Like “Anscombe’s Quartet,” all of the underlying data in the following plots have the same means, standard deviations, and x-y correlations.

  ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
    geom_point()+
    facet_wrap(~dataset, ncol=3) +
    ggtitle("DatasauRus Visualization") +
    theme_minimal() + 
    theme(panel.border = element_rect(color = "black", fill = NA, size = 1),
        strip.text.x = element_text(face = "bold.italic"),
        plot.title = element_text(hjust = 0.5),
        legend.position = "none")

Part 2 (Assignment 3.2): Introduction to Warr’s (1993) Study

Before we can start plotting, we need to have some data. For this exercise, we are going to use data from the National Youth Survey (NYS). Specifically, we’ll use data from Mark Warr’s (1993) now classic study in Criminology titled: “Age, Peers, and Delinquency” ( Warr, 1993).

2.1: Study Overview

I recommend doing a quick AIC reading of Warr’s study. To get you started, here is the abstract:

Hirschi and Gottfredson (1983; Gottfredson and Hirschi, 1990) have argued that the age distribution of crime cannot be explained by any known variables, and they point specifically to the failure of sociological theories to explain this phenomenon. This paper examines a quintessentially sociological theory of crime-differential association-and evaluates its ability to explain the age distribution of crime. Analysis of data from the National Youth Survey on persons aged ll-21 reveals that peer relations (exposure to delinquent peers, time spent with peers, loyalty to peers) change dramatically over this age span, following much the same pattern as crime itself. When measures of peer influence are controlled, the effects of age on self-reported delinquency are largely rendered insignificant. Additional analyses show that delinquent friends tend to be ’’sticky” friends (once acquired, they are not quickly lost) and that Sutherland’s arguments concerning the duration and priority of delinquent associations are only partially correct.

The paper is fundamentally about investigating the age-crime relationship and specifically examining whether delinquent peer influence explains the age-distribution of crime. To investigate this, Warr (1993) uses the first five waves of the National Youth Survey (NYS). Here is the first section of the “Data and Methods” section where he outlines the specific data he is analyzing (p. 19):

The data for this study come from the National Youth Survey (NYS). The NYS is a longitudinal study of a national probability sample of 1,726 persons aged 11-17 in 1976 (see Elliott et al., 1985). The sample was obtained through a multistage, cluster sampling of households in the continental United States in 1976. Five consecutive annual waves of the survey were conducted from 1976 through 1980, and these five are used in this analysis.

2.2: Get the Data

As you learned in your first project assignment, the National Youth Study (NYS) data used in Warr’s study are available on ICPSR. Eventually I will show you how to download the data directly from ICPSR entirely within R. But, for the current assignment you will simply be working with the R dataset that I provided you on Canvas. In order to get the data loaded into R, follow these steps:

  1. Start by creating a file in your “CRM495_work” folder titled “Datasets”.
  2. Next, download the “nys_warr1993.rda” file from Canvas and place it in the “Datasets” folder you just created.
  3. Install and load the “here” package. The here package will help you start a reproducible project-oriented workflow from the beginning. Here is how it will work once installed:
    • You save your primary RMarkdown file in your top-level directory folder. For us, that means saving your R Script file in your LastName_CRM495_work folder.

    • Next, after closing RStudio, you will simply click directly on the RMarkdown file in your LastName_CRM495_work folder to automatically open it with RStudio. When you do this, your “working directory,” i.e., the place R looks for files by default, will automatically be set to your CRM495_work folder.

    • The here package will then make it easy to find and call objects (e.g., datasets) in subfolders of your working directory (e.g., in “Datasets” folder).

    • Note: The code below shows you how you to use the “here” package once it’s installed and given you followed the above directions.

library(here)
load(file = here("Datasets", "nys_warr1993.rda"))

2.3: Look at the data

Now that you have the data in your environment, you can take a look at it. If you look at the global environment where the data object is stored in your R Studio session, you’ll notice that the data have 8,625 observations and 20 variables. The reason it has so many observations is because it pools the first five waves of the NYS. So it includes data for each individual for each wave of data. If you look at the data, you’ll notice that it is sorted by each individual.

head(nys_warr1993, 10) %>%
  gt()
CASEID wave age peer_mar_dic peer_alc_dic peer_cheat_dic peer_vandal_dic peer_burg_dic peer_selldrugs_dic peer_theftlt5_dic peer_theftgt50_dic evsoc_dic socimp_dic peertroub_dic resp_mar resp_cheat_trunc resp_burg_trunc resp_selldrugs_trunc resp_theftlt5_trunc resp_theftgt50_trunc
1 1 13 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0
1 2 14 0 0 0 0 1 1 0 1 1 0 0 1 1 NA 0 2 0
1 3 15 0 0 0 0 1 1 0 1 1 0 0 1 2 0 0 1 0
1 4 16 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0
1 5 17 0 0 0 0 0 1 0 0 1 0 1 9 0 0 0 3 0
2 1 15 1 0 0 1 1 1 1 1 1 0 1 1 0 0 0 0 0
2 2 16 0 0 0 0 1 1 1 1 1 1 0 1 NA NA 0 0 0
2 3 17 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0
2 4 18 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0
2 5 19 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

\(~\)

Basically, it has each individual’s five waves of data stacked on top of each other. It would also make sense to sort it by wave, so that each wave of data was stacked on top of each other.

nys_warr1993 %>%
  arrange(wave) %>%
  head(10) %>%
  gt()
CASEID wave age peer_mar_dic peer_alc_dic peer_cheat_dic peer_vandal_dic peer_burg_dic peer_selldrugs_dic peer_theftlt5_dic peer_theftgt50_dic evsoc_dic socimp_dic peertroub_dic resp_mar resp_cheat_trunc resp_burg_trunc resp_selldrugs_trunc resp_theftlt5_trunc resp_theftgt50_trunc
1 1 13 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0
2 1 15 1 0 0 1 1 1 1 1 1 0 1 1 0 0 0 0 0
3 1 11 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0
4 1 16 0 0 0 0 0 0 0 0 0 0 0 7 3 1 0 0 1
5 1 14 NA NA NA NA NA NA NA NA 0 1 NA 1 2 0 0 5 0
6 1 11 0 1 0 0 0 1 0 1 0 0 1 1 2 0 0 1 0
7 1 14 0 0 0 0 1 1 0 1 1 1 1 7 5 0 0 0 0
8 1 11 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 0 0
9 1 11 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0
10 1 12 0 0 0 0 1 1 0 1 0 1 1 1 2 0 0 5 0

\(~\)

When it comes to working with the data in R, it really doesn’t matter how you sort it because we included both the CASEID and wave variables in the data set. This allows us to sort, group, and analyze the data by either or both of these variables.

2.4: Describe variables

Ok, so what do all of these variables actually measures?

  • CASEID: is the unique identifier for each individual in the sample. It allows you to track individuals over multiple waves.
  • Wave: indicates which of the first five waves of the NYS the observations are from. “age” is simply the respondents self-reported age from at the time they completed the survey/interview.
  • age: indicates respondents’ self-reported age at the time of the interview/survey.
  • peer_mar_dic - peer_theftgt50_dic: These are the peer delinquency variables that Warr (1993) analyzed. Specifically, they indicate whether the respondent reported none of their “close friends” engaged in the specific behavior in the past year. The original answer categories were a five point scale ranging from 1 = “none of them” to 5 = “all of them” (see Warr, 1993: pg. 20 for description). Here 1 = “none of them” and 0 = “very few of them” to “all of them.” Here is the key for the behavior abreviations utilized in the variable names:
    • mar = used marijuana
    • alc = used alcohol
    • cheat = cheated on school tests
    • vandal = vandalized (i.e. purposely damaged or destroyed property)
    • burg = burglary (i.e. broken into a vehicle or building to steal something)
    • selldrugs = sold hard drugs (e.g., heroin, cocaine, and LSD)
    • theftlt5 = minor theft (i.e. stole something worth less than $5)
    • theftgt5 = major theft (i.e. stole something worth more than $50)
  • evsoc_dic: Another variable that Warr (1993) dichotomized, it is a measure of time spent with friends “in an average week.” Specifically, it asks about the number of nights nights in an average week respondents spent doing social activities. Thus original answer categories ranged from 1 to 7 and Warr dichotomized them so that 1 = “3 or more” and 0 = “less than 3” (see Warr, 1993: pg. 24 for description).
  • socimp_dic: Warr’s measure of the importance respondents place on socializing with their peers. The original answer categories were a five point scale ranging from 1 = “Not important” to 5 = “Very important” (see Warr, 1993: pg. 24 for description). Warr dichotomized it so that 1 = “Verry Important” or “Pretty Important” and 0 = “Not important” to “Somewhat important.”
  • resp_mar - resp_theftgt50: Warr’s six variables of self-reported delinquency that align with six of the eight peer delinquency measures. Except for self-reported marijuana use (measured by asking about rate of use - e.g. “Never” to “2 to 3 times/day”) they are measured as respondents’ self-reported frequency of engaging in the behavior (i.e., number of times) in the last year. Warr truncated the 5 frequency measures so that any value of 5 or greater was coded as a 5 (see Warr, 1993: pg. 25 for description).

If you are interested in the specific question wording in the NYS for each of these items, feel free to take a look at the codebooks. Here is a basic key that I created:

Warr (1993) NYS Items
variable1 wave1 wave2 wave3 wave4 wave5
icpsr 8375 8424 8506 8917 9112
age V169 V7 V10 V6 V6
peer_mar_dic V367 V210 V308 V288 V315
peer_alc_dic V370 V213 V311 V291 V318
peer_cheat_dic V365 V208 V306 V286 V313
peer_vandal_dic V366 V209 V307 V287 V314
peer_burg_dic V371 V214 V312 V292 V319
peer_selldrugs_dic V372 V215 V313 V293 V320
peer_theftlt5_dic V368 V211 V309 V289 V316
peer_theftgt50_dic V373 V216 V314 V294 V321
evsoc_dic V179 V17 V81 V24 V37
socimp_dic V180 V18 V82 V25 V38
peertroub_police V377 V223 V321 V301 V328
resp_mar V479 V566 V531 V597 V572
resp_cheat V412 V293 V400 V438 478
resp_burg V454 V355 V444 V551 V524
resp_selldrugs V428 V309 V418 V481 V496
resp_theftlt5 V400 V281 V388 V395 V464
resp_theftgt50 V386 V267 V374 V365 V448

1 Note: 'icpsr' indicates the icpsr number for the data set and not a variable

\(~\)

In a future assignment, you’ll learn how to take the raw data from ICPSR and wrangle and recode it to align with the coding decisions of a specific study like I did for you in this case. But for now, our main goal is to learn some of the basics of the “ggplot2” package. So let’s get on to actually creating some plots!

  • Note: If you are interested, on Canvas, I have provided the code for taking the raw data and wrangling/recoding it to align with Warr’s (1993) decisions (see “nys_warr1993.rda”).

Part 3 (Assignment 3.3): Introduction to the “ggplot2” package

3.1: Understanding the basic logic of “ggplot2”

The “ggplot2” package is based on a famous data visualization book by Leland Wilkinson called The Grammar of Graphics. Hadley Wickham built ggplot2 to provide a layered approach to this model of data visualization. The key to ggplot2 is to understand that you can build a plot in layers after telling ggplot() what data set you are using and mapping specific aesthetics to your plot (e.g. what variables to put on the x-axis and y-axis).

3.2: Building a basic plot with ggplot

The main function is ggplot(). If we simply write ggplot(), we will get a blank canvas. And maybe this is a useful way to think about a chart. You are starting with a blank canvas, and the goal is to display information in a clear and insightful manner. Here is how Hadley Wickham says it in the “Data Visualization” chapter of his book:

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to.

So, let’s try to build your intuition about what “ggplot2” is doing by building a simple plot in layers. First, let’s start by simply telling R to start a plot.

library(tidyverse)

ggplot(data = nys_warr1993) 

Starting a ggplot() simply creates a blank canvas. Notice how we specified the data within the ggplot() command with data = nys_warr1993? We could have simply typed ggplot(nys_warr1993) and it would have done the same thing. Navarro discusses this in her video series when she disucsses named and unnamed arguments. At this point in your R journey, I concur with Navarro’s advice to error on the side of being more rather than less explicit with the arguments you provide to a function. So I will follow the more verbose version of ggplot code in this assignment.

  • Note: I think more verbose code can also be useful for replication and reproducibility. Someone doesn’t have to be an expert in R or the ggplot2 package to know that when I write data = nys_warr1993 that it’s very likey nys_warr1993 refers to a dataset.

Now that we have our blank canvas, we can start adding features or, in the language of ggplot, mapping aesthetics to it. Let’s start by adding the age variable on the x-axis as in Warr’s (1993) Figure 1.

ggplot(data = nys_warr1993,
       mapping = aes(x = age)) 

Now our blank canvas has the age variable mapped to the x-axis. The plot itself is still blank because we have not told R how to visualize the age variable. In ggplot(), you do this with a “geom.” Here is what Wickham and Grolemund said about geoms in R for Data Science:

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.

There are lots of different geoms built into the ggplot2 package, and a lot of other people create various geoms for specialized plots (see, for example, Matthew Kay’s work here and here).

We will start by creating a simple bar chart of the age distribution in our data.

ggplot(data = nys_warr1993, mapping = aes(x = age)) + 
  geom_bar(fill = "lightblue")

  • Note: We added the “geom” layer to the plot by using the + symbol and then telling ggplot() what I wanted to add (in this case a “geom_bar” ). Once you start a ggplot(), this is generally how you add layers and/or features to the chart - you simply follow the logic of starting with a basic plot and then sequentially adding (+) layers to it.

  • Note: In this plot, we told R to use the geom_bar() geom to represent the age data as a bar chart. Also, note that we told geom_bar() to make the bars light blue with fill = "lightblue". There are numerous built in colors available in R (see here) and you can enter hexadecimal color values by placing a “#” before the appropriate value (e.g., fill = #add8e6). In general, we recommend using colorblind-friendly palettes whenever possible.

  • Note: the fill controls the color of the fill for a particular geom. If you want to change the color of the lines around the individual bars, you would add a color = command to the geom. Here I’ll add color = "black" in order to make the bars stand out a little more

ggplot(data = nys_warr1993, mapping = aes(x = age)) + 
  geom_bar(fill = "lightblue", color = "black")

The chart above looks fine and provides us with the basic age distribution across the first five waves of the NYS data. This is often useful when getting a sense of how variables in your data are distributed.

  • Note: This particular chart is somewhat misleading in that each respondent provides data for that chart up to five times.

We could spend a lot of time going through these kinds of basic ggplot skills. For now, I just want you to understand that any ggplot is essentially built as a bunch of layers with specific aesthetics mapped to different variables and visual features as well as different “geoms” used to represent the data in specific ways.

Danielle Navarro’s video series on ggplot2 covers a lot of these basics. For now, let’s move on to the plots that we are actually trying to re-create from Warr’s (1993) Figure 1. I think you will learn some more basic features of “ggplot2” and some additional add-on packages through working on this specific problem.

Part 4 (Assignment 3.4): Recreating Warr’s (1993) Figure 1

4.1: Wrangle the Data

Here is the first page of Warr’s Figure 1 (1993, p.22):

As implied by the title of the article (and the figure itself), the first key variable across each of the figures that we are going to try to reproduce is “Age.” The second set of items is “exposure to delinquent peers,” which I described above in section 2.3 (and as described by Warr, 1993 on pg. 20).

Sometimes, as in section 3.3, we can just take the raw data and plot it. But this is actually pretty rare. We usually need to wrangle and recode data to get it in the form we want to plot. In this case, I have provided you with an R data file named “nys_warr1993_peersum.rda”. Go ahead and load that into your R session as described earlier (see section 2.2). It simply provides summary data for each age group by each of the eight peer delinquency variables in Warr’s (1993) Figure 1. Specifically, it includes 11 observations (one for each age group represented in the first five waves of the NYS) and 9 variables:

  • age: Indicates the specific age group
  • perc_peer_mar - perc_peer_theftgt50: These represent the proportion of respondents in each age group who reported having no “close friends” who engaged in the specific behavior during the last year (see Warr, 1993: pg. 20 for description). Here is what the behavior abreviations mean:
    • mar = used marijuana
    • alc = used alcohol
    • cheat = cheated on school tests
    • vandal = vandalized (i.e. purposely damaged or destroyed property)
    • burg = burglary (i.e. broken into a vehicle or building to steal something)
    • selldrugs = sold hard drugs (e.g., heroin, cocaine, and LSD)
    • theftlt5 = minor theft (i.e. stole something worth less than $5)
    • theftgt5 = major theft (i.e. stole something worth more than $50)

4.2: Plot the Data

Now that we have this summary data set, plotting it is relatively easy. We just need to tell ggplot what specific variables to use and the specific geom we want to represent the data as. There are some other details, but I think we can get something that looks like Warr’s (1993) Figure 1 relatively quickly.

At this point, we will just focus on the first four behaviors: Marijuana, Alcohol, Cheating, and Vandalism. You’ll finish the assignment by creating the plot for the second set of four behaviors.

I’m going to start by plotting each delinquent peer exposure measure separately and then combine them later.

theme_set(theme_classic())

peer_mar_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_mar)) + 
  geom_line() + 
  geom_point(shape = "square")

peer_alc_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_alc)) + 
  geom_line() + 
  geom_point(shape = "square")

peer_cheat_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_cheat)) + 
  geom_line() + 
  geom_point(shape = "square")

peer_vandal_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_vandal)) + 
  geom_line() + 
  geom_point(shape = "square")


peer_mar_plot

peer_alc_plot

peer_cheat_plot

peer_vandal_plot

There are a few of things to note in the above code:

  1. Notice that we constructed the plots with two geoms (geom_point and geom_line). This is ggplot2’s layering at work. In order to create the line plot with the points, we needed to first draw the line, then add the points (we could have drawn the points and then the lines, but then the line would technically be in front of the points).

  2. Notice how we specified the specific shape we wanted inside the geom_point() command with shape = "square". We could have also specified the number code for a solid square with shape = 15 to get the same thing. There are lots of different built-in shapes you can use for points (see here).

  3. Notice that above the code for the two plots, we included the line theme_set(theme_classic()). ggplot2 has multiple built-in themes and theme_classic() is the most similar to the look or aesthetics used in Warr’s (1993) Figure 1 plots. Of course, these themes are all customizable, and there are multiple packages you can download with additional themes (e.g., “ggthemes”).

  4. Notice that I assigned each individual ggplot to it’s own object. Remember, pretty much anything in R can be assigned to an object. By doing that here, we will be able to combine them later.

4.3: Customize plot details

In the above plots, the actual data displayed appears to match up with Warr’s (1993) figure. However, the details of the plot (e.g. axis labels, scales, etc.) are not exactly how we want them. The good news is that virtually everything within ggplot2 is customizable. So, we should be able to make these changes relatively easily.

peer_mar_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_mar)) + 
  geom_line() + 
  geom_point(shape = "square") +
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + 
  labs (title = "Marijuana", x = "Age", y = "Percent") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10))

peer_alc_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_alc)) + 
  geom_line() + 
  geom_point(shape = "square") + 
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + 
  labs (title = "Alcohol", x = "Age", y = "Percent") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10))

peer_cheat_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_cheat)) + 
  geom_line() + 
  geom_point(shape = "square") +
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 10)) + 
  labs (title = "Cheating", x = "Age", y = "Percent") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10))

peer_vandal_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_vandal)) + 
  geom_line() + 
  geom_point(shape = "square") +
  scale_x_continuous(limits = c(11, 21), breaks = 11:21) + 
  scale_y_continuous(limits = c(50, 80), breaks = seq(50, 80, 10)) + 
  labs (title = "Vandalism", x = "Age", y = "Percent") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10))


peer_mar_plot

peer_alc_plot

peer_cheat_plot

peer_vandal_plot

I essentially added three things to the above plots.

  1. I added both scale_x_continuous and scale_y_continuous functions. In each of these, we set the limits of the scale on the x-axis and y-axis, respectively, and we set the number of breaks. For the x-axis, we simply instructed ggplot2 to place a break at each whole number from 11 to 21 (breaks = 11:21). For the y-axis, we told ggplot2 to place a break at every 10 units between zero and one hundred (breaks = seq(0, 100, 10)).

  2. I added the labs command, which allowed me to specify the labels we wanted for the “title” of the plot and the x-axis and y-axis respectively (see here for details).

  3. I made some small adjustments to the underlying theme we set. If you want to see what is different, you can run the code above without the theme arguments. And if you are a real glutton for punishment, check out the ggplot2 website: “Modify components of a theme.”

  • Note: In the above code I wrote over the objects we had created previously. If I had assigned the objects different names than we had used previously, I would have created four new ggplot objects instead of replacing the old ones like I did here.

4.4: Combine separate plots into one.

The last thing we are going to do is show you how to combine separate plots into one using the “patchwork” package (be sure to install it in the console). There are lots of ways to produce multiple plots on the same overall canvas. The “patchwork” package seems to me to be the easiest way to combine plots.

Let’s start by simply combining two plots (marijuana use and alcohol use) onto one plot so I can show you some intracacies of the “patchwork” package.

library(patchwork)
fig1_maralc = (peer_mar_plot | peer_alc_plot) +
  plot_annotation(
    title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
    theme = theme(plot.title = element_text(hjust = 0.5))) 
fig1_maralc

  • Note: In the options for the R chunk (the text following the {r) - I included knitted dimensions for the figure by writing {r, fig.width = 9, fig.height = 6}. This simply tells RMarkdown to produce a figure of those dimensions (essentially the size of regular letter-sized paper with one inch margins) - this will be particularly useful when knitting the final document.

  • Note: In the above plot, we could have placed the two plots on top of each other instead by separating them with a / rather than a |, like so:

#library(patchwork)
fig1_maralc = (peer_mar_plot / peer_alc_plot) +
  plot_annotation(
    title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
    theme = theme(plot.title = element_text(hjust = 0.5))) 
fig1_maralc

In order to get all four plots on the same canvas, I can simply include the plots next to each other in parentheses like so:

fig1_pg22 <- (peer_mar_plot | peer_alc_plot) / (peer_cheat_plot | peer_vandal_plot) + 
  plot_annotation(
    title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
    theme = theme(plot.title = element_text(hjust = 0.5))) 
fig1_pg22

If you followed along to this point, you’ll notice the plot above looks very similar to pg. 22 in Warr’s (1993) article. The next step is to complete the second half of Figure 1 with the remaining four peer delinquency variables. I’m going to let you do this on your own.

Part 5 (Assignment 3.5): “Draw the Owl”

You should now have everything that you need to recreate the second part of Figure 1 from Warr (1993) on pg. 23. If you have followed along up to this point, you should have the summary data (nys_warr1993_peersum) as described above in section 4.1.

All that is left to do is:

  1. Plot the individual plots for the “Burglary”, “Selling Hard Drugs”, “Theft < $5”, and “Theft > $50” peer delinquency measures; and

  2. Use “patcwhork” to combine the four plots so they look like the first part of Warr’s plot on pg. 22.

    • Note: Pay attention to the y-axis scales in the plots on pg. 23 as they are not the same as in the plots on pg. 22.
  3. Provide your substantive interpretation of what the plots suggest about the relationship between age and peer delinquency.

  4. Create a “Conclusion section where you write about what you learned in this assignment and any problems or issues you had in completing it.

Part 6 (Assignment 3.6)

Submit your assignment

  1. Upon completing the tasks in the previous four sections, “knit” your final RMD file to html format save the knitted html document to your “Assignments” folder in your LastName_CRM495_work folder as: LastName_CRM495_RAssign3_YEAR_MO_DY.

  2. Inside the “LastName_CRM495_commit” folder in our shared folder, create another folder named: Assignment 3.

  3. To submit your assignment for grading, save copies of both your (1) “RAssign3” html file and (2) your “RAssign3_RMD file” into the LastName_CRM495_commit > Assignment 3 folder. Remember, be sure to save copies of both files - do not just drag the files over from your “work” folder, or you may lose those original copies from your “work” folder.

  4. Finally, submit your knitted html document on Canvas in the “R Assignment 3” submission portal. This will allow me to have a time-stamped version of your assignment for grading purposes.