The purpose of this assignment is to introduce you to the powerful data visualization functionality in R and specifically the data visualization capabilities of the package “ggplot2”. Specifically, we will:
I assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.
At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.
As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
library(tidyverse)
library(patchwork)
library(here)
library(gt)
library(datasauRus)
When I was in school, both undergraduate and graduate school, data visualization was not emphasized in my methods and data analysis training. When it was introduced, it was introduced as more of an afterthought than a central part of the research and data analysis workflow. Indeed, if you look through the criminological literature, you’ll often see minimal and/or relatively poor data visualizations (although see more recent work like Pickett et al., 2022 for some cool exceptions). What you will see, however, are tables filled with descriptive statistics (e.g. means, standard deviations, and bivariate correlations) and multivariate statistical summaries (e.g. linear regression results). In this section, I hope to show you why data visualization is an important and necessary part of the research process.
The classic example of why data visualization is important is
“Anscombe’s Quartet” published by Anscombe
(1973) (you can find it via UNCW’s lbrary here). It’s such a
classic example, that the data for Anscombe’s quartet is built into R
(anscombe
). Let’s take a look at it:
aq <- relocate(anscombe, x1, y1, x2, y2, x3, y3, x4, y4)
gt(aq)
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
---|---|---|---|---|---|---|---|
10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
14 | 9.96 | 14 | 8.10 | 14 | 8.84 | 8 | 7.04 |
6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
4 | 4.26 | 4 | 3.10 | 4 | 5.39 | 19 | 12.50 |
12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
\(~\)
What you have here are four different “toy” datasets that each contain two variables (“x” and “y”) with numbers indicating the data set they belong to. Here, I have ordered the data so that the each of the x and y variables for a particular dataset are next to each other. Let’s take a look at some traditional summary statistics for each of these four datasets.
Variable | Statistic | Quartet Data | |||
---|---|---|---|---|---|
Data #1 | Data #2 | Data #3 | Data #4 | ||
x | Mean (SD) | 9.00 (3.32) | 9.00 (3.32) | 9.00 (3.32) | 9.00 (3.32) |
y | Mean (SD) | 7.50 (2.03) | 7.50 (2.03) | 7.50 (2.03) | 7.50 (2.03) |
x-y correlation | Pearson's r | 0.82 | 0.82 | 0.82 | 0.82 |
\(~\)
As demonstrated above, each data set as similar means and standard deviations and identical correlation coefficients (at least to the second decimal). That looks like the same data, right? Wrong! Here is what it looks like when you plot it:
Notice how data #1 demonstrates a linear relationship between x and y that is somewhat noisy; data #2 displays a curvilinear relationship; data #3 is almost perfectly linear except one large outlier; and the values in data #4 are stacked up on the same x-value except for one outlier in the upper right corner of the graph. Looking at the data clearly changes your imppression of what these different datasets are telling you about the relationship between x and y.
This is clearly a toy example meant to demonstrate the importance of looking at/visualizing your data. But these issues do come up in real-world research as evidenced in Chapter 1 (section 1.1) of Kieran Healy’s book Data Visualization: A Practical Introduction.
In the above chart from Jackman (1980) which was responding to a paper by Hewitt (1977), removing the outlier–South Africa–changes the relationship between voter turnout and income inequality. As Healy states:
The original paper had argued for a significant association between voter turnout and inocome inequality based on a quantitative analysis of eighteen countries. When this relationship was graphed as a scatterplot, however, it immediately became clear that the quantitative association depended entirely on the includsion of South Africa in the sample.
For a perhaps more fun toy example, check out the “Datasaurus” package. Like “Anscombe’s Quartet,” all of the underlying data in the following plots have the same means, standard deviations, and x-y correlations.
ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
geom_point()+
facet_wrap(~dataset, ncol=3) +
ggtitle("DatasauRus Visualization") +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA, size = 1),
strip.text.x = element_text(face = "bold.italic"),
plot.title = element_text(hjust = 0.5),
legend.position = "none")
Before we can start plotting, we need to have some data. For this exercise, we are going to use data from the National Youth Survey (NYS). Specifically, we’ll use data from Mark Warr’s (1993) now classic study in Criminology titled: “Age, Peers, and Delinquency” ( Warr, 1993).
I recommend doing a quick AIC reading of Warr’s study. To get you started, here is the abstract:
Hirschi and Gottfredson (1983; Gottfredson and Hirschi, 1990) have argued that the age distribution of crime cannot be explained by any known variables, and they point specifically to the failure of sociological theories to explain this phenomenon. This paper examines a quintessentially sociological theory of crime-differential association-and evaluates its ability to explain the age distribution of crime. Analysis of data from the National Youth Survey on persons aged ll-21 reveals that peer relations (exposure to delinquent peers, time spent with peers, loyalty to peers) change dramatically over this age span, following much the same pattern as crime itself. When measures of peer influence are controlled, the effects of age on self-reported delinquency are largely rendered insignificant. Additional analyses show that delinquent friends tend to be ’’sticky” friends (once acquired, they are not quickly lost) and that Sutherland’s arguments concerning the duration and priority of delinquent associations are only partially correct.
The paper is fundamentally about investigating the age-crime relationship and specifically examining whether delinquent peer influence explains the age-distribution of crime. To investigate this, Warr (1993) uses the first five waves of the National Youth Survey (NYS). Here is the first section of the “Data and Methods” section where he outlines the specific data he is analyzing (p. 19):
The data for this study come from the National Youth Survey (NYS). The NYS is a longitudinal study of a national probability sample of 1,726 persons aged 11-17 in 1976 (see Elliott et al., 1985). The sample was obtained through a multistage, cluster sampling of households in the continental United States in 1976. Five consecutive annual waves of the survey were conducted from 1976 through 1980, and these five are used in this analysis.
As you learned in your first project assignment, the National Youth Study (NYS) data used in Warr’s study are available on ICPSR. Eventually I will show you how to download the data directly from ICPSR entirely within R. But, for the current assignment you will simply be working with the R dataset that I provided you on Canvas. In order to get the data loaded into R, follow these steps:
You save your primary RMarkdown file in your top-level directory folder. For us, that means saving your R Script file in your LastName_CRM495_work folder.
Next, after closing RStudio, you will simply click directly on the RMarkdown file in your LastName_CRM495_work folder to automatically open it with RStudio. When you do this, your “working directory,” i.e., the place R looks for files by default, will automatically be set to your CRM495_work folder.
The here package will then make it easy to find and call objects (e.g., datasets) in subfolders of your working directory (e.g., in “Datasets” folder).
Note: The code below shows you how you to use the “here” package once it’s installed and given you followed the above directions.
library(here)
load(file = here("Datasets", "nys_warr1993.rda"))
Now that you have the data in your environment, you can take a look at it. If you look at the global environment where the data object is stored in your R Studio session, you’ll notice that the data have 8,625 observations and 20 variables. The reason it has so many observations is because it pools the first five waves of the NYS. So it includes data for each individual for each wave of data. If you look at the data, you’ll notice that it is sorted by each individual.
head(nys_warr1993, 10) %>%
gt()
CASEID | wave | age | peer_mar_dic | peer_alc_dic | peer_cheat_dic | peer_vandal_dic | peer_burg_dic | peer_selldrugs_dic | peer_theftlt5_dic | peer_theftgt50_dic | evsoc_dic | socimp_dic | peertroub_dic | resp_mar | resp_cheat_trunc | resp_burg_trunc | resp_selldrugs_trunc | resp_theftlt5_trunc | resp_theftgt50_trunc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 13 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 2 | 14 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | NA | 0 | 2 | 0 |
1 | 3 | 15 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 |
1 | 4 | 16 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 5 | 17 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 9 | 0 | 0 | 0 | 3 | 0 |
2 | 1 | 15 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 16 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | NA | NA | 0 | 0 | 0 |
2 | 3 | 17 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 4 | 18 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 5 | 19 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
\(~\)
Basically, it has each individual’s five waves of data stacked on top of each other. It would also make sense to sort it by wave, so that each wave of data was stacked on top of each other.
nys_warr1993 %>%
arrange(wave) %>%
head(10) %>%
gt()
CASEID | wave | age | peer_mar_dic | peer_alc_dic | peer_cheat_dic | peer_vandal_dic | peer_burg_dic | peer_selldrugs_dic | peer_theftlt5_dic | peer_theftgt50_dic | evsoc_dic | socimp_dic | peertroub_dic | resp_mar | resp_cheat_trunc | resp_burg_trunc | resp_selldrugs_trunc | resp_theftlt5_trunc | resp_theftgt50_trunc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 13 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 15 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 11 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 1 | 0 | 0 | 1 |
5 | 1 | 14 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 1 | NA | 1 | 2 | 0 | 0 | 5 | 0 |
6 | 1 | 11 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 1 | 0 |
7 | 1 | 14 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 | 5 | 0 | 0 | 0 | 0 |
8 | 1 | 11 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
9 | 1 | 11 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
10 | 1 | 12 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 2 | 0 | 0 | 5 | 0 |
\(~\)
When it comes to working with the data in R, it really doesn’t matter
how you sort it because we included both the CASEID
and
wave
variables in the data set. This allows us to sort,
group, and analyze the data by either or both of these variables.
Ok, so what do all of these variables actually measures?
If you are interested in the specific question wording in the NYS for each of these items, feel free to take a look at the codebooks. Here is a basic key that I created:
Warr (1993) NYS Items | |||||
---|---|---|---|---|---|
variable1 | wave1 | wave2 | wave3 | wave4 | wave5 |
icpsr | 8375 | 8424 | 8506 | 8917 | 9112 |
age | V169 | V7 | V10 | V6 | V6 |
peer_mar_dic | V367 | V210 | V308 | V288 | V315 |
peer_alc_dic | V370 | V213 | V311 | V291 | V318 |
peer_cheat_dic | V365 | V208 | V306 | V286 | V313 |
peer_vandal_dic | V366 | V209 | V307 | V287 | V314 |
peer_burg_dic | V371 | V214 | V312 | V292 | V319 |
peer_selldrugs_dic | V372 | V215 | V313 | V293 | V320 |
peer_theftlt5_dic | V368 | V211 | V309 | V289 | V316 |
peer_theftgt50_dic | V373 | V216 | V314 | V294 | V321 |
evsoc_dic | V179 | V17 | V81 | V24 | V37 |
socimp_dic | V180 | V18 | V82 | V25 | V38 |
peertroub_police | V377 | V223 | V321 | V301 | V328 |
resp_mar | V479 | V566 | V531 | V597 | V572 |
resp_cheat | V412 | V293 | V400 | V438 | 478 |
resp_burg | V454 | V355 | V444 | V551 | V524 |
resp_selldrugs | V428 | V309 | V418 | V481 | V496 |
resp_theftlt5 | V400 | V281 | V388 | V395 | V464 |
resp_theftgt50 | V386 | V267 | V374 | V365 | V448 |
1
Note: 'icpsr' indicates the icpsr number for the data set and not a variable
|
\(~\)
In a future assignment, you’ll learn how to take the raw data from ICPSR and wrangle and recode it to align with the coding decisions of a specific study like I did for you in this case. But for now, our main goal is to learn some of the basics of the “ggplot2” package. So let’s get on to actually creating some plots!
The “ggplot2” package is based on a famous data visualization book by
Leland Wilkinson called The Grammar of Graphics. Hadley Wickham
built ggplot2 to provide a layered
approach to this model of data visualization. The key to ggplot2 is
to understand that you can build a plot in layers after telling
ggplot()
what data set you are using and mapping specific
aesthetics to your plot (e.g. what variables to put on the x-axis and
y-axis).
The main function is ggplot()
. If we simply write
ggplot()
, we will get a blank canvas. And maybe this is a
useful way to think about a chart. You are starting with a blank canvas,
and the goal is to display information in a clear and insightful manner.
Here is how Hadley Wickham says it in the “Data
Visualization” chapter of his book:
With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to.
So, let’s try to build your intuition about what “ggplot2” is doing by building a simple plot in layers. First, let’s start by simply telling R to start a plot.
library(tidyverse)
ggplot(data = nys_warr1993)
Starting a ggplot()
simply creates a blank canvas.
Notice how we specified the data within the ggplot()
command with data = nys_warr1993
? We could have simply
typed ggplot(nys_warr1993)
and it would have done the same
thing. Navarro discusses this in her video series when
she disucsses named and unnamed arguments. At this point in your R
journey, I concur with Navarro’s advice to error on the side of being
more rather than less explicit with the arguments you provide to a
function. So I will follow the more verbose version of ggplot code in
this assignment.
data = nys_warr1993
that it’s very likey
nys_warr1993
refers to a dataset.Now that we have our blank canvas, we can start adding features or, in the language of ggplot, mapping aesthetics to it. Let’s start by adding the age variable on the x-axis as in Warr’s (1993) Figure 1.
ggplot(data = nys_warr1993,
mapping = aes(x = age))
Now our blank canvas has the age
variable mapped to the
x-axis. The plot itself is still blank because we have not told R how to
visualize the age variable. In ggplot()
, you do this with a
“geom.” Here is what Wickham and Grolemund said about geoms in R
for Data Science:
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
There are lots of different geoms built into the ggplot2 package, and a lot of other people create various geoms for specialized plots (see, for example, Matthew Kay’s work here and here).
We will start by creating a simple bar chart of the age distribution in our data.
ggplot(data = nys_warr1993, mapping = aes(x = age)) +
geom_bar(fill = "lightblue")
Note: We added the “geom” layer to the
plot by using the +
symbol and then telling
ggplot()
what I wanted to add (in this case a “geom_bar” ).
Once you start a ggplot()
, this is generally how you add
layers and/or features to the chart - you simply follow the logic of
starting with a basic plot and then sequentially adding (+
)
layers to it.
Note: In this plot, we told R to use
the geom_bar()
geom to represent the age data as a bar
chart. Also, note that we told geom_bar() to make the bars light blue
with fill = "lightblue"
. There are numerous built in colors
available in R (see here)
and you can enter hexadecimal color
values by placing a “#” before the appropriate value (e.g.,
fill = #add8e6
). In general, we recommend using colorblind-friendly
palettes whenever possible.
Note: the fill
controls
the color of the fill for a particular geom. If you want to change the
color of the lines around the individual bars, you would add a
color =
command to the geom. Here I’ll add
color = "black"
in order to make the bars stand out a
little more
ggplot(data = nys_warr1993, mapping = aes(x = age)) +
geom_bar(fill = "lightblue", color = "black")
The chart above looks fine and provides us with the basic age distribution across the first five waves of the NYS data. This is often useful when getting a sense of how variables in your data are distributed.
We could spend a lot of time going through these kinds of basic
ggplot
skills. For now, I just want you to understand that
any ggplot is essentially built as a bunch of layers with specific
aesthetics mapped to different variables and visual features as well as
different “geoms” used to represent the data in specific ways.
Danielle Navarro’s video series on ggplot2 covers a lot of these basics. For now, let’s move on to the plots that we are actually trying to re-create from Warr’s (1993) Figure 1. I think you will learn some more basic features of “ggplot2” and some additional add-on packages through working on this specific problem.
Here is the first page of Warr’s Figure 1 (1993, p.22):
As implied by the title of the article (and the figure itself), the first key variable across each of the figures that we are going to try to reproduce is “Age.” The second set of items is “exposure to delinquent peers,” which I described above in section 2.3 (and as described by Warr, 1993 on pg. 20).
Sometimes, as in section 3.3, we can just take the raw data and plot it. But this is actually pretty rare. We usually need to wrangle and recode data to get it in the form we want to plot. In this case, I have provided you with an R data file named “nys_warr1993_peersum.rda”. Go ahead and load that into your R session as described earlier (see section 2.2). It simply provides summary data for each age group by each of the eight peer delinquency variables in Warr’s (1993) Figure 1. Specifically, it includes 11 observations (one for each age group represented in the first five waves of the NYS) and 9 variables:
Now that we have this summary data set, plotting it is relatively easy. We just need to tell ggplot what specific variables to use and the specific geom we want to represent the data as. There are some other details, but I think we can get something that looks like Warr’s (1993) Figure 1 relatively quickly.
At this point, we will just focus on the first four behaviors: Marijuana, Alcohol, Cheating, and Vandalism. You’ll finish the assignment by creating the plot for the second set of four behaviors.
I’m going to start by plotting each delinquent peer exposure measure separately and then combine them later.
theme_set(theme_classic())
peer_mar_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_mar)) +
geom_line() +
geom_point(shape = "square")
peer_alc_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_alc)) +
geom_line() +
geom_point(shape = "square")
peer_cheat_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_cheat)) +
geom_line() +
geom_point(shape = "square")
peer_vandal_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_vandal)) +
geom_line() +
geom_point(shape = "square")
peer_mar_plot
peer_alc_plot
peer_cheat_plot
peer_vandal_plot
There are a few of things to note in the above code:
Notice that we constructed the plots with two geoms (geom_point and geom_line). This is ggplot2’s layering at work. In order to create the line plot with the points, we needed to first draw the line, then add the points (we could have drawn the points and then the lines, but then the line would technically be in front of the points).
Notice how we specified the specific shape we wanted inside the
geom_point()
command with shape = "square"
. We
could have also specified the number code for a solid square with
shape = 15
to get the same thing. There are lots of
different built-in shapes you can use for points (see here).
Notice that above the code for the two plots, we included the
line theme_set(theme_classic())
. ggplot2 has multiple built-in
themes and theme_classic()
is the most similar to the
look or aesthetics used in Warr’s (1993) Figure 1 plots. Of course,
these themes are all customizable, and there are multiple packages you
can download with additional themes (e.g., “ggthemes”).
Notice that I assigned each individual ggplot to it’s own object. Remember, pretty much anything in R can be assigned to an object. By doing that here, we will be able to combine them later.
In the above plots, the actual data displayed appears to match up with Warr’s (1993) figure. However, the details of the plot (e.g. axis labels, scales, etc.) are not exactly how we want them. The good news is that virtually everything within ggplot2 is customizable. So, we should be able to make these changes relatively easily.
peer_mar_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_mar)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
labs (title = "Marijuana", x = "Age", y = "Percent") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10))
peer_alc_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_alc)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
labs (title = "Alcohol", x = "Age", y = "Percent") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10))
peer_cheat_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_cheat)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 10)) +
labs (title = "Cheating", x = "Age", y = "Percent") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10))
peer_vandal_plot <- ggplot(data = nys_warr1993_peersum, aes(x = age, y = perc_peer_vandal)) +
geom_line() +
geom_point(shape = "square") +
scale_x_continuous(limits = c(11, 21), breaks = 11:21) +
scale_y_continuous(limits = c(50, 80), breaks = seq(50, 80, 10)) +
labs (title = "Vandalism", x = "Age", y = "Percent") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10))
peer_mar_plot
peer_alc_plot
peer_cheat_plot
peer_vandal_plot
I essentially added three things to the above plots.
I added both scale_x_continuous
and
scale_y_continuous
functions. In each of these, we set the
limits of the scale on the x-axis and y-axis, respectively, and we set
the number of breaks. For the x-axis, we simply instructed ggplot2 to
place a break at each whole number from 11 to 21
(breaks = 11:21
). For the y-axis, we told ggplot2 to place
a break at every 10 units between zero and one hundred
(breaks = seq(0, 100, 10)
).
I added the labs
command, which allowed me to
specify the labels we wanted for the “title” of the plot and the x-axis
and y-axis respectively (see here for
details).
I made some small adjustments to the underlying theme we set. If you want to see what is different, you can run the code above without the theme arguments. And if you are a real glutton for punishment, check out the ggplot2 website: “Modify components of a theme.”
The last thing we are going to do is show you how to combine separate plots into one using the “patchwork” package (be sure to install it in the console). There are lots of ways to produce multiple plots on the same overall canvas. The “patchwork” package seems to me to be the easiest way to combine plots.
Let’s start by simply combining two plots (marijuana use and alcohol use) onto one plot so I can show you some intracacies of the “patchwork” package.
library(patchwork)
fig1_maralc = (peer_mar_plot | peer_alc_plot) +
plot_annotation(
title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
theme = theme(plot.title = element_text(hjust = 0.5)))
fig1_maralc
Note: In the options for the R chunk
(the text following the {r
) - I included knitted dimensions
for the figure by writing
{r, fig.width = 9, fig.height = 6}
. This simply tells
RMarkdown to produce a figure of those dimensions (essentially the size
of regular letter-sized paper with one inch margins) - this will be
particularly useful when knitting the final document.
Note: In the above plot, we could have
placed the two plots on top of each other instead by separating them
with a /
rather than a |
, like so:
#library(patchwork)
fig1_maralc = (peer_mar_plot / peer_alc_plot) +
plot_annotation(
title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
theme = theme(plot.title = element_text(hjust = 0.5)))
fig1_maralc
In order to get all four plots on the same canvas, I can simply include the plots next to each other in parentheses like so:
fig1_pg22 <- (peer_mar_plot | peer_alc_plot) / (peer_cheat_plot | peer_vandal_plot) +
plot_annotation(
title = "Figure 1. Percentage of Respondents with No Delinquent Friends, by Age and Offense",
theme = theme(plot.title = element_text(hjust = 0.5)))
fig1_pg22
If you followed along to this point, you’ll notice the plot above looks very similar to pg. 22 in Warr’s (1993) article. The next step is to complete the second half of Figure 1 with the remaining four peer delinquency variables. I’m going to let you do this on your own.
You should now have everything that you need to recreate the second part of Figure 1 from Warr (1993) on pg. 23. If you have followed along up to this point, you should have the summary data (nys_warr1993_peersum) as described above in section 4.1.
All that is left to do is:
Plot the individual plots for the “Burglary”, “Selling Hard Drugs”, “Theft < $5”, and “Theft > $50” peer delinquency measures; and
Use “patcwhork” to combine the four plots so they look like the first part of Warr’s plot on pg. 22.
Provide your substantive interpretation of what the plots suggest about the relationship between age and peer delinquency.
Create a “Conclusion section where you write about what you learned in this assignment and any problems or issues you had in completing it.
Upon completing the tasks in the previous four sections, “knit” your final RMD file to html format save the knitted html document to your “Assignments” folder in your LastName_CRM495_work folder as: LastName_CRM495_RAssign3_YEAR_MO_DY.
Inside the “LastName_CRM495_commit” folder in our shared folder, create another folder named: Assignment 3.
To submit your assignment for grading, save copies of both your (1) “RAssign3” html file and (2) your “RAssign3_RMD file” into the LastName_CRM495_commit > Assignment 3 folder. Remember, be sure to save copies of both files - do not just drag the files over from your “work” folder, or you may lose those original copies from your “work” folder.
Finally, submit your knitted html document on Canvas in the “R Assignment 3” submission portal. This will allow me to have a time-stamped version of your assignment for grading purposes.