Assumptions & Ground Rules

The purpose of this assignment is to introduce you to the powerful data visualization functionality in R and specifically the data visualization capabilities of the package “ggplot2”. Specifically, we will:

  1. Motivate the importance of data visualization as an important part of any data analysis workflow.
  2. Provide basic overview of data analyzed by Warr (1993).
  3. Introduce some basic functionality of the “ggplot2” package that is part of the tidyverse suite of packages.
  4. Reproduce Figure 1 from Warr’s (1993) study (as displayed on pg. 22-23 in his article) using the first five waves of the National Youth Survey (NYS).

I assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

  • Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).



Part 1 (Assignment 3.1): Why Visualize Data?

When I was in school, both undergraduate and graduate school, data visualization was not emphasized in my methods and data analysis training. When it was introduced, it was introduced as more of an afterthought than a central part of the research and data analysis workflow. Indeed, if you look through the criminological literature, you’ll often see minimal and/or relatively poor data visualizations (although see more recent work like Pickett et al., 2022 for some cool exceptions). What you will see, however, are tables filled with descriptive statistics (e.g. means, standard deviations, and bivariate correlations) and multivariate statistical summaries (e.g. linear regression results). In this section, I hope to show you why data visualization is an important and necessary part of the research process.

1.1: Anscombe’s Quartet

The classic example of why data visualization is important is “Anscombe’s Quartet” published by Anscombe (1973) (you can find it via UNCW’s lbrary here). It’s such a classic example, that the data for Anscombe’s quartet is built into R (anscombe). Let’s take a look at it:

aq <- relocate(anscombe, x1, y1, x2, y2, x3, y3, x4, y4)
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89


What you have here are four different “toy” datasets that each contain two variables (“x” and “y”) with numbers indicating the data set they belong to. Here, I have ordered the data so that the each of the x and y variables for a particular dataset are next to each other. Let’s take a look at some traditional summary statistics for each of these four datasets.

Anscombe's Quartet: Descriptive Statistics
Variable Statistic Quartet Data
Data #1 Data #2 Data #3 Data #4
x Mean (SD) 9.00 (3.32) 9.00 (3.32) 9.00 (3.32) 9.00 (3.32)
y Mean (SD) 7.50 (2.03) 7.50 (2.03) 7.50 (2.03) 7.50 (2.03)
x-y correlation Pearson's r 0.82 0.82 0.82 0.82


As demonstrated above, each data set as similar means and standard deviations and identical correlation coefficients (at least to the second decimal). That looks like the same data, right? Wrong! Here is what it looks like when you plot it:

Notice how data #1 demonstrates a linear relationship between x and y that is somewhat noisy; data #2 displays a curvilinear relationship; data #3 is almost perfectly linear except one large outlier; and the values in data #4 are stacked up on the same x-value except for one outlier in the upper right corner of the graph. Looking at the data clearly changes your imppression of what these different datasets are telling you about the relationship between x and y.

1.2: Real-World Example of Importance of Data Visualization

This is clearly a toy example meant to demonstrate the importance of looking at/visualizing your data. But these issues do come up in real-world research as evidenced in Chapter 1 (section 1.1) of Kieran Healy’s book Data Visualization: A Practical Introduction.