Assignment 6 Objectives

The purpose of this fifth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 5 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter covered measures of dispersion, including variation ratio, range, interquartile range, variance, and standard deviation. We use measures of dispersion to summarize the “spread” (rather than central tendency) of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of dispersion and create boxplots that help us standardize and efficiently describe the spread of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R, and you will learn how to modify some elements (e.g., color) of a ggplot object. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.

By the end of Assignment 6, you should…

  • be able to calculate measures of dispersion by hand from frequency tables you generate in R
  • be able to generate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • be able to generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • know how to change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “turquoise”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
  • know how to add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
  • be able to add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))

Assumptions & Ground Rules

We are building on objectives from Assignments 1-4. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an HTML file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a variable distribution using base R functions mean() and median()(e.g., mean(data$variable))
  • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() functions

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • create basic graphs using ggplot2’s ggplot() function

If you do not recall how to do these things, review Assignments 1-5.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • measures of dispersion, such as variation ratio, range, interquartile range (IQR), variance, and standard deviation
  • the difference between range and IQR
  • the relationship between variance and standard deviation
  • how to calculate range, variation ratio, and IQR
  • how to calculate variance of a population, a sample, and a sample with grouped data
  • how to calculate standard deviation of a population, a sample, and a sample with grouped data
  • how to calculate sample variance and standard deviation with ungrouped and grouped data using computational formulas
  • boxplots, including steps for boxplot construction, elements of a boxplot, and how to read a boxplot to summarize the central tendency and dispersion of a data distribution

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Part 1 (Assignment 6.1)

Goal: Read in Youth Data and Determine Measures of Dispersion

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-02-02_Ducate_CRIM5305_Assign06)

In the last assignment, you learned how to identify or calculate measures of central tendency from frequency tables to summarize the most common or “expected” value of a data distribution. In doing so, you learned how to decide which measures of central tendency are most appropriate or useful for summarizing specific variables. In this assignment, you will use frequency tables and boxplots to calculate measures of and visualize dispersion for several variables.

  1. Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 5 (named YEAR-MO-DY_LastName_CRIM5305_Assign05). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter CRIM5305 Assignment 6.
    5. In the Author box, enter your First and Last Name (e.g., Caitlin Ducate).
    6. Under Default Output Format box, select “Word document”
  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 6.1).”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 6. It is my work and only my work.
  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, and summarytools.
      • Recall, you only need to install packages one time. However, you must load them each time you start a new R session.

  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and assign the “Youth_0.sav” SPSS datafile into an R data object named YouthData.
      • Forget how to do this? Refer to Assignment 1.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: YouthData. This will call the object and provide a brief view of the data. (Note: You can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window.) Your R studio session should now look a lot like this:
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
View Youth Data

View Youth Data

As in the image, you should see 1,272 rows (or observations) and 23 columns (or variables.)

  1. Now, insert an R chunk, type YouthData %>% view_df(), and hit RUN. Check your Viewer tab to get a better look at the variable names, labels, and values.
    • Forget how to do this? Refer to Assignment 2.

  2. Create a third-level header titled: “Frequency Table for ‘v77’ Variable”
    • Create a new R code chunk and type YouthData %>% frq(v77) to generate a frequency table for the variable that measures the ‘parental supervision scale.’
    • Note: R is case sensitive! Be sure you are typing “v77” with a lower-case, not upper-case, ‘v’.
    • Your frequency table should look like this:
Frequency Table for v77

Frequency Table for v77

  1. Using this frequency table, calculate the variation ratio of the variable and answer question 7 of Assignment 6.
    • REMEMBER: You can use R as a calculator. In fact, you can write a line of code that will calculate the value for you. If you go this route, remember to follow the order of operations (e.g., use parentheses in the right places). See my walk-through video if you are curious how to do this.

Part 2 (Assignment 6.2)

Goal: Determine Measures of Dispersion for fropinon Variable

Now, we are going to generate frequency tables for three variables, use these tables to determine measures of dispersion, and then answer Question 5 on page 145 of your book (i.e., standard deviation, variance, range, minimum value, and maximum value.) These measurements of dispersion will help us to infer meaningful information about spread of these distributions in this sample.

You should have read about how to calculate measures of dispersion by hand in the book chapter; you can also calculate these directly in R. For instance, you may have noticed that the frequency table you generated earlier using sjmisc::frq() included the standard deviation (“sd=”) in the output. You may also recall that the descriptive statistics table you generated in Assignment 4 using summarytools::descr() included the standard deviation, along with the minimum value, maximum value, IQR, and other information. However, for this part of this assignment, you should be able to generate the frequency tables in R and then calculate all dispersion measures by hand. This will help you better understand what the programs are reporting and how they generated these measures. If you want to read more about measures of dispersion and how to calculate them in R, you might want to check out here and here.

  1. Create a second-level header titled: “Part 2 (Assignment 6.2).” Then, create a third-level header titled: “Calculate Measures of Dispersion for fropinon, delinquency, and certain
    • Note: The fropinon variable is a five-category ordinal measure asking respondents how wrong they think their friends think it is to steal. Responses range from 1 (always wrong) to 5 (never wrong). However, the variable is misspelled – instead of “fropinion” with two i’s, the variable is fropinon with one ‘i’. Be sure to spell the variable as it is found in the dataset when referencing it in R code chunks. Also,remember that R is case sensitive. So, if you type “fropinion” or fropinon instead, R will not be able to find the variable!
    • Also note that this is why learning to use RStudio’s ability to autocomplete is valuable. As long as I type the first few letters of a variable, it will complete the name for me, ensuring my variable is typed correctly. Remember that you can use TAB to autocomplete.
  2. Create a new R code chunk and type YouthData %>% frq(fropinon)
    1. Repeat the above step for the other 2 variables, delinquency and certain. Before each new R chunk, create a third-level header titled: “Frequency Table of [Variable Name]”. For example, when you create the frequency table for the delinquency variable, create a third-level header above it titled “Frequency Table of ‘delinquency’”.
    2. NOTE: If you just want to calculate the standard deviation, sd(data$varname) where you substitute the name of the data set for dataand the name of the variable for varname
    3. Then, answer questions 9-12

Graphical representations can be helpful, especially for determining distribution (or skew.) They can also help to determine measures of dispersion, such as range and interquartile range. In the next section, you will create a boxplot for fropinon.


  1. Create a third-level header titled: “Basic Boxplot of fropinon
    1. Insert an R chunk. You can create a simple boxplot using base R by typing boxplot(YouthData$fropinon). Recall that the $ is a base R operator used to reference an element (variable) within an object (dataset).
    2. Your R studio should look like this:
Boxplot using base R

Boxplot using base R

  1. The base R boxplot() function we used above creates a boxplot of any variable. However, with the base R plotting functions, it is difficult to manipulate and save the boxplot if desired. Rather, we recommend using the ggplot() function (from the ggplot2 package) to generate plots instead. Below, we will show you how to create a boxplot using ggplot(), which you can then customize various properties including its colors, titles, and layout orientation.

  2. Create a third-level header titled: “Boxplot of fropinon using ggplot()”

  3. Insert another R chunk and type YouthData %>% ggplot(aes(fropinon)) + geom_boxplot().
    1. Recall, ggplot() is a function included in the tidyverse package that allows us to create graphs and plots.
    2. The (aes()) function manipulates the aesthetic of the graph or plot, such as the orientation. For example, plots will orient to the x-axis by default if you type ggplot(aes(fropinon)). If you type ggplot(aes(y=fropinon)), the plot will be flipped to the y-axis like the base R boxplot above.
    3. The geom_boxplot() function works like the geom_histogram() function you used in earlier assignments. Be sure to include the + sign before geom_boxplot() since you are “adding” this geometric object layer to the initial XY coordinate plot.
      • Note: If you break your code into multiple lines (as pictured below,) be sure that the + sign is on the same line as the ggplot() function. Otherwise, R will assume you’re done with the ggplot() function, and it will not understand that you want to add a boxplot to it.
    4. Your R studio should look like this:

Boxplot using ggplot

Boxplot using ggplot

  1. Next, we can add some color.
    1. Create a third-level header titled: “Add Color to fropinon boxplot”. Then, create a new R chunk and type YouthData %>% ggplot(aes(fropinon)) + geom_boxplot().
    2. Inside the paratheses after geom_boxplot, type fill = "turquoise", color = "black".
      • fill = dictates the inner color of the boxplot. color = dictates the color or the outline and lines comprising the boxplot. Be sure to include the quotation marks (““).
YouthData %>%
  ggplot(aes(fropinon)) +
  geom_boxplot(fill = "turquoise", color = "black")