Assignment 5 Objectives

The purpose of this fifth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 5 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter covered measures of dispersion, including variation ratio, range, interquartile range, variance, and standard deviation. We use measures of dispersion to summarize the “spread” (rather than central tendency) of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of dispersion and create boxplots that help us standardize and efficiently describe the spread of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R, and you will learn how to modify some elements (e.g., color) of a ggplot object. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.

By the end of assignment #5, you should…

  • be able to calculate measures of dispersion by hand from frequency tables you generate in R
  • be able to generate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • be able to generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • know how to change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
  • know how to add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
  • understand how to select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
  • be able to add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))

Assumptions & Ground Rules

We are building on objectives from Assignments 1-4. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an HTML file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a variable distribution using base R functions mean() and median() (e.g., mean(data$variable))
  • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • create basic graphs using ggplot2’s ggplot() function


If you do not recall how to do these things, review Assignments 1-4.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • measures of dispersion, such as variation ratio, range, interquartile range (IQR), variance, and standard deviation
  • the difference between range and IQR
  • the relationship between variance and standard deviation
  • how to calculate range, variation ratio, and IQR
  • how to calculate variance of a population, a sample, and a sample with grouped data
  • how to calculate standard deviation of a population, a sample, and a sample with grouped data
  • how to calculate sample variance and standard deviation with ungrouped and grouped data using computational formulas
  • boxplots, including steps for boxplot construction, elements of a boxplot, and how to read a boxplot to summarize the central tendency and dispersion of a data distribution

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Part 1 (Assignment 5.1)

Goal: Read in Youth Data and Determine Measures of Dispersion

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_08_Fordham_K300Assign5)

In the last assignment, you learned how to identify or calculate measures of central tendency from frequency tables to summarize the most common or “expected” value of a data distribution. In doing so, you learned how to decide which measures of central tendency are most appropriate or useful for summarizing specific variables. In this assignment, you will use frequency tables and boxplots to calculate measures of and visualize dispersion for several variables.

  1. Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 4 (named YEAR_MO_DY_LastName_K300Assign4). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter K300 Assignment 5.
    5. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
    6. Under Default Output Format box, select “HTML document” (HTML is usually the default selection)

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 5.1).” Then, create a third-level header titled: “Read in Youth Data and Determine Measures of Dispersion”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 5. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you cannot recall.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, and summarytools. In addition, install and load the viridisLite package.
      • Recall, you only need to install packages one time. However, you must load them each time you start a new R session. Also, remember that you can optionally use (and we recommend) groundhog.library() to improve the reproducibility of your script.
      • In this assignment, you will learn to customize your ggplot graphs, including changing the default color scheme to any colors you want. As we will explain later, the viridisLite package is helpful for identifying colors that are colorblind accessible.

  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and assign the “Youth_0.sav” SPSS datafile into an R data object named YouthData.
      • Forget how to do this? Refer to Assignment 1.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: YouthData. This will call the object and provide a brief view of the data. (Note: You can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window.) Your R studio session should now look a lot like this:
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
View Youth Data

View Youth Data

As in the image, you should see 1,272 rows (or observations) and 23 columns (or variables.)

  1. Now, insert an R chunk, type YouthData %>% view_df(), and hit RUN. Check your Viewer tab to get a better look at the variable names, labels, and values.
    • Forget how to do this? Refer to Assignment 2.

  2. Create a third-level header titled: “Frequency Table for ‘v77’ Variable”
    • Create a new R code chunk and type YouthData %>% frq(v77) to generate a frequency table for the variable that measures the ‘parental supervision scale.’ Using the frequency table, answer the questions 10 and 11 in Quiz 5.
    • Note: R is case sensitive! Be sure you are typing “v77” with a lower-case, not upper-case, ‘v’.
    • Your frequency table should look like this:
Frequency Table for v77

Frequency Table for v77

Part 2 (Assignment 5.2)

Goal: Determine Measures of Dispersion for “fropinon” Variable (Question 5, Ch.5 (pp.145))

Now, we are going to generate frequency tables for three variables, use these tables to determine measures of dispersion, and then answer Question 5 on page 145 of your book (i.e., standard deviation, variance, range, minimum value, and maximum value.) These measurements of dispersion will help us to infer meaningful information about spread of these distributions in this sample.

You should have read about how to calculate measures of dispersion by hand in the book chapter; you can also calculate these directly in R. For instance, you may have noticed that the frequency table you generated earlier using sjmisc::frq() included the standard deviation (“sd=”) in the output. You may also recall that the descriptive statistics table you generated in Assignment 4 using summarytools::descr() included the standard deviation, along with the minimum value, maximum value, IQR, and other information. However, for this part of this assignment, you should generate the frequency tables in R and then calculate all dispersion measures by hand. This will help you better understand what the programs are reporting and how they generated these measures. If you want to read more about measures of dispersion and how to calculate them in R, you might want to check out here and here.

  1. Create a second-level header titled: “Part 2 (Assignment 5.2).” Then, create a third-level header titled: “Calculate Measures of Dispersion for fropinon, delinquency, and certain
    • Note: The “fropinon” variable is a five-category ordinal measure asking respondents how wrong they think their friends think it is to steal. Responses range from 1 (always wrong) to 5 (never wrong). However, the variable is misspelled – instead of “fropinion” with two i’s, the variable is “fropinon” with one ‘i’. Be sure to spell the variable as it is found in the dataset when referencing it in R code chunks. Also,remember that R is case sensitive. So, if you type “fropinion” or “Fropinon” instead, R will not be able to find the variable!

  2. Create a new R code chunk and type YouthData %>% frq(fropinon)
    1. Repeat the above step for the other 2 variables, “delinquency” and “certain”. Before each new R chunk, create a third-level header titled: “Frequency Table of [Variable Name]”. For example, when you create the frequency table for the “delinquency” variable, create a third-level header above it titled “Frequency Table of ‘delinquency’”. Then, answer question 5 on page 145 (Questions 12 and 13 in Quiz 5 on Canvas.)
    2. Graphical representations can be helpful, especially for determining distribution (or skew.) They can also help to determine measures of dispersion, such as range and interquartile range. In the next section, you will create a boxplot for each of these variables to answer questions 6 and 7 on page 145 of B&P (Questions 14-16 in Quiz 5.)

  3. Create a third-level header titled: “Basic Boxplot of fropinon
    1. Insert an R chunk. You can create a simple boxplot using base R by typing boxplot(YouthData$fropinon). Recall that the $ is a base R operator used to reference an element (variable) within an object (dataset).
    2. Your R studio should look like this:
      Boxplot of 'fropinon' variable using base R

      Boxplot of ‘fropinon’ variable using base R

  4. The base R boxplot() function we used above creates a boxplot of any variable. However, with the base R plotting functions, it is difficult to manipulate and save the boxplot if desired. Rather, we recommend using the ggplot() function (from the ggplot2 package) to generate plots instead. Below, we will show you how to create a boxplot using ggplot(), which you can then customize various properties including its colors, titles, and layout orientation.

  5. Now, let’s jazz up this boxplot a bit by recreating and then modifying it using ggplot(). By adding some color and a title, we can make the boxplot easier and more appealing to read.

  6. Create a third-level header titled: “Boxplot of fropinon using ggplot()”

  7. Insert another R chunk and type YouthData %>% ggplot(aes(fropinon)) + geom_boxplot().
    1. Recall, ggplot() is a function included in the tidyverse package that allows us to create graphs and plots.
    2. The (aes()) function manipulates the aesthetic of the graph or plot, such as the orientation. For example, plots will orient to the x-axis by default if you type ggplot(aes(fropinon)). If you type ggplot(aes(y=fropinon)), the plot will be flipped to the y-axis like the base R boxplot above.
    3. The geom_boxplot() function works like the geom_histogram() function you used in earlier assignments. Be sure to include the + sign before geom_boxplot() since you are “adding” this geometric object layer to the initial XY coordinate plot.
      • Note: If you break your code into multiple lines (as pictured below,) be sure that the + sign is on the same line as the ggplot() function. Otherwise, R will assume you’re done with the ggplot() function, and it will not understand that you want to add a boxplot to it.
    4. Your R studio should look like this:

      Boxplot of 'fropinon' using ggplot

      Boxplot of ‘fropinon’ using ggplot

  8. Next, we can add some color.
    1. Create a third-level header titled: “Add Color to fropinon boxplot”. Then, create a new R chunk and type YouthData %>% ggplot(aes(fropinon)) + geom_boxplot().
    2. Inside the paratheses after geom_boxplot, type fill = "orange", color = "black".
      • fill = dictates the inner color of the boxplot. color = dictates the color or the outline and lines comprising the boxplot. Be sure to include the quotation marks (““).
    3. Your code should look like this:
How to add color to a ggplot boxplot

How to add color to a ggplot boxplot

Boxplot of 'fropinon' in black and orange

Boxplot of ‘fropinon’ in black and orange

  1. Following the procedures above, we can use built-in color palettes to change the outline or fill to nearly any colors you want, such as yellow, turquoise, or magenta! I often use unique hexidecimal codes instead of color names to precisely select specific colors.
    1. For example, we can use the hex values for IU’s colors, which are “#990000” for crimson and “#EDEBEB” for cream.
    2. To improve the “cream” contrast on our plot, we will also specify a minimal theme with a white background by adding + theme_minimal() to our ggplot object.
      • Note: As the link above explains, on the web, IU substitutes gray for the cream in their primary colors because cream does not reproduce well in online environments.
        Boxplot of 'fropinon' in crimson and cream

        Boxplot of ‘fropinon’ in crimson and cream

  2. The options are nearly limitless! However, when customizing your graphs, we recommend using a package like “viridis” or “viridisLite” to help you choose colors that are accessible for individuals with all forms of colorblindness. You can learn more about the viridis package and its color palettes here and here.
    1. As explained in the links above, we can use the “viridisLite” package to automatically apply the viridis color scale to certain ggplot graphs. However, we can also choose to select our own colors manually. We show you one way to do this in the boxplot below, where we manually specified two colors from the viridis color palette. Remember, you can click the “Code” button to see how we did it.
    2. First, we use the viridisLite::viridis() function to request two contrasting colors from the palette (hence the 2 in the parentheses). We assigned the resulting two hexidecimal character codes into an object that we named “cols”, then we assigned the last color in this vector of two colors to “col1” and the first in this vector to “col2.”
    3. From here, we used our ggplot code from earlier to regenerate the boxplot, but we substituted our “col1” and “col2” hexidecimal character objects in for our boxplot’s fill (fill=col1) and outline (color=col2) color values.
      Boxplot of 'fropinon' with viridis color scale

      Boxplot of ‘fropinon’ with viridis color scale

  3. Lastly (for now, at least), we can add titles and labels to our boxplot. This makes it easier for you and any other reader to know know what you have plotted. For example, we know the ‘fropinon’ variable contains survey responses to a question asking participants how wrong they think their friends think it is to steal, with response values ranging from 1 (always wrong) to 5 (never wrong). So, we will title the boxplot “Boxplot of Friends’ Opinions on Stealing”.
    1. Create a third-level header called “Adding Boxplot Title”. Then, insert an R chunk.
    2. Type YouthData %>% ggplot(aes(fropinon)) + geom_boxplot(fill = "orange", color = "black").
    3. Then, add a plot title by typing + labs(title = "Boxplot of Friends' Opinions on Stealing") after geom_boxplot(fill = "orange", color = "black"). If you break across lines, remember to include the + at the end of the previous line and not at the beginning of the new line.
      • labs() is a function that allows you to change labels.
      • title = designates that you’re working with the boxplot title and not the caption (caption =) or a subtitle (subtitle =).
    4. Your R Studio should look something like this:
How to add title to a ggplot boxplot

How to add title to a ggplot boxplot

Congratulations! You just learned how to create and modify a (jazzed up) boxplot in R!

Part 3 (Assignment 5.3)

Goal: Determine Measures of Dispersion for “delinquency” and “certain” Variable (Question 5, Ch.5 (pp.145))

Now that you can create a boxplot in R, you will create boxplots for the “delinquency” and “certain” variables as well. You will do this using the method from above.

  1. Create a second-level header titled: Part 3 (Assignment 5.3). Create a third-level header titled: “Calculate Measures of Dispersion for delinquency and certain (Question 5, Ch.5 (pp.145))”. Then, create a fourth-level header (type ####) titled: “Boxplot for delinquency

  2. Insert an R chunk and create a boxplot without color or a title for the “delinquency” variable. Your R studio should look like this:
Boxplot of 'delinquency' Variable

Boxplot of ‘delinquency’ Variable

  1. Now, add colors to the boxplot by typing fill = "blue", color = "black" in the parentheses of geom_boxplot(). Then, add a title that says “Boxplot of Number of Delinquent Acts”. To do this, type + labs(title = "Boxplot of Number of Delinquent Acts") after geom_boxplot(fill = "blue", color = "black").
    1. You can use any colors you want for your boxplot! Try switching the colors to red, yellow, purple, and so on to see what looks best to you. Remember, you can also use colors from the viridis package’s color palette or various others (e.g., check out the scico palettes here and here) to ensure that your plot is accessible for individuals with all forms of colorblindness.
    2. Your R Studio should look something like this:
      Boxplot of 'delinquency' variable colors from viridis scale

      Boxplot of ‘delinquency’ variable colors from viridis scale

  2. Lastly, repeat this process with the “certain” variable.
    1. Create a fourth-level header titled: “Boxplot for certain
    2. Insert a R chunk. We want the boxplot to have customized colors with the title “Boxplot for Certainty of Being Punished.” Remember, you can use any colors you want!

      Creating a custom boxplot for 'certain' variable

      Creating a custom boxplot for ‘certain’ variable

  3. You should now have everything that you need to complete the questions in Assignment 5 that parallel those from B&P’s SPSS Exercises for Chapter 5! Complete the remainder of the questions in Assignment 5 in your RMD file.
    1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign5. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 5.5.

Assignment 5 Objective Checks

After completing assignment #5…

  • are you able to calculate measures of dispersion by hand from frequency tables you generate in R?
  • are you able to generate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())?
  • are you able to generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution?
  • do you know how to change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)?
  • do you know how to add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)?
  • do you understand how to select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())?
  • are you able to add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))?