Assignment 5 Objectives

The purpose of this fourth assignment is to help you use R to complete some of the Assignment 5 exercises adapted from the SPSS Exercises at the end of Chapter 4 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter focused on measures of central tendency (e.g., mean, median, and mode,) and their advantages and disadvantages as single statistical descriptions of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of central tendency and other statistics (e.g., skewness; kurtosis) that us help standardize and efficiently describe the shape of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.

By the end of Assignment 5, you should…

  • know how to use the base R head() function to quickly view a snapshot of your data
  • know how to use the glimpse() function to quickly view all columns (variables) in your data
  • be able to improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • be able to calculate measures of central tendency for a variable distribution using base R functions mean() and median().
  • know how calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() function
  • have more practice with using the $ operator to reference or call a named element from a list or data object, such as a specific variable in a data file (e.g., mean(data$variable))?
  • have more practice with creating frequency tables using sjmisc::frq() and summarytools::freq()
  • have more practice with generating simple histograms using ggplot()

Assumptions & Ground Rules

We are building on objectives from Assignments 1-3. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an Word file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method

Data viewing & wrangling

  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()

Data visualization & aesthetics

  • create basic graphs using ggplot2’s ggplot() function

If you do not recall how to do these things, first review Assignments 1-4.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • measures of central tendency: mean, median, and mode
  • how to recognize a unimodal or a bimodal distribution and to identify the mode of a distribution
  • how to calculate the median position from raw data or from grouped data
  • how to calculate the arithmetic mean from raw data, grouped data, or a frequency distribution
  • how skewness affects measures of central tendency
  • comparative advantages and disadvantages of the mean and the median

Part 1 (Assignment 5.1)

Goal: Create a new RMD file for Assignment 5

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-01-23_Ducate_CRIM5305_Assign05)

In the last assignment, you learned how to use sjmisc::frq() and summarytools::freq() functions to generate frequency tables for variables. You also learned about the summarytools::dfsummary() function for quickly summarizing all or a subset of the variables in a data object. Lastly, you learned how to select and recode variables using dplyr’s mutate() and if_else() functions as well as how to display data in graphs using ggplot(). In this assignment, you will decide which measure of central tendency is most appropriate for a given variable, then use frequency tables and R functions to calculate measures of central tendency and other univariate descriptive statistics.

  1. Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 4 (named YEAR-MO-DY_LastName_CRIM5305_Assign04). Click to open the R Markdown file.
    • Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.

  2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.

  3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    1. In the Title box, enter CRIM5305 Assignment 5.
    2. In the Author box, enter your First and Last Name (e.g., Caitlin Ducate).
    3. Under Default Output Format box, ensure “Word” is selected
  4. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.

  5. Create a second-level heading titled: “Part 1 (Assignment 4.1)” a. Remember, a second-level heading starts with two hashtags followed by a space and the heading title, like this: ## Heading Title
  6. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text:
    • This R Markdown document contains my work for Assignment 5. It is my work and only my work.

Part 2 (Assignment 5.2)

Goal: Reading in Data and Creating Frequency Table from Youth Data

We will begin by reading in the Youth dataset and creating a frequency table of the “parnt2” variable. The frequency table will allow us to answer the first question on pages 100-101.

  1. Create a second-level header titled: “Part 2 (Assignment 5.2).”

  2. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, summarytools, and gt.
      • Note: A new package - “gt” is listed above. Before loading it, remember that you must first install the package. Also, recall that you only need to install packages one time, but you must load them each time you start a new R session.

  3. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.

    2. In the new R code chunk, read and assign the “Youth_0.sav” SPSS datafile into an R data object named YouthData.

      • Forget how to do this? Refer to instructions in Assignment 1.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: YouthData.

      • Remember, this will call the object and provide a brief view of the data.
      • Also, remember that you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window.
      • Your R studio session should now look a lot like this:
      YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
      YouthData
View Youth Data

View Youth Data

  1. After typing and running the line where you called to view the YouthData object, you should see a table with the first 10 out of 1,272 total rows (containing observations) and 23 columns (or variables).
    • Once you have looked at the YouthData object, you may comment out this line (i.e., the line that JUST reads YouthData. Do not comment out the read_spss() line!)
    • At this point, your Rmarkdown file should look a lot like this:
Current RMD File

Current RMD File


  1. Before we move on, let’s learn about a new function for quickly viewing a snapshot of our data, then check out that new “gt” package you just installed.
    1. First, insert a new third-level header called “View Data”
    2. Insert a new R chunk, type head(YouthData), and hit RUN.
      • head() is a built-in R function that returns a snapshot of a dataframe or object. In this case, it gives us a snapshot of the variables and first several rows of observations in the YouthData object. This can be an especially useful function for quickly glimpsing large datasets.
      • Speaking of glimpsing, the glimpse() function is similarly useful for quickly glimpsing all the columns (variables) in a dataframe - feel free to check that one out as well!
    3. In the same code chunk, below your last command, type head(YouthData) %>% gt(), and hit RUN again.
      • Congrats - you just piped your output from the head(data) function to the gt() function, which instructs the “gt” program to reformat our data table using the package’s default table style settings!
      • Notice any differences in the two tables? Often, the key differences are especially noticeable in our final knitted document.
    4. Try knitting your RMD document now to see what the two tables will look like in your final Word file.
head(YouthData)
# glimpse(YouthData)
head(YouthData) %>% gt()
Gender v2 v21 v22 v63 v77 v79 v109 v119 parnt2 fropinon frbehave certain moral delinquency d1 hoursstudy filter_$ heavytvwatcher studyhard supervision drinkingnotbad Lowcertain_bin
1 15 36 15 3 1 3 5 1 5 18 9 9 19 8 1 15 1 1 1 0 0 1
0 15 3 5 4 1 2 4 1 8 11 6 11 19 10 1 5 0 0 0 1 0 0
1 15 20 6 4 1 1 4 2 8 12 5 11 20 1 0 6 0 1 0 1 0 0
0 15 2 4 2 1 2 5 3 4 9 5 13 19 1 0 4 1 0 0 0 0 0
0 14 12 2 4 3 5 5 3 7 27 9 4 18 104 1 2 1 0 0 1 1 1
1 15 1 3 4 1 1 5 3 7 8 9 10 20 0 0 3 0 0 0 1 0 0


Note that, when knitting to Word, the table still isn’t very pretty. However, it is now formatted as an actual TABLE, which means you can modify and style it, whereas the output of head(YouthData) on its own is just code, which you cannot style. There are ways to use templates to make the output even nicer by default–if you ever decide to go that route

  1. Recall, in an R chunk, you can also type YouthData %>% view_df() and hit RUN, then check your Viewer tab to get a better look at the variable names, labels, and values.
    • Forget how to do this? Refer to Assignment 2.

  2. Create a third-level header titled: Frequency Table for ‘parnt2’ Variable
  3. Create a new R code chunk and type YouthData %>% freq(parnt2) to generate a frequency table for the “parental supervision scale” variable.
    • Note: This variable’s name is “parnt2”, but its label is “parental supervision scale”. This is why using view_df() is helpful – it allows you to see the variable names, labels, and value labels. For example, if you type “parental supervision scale” into R studio, nothing will happen, because it is a label, not a variable (or object) name. However, if you type parnt2 into R, it returns the variable that measures the parental supervision scale.
#default
YouthData %>% freq(parnt2) 
## Frequencies  
## YouthData$parnt2  
## Label: parental supervision scale  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           2      6      0.47           0.47      0.47           0.47
##           3     16      1.26           1.73      1.26           1.73
##           4    163     12.81          14.54     12.81          14.54
##           5    139     10.93          25.47     10.93          25.47
##           6    440     34.59          60.06     34.59          60.06
##           7    189     14.86          74.92     14.86          74.92
##           8    319     25.08         100.00     25.08         100.00
##        <NA>      0                               0.00         100.00
##       Total   1272    100.00         100.00    100.00         100.00


  1. Now let us look at this distribution graphically. Plotting our variables is important because it helps us determine whether or not our variable is skewed and, if so, by how much. Because our parnt2 variable is discrete, let’s plot it as a bar graph. Recall that we can plot bar graphs using the generic command data %>% ggplot(aes(variable)) + geom_bar().
YouthData %>% ggplot(aes(parnt2)) + geom_bar()

  1. You should now be able to answer questions 5-8 of Assignment 5.

Part 3 (Assignment 5.3)

Goal: Create frequency tables for the “v77”, “v79”, “certain”, and “Gender” variables

Now, we are going to generate frequency tables for four variables and then determine which measure of central tendency is most appropriate for each variable, as well as to determine whether the variable’s distribution is skewed and, if so, the direction of its skew (i.e., negative or positive). Recall, it is important to determine whether a variable’s distribution is skewed and which measurement of central tendency is the most appropriate to ensure that we are reporting meaningful summary statistics when describing a variable. Generating a graph (e.g., bar chart or histogram) might also help you to determine the most appropriate measure of central tendency as well as to identify the direction of a skewed distribution.

  1. Create a second-level header titled: “Part 3 (Assignment 5.3).” Then, create a third-level header titled: “Measures of Central Tendency for Youth Data Variables”

  2. Create a fourth-level header titled: “v77: How wrong do friends think it is to steal”. Below it, create a new R code chunk and type YouthData %>% frq(v77)
    1. Repeat the above step for the other 3 variables, “v79”, “certain”, and “Gender”. Create a forth-level header above each variable titled “[Variable Name]: [Label]”. For example, type “v79: How wrong do friends think it is to drink” before your next R chunk. Then, type YouthData %>% sjmisc::frq(v79).
      • Earlier, you used the freq() function from the summarytools package to generate a frequency distribution; this time, notice that you were instructed to use the frq() function from the sjmisc package instead. One benefit of the frq() function is that it automatically calculates the mean and standard deviation (sd) for a variable. (Note, though, that it may calculate a mean value even when the mean is not the most appropriate measure of central tendency!)
    2. Now visualize your data based on its level of measurement to determine its distribution and possible skew.
      • If you do not recall how to create bar graphs and histograms, refer to Assignment 4. As a quick reminder, the code to do this takes the form of: data %>% ggplot(aes(variable)) + geom_bar() for bar graphs, which are the most useful in this case.
      • Note: R is case sensitive! Be sure you are typing “v77” with a lower-case, not upper-case, ‘v’.

Your R studio should look like Figures 3 and 4 (see next page).

Frequency Tables of Variables

Frequency Tables of Variables

With the output of the code looking something like this:

Frequency Tables of Variables

Frequency Tables of Variables

  1. You could also simply calculate the mean and medianof each variable within R. There are various ways to do this. For example, you use the base R mean() function to calculate the mean of a particular variable.
    1. Create a third-level header titled “Means, medians, and modes of Variables”
    2. Insert an R chunk and type mean(YouthData$certain). This will generate the mean of the certain variable in the YouthData dataset. You can repeat this step for the other variables, v77, v79, and Gender.
      • Note: The $ operator is used to call, access, or reference a named element from a list or data object in R. Here, we are essentially telling R to access and calculate the mean of the certain column from the YouthData object. The $ operator, which you may recall that we first introduced in Assignment 2, is used frequently with base R functions. By this point, you have also used the tidyverse %>% operator many times. Often, tasks can be accomplished either with $ operators or with %>% operators, though some base R functions do not work with tidyverse-style pipes. This is one such case where we cannot use the %> operator and must use $ instead.
  2. You can use the same coding to determine the median of each remaining variable by simply replacing the word ‘mean’ with ‘median’. For example, to determine the median of the ‘certain’ variable, you would type median(YouthData$certain). You can use this approach to calculate the median of the remaining variables.
    • NOTE: There is a mode() function in R, but it DOES NOT calculate the mode of a distribution. Instead, it calculates the storage mode of an object. You can see this for yourself by running mode(YouthData$Gender). To find the mode, rely on frequency tables instead.

  3. Alternatively, the descr() function in the summarytools package will generate various descriptive statistics, including the mean, median, and values indicating the level of skewness and kurtosis. Using the “certain” variable as an example, type: YouthData %>% descr(certain)

You should now be able to answer questions 9-10 of Assignment 5.

Part 4 (Assignment 5.4)

Goal: Create a histogram for “delinquency” Variable

Next, you will create a histogram for the delinquency variable to determine its most appropriate measure of central tendency. This will allow you to answer question 11 of Assignment 5.

You should now have everything that you need to complete the questions in Assignment 5 that parallel those from B&P’s SPSS Exercises for Chapter 4!

  1. Complete the remainder of the questions in Assignment 5 in your RMD file.
    1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document as: YEAR-MO-DY_LastName_CRIM5305_Assign05. Submit via Blackboard in the assignment called Assignment 5: Word Document.

Assignment 5 Objective Checks

After completing Assignment 5…

  • do you know how to use the base R head() function to quickly view a snapshot of your data?
  • can you use the glimpse() function to quickly view all columns (variables) in your data?
  • can you improve a knitted table by piping a function’s results to gt() (e.g., head(data) %>% gt())?
  • do you know how to calculate measures of central tendency for a variable distribution using base R functions: mean(), median(), and mode()?
  • can you generate central tendency and other descriptive statistics using summarytools::descr() function?
  • did you get more practice using the $ operator to reference or call a named element from a list or data object, such as a specific variable in a data file (e.g., mean(data$variable))?
  • did you get more practice creating frequency tables using sjmisc::frq() and summarytools::freq()?
  • do you get more practice generating simple histograms with ggplot()?