Assignment 4 Objectives

The purpose of this fourth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 4 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter focused on measures of central tendency (e.g., mean, median, and mode,) and their advantages and disadvantages as single statistical descriptions of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of central tendency and other statistics (e.g., skewness; kurtosis) that us help standardize and efficiently describe the shape of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.

By the end of assignment #4, you should…

  • know how to use the base R head() function to quickly view a snapshot of your data
  • know how to use the glimpse() function to quickly view all columns (variables) in your data
  • be able to improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • be able to calculate measures of central tendency for a variable distribution using base R functions mean(), median(), and mode().
  • know how calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • have more practice with using the $ operator to reference or call a named element from a list or data object, such as a specific variable in a data file (e.g., mean(data$variable))?
  • have more practice with creating frequency tables using sjmisc::frq() and summarytools::freq()
  • have more practice with generating simple histograms using ggplot()

Assumptions & Ground Rules

We are building on objectives from Assignments 1-3. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an HTML file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages

Data viewing & wrangling

  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()

Data visualization & aesthetics

  • create basic graphs using ggplot2’s ggplot() function


If you do not recall how to do these things, first review Assignments 1-3.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • measures of central tendency: mean, median, and mode
  • how to recognize a unimodal or a bimodal distribution and to identify the mode of a distribution
  • how to calculate the median position from raw data or from grouped data
  • how to calculate the arithmetic mean from raw data, grouped data, or a frequency distribution
  • how skewness affects measures of central tendency
  • comparative advantages and disadvantages of the mean and the median

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

  • Early on, you may have a lot of trouble getting your code to run due to minor typos. This is normal.
  • Remember, you are learning to read and write a new (coding) language. As with learning any new languages, we learn from practice - and from correcting our mistakes.

Part 1 (Assignment 4.1)

Goal: Create a new RMD file for Assignment 4

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_01_Fordham_K300Assign4)

In the last assignment, you learned how to use sjmisc::frq() and summarytools::freq() functions to generate frequency tables for variables. You also learned about the summarytools::dfsummary() function for quickly summarizing all or a subset of the variables in a data object. Lastly, you learned how to select and recode variables using dplyr’s select(), mutate(), and if_else() functions as well as how to display data in graphs using ggplot(). In this assignment, you will decide which measure of central tendency is most appropriate for a given variable, then use frequency tables and R functions to calculate measures of central tendency and other univariate descriptive statistics.

  1. Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 3 (named YEAR_MO_DY_LastName_K300Assign3). Click to open the R Markdown file.
    • Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.

  2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.

  3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    1. In the Title box, enter K300 Assignment 4.
    2. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
    3. Under Default Output Format box, ensure “HTML” is selected (HTML is usually the default selection)

  4. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.

  5. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text:
    • This R Markdown document contains my work for Assignment 3. It is my work and only my work.

Part 2 (Assignment 4.2)

Goal: Reading in Data and Creating Frequency Table from Youth Data

We will begin by reading in the Youth dataset and creating a frequency table of the “parnt2” variable. The frequency table will allow us to answer the first question on pages 100-101.

  1. Create a second-level header titled: “Part 1 (Assignment 4.1).”

  2. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, summarytools, and gt.
      • Note: A new package - “gt” is listed above. Before loading it, remember that you must first install the package. Also, recall that you only need to install packages one time, but you must load them each time you start a new R session. Finally, remember that you can optionally use (and we recommend) groundhog.library() to improve the reproducibility of your script.

  3. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and assign the “Youth_0.sav” SPSS datafile into an R data object named YouthData.
      • Forget how to do this? Refer to instructions in Assignment 1.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: YouthData.
      • Remember, this will call the object and provide a brief view of the data.
      • Also, remember that you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window.
      • Your R studio session should now look a lot like this:
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
YouthData
## # A tibble: 1,272 × 23
##    Gender        v2   v21   v22 v63       v77     v79     v109    v119    parnt2
##    <dbl+lbl>  <dbl> <dbl> <dbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l>  <dbl>
##  1 1 [male]      15    36    15 3 [usual… 1 [alw… 3 [som… 5 [a v… 1 [hur…      5
##  2 0 [female]    15     3     5 4 [alway… 1 [alw… 2 [usu… 4 [a b… 1 [hur…      8
##  3 1 [male]      15    20     6 4 [alway… 1 [alw… 1 [alw… 4 [a b… 2 [hur…      8
##  4 0 [female]    15     2     4 2 [somet… 1 [alw… 2 [usu… 5 [a v… 3 [hur…      4
##  5 0 [female]    14    12     2 4 [alway… 3 [som… 5 [nev… 5 [a v… 3 [hur…      7
##  6 1 [male]      15     1     3 4 [alway… 1 [alw… 1 [alw… 5 [a v… 3 [hur…      7
##  7 0 [female]    16    10     3 4 [alway… 3 [som… 2 [usu… 4 [a b… 1 [hur…      8
##  8 0 [female]    15    25    10 2 [somet… 3 [som… 5 [nev… 4 [a b… 3 [hur…      5
##  9 0 [female]    15     6    10 4 [alway… 2 [usu… 3 [som… 5 [a v… 1 [hur…      8
## 10 1 [male]      15    15     8 4 [alway… 3 [som… 3 [som… 4 [a b… 2 [hur…      7
## # … with 1,262 more rows, and 13 more variables: fropinon <dbl>,
## #   frbehave <dbl>, certain <dbl>, moral <dbl>, delinquency <dbl>,
## #   d1 <dbl+lbl>, hoursstudy <dbl>, `filter_$` <dbl>, heavytvwatcher <dbl+lbl>,
## #   studyhard <dbl+lbl>, supervision <dbl+lbl>, drinkingnotbad <dbl+lbl>,
## #   Lowcertain_bin <dbl>
View Youth Data

View Youth Data

  1. After typing and running the line where you called to view the YouthData object, you should see a table with the first 10 out of 1,272 total rows (containing observations) and 23 columns (or variables).

  2. Before we move on, let’s learn about a new function for quickly viewing a snapshot of our data, then check out that new “gt” package you just installed.
    1. Insert a new R chunk, type head(YouthData), and hit RUN.
      • head() is a built-in R function that returns a snapshot of a dataframe or object. In this case, it gives us a snapshot of the variables and first several rows of observations in the YouthData object. This can be an especially useful function for quickly glimpsing large datasets.
      • Speaking of glimpsing, the glimpse() function is similarly useful for quickly glimpsing all the columns (variables) in a dataframe - feel free to check that one out as well!
    2. In the same code chunk, below your last command, type head(YouthData) %>% gt(), and hit RUN again.
      • Congrats - you just piped your output from the head(data) function to the gt() function, which instructs the “gt” program to reformat our data table using the package’s default table style settings!
      • Notice any differences in the two tables? Often, the key differences are especially noticeable in our final knitted document.
    3. Try knitting your RMD document now to see what the two tables will look like in your final HTML file.
head(YouthData)
## # A tibble: 6 × 23
##   Gender        v2   v21   v22 v63        v77     v79     v109    v119    parnt2
##   <dbl+lbl>  <dbl> <dbl> <dbl> <dbl+lbl>  <dbl+l> <dbl+l> <dbl+l> <dbl+l>  <dbl>
## 1 1 [male]      15    36    15 3 [usuall… 1 [alw… 3 [som… 5 [a v… 1 [hur…      5
## 2 0 [female]    15     3     5 4 [always] 1 [alw… 2 [usu… 4 [a b… 1 [hur…      8
## 3 1 [male]      15    20     6 4 [always] 1 [alw… 1 [alw… 4 [a b… 2 [hur…      8
## 4 0 [female]    15     2     4 2 [someti… 1 [alw… 2 [usu… 5 [a v… 3 [hur…      4
## 5 0 [female]    14    12     2 4 [always] 3 [som… 5 [nev… 5 [a v… 3 [hur…      7
## 6 1 [male]      15     1     3 4 [always] 1 [alw… 1 [alw… 5 [a v… 3 [hur…      7
## # … with 13 more variables: fropinon <dbl>, frbehave <dbl>, certain <dbl>,
## #   moral <dbl>, delinquency <dbl>, d1 <dbl+lbl>, hoursstudy <dbl>,
## #   `filter_$` <dbl>, heavytvwatcher <dbl+lbl>, studyhard <dbl+lbl>,
## #   supervision <dbl+lbl>, drinkingnotbad <dbl+lbl>, Lowcertain_bin <dbl>
# glimpse(YouthData)
head(YouthData) %>% gt()
Gender v2 v21 v22 v63 v77 v79 v109 v119 parnt2 fropinon frbehave certain moral delinquency d1 hoursstudy filter_$ heavytvwatcher studyhard supervision drinkingnotbad Lowcertain_bin
1 15 36 15 3 1 3 5 1 5 18 9 9 19 8 1 15 1 1 1 0 0 1
0 15 3 5 4 1 2 4 1 8 11 6 11 19 10 1 5 0 0 0 1 0 0
1 15 20 6 4 1 1 4 2 8 12 5 11 20 1 0 6 0 1 0 1 0 0
0 15 2 4 2 1 2 5 3 4 9 5 13 19 1 0 4 1 0 0 0 0 0
0 14 12 2 4 3 5 5 3 7 27 9 4 18 104 1 2 1 0 0 1 1 1
1 15 1 3 4 1 1 5 3 7 8 9 10 20 0 0 3 0 0 0 1 0 0


Which table do you prefer in the knitted document? I typically prefer gt-style tables myself. Some other programs you will use to create tables for this course are also compatible with and can be piped to the gt() function to quickly and easily improve the default table output (and to customize it if desired).

  1. Recall, in an R chunk, you can also type YouthData %>% view_df() and hit RUN, then check your Viewer tab to get a better look at the variable names, labels, and values.
    • Forget how to do this? Refer to Assignment 2.

  2. Create a third-level header titled: “Frequency Table for ‘parnt2’ Variable”

  3. Create a new R code chunk and type YouthData %>% freq(parnt2) to generate a frequency table for the “parental supervision scale” variable.
    • Note: This variable’s name is “parnt2”, but its label is “parental supervision scale”. This is why using view_df() is helpful – it allows you to see the variable names, labels, and value labels. For example, if you type “parental supervision scale” into R studio, nothing will happen, because it is a label, not a variable (or object) name. However, if you type parnt2 into R, it returns the variable that measures the parental supervision scale.
#default
YouthData %>% freq(parnt2) 
## Frequencies  
## YouthData$parnt2  
## Label: parental supervision scale  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           2      6      0.47           0.47      0.47           0.47
##           3     16      1.26           1.73      1.26           1.73
##           4    163     12.81          14.54     12.81          14.54
##           5    139     10.93          25.47     10.93          25.47
##           6    440     34.59          60.06     34.59          60.06
##           7    189     14.86          74.92     14.86          74.92
##           8    319     25.08         100.00     25.08         100.00
##        <NA>      0                               0.00         100.00
##       Total   1272    100.00         100.00    100.00         100.00


  1. Refer back to B&P’s SPSS Exercise at the end of Chapter 4 (pages 100-101) and answer question 1.

Part 3 (Assignment 4.3)

Goal: Create frequency tables for the “v77”, “v79”, “certain”, and “Gender” variables (Question 2, Ch.4 (pp.100-101))

Now, we are going to generate frequency tables for four variables and then determine which measure of central tendency is most appropriate for each variable, as well as to determine whether the variable’s distribution is skewed and, if so, the direction of its skew (i.e., negative or positive). It is important to determine whether a variable’s distribution is skewed and which measurement of central tendency is the most appropriate to ensure that we are reporting meaningful summary statistics when describing a variable. Generating a graph (e.g., bar chart or histogram) might also help you to determine the most appropriate measure of central tendency as well as to identify the direction of a skewed distribution.

  1. Create a second-level header titled: “Part 2 (Assignment 4.2).” Then, create a third-level header titled: “Frequency Tables and Measures of Central Tendency for Youth Data Variables”

  2. Create a new R code chunk and type YouthData %>% frq(v77)
    1. Repeat the above step for the other 3 variables, “v79”, “certain”, and “Gender” and answer question 2 on page 101. Create a third-level header above each variable titled “Frequency Table of [Variable Name]”. For example, type “Frequency of v79” before your next R chunk. Then, type YouthData %>% frq(v79).
      • Earlier, you used the freq() function from the summarytools package to generate a frequency distribution; this time, notice that you were instructed to use the frq() function from the sjmisc package instead. One benefit of the frq() function is that it automatically calculates the mean and standard deviation (sd) for a variable. (Note, though, that it may calculate a mean value even when the mean is not the most appropriate measure of central tendency!)
    2. Graphical representations can be helpful, especially for determining the shape or skewness of a distribution. If you would like, you can create a bar graph or histogram for each variable. Doing so will help you to visualize the data and also gives you a chance to practice determining which graphs (i.e., bar graph, histogram, line graph) are most appropriate for particular variables.
      • If you do not recall how to create bar graphs and histograms, refer to Assignment 3. As a quick reminder, the code to do this takes the form of: data %>% ggplot(aes(variable)) + geom_bar()
      • Note: R is case sensitive! Be sure you are typing “v77” with a lower-case, not upper-case, ‘v’. Your R studio should look like this:
        Frequency Tables of Variables

        Frequency Tables of Variables

  3. You could also simply calculate the mean and median of each variable within R. There are various ways to do this. For example, you use the base R mean() function to calculate the mean of a particular variable.
    1. Create a third-level header titled “Means, medians, and modes of Variables”
    2. Insert an R chunk and type mean(YouthData$certain). This will generate the mean of the ‘certain’ variable in the YouthData dataset. You can repeat this step for the other variables, v77, v79, and Gender.
      • Note: The $ operator is used to call, access, or reference a named element from a list or data object in R. Here, we are essentially telling R to access and calculate the mean of the certain column from the YouthData object. The $ operator, which you may recall that we first introduced in Assignment 2, is used frequently with base R functions. By this point, you have also used the tidyverse %>% operator many times. Often, tasks can be accomplished either with $ operators or with %>% operators, though some base R functions do not work with tidyverse-style pipes. I typically prefer tidyverse solutions and, when using tidyverse functions (e.g., “dplyr” and “ggplot2” packages), I usually recommend using the pipe operator %>% in lieu of the $ to initiate a sequence of actions for code readability, efficiency (e.g., to avoid repeating the data object name), and to minimize the number of objects assigned or degree of complication of nested logical sequences. It is worth noting that, as of version 4.1, base R also has a new pipe operator; you can read more about it here.
    3. Your R studio should look like this:
      Means of Variables

      Means of Variables

  4. You can use the same coding to determine the median of each remaining variable by simply replacing the word ‘mean’ with ‘median’. For example, to determine the median of the ‘certain’ variable, you would type median(YouthData$certain). You can use this approach to calculate the median of the remaining variables, using the same above code but replacing the words ‘mean,’ ‘median,’ and ‘mode’ as necessary.


  5. Unfortunately, calculating the mode is not as straightforward, as there is not a similar built-in way of doing so in R. You may be tempted to use mode() but DO NOT do it - this actually does something else entirely! Instead, if you want to calculate the mode of a vector of values in R, one way to do it is to write a custom function. In the code below, we show how you can create a function called getMode(), which you can then use to calculate the mode of a variable.
 getMode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
 }
getMode(YouthData$certain)
## [1] 12


6. Alternatively, the descr() function in the summarytools package will generate various descriptive statistics, including the mean, median, and values indicating the level of skewness and kurtosis. Using the “certain” variable as an example, type: YouthData %>% descr(certain)
Using `descr()` Function to Summarize Variables

Using descr() Function to Summarize Variables

  • Repeat the steps above, using mean() and median() functions from base R and/or the descr() function from the summarytools package, to finish answering question 2 on page 101.

Part 4 (Assignment 4.4)

Goal: Create a histogram for “delinquency” Variable (Question 3, Ch. 4 (pp. 100-101))

Next, you will create a histogram for the delinquency variable to determine its most appropriate measure of central tendency. This will allow you to answer question 3 on pages 100-101.

  • Create a second-level header titled: “Part 3 (Assignment 4.3)
    • Create a third-level header titled: “Histogram of ‘delinquency’ Data”
    • Create an R chunk and type YouthData %>% ggplot(aes(delinquency)) + geom_histogram(). Answer question 3.
YouthData %>%
  ggplot(aes(delinquency)) +
  geom_histogram()

Part 5 (Assignment 4.5)

Goal: Create a Frequency Table for “Gender” Variable

For the last part of the assignment, you will create a frequency table of the “Gender” variable to determine its mean. The table will also allow you to check out the distribution of the variable.

  1. Create a frequency table of the “Gender” variable. Look at its mean and answer question 4 on pages 100-101.
    • While this table is helpful, it is not as detailed or visually pleasing as we would like. So, as in Assignment 3, we will “clean up” the frequency table. Create a new R chunk and type YouthData %>% freq(Gender, plain.ascii = FALSE, style = "rmarkdown"). Remember to type r, results = 'asis'} before the R code in the top line of the code chunk options, so that R Studio shows a clean table upon knitting. If you do no recall why we use 'asis', plain.ascii, or style, refer to Assignment 3.
YouthData %>%
  frq(Gender)
## Gender of respondent (Gender) <numeric> 
## # total N=1272 valid N=1272 mean=0.47 sd=0.50
## 
## Value |  Label |   N | Raw % | Valid % | Cum. %
## -----------------------------------------------
##     0 | female | 680 | 53.46 |   53.46 |  53.46
##     1 |   male | 592 | 46.54 |   46.54 | 100.00
##  <NA> |   <NA> |   0 |  0.00 |    <NA> |   <NA>
YouthData %>%
  freq(Gender, plain.ascii = FALSE, style = "rmarkdown")

Frequencies

YouthData$Gender

Label: Gender of respondent
Type: Numeric

  Freq % Valid % Valid Cum. % Total % Total Cum.
0 680 53.46 53.46 53.46 53.46
1 592 46.54 100.00 46.54 100.00
<NA> 0 0.00 100.00
Total 1272 100.00 100.00 100.00 100.00
  1. We can also use the psych package to view our measures of central tendency.
    1. Create a third-level header titled: “Using ‘psych’ Package for Measures of Central Tendency”
    2. Install the psych package and load it into R.
      • Remember, you can check to make sure a package is loaded by clicking Packages and looking for a check mark.
    3. Create a new R chunk and type describe(YouthData$Gender)
      • The describe() function in the psych package allows you to look at the mean, median, minimum, maximum, and standard deviation of a dataset or variable. Remember, the $ calls a variable in a specific dataset. So, if we typed describe(YouthData), we would get these values for all variables in the data. But, by using the $ function, we can specify that we want these values for the “Gender” variable. If you want more of a refresher, refer to Assignment 2. Your R Studio should look like this:
describe(YouthData$Gender)
##    vars    n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 1272 0.47 0.5      0    0.47   0   0   1     1 0.14    -1.98 0.01
Describe 'Gender' Variable

Describe ‘Gender’ Variable

You should now have everything that you need to complete the questions in Assignment 4 that parallel those from B&P’s SPSS Exercises for Chapter 4!

  1. Complete the remainder of the questions in Assignment 4 in your RMD file.
    1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign4. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 4.

Assignment 4 Objective Checks

After completing assignment #4…

  • do you know how to use the base R head() function to quickly view a snapshot of your data?
  • can you use the glimpse() function to quickly view all columns (variables) in your data?
  • can you improve a knitted table by piping a function’s results to gt() (e.g., head(data) %>% gt())?
  • do you know how to calculate measures of central tendency for a variable distribution using base R functions: mean() and median()?
  • can you generate central tendency and other descriptive statistics using summarytools::descr() and psych::describe functions?
  • did you get more practice using the $ operator to reference or call a named element from a list or data object, such as a specific variable in a data file (e.g., mean(data$variable))?
  • did you get more practice creating frequency tables using sjmisc::frq() and summarytools::freq()?
  • do you get more practice generating simple histograms with ggplot()?