Assignment 3: Describing Data Distributions

Assignment 3 Objectives

The purpose of this third assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 2 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
By the end of assignment #3, you should be able to…

Assumptions & Ground Rules

Basic R/RStudio skills
Reproducibility
Data viewing & wrangling

Part 1 (Assignment 3.1)

Part 2 (Assignment 3.2)

Part 3 (Assignment 3.3)

Assignment 3 Objective Checks

After completing assignment #3, can you…

Assignment 3 Objectives

The purpose of this third assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 2 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

These chapters focused on data distributions aggregation. As with the two previous assignments, you will be using R Markdown (with R & RStudio) to complete and present your work. In this assignment, you will learn how to recode variables, generate frequency tables, and create simple graphs in R.

By the end of assignment #3, you should be able to…

create simple frequency tables using sjmisc::frq() and summarytools::freq()
identify strengths and limitations of frq() and freq() for creating frequency tables
sort a frequency table by frequency, from highest to lowest and from lowest to highest frequencies
recognize summarytools::dfsummary() as another way to quickly describe one or more variables in a data file

Assumptions & Ground Rules

We are building on objectives from Assignments 1 & 2. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
recognize when a function is being called from a specific package using a double colon with the package::function() format
read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
use a tidyverse %>% pipe operator to perform a sequence of actions
knit your RMD document into a Word file that you can then save and submit for course credit

Reproducibility

use here() for a simple and reproducible self-referential file directory method

Data viewing & wrangling

use sjPlot::view_df() to quickly browse variables in a data file
use attr() to identify variable and attribute value labels

If you do not recall how to do these things, first review Assignments 1 & 2.

Additionally, you should have read the assigned book chapters and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

sampling
units of analysis
variable levels of measurement

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Early on, you may have a lot of trouble getting your code to run due to minor typos. This is normal.
Remember, you are learning to read and write a new (coding) language. As with learning any new languages, we learn from practice - and from correcting our mistakes.

Part 1 (Assignment 3.1)

Goal: Create a new RMD file for Assignment 3

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-01-25_Ducate_CRIM5305_Assign03)

In the second assignment, you learned how to read in and assign a dataset to an R object. You also learned how to use the view_df function from the sjPlot package and the base R attr() function to display your dataframe and identify variable attributes. In this third assignment, you will use the “sjmisc” and “summarytools” packages to display your descriptive data in frequency tables. You will also learn about the dfsummary() function from the “summarytools” package, which is an alternative to sjPlot::view_df for creating a useful summary of all or a subset of the variables in a dataset.

Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 2 (named YEAR-MO-DY_LastName_CRIM5305Assign2). Click to open the R Markdown file.
- Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.
In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
1. In the Title box, enter CRIM5305 Assignment 3.
2. In the Author box, enter your First and Last Name (e.g., Caitlin Ducate).
3. Under Default Output Format box, be sure “Word” is selected
Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete all the text after the YAML header before you begin working.
Create a second-level heading titled: “Part 1 (Assignment 3.1)”
1. Remember, a second-level heading starts with two hashtags followed by a space and the heading title, like this: ## Heading Title
2. A third-level heading starts with three hashtags: ### Heading Title
3. A fourth-level heading starts with four hashtags: #### Heading Title
This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text:
- This R Markdown document contains my work for Assignment 3. It is my work and only my work.

Part 2 (Assignment 3.2)

Goal: Read in 2012 States Data and view variable information

Create a second-level heading titled: “Part 2 (Assignment 3.2): Reading in and viewing 2012 States Data”
1. Remember, a second-level heading starts with two hashtags followed by a space and the heading title, like this: ## Heading Title
2. A third-level heading starts with three hashtags: ### Heading Title
3. A fourth-level heading starts with four hashtags: #### Heading Title
Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.
1. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
2. Insert an R code chunk
3. Inside the new R code chunk, load the following six packages: tidyverse, haven, here, sjmisc, sjPlot, and summarytools.
4. Some of these packages will need to be installed. Remember, you only need to install a package once, but you must load a package each time you start a new R session and need to use the package.
After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
1. Insert another R code chunk.
2. In the new R code chunk, read and assign the “2012 states data.sav” SPSS datafile into an R data object named StatesData2012.
  - Forget how to do this? Refer to instructions in Assignment 1.
3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: StatesData2012.
  - This will call the object and provide a brief view of the data. (Note: You can also simply click on the data object in the “Environment” window.)
  - Your R studio session should now look a lot like this:

View Your 2012 States Data

StatesData2012 <- read_spss(here("Datasets", "2012StatesData.sav"))

StatesData2012

## # A tibble: 50 × 30
##    State  Numbe…¹ Numbe…² South   Region  Permo…³ cigtax smoke…⁴ tobac…⁵ Persm…⁶
##    <chr>    <dbl>   <dbl> <dbl+l> <dbl+l>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
##  1 Alaba…     335    1981 1 [Sou… 1 [Sou…    15.1  0.425       0    318.      22
##  2 Alaska      54     216 0 [Non… 2 [Wes…    21.5  2           0    270.      22
##  3 Arizo…     443     607 0 [Non… 2 [Wes…    19.7  2           1    247.      16
##  4 Arkan…     200    1269 1 [Sou… 1 [Sou…    17.6  1.15        1    324.      22
##  5 Calif…    2766    1882 0 [Non… 2 [Wes…    15.6  0.87        0    235       14
##  6 Color…     349     668 0 [Non… 2 [Wes…    18.2  0.84        1    238.      18
##  7 Conne…     228     418 0 [Non… 4 [Nor…    11    3           0    238.      16
##  8 Delaw…      63     156 1 [Sou… 1 [Sou…    14.1  1.6         1    281.      18
##  9 Flori…    1229    1712 1 [Sou… 1 [Sou…    15.9  1.34        1    259.      17
## 10 Georg…     680    2322 1 [Sou… 1 [Sou…    16.5  0.37        0    299.      19
## # … with 40 more rows, 20 more variables: totalpop <dbl>, DivorceRt <dbl>,
## #   perfampoverty <dbl>, perindpoverty <dbl>, MedianIncome <dbl>,
## #   Pernoinsurance <dbl>, MurderRt <dbl>, RobberyRt <dbl>, AssaultRt <dbl>,
## #   BurglaryRt <dbl>, MVTheftRT <dbl>, InfantMort <dbl>, HeartDeathRt <dbl>,
## #   CancerDeathRt <dbl>, PerBachelorD <dbl>, PercentRural <dbl>,
## #   Percent18to24 <dbl>, ID <dbl>, Assault_bin <dbl+lbl>, Murdercat <dbl+lbl>,
## #   and abbreviated variable names ¹Number1824, ²NumberRural, ³Permoved, …

As in the image, you should see 50 rows and 30 columns, which corresponds to 50 individual observations and 30 variables (e.g., region; cigarette tax; and murder rate).

Now, let’s view the variables in the data. In the SPSS program referenced in the book, one would click on the “variable view” tab. Recall, one way to see the variables in your data is to simply click the data object in your R environment (“StatesData2012”). This will open another window in which you can see your variables and every row of observations (akin to “data view” in SPSS). Recall, however, for a “variable view” equivalent in R, you can use the sjPlot::view_df() function:
1. Insert an R chunk
2. Type StatesData2012 %>% view_df() and hit RUN or press CMD/CTRL + Enter
3. In your Viewer tab, the variable names, labels, and values should look like this:
View Your 2012 States Data
Now, refer back to B&P’s SPSS Exercise at the end of Chapter 2 (pages 41-42) and answer the questions, which ask about the unit of analysis and the following variables:
1. State
2. Murdercat
3. BurglaryRt
4. MedianIncome

Part 3 (Assignment 3.3)

Goal: Use R to create frequency tables for the Murdercat variable (Questions 11-13, Ch.2 (pp.41-42).

In Chapter 2, you learned about levels of measurement and about how frequency tables are used in descriptive research. While there are many different ways to describe variables, frequency tables are one of the most basic and efficient way to do so. Frequency tables describe the number of occurrences in our data for each variable attribute or for grouped variable attributes. There are many ways to generate frequency tables, and we will only cover a couple of them here.

Suppose we want to generate a frequency table for the “Murdercat” (or murder rate categorical) variable. Here is a simple way to get that using the frq() function from the “sjmisc” package:

Create a second-level header titled: “Part 2 (Assignment 3.2).” Then, create a third-level header titled: “Frequency Table for”Murdercat” Variable in ‘StatesData2012’ using frq()“
Create a new R code chunk and type StatesData2012 %>% frq(Murdercat)
1. One way to generate a simple frequency table for an individual variable is with the frq() command from the “sjmisc” package. frq() displays a basic frequency table of the designated variable(s). The table should show the value labels (e.g., “0 to 3 murders per 100k”; “3.1 to 6 murders per 100k”; “6.1 to 9 murders per 100k”; etc.) and the N, or the total number of units (states) in the dataset. It also shows the percentage of states within each attribute value, including the cumulative percentage.
2. Your RStudio window should look like this:

Frequency Table Created with sjmisc

In the image above, you can see that the N is 50 for all 50 states. You can also see that 17 states had “0 to 3 murders per 100k” in 2012, while 21 states had “3.1 to 6 murders per 100k” in 2012. Only 1 state, or 2% of the US, had “9.1 to 12 murders per 100k” in 2012.

As noted, this is one way to create a frequency table, but there are lots of other packages that we could use instead. Each package and function has its various strengths. For examples, see here.

While the above frequency table is easy to generate and has the descriptive information we need, it is not easy to sort the table output. So, we will also introduce you to the freq() function from the ” `“summarytools” package. This package is described in the link above and in more detail here

For a basic frequency table using “summarytools” package, type StatesData2012 %>% freq(Murdercat).
1. Note that this function uses freq(), not frq(). Both functions (with or without the ‘e’) will work as long as their respective packages are loaded: “summarytools” for freq() and “sjmisc” for frq().
2. Also, note that these functions display tables with the same frequency values but somewhat different formmating. Try using data %>% freq(variable) and data %>% frq(variable) code (with and without an ‘e’) to see for yourself. One difference is that, unlike the sjmisc::frq() output, the summarytools::freq() output does not include a variable’s attribute value labels (e.g., “0 to 3 murders per 100k”).
As mentioned, a key benefit of the summarytools::freq() function is that it allows us to easily sort a frequency table, such as by highest to lowest frequencies or vice versa. For example, if we want to sort the frequencies from highest to lowest frequencies with high values on top, we can use freq(variable, order = "freq").
1. Create a new R chunk and type StatesData2012 %>% freq(Murdercat, order = "freq"). This will sort the frequencies of Murdercat from lowest to highest.
2. Your RStudio should look like this:

Frequency Table of Murdercat Ascending

We can even reverse order by adding “-” (a minus sign) before the freq object (data %>% freq(variable, order = "-freq"). This means the frequencies will be sorted from lowest to highest. Sort the Murdercat variable from lowest to highest frequencies and complete the rest of the SPSS exercises on page 42.

#sorted by frequency, high to low
StatesData2012 %>%
  freq(Murdercat, order = "freq")

## Frequencies  
## StatesData2012$Murdercat  
## Label: Murder rate categorical  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     21     42.00          42.00     42.00          42.00
##           0     17     34.00          76.00     34.00          76.00
##           2     10     20.00          96.00     20.00          96.00
##           3      1      2.00          98.00      2.00          98.00
##           4      1      2.00         100.00      2.00         100.00
##        <NA>      0                               0.00         100.00
##       Total     50    100.00         100.00    100.00         100.00

#sorted by frequency, low to high
StatesData2012 %>%
  freq(Murdercat, order = "-freq")

## Frequencies  
## StatesData2012$Murdercat  
## Label: Murder rate categorical  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           3      1      2.00           2.00      2.00           2.00
##           4      1      2.00           4.00      2.00           4.00
##           2     10     20.00          24.00     20.00          24.00
##           0     17     34.00          58.00     34.00          58.00
##           1     21     42.00         100.00     42.00         100.00
##        <NA>      0                               0.00         100.00
##       Total     50    100.00         100.00    100.00         100.00

We can also “clean up” this frequency table to make it easier to read once we knit our final document. This will require modifying: (1) our R code and (2) our R chunk options.

First, try adding plain.ascii = FALSE, style = 'rmarkdown' after order = "freq" to your freq() code. This should generate a table with RMD text formatting rather than the default plain text output (e.g., with asterisks and vertical lines throughout):

#clean it up for knitted RMD 
StatesData2012 %>%
 freq(Murdercat, order = "freq", plain.ascii = FALSE, style = 'rmarkdown')

## ### Frequencies  
## #### StatesData2012$Murdercat  
## **Label:** Murder rate categorical  
## **Type:** Numeric  
## 
## |     &nbsp; | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## |      **1** |   21 |   42.00 |        42.00 |   42.00 |        42.00 |
## |      **0** |   17 |   34.00 |        76.00 |   34.00 |        76.00 |
## |      **2** |   10 |   20.00 |        96.00 |   20.00 |        96.00 |
## |      **3** |    1 |    2.00 |        98.00 |    2.00 |        98.00 |
## |      **4** |    1 |    2.00 |       100.00 |    2.00 |       100.00 |
## | **\<NA\>** |    0 |         |              |    0.00 |       100.00 |
## |  **Total** |   50 |  100.00 |       100.00 |  100.00 |       100.00 |

Second in the R chunk options (i.e., the very top line in your R chunk), type the following: “{r, results = ‘asis’}” to pass the results through as RMD formatted text in the knitting process.
For more information, refer to the summarytools forum above or try knitting your document without these additions and viewing the changes before and after these additions.
Your final “clean” table should look like this when knitted:

Clean Up Final Frequency Table

You should now have everything that you need to complete rest of the questions in Assignment 3 that parallel those from B&P’s SPSS Exercises for Chapter 2!

Complete the remainder of the questions in Assignment 3 in your RMD file.
1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ##, ###, or ####) separating R code chunks, organized by assignment questions.
2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will demonstrate your understanding of the course materials and will serve as useful reminders to you when working on later assignments!
3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR-MO-DY_LastName_CRIM5305_Assign03. Submit via Blackboard in the relevant section (i.e., the last question) for Assignment 3. NOTE: If you absolutely cannot get your file to knit, upload your RMD file instead

Assignment 3 Objective Checks

After completing assignment #3, can you…

create simple frequency tables using sjmisc::frq() and summarytools::freq()?
identify strengths and limitations of frq() and freq() for creating frequency tables?
sort a frequency table by frequency, from highest to lowest and from lowest to highest frequencies?
recognize summarytools::dfsummary() as another way to quickly describe one or more variables in a data file?