Assignment 3: Describing Data Distributions

Assignment 3 Objectives

The purpose of this third assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapters 2 and 3 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
By the end of assignment #3, you should be able to…

Assumptions & Ground Rules

We are building on objectives from Assignments 1 & 2. By the start of this assignment, you should already know how to:

Part 1 (Assignment 3.1)

Goal: Create a new RMD file for Assignment 3

Part 2 (Assignment 3.2)

Goal: Read in 2012 States Data and view variable information

Part 3 (Assignment 3.3)

Goal: Use R to create frequency tables for the Murdercat variable (Questions 11-13, Ch.2 (pp.41-42).

StatesData2012trim

Data Frame Summary

StatesData2012trim

Part 4 (Assignment 3.4)

Goal: Read in and Identify Characteristics of Lone Offender Assault NCVS Data

Part 5 (Assignment 3.5)

Goal: Recode and Create Frequency Table for “Vic18andoverbin” Variable

Frequencies

NCVS1992to2013_trim$Vic18andoverbin

Assignment 3 Objective Checks

After completing assignment #3, can you…

Assignment 3 Objectives

The purpose of this third assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapters 2 and 3 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

These chapters focused on data distributions and displaying data with tabular or graphical representations. As with the two previous assignments, you will be using R Markdown (with R & RStudio) to complete and present your work. In this assignment, you will learn how to recode variables, generate frequency tables, and create simple graphs in R.

By the end of assignment #3, you should be able to…

create simple frequency tables using sjmisc::frq() and summarytools::freq()
identify strengths and limitations of frq() and freq() for creating frequency tables
sort a frequency table by frequency, from highest to lowest and from lowest to highest frequencies
recognize summarytools::dfsummary() as another way to quickly describe one or more variables in a data file
use theggplot()function from “ggplot2” package to generate basic bar charts and histograms
select specific variables using dplyr::select()
recode variables using mutate() and if_else() functions from the “dplyr” package
understand how the if_else() function works and why we use it instead of ifelse()

Assumptions & Ground Rules

We are building on objectives from Assignments 1 & 2. By the start of this assignment, you should already know how to:

create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
recognize when a function is being called from a specific package using a double colon with the package::function() format
read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
use a tidyverse %>% pipe operator to perform a sequence of actions
knit your RMD document into an HTML file that you can then save and submit for course credit

use here() for a simple and reproducible self-referential file directory method
Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages

use sjPlot::view_df() to quickly browse variables in a data file
use attr() to identify variable and attribute value labels

If you do not recall how to do these things, first review Assignments 1 & 2.

Additionally, you should have read the assigned book chapters and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

units of analysis
variable levels of measurement
skewness
rates, percents, proportions, intervals, and interval widths
appropriate graphs for different types of variables (e.g., at different levels of measurement)
- difference between histograms and bar charts and when each is appropriate
- difference between histograms and line graphs and when each is appropriate

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Early on, you may have a lot of trouble getting your code to run due to minor typos. This is normal.
Remember, you are learning to read and write a new (coding) language. As with learning any new languages, we learn from practice - and from correcting our mistakes.

Part 1 (Assignment 3.1)

Goal: Create a new RMD file for Assignment 3

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_09_01_Fordham_K300_Assign3)

In the second assignment, you learned how to read in and assign a dataset to an R object. You also learned how to use the view_df function from the sjPlot package and the base R attr() function to display your dataframe and identify variable attributes. In this third assignment, you will use the “sjmisc” and “summarytools” packages to display your descriptive data in frequency tables. You will also learn about the dfsummary() function from the “summarytools” package, which is an alternative to sjPlot::view_df for creating a useful summary of all or a subset of the variables in a dataset. Additionally, you will learn how to select and recode variables using the select(), mutate(), and if_else functions from the “dplyr” package, and how to display your data in basic bar charts or histograms using the ggplot() function from the “ggplot2” package.

Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 2 (named YEAR_MO_DY_LastName_K300Assign2). Click to open the R Markdown file.
- Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.
In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
1. In the Title box, enter K300 Assignment 3.
2. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
3. Under Default Output Format box, be sure “HTML” is selected (HTML is usually the default selection)
Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete all the text after the YAML header before you begin working.
This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text:
- This R Markdown document contains my work for Assignment 3. It is my work and only my work.

Part 2 (Assignment 3.2)

Goal: Read in 2012 States Data and view variable information

Create a second-level heading titled: “Part 1 (Assignment 3.1): Reading in and viewing 2012 States Data”
1. Remember, a second-level heading starts with two hashtags followed by a space and the heading title, like this: ## Heading Title
2. A third-level heading starts with three hashtags: ### Heading Title
3. A fourth-level heading starts with four hashtags: #### Heading Title
Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.
1. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
2. Insert an R code chunk
3. Inside the new R code chunk, load the following six packages: tidyverse, haven, here, sjmisc, sjPlot, and summarytools.
4. Some of these packages will need to be installed. Remember, you only need to install a package once, but you must load a package each time you start a new R session and need to use the package.
After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
1. Insert another R code chunk.
2. In the new R code chunk, read and assign the “2012 states data.sav” SPSS datafile into an R data object named StatesData2012.
  - Forget how to do this? Refer to instructions in Assignment 1.
3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: StatesData2012.
  - This will call the object and provide a brief view of the data. (Note: You can also simply click on the data object in the “Environment” window.)
  - Your R studio session should now look a lot like this:

View Your 2012 States Data

StatesData2012 <- read_spss(here("Datasets", "2012StatesData.sav"))

StatesData2012

## # A tibble: 50 × 30
##    State  Numbe…¹ Numbe…² South   Region  Permo…³ cigtax smoke…⁴ tobac…⁵ Persm…⁶
##    <chr>    <dbl>   <dbl> <dbl+l> <dbl+l>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
##  1 Alaba…     335    1981 1 [Sou… 1 [Sou…    15.1  0.425       0    318.      22
##  2 Alaska      54     216 0 [Non… 2 [Wes…    21.5  2           0    270.      22
##  3 Arizo…     443     607 0 [Non… 2 [Wes…    19.7  2           1    247.      16
##  4 Arkan…     200    1269 1 [Sou… 1 [Sou…    17.6  1.15        1    324.      22
##  5 Calif…    2766    1882 0 [Non… 2 [Wes…    15.6  0.87        0    235       14
##  6 Color…     349     668 0 [Non… 2 [Wes…    18.2  0.84        1    238.      18
##  7 Conne…     228     418 0 [Non… 4 [Nor…    11    3           0    238.      16
##  8 Delaw…      63     156 1 [Sou… 1 [Sou…    14.1  1.6         1    281.      18
##  9 Flori…    1229    1712 1 [Sou… 1 [Sou…    15.9  1.34        1    259.      17
## 10 Georg…     680    2322 1 [Sou… 1 [Sou…    16.5  0.37        0    299.      19
## # … with 40 more rows, 20 more variables: totalpop <dbl>, DivorceRt <dbl>,
## #   perfampoverty <dbl>, perindpoverty <dbl>, MedianIncome <dbl>,
## #   Pernoinsurance <dbl>, MurderRt <dbl>, RobberyRt <dbl>, AssaultRt <dbl>,
## #   BurglaryRt <dbl>, MVTheftRT <dbl>, InfantMort <dbl>, HeartDeathRt <dbl>,
## #   CancerDeathRt <dbl>, PerBachelorD <dbl>, PercentRural <dbl>,
## #   Percent18to24 <dbl>, ID <dbl>, Assault_bin <dbl+lbl>, Murdercat <dbl+lbl>,
## #   and abbreviated variable names ¹Number1824, ²NumberRural, ³Permoved, …

As in the image, you should see 50 rows and 30 columns, which corresponds to 50 individual observations and 30 variables (e.g., region; cigarette tax; and murder rate).

Now, let’s view the variables in the data. In the SPSS program referenced in the book, one would click on the “variable view” tab. Recall, one way to see the variables in your data is to simply click the data object in your R environment (“StatesData2012”). This will open another window in which you can see your variables and every row of observations (akin to “data view” in SPSS). Recall, for a “variable view” equivalent in R, you can use the sjPlot::view_df() function:
1. Insert an R chunk
2. Type StatesData2012 %>% view_df() and hit RUN
3. In your Viewer tab, the variable names, labels, and values should look like this:
  
  View Your 2012 States Data
Now, refer back to B&P’s SPSS Exercise at the end of Chapter 2 (pages 41-42) and answer the questions, which ask about the unit of analysis and the following variables:
1. State
2. Murdercat
3. BurglaryRt
4. MedianIncome

Part 3 (Assignment 3.3)

Goal: Use R to create frequency tables for the Murdercat variable (Questions 11-13, Ch.2 (pp.41-42).

In Chapter 2, you learned about levels of measurement and about how frequency tables are used in descriptive research. While there are many different ways to describe variables, frequency tables are one of the most basic and efficient way to do so. Frequency tables describe the number of occurrences in our data for each variable attribute or for grouped variable attributes. There are many ways to generate frequency tables, and we will only cover a couple of them here.

Suppose we want to generate a frequency table for the “Murdercat” (or murder rate categorical) variable. Here is a simple way to get that using the frq() function from the “sjmisc” package:

Create a second-level header titled: “Part 2 (Assignment 3.2).” Then, create a third-level header titled: “Frequency Table for”Murdercat” Variable in ‘StatesData2012’ using frq()”
Create a new R code chunk and type StatesData2012 %>% frq(Murdercat)
1. One way to generate a simple frequency table for an individual variable is with the frq() command from the “sjmisc” package. frq() displays a basic frequency table of the designated variable(s). The table should show the value labels (e.g., “0 to 3 murders per 100k”; “3.1 to 6 murders per 100k”; “6.1 to 9 murders per 100k”; etc.) and the N, or the total number of units (states) in the dataset. It also shows the percentage of states within each attribute value, including the cumulative percentage.
2. Your RStudio window should look like this:

Frequency Table Created with sjmisc

In the image above, you can see that the N is 50 for all 50 states. You can also see that 17 states had “0 to 3 murders per 100k” in 2012, while 21 states had “3.1 to 6 murders per 100k” in 2012. Only 1 state, or 2% of the US, had “9.1 to 12 murders per 100k” in 2012.

As noted, this is one way to create a frequency table, but there are lots of other packages that we could use instead. Each package and function has its various strengths. For examples, see here.

While the above frequency table is easy to generate and has the descriptive information we need, it is not easy to sort the table output. So, we will also introduce you to the freq() function from the ” `“summarytools” package. This package is described in the link above and in more detail here

For a basic frequency table using “summarytools” package, type StatesData2012 %>% freq(Murdercat).
1. Note that this function uses freq(), not frq(). Both functions (with or without the ‘e’) will work as long as their respective packages are loaded: “summarytools” for freq() and “sjmisc” for frq().
2. Also, note that these functions display tables with the same frequency values but somewhat different formmating. Try using data %>% freq(variable) and data %>% frq(variable) code (with and without an ‘e’) to see for yourself. One difference is that, unlike the sjmisc::frq() output, the summarytools::freq() output does not include a variable’s attribute value labels (e.g., “0 to 3 murders per 100k”).
As mentioned, a key benefit of the summarytools::freq() function is that it allows us to easily sort a frequency table, such as by highest to lowest frequencies or vice versa. For example, if we want to sort the frequencies from highest to lowest frequencies with high values on top, we can use freq(variable, order = "freq").
1. Create a new R chunk and type StatesData2012 %>% freq(Murdercat, order = "freq"). This will sort the frequencies of Murdercat from lowest to highest.
2. Your RStudio should look like this:

Frequency Table of Murdercat Ascending

We can even reverse order by adding “-” (a minus sign) before the freq object (data %>% freq(variable, order = "-freq"). This means the frequencies will be sorted from lowest to highest. Sort the Murdercat variable from lowest to highest frequencies and complete the rest of the SPSS exercises on page 42.

#sorted by frequency, high to low
StatesData2012 %>%
  freq(Murdercat, order = "freq")

## Frequencies  
## StatesData2012$Murdercat  
## Label: Murder rate categorical  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     21     42.00          42.00     42.00          42.00
##           0     17     34.00          76.00     34.00          76.00
##           2     10     20.00          96.00     20.00          96.00
##           3      1      2.00          98.00      2.00          98.00
##           4      1      2.00         100.00      2.00         100.00
##        <NA>      0                               0.00         100.00
##       Total     50    100.00         100.00    100.00         100.00

#sorted by frequency, low to high
StatesData2012 %>%
  freq(Murdercat, order = "-freq")

## Frequencies  
## StatesData2012$Murdercat  
## Label: Murder rate categorical  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           3      1      2.00           2.00      2.00           2.00
##           4      1      2.00           4.00      2.00           4.00
##           2     10     20.00          24.00     20.00          24.00
##           0     17     34.00          58.00     34.00          58.00
##           1     21     42.00         100.00     42.00         100.00
##        <NA>      0                               0.00         100.00
##       Total     50    100.00         100.00    100.00         100.00

We can also “clean up” this frequency table to make it easier to read once we knit our final document. This will require modifying: (1) our R code and (2) our R chunk options.

First, try adding plain.ascii = FALSE, style = 'rmarkdown' after order = "freq" to your freq() code. This should generate a table with RMD text formatting rather than the default plain text output (e.g., with asterisks and vertical lines throughout):

#clean it up for knitted RMD 
StatesData2012 %>%
 freq(Murdercat, order = "freq", plain.ascii = FALSE, style = 'rmarkdown')

## ### Frequencies  
## #### StatesData2012$Murdercat  
## **Label:** Murder rate categorical  
## **Type:** Numeric  
## 
## |     &nbsp; | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## |      **1** |   21 |   42.00 |        42.00 |   42.00 |        42.00 |
## |      **0** |   17 |   34.00 |        76.00 |   34.00 |        76.00 |
## |      **2** |   10 |   20.00 |        96.00 |   20.00 |        96.00 |
## |      **3** |    1 |    2.00 |        98.00 |    2.00 |        98.00 |
## |      **4** |    1 |    2.00 |       100.00 |    2.00 |       100.00 |
## | **\<NA\>** |    0 |         |              |    0.00 |       100.00 |
## |  **Total** |   50 |  100.00 |       100.00 |  100.00 |       100.00 |

Second in the R chunk options (i.e., the very top line in your R chunk), type the following: “{r, results = ‘asis’}” to pass the results through as RMD formatted text in the knitting process.
For more information, refer to the summarytools forum above or try knitting your document without these additions and viewing the changes before and after these additions.
Your final “clean” table should look like this when knitted:

Clean Up Final Frequency Table

Finish answering the exercises on pages 41 and 42.

BONUS: Before moving onto the SPSS exercises in Chapter 3, note that the dfsummary() command from the summarytools package is particularly useful for generating a quick summary of all or a subset of variables in a dataset.
1. Here, we show the output of this command for a subset of three variables from these data: “South”, “MurderRt”, and “Murdercat” variables.
2. NOTE: This part is NOT required to do for Assignment 3. We are simply showing you how this would be done if you wanted to display a summary of all or some variables from a dataframe. However, feel free to try it yourself! To see the code we used to create this chart, you can click the code button below.

library(summarytools)

StatesData2012trim <- StatesData2012 %>%
  dplyr::select(South, MurderRt, Murdercat)

print(dfSummary(StatesData2012trim, style = 'grid', graph.magnif = 0.85), method = "render", omit.headings = TRUE)

Data Frame Summary

StatesData2012trim

Dimensions: 50 x 3
Duplicates: 9

Variable

Label

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

South [haven_labelled, vctrs_vctr, double]

State in South

Min : 0

Mean : 0.3

Max : 1

0	:	34	(	68.0%	)
1	:	16	(	32.0%	)

50 (100.0%)

0 (0.0%)

MurderRt [numeric]

Murder Rate per 100K

Mean (sd) : 4.5 (2.4)

min ≤ med ≤ max:

0.9 ≤ 4.7 ≤ 12.3

IQR (CV) : 3.3 (0.5)

37 distinct values

50 (100.0%)

0 (0.0%)

Murdercat [haven_labelled, vctrs_vctr, double]

Murder rate categorical

Mean (sd) : 1 (0.9)

min ≤ med ≤ max:

0 ≤ 1 ≤ 4

IQR (CV) : 1 (0.9)

0	:	17	(	34.0%	)
1	:	21	(	42.0%	)
2	:	10	(	20.0%	)
3	:	1	(	2.0%	)
4	:	1	(	2.0%	)

50 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.0)
2022-10-18

Part 4 (Assignment 3.4)

Goal: Read in and Identify Characteristics of Lone Offender Assault NCVS Data

Now, we will begin working with the 1992 to 2013 NCVS Lone Assault data, which details individual experiences with criminal victimization. You’ll begin by reading this dataset in and displaying the variable view using sjPlot::view_df(). Then, you will need to answer the questions regarding levels of measurement and graphs on pages 71-72 of B&P (SPSS Exercises). To answer these questions, you will need to view the “injured”, “maleoff”, “age_r”, and “V2129” variables.

Create a second-level header titled: “Part 3 (Assignment 3.3).” Then, create a third-level header titled: “Reading in and viewing Lone Offender Assaults NCVS Data”
Now, you need to get this dataset loaded into RStudio. You already know how to do this, but please refer to the instructions above if you have questions.
1. Insert an R chunk. Read in the 1992-2013 NCVS data and place it into a data object called NCVS1992to2013.
2. View the data in Rstudio by calling the object, NCVS1992to2013. View the variable summary in the “Viewer” tab using data %>% view_df(). Your “Viewer” tab, or RStudio, should look like this:

NCVS1992to2013 <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))

View NCVS Data and Variables in Viewer Tab

Now that you can view the dataset, determine which type of graph (histogram, pie chart, bar graph) is most appropriate for each variable given its characteristics (e.g., level of measurement).
1. Create a third-level header titled: “Describing”injured”, “maleoff”, “age_r”, and “V2129” variables”
2. When deciding which graphs are most appropriate, you should first create a frequency table (as we did above) for each variable to determine whether a variable is numeric or alphanumeric, binary, rank-ordered, etc.
  - For now, create separate R code chunks for each frequency table, and include headers (e.g., fourth level header: “Frequency table for”injured” variable”) above each table so we can easily tell what the table is.
3. After each frequency table, you should then determine and describe in R Markdown text the variable’s level of measurement and what type of graph is appropriate. Do this after each of the “injured”, “maleoff”, “age_r”, and “V2129” variables. Be sure to refer to your book if you are unsure.
Once you have determined which graph is most appropriate for each variable, you are ready to create these graphs in R and then finish answering the questions 1 and 2 on pages 71-72!
1. Create a third-level header titled: “Graphs of”injured”, “maleoff”, “age_r”, and “V2129” variables”
2. Like with the frequency tables above, for this assignment, create separate code chunks for each graph.
  - That is, first create an R chunk and display the first graph for the “injured” variable. Then, create a second R chunk for the graph of the “maleoff” variable, followed by a third R chunk for the graph of the “age_r” variable and so on. Use headers (e.g., fourth-level) to separate and organize each graph.
3. For each graph, in the R Markdown text, describe: (1) what you are graphing; (2) what type of graph you chose; and (3) why you chose that graph. These notes will demonstrate your understanding of the course material and help you remember how to correctly graph variables in the future.
4. Note: It is up to you to determine which graph is best for each variable.
  - If you decide a bar graph is most appropriate for the “injured” variable, type NCVS1992to2013 %>% ggplot(aes(injured)) + geom_bar().
  - If you decide a histogram is most appropriate, type NCVS1992to2013 %>% ggplot(aes(injured)) + geom_histogram().
  - A bar graph of the “injured” variable would look like this:

NCVS1992to2013 %>%
  ggplot(aes(injured)) +
  geom_bar()

Note: ggplot() is a function in the ggplot2 package (which, like haven and dplyr, is part of the tidyverse) that allows us to create graphs and plots. We will cover some basic options for editing elements of a ggplot object in later assignments. For now, here are a few things to note:
- The (aes()) function manipulates the aesthetic of the graph or plot, such as the orientation. In essense, this is the part of the code that sets up the XY background for your plot. For example, plots will orient to the x-axis by default if you type ggplot(aes(variable)) as we did above. Alternatively, if you type ggplot(aes(y=variable)), the plot aesthetic will change by flipping its orientation to the y-axis.
- After setting up the XY aesthetic the way we want, we then plot the data. Often, this will involve adding one or more “geometric objects” such as a bar chart or histogram with a “geom” function like geom_bar() or geom_histogram() to the object. To do this, we literally “add” the geometric object layer to the XY coordinate plot by including a + sign before it.
  - The basic format for a simple univariate plot is: data %>% ggplot(aes(variable)) + geom_type
  - Note: If you break your code into multiple lines, be sure that the + sign is on the same line as the ggplot() function. Otherwise, R will assume you’re done with the ggplot() function, and it will not understand that you want to add a geometric object to it.

Part 5 (Assignment 3.5)

Goal: Recode and Create Frequency Table for “Vic18andoverbin” Variable

In the remainder of the exercise, we are interested in the “age_r” variable and determining the proportion of victims who experienced assaults before they were 18. You can do this by recoding the variable and then creating a frequency table, which will display proportions or percentages along with frequencies for your recoded variable.

Create a second-level header titled: “Part 4 (Assignment 3.4)”. Then, create a third-level header titled: “Recoding and Describing”Vic18andoverbin” Variable”.
Insert an R chunk and create a frequency table for the “age_r” variable. If you forget how to do this, refer to Parts 1 and 2 of the assignment.
1. You can use the frequency table to calculate an answer for question 3 on pages 71-72, but you cannot quickly and easily get the information you want (i.e., proportion of victims under 18) or generate a graph to display this data as desired. For this reason, we will recode the “age_r” variable into a new binary variable to make it easier to determine and visualize the proportion of victims under 18.
Now, when recoding and renaming variables, you always need a plan. In this case, we want a new variable with two categories: “younger than 18” and “18 or older.” We will create a “dummy” or “indicator” variable that equals “0” if the person is younger than 18 and equals “1” if the person is 18 or older. We will rename and recode the variable from “age_r” into a new variable for two reasons:
1. First, we do not want to overwrite the original “age_r” variable. That would erase the original observations and replace them with our new binary variable (Vic18andoverbin), requiring us to reload in the dataset if we need the old “age_r” variable (or if we make a mistake).
2. Second, renaming the variable makes it easier to quickly recognize and recall the contents and coding of the new variable column when working with the data.
3. For these reasons, we will rename our new variable “Vic18andoverbin” to represent the proportion of victims who were over 18 when assaulted (i.e., victims 18 and up - binary variable).
  - Note: This proportion will be different than the proportion of victims under 18 when assaulted. Be sure to look at the correct row or column. Details like this are important when analyzing and describing data.
To recode a variable, we will use the mutate() function in the dplyr package. We will also use the if_else() function, which represents a ‘yes or no’ test within R.

Note: There are two condition tests in R – ifelse() and if_else(). ifelse() simply tells R if a condition is met (or true), the value should be ‘x’ while, if the condition is not met (or false), the value should be ‘y’. For example, if a victim is under 18, R would code that value to “0”, or ‘x’. If the victim is over 18, R would code that value to “1”, or ‘y’. This test typically works as desired for variables that have no missing values. For instance, we have age values for all victims in the data, so we do not need to worry about missing data on that variable. However, if we did have missing values, then it is easy to accidentally recode any NA values as “1” or ‘y’ using this function. Thus, it is good practice to use if_else() instead, as it will typically work as desired in situations where there are missing values on a variable. This is because the if_else() logical test adds a missing = NULL operation by default, meaning it accounts for missing data by automatically recoding it as NULL (or NA) without the need to precisely specify this action. To learn more about these differences, visit https://dplyr.tidyverse.org/reference/if_else.html.

Insert an R chunk and type NCVS1992to2013_trim <- NCVS1992to2013 %>% mutate(Vic18andoverbin = if_else(age_r < 18, 0, 1)).
1. The mutate() function recodes the “age_r” variable according to our if_else() (i.e., true, false) logic statement. If “age_r” (the original variable) is less than 18, then R assigns a value of “0” to our new variable. If “age_r” is not less than 18 (i.e., if the value of the variable, or the victim’s age, is 18 or older and is not missing), then R assigns a value of “1” to the new variable. The new binary indicator variable is named “Vic18andoverbin”.
2. Now, view the data by clicking the dataset in the environment. You can also use view_df() to see that the range of the variable, Vic18andoverbin, is 0-1.
3. Your new data column and variable view should look similar to this:
  $View New Vic18andoverbin Variable$
  View New Vic18andoverbin Variable

View New Vic18andoverbin Variable

NCVS1992to2013_trim <- NCVS1992to2013 %>%
  mutate(
    Vic18andoverbin = if_else(age_r < 18, 0, 1) # if_else() codes values as 0 if victims were under 18 and 1 if they were over 18.
  )

Congratulations! You just recoded your first variable into a meaningful and informative binary variable.
1. Now, instead of generating frequency tables and graphs with the full range of possible age values, we will be able to easily summarize and visualize age trends using a variable that collapsed all 23,969 observations into one of the two categories that were identified as relevant by our research question.
2. Create a frequency table with your new variable. Make sure that the proportion of individuals who were victimized when they were younger than age 18 is the same as your answer for question 3 in B&P. If it is not, double check your code because you’ve likely made a mistake somewhere!
3. Your new frequency table should look something like this:

Frequencies

NCVS1992to2013_trim$Vic18andoverbin

Type: Numeric

	Freq	% Valid	% Valid Cum.	% Total	% Total Cum.
0	4411	18.40	18.40	18.40	18.40
1	19558	81.60	100.00	81.60	100.00
<NA>	0			0.00	100.00
Total	23969	100.00	100.00	100.00	100.00

Frequency Table of Vic18andover Binary Variable

For the last part of the assignment, help us visualize these patterns by creating a bar chart of the “Vic18andoverbin” variable.
1. To create a bar chart, insert an R code chunk and use the code format you learned above: data %>% ggplot(aes(variable)) + geom_bar(). Remember to call the correct data object and variable (i.e., you assigned your new variable into a new data object)!
2. Your RStudio should look like this:

You should now have everything that you need to complete the questions in Assignment 3 that parallel those from B&P’s SPSS Exercises for Chapter 3!

Complete the remainder of the questions in Assignment 3 in your RMD file.
1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ##, ###, or ####) separating R code chunks, organized by assignment questions.
2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will demonstrate your understanding of the course materials and will serve as useful reminders to you when working on later assignments!
3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted HTML document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign3. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 3.

Assignment 3 Objective Checks

After completing assignment #3, can you…

create simple frequency tables using sjmisc::frq() and summarytools::freq()?
identify strengths and limitations of frq() and freq() for creating frequency tables?
sort a frequency table by frequency, from highest to lowest and from lowest to highest frequencies?
recognize summarytools::dfsummary() as another way to quickly describe one or more variables in a data file?
use theggplot()function from “ggplot2” package to generate basic bar charts and histograms?
select specific variables using dplyr::select()?
recode variables using mutate() and if_else() functions from the “dplyr” package?
explain how the if_else() function works and why we use it instead of ifelse()?

Assignment 3: Describing Data Distributions

Jon Brauer & Tyeisha Fordham

10/01/2022

Assignment 3 Objectives

The purpose of this third assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapters 2 and 3 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

By the end of assignment #3, you should be able to…

Assumptions & Ground Rules

We are building on objectives from Assignments 1 & 2. By the start of this assignment, you should already know how to:

Part 1 (Assignment 3.1)

Goal: Create a new RMD file for Assignment 3

Part 2 (Assignment 3.2)

Goal: Read in 2012 States Data and view variable information

Part 3 (Assignment 3.3)

Goal: Use R to create frequency tables for the Murdercat variable (Questions 11-13, Ch.2 (pp.41-42).

Data Frame Summary

StatesData2012trim

Part 4 (Assignment 3.4)

Goal: Read in and Identify Characteristics of Lone Offender Assault NCVS Data

Part 5 (Assignment 3.5)

Goal: Recode and Create Frequency Table for “Vic18andoverbin” Variable

Frequencies

NCVS1992to2013_trim$Vic18andoverbin

Assignment 3 Objective Checks

After completing assignment #3, can you…