Assignment 4: Data Visualization Techniques

Assignment 4 Objectives

The purpose of this fourth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapters 3 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
By the end of assignment #4, you should be able to…

Assumptions & Ground Rules

We are building on objectives from Assignments 1-3. By the start of this assignment, you should already know how to:

Part 1 (Assignment 4.1)

Part 2 (Assignment 4.2)

Part 3 (Assignment 4.3)

Part 4 (Assignment 4.4)

Assignment 4 Objective Checks

After completing assignment #4, can you…

Assignment 4 Objectives

The purpose of this fourth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapters 3 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

These chapters focused on data distributions and displaying data with tabular or graphical representations. As with the previous assignments, you will be using R Markdown (with R & RStudio) to complete and present your work. In this assignment, you will learn how to recode variables, generate frequency tables, and create simple graphs in R.

By the end of assignment #4, you should be able to…

use the ggplot()function from ggplot2 package to generate basic bar charts and histograms
recode variables using mutate() and if_else() functions from the dplyr package
understand how the if_else() function works
save a plot using ggsave()

Assumptions & Ground Rules

We are building on objectives from Assignments 1-3. By the start of this assignment, you should already know how to:

create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
recognize when a function is being called from a specific package using a double colon with the package::function() format
read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
use a tidyverse %>% pipe operator to perform a sequence of actions
knit your RMD document into an HTML file that you can then save and submit for course credit

use here() for a simple and reproducible self-referential file directory method

use sjPlot::view_df() to quickly browse variables in a data file
use attr() to identify variable and attribute value labels

If you do not recall how to do these things, first review Assignments 1, 2, & 3.

Additionally, you should have read the assigned book chapters and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

skewness
rates, percents, proportions, intervals, and interval widths
appropriate graphs for different types of variables (e.g., at different levels of measurement)
- difference between histograms and bar charts and when each is appropriate
- difference between histograms and line graphs and when each is appropriate

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Early on, you may have a lot of trouble getting your code to run due to minor typos. This is normal.
Remember, you are learning to read and write a new (coding) language. As with learning any new languages, we learn from practice - and from correcting our mistakes.

Part 1 (Assignment 4.1)

Goal: Create a new RMD file for Assignment 4

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2022-09-01_Ducate_CRIM5305_Assign3)

In the second assignment, you learned how to read in and assign a dataset to an R object. You also learned how to use the view_df function from the sjPlot package and the base R attr() function to display your dataframe and identify variable attributes. In the third assignment, you learned to use the sjmisc and summarytools packages to display your descriptive data in frequency tables. You also learned about the dfsummary() function from the summarytools package, which is an alternative to sjPlot::view_df for creating a useful summary of all or a subset of the variables in a dataset.

In this fourth assignment, you will be reminded how to display your descriptive data in frequency tables. Additionally, you will learn how to select and recode variables using the select(), mutate(), and if_else functions from the “dplyr” package, and how to display your data in basic bar charts or histograms using the ggplot() function from the “ggplot2” package.

Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 2 (named something like YEAR-MO-DY_LastName_CRIM5305_Assign02). Click to open the R Markdown file.
- Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.
In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
1. In the Title box, enter CRIM5305 Assignment 4.
2. In the Author box, enter your First and Last Name (e.g., Caitlin Ducate).
3. Under Default Output Format box, be sure “Word” is selected
Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete all the text after the YAML header before you begin working.
Create a second-level heading titled: “Part 1 (Assignment 4.1)”
1. Remember, a second-level heading starts with two hashtags followed by a space and the heading title, like this: ## Heading Title
2. A third-level heading starts with three hashtags: ### Heading Title
3. A fourth-level heading starts with four hashtags: #### Heading Title
This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text:
- This R Markdown document contains my work for Assignment 4. It is my work and only my work.

Part 2 (Assignment 4.2)

Goal: Read in and Identify Characteristics of Lone Offender Assault NCVS Data

We will be working with the 1992 to 2013 NCVS Lone Assault data, which details individual experiences with criminal victimization. You’ll begin by reading this dataset in and displaying the variable view using sjPlot::view_df().

Then, you will need to answer the questions regarding levels of measurement and graphs on Assignment 4. To answer these questions, you will need to view the “injured”, “maleoff”, “age_r”, and “V2129” variables. That is what we will do next

Create a second-level header titled: “Part 2 (Assignment 4.2)”
1. Remember, a second-level heading starts with two hashtags followed by a space and the heading title, like this: ## Heading Title
2. A third-level heading starts with three hashtags: ### Heading Title
3. A fourth-level heading starts with four hashtags: #### Heading Title
Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.
First, we need to load in our libraries.
1. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
2. Insert an R code chunk
3. Inside the new R code chunk, load the following six packages: tidyverse, haven, here, sjmisc, sjPlot, and summarytools.
4. You should have all of these packages installed, but if you don’t, please install them using the install.packages() command. Remember, you only need to install a package once, but you must load a package each time you start a new R session and need to use the package.
After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
1. Insert another R code chunk.
2. In the new R code chunk, read and assign the “NCVS lone offender assaults 1992 to 2013.sav” SPSS datafile into an R data object named NCVS1992to2013.
  - Remember, we can do this with the following code: NCVS1992to2013 <- read_spss(here("Datasets", "NCVS lone offender assaults 1992 to 2013.sav"))
  - Getting an error? Make sure you have no typos, that the name of your file EXACTLY matches the name of the file in your Datasets folder, and that you have your Datasets folder in your Assignments folder.
3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: NCVS1992to2013.
  - This will call the object and provide a brief view of the data. (Note: You can also simply click on the data object in the “Environment” window.)
  - Once you have confirmed that you read in the data correctly, you may comment out this line.
Now create a third-level header titled: Describing “injured”, “maleoff”, “age_r”, and “V2129” variables
1. First, view the data in Rstudio by calling the object, NCVS1992to2013. View the variable summary in the “Viewer” tab using data %>% view_df(). Your “Viewer” tab in RStudio should look like this:

View NCVS Data and Variables in Viewer Tab

b.  You should now create a frequency table for each variable to determine the level of measurement: whether a variable is numeric or alphanumeric, binary, rank-ordered, etc.
  -   Create separate R code chunks for each frequency table, and include headers (e.g., fourth level header: "Frequency table for"injured" variable") above each table so we can easily tell what the table is.
  -   **Don't remember how to make frequency tables?** Try `NCVS1992to2013 %>% freq(VARIABLE)`, where VARIABLE is replaced by the name of the variable (e.g., `NCVS1992to2013 %>% freq(injured)`)

Now go back to Assignment 4 and indicate the level of measurement for the “injured”, “maleoff”, “age_r”, and “V2129” variables in Question 3. You should also be able to determine the percentage of victims below age 18 (hint: look at the cumulative percentage in the table produced by NCVS1992to2013 %>% freq(age_r)).

Part 3 (Assignment 4.3)

Now that you know the level of measurement for each variable, you need to determine which type of graph (histogram or bar graph) is most appropriate for each variable given its characteristics (e.g., level of measurement) and graph it.
1. Create a second-level header titled: “Part 3 (Assignment 4.3).” Then create a third-level header titled: Graphing “injured”, “maleoff”, “age_r”, and “V2129” variables
2. Like with the frequency tables above, for this assignment, create separate code chunks for each graph.
  - That is, first create an R chunk and display the first graph for the “injured” variable. Then, create a second R chunk for the graph of the “maleoff” variable, followed by a third R chunk for the graph of the “age_r” variable and so on. Use headers (e.g., fourth-level) to separate and organize each graph.
3. Note: It is up to you to determine which graph is best for each variable.
  - If you decide a bar graph is most appropriate for the “injured” variable, type NCVS1992to2013 %>% ggplot(aes(injured)) + geom_bar().
  - If you decide a histogram is most appropriate, type NCVS1992to2013 %>% ggplot(aes(injured)) + geom_histogram().
  - A histogram of the “age_r” variable would look like this:

NCVS1992to2013 %>%
  ggplot(aes(age_r)) +
  geom_histogram()

Once you have graphed each of your variables, you should be able to answer Question 5, which asks you to identify the appropriate graph for each variable.

Note: ggplot() is a function in the ggplot2 package (which, like haven and dplyr, is part of the tidyverse) that allows us to create graphs and plots. We will cover some basic options for editing elements of a ggplot object in later assignments. For now, here are a few things to note:
- The (aes()) function manipulates the aesthetic of the graph or plot, such as the orientation. In essense, this is the part of the code that sets up the XY background for your plot. For example, plots will orient to the x-axis by default if you type ggplot(aes(variable)) as we did above. Alternatively, if you type ggplot(aes(y=variable)), the plot aesthetic will change by flipping its orientation to the y-axis.
- After setting up the XY aesthetic the way we want, we then plot the data. Often, this will involve adding one or more “geometric objects” such as a bar chart or histogram with a “geom” function like geom_bar() or geom_histogram() to the object. To do this, we literally “add” the geometric object layer to the XY coordinate plot by including a + sign before it.
  - The basic format for a simple univariate plot is: data %>% ggplot(aes(variable)) + geom_type
  - Note: If you break your code into multiple lines, be sure that the + sign is on the same line as the ggplot() function. Otherwise, R will assume you’re done with the ggplot() function, and it will not understand that you want to add a geometric object to it.

Part 4 (Assignment 4.4)

Goal: Recode and Create Frequency Table for “Vic18andoverbin” Variable

In the remainder of the exercise, we are interested in the “age_r” variable and determining the proportion of victims who experienced assaults before they were 18. You can do this by recoding the variable and then creating a frequency table, which will display proportions or percentages along with frequencies for your recoded variable.

Create a second-level header titled: “Part 4 (Assignment 4.4)”. Then, create a third-level header titled: “Recoding and Describing”Vic18andoverbin” Variable”.
Now, when recoding and renaming variables, you always need a plan. In this case, we want a new variable with two categories: “younger than 18” and “18 or older.” We will create a binary “dummy” or “indicator” variable that equals “0” if the person is younger than 18 and equals “1” if the person is 18 or older. We will rename and recode the variable from “age_r” into a new variable for two reasons:
1. First, we do not want to overwrite the original “age_r” variable. That would erase the original observations and replace them with our new binary variable (Vic18andoverbin), requiring us to reload in the dataset if we need the old “age_r” variable (or if we make a mistake).
2. Second, renaming the variable makes it easier to quickly recognize and recall the contents and coding of the new variable column when working with the data.
3. For these reasons, we will rename our new variable “Vic18andoverbin” to represent the proportion of victims who were over 18 when assaulted (i.e., victims 18 and up - binary variable).
To recode a variable, we will use the mutate() function in the dplyr package. We will also use the if_else() function, which represents a ‘yes or no’ test within R.
- Note: There are two condition tests in R – ifelse() and if_else(). Be sure to use if_else(), the one WITH the underscore ( _ ).

Insert an R chunk and type NCVS1992to2013 <- NCVS1992to2013 %>% mutate(Vic18andoverbin = if_else(age_r < 18, 0, 1)).
1. The mutate() function recodes the “age_r” variable according to our if_else() (i.e., true, false) logic statement. If “age_r” (the original variable) is less than 18, then R assigns a value of “0” to our new variable. If “age_r” is not less than 18 (i.e., if the value of the variable, or the victim’s age, is 18 or older and is not missing), then R assigns a value of “1” to the new variable. The new binary indicator variable is named Vic18andoverbin.
2. Now, view the data by clicking the dataset in the environment. You can also use NCVS1992to2013 %>% view_df() to see that the range of the variable, Vic18andoverbin, is 0-1.

View New Vic18andoverbin Variable

NCVS1992to2013 <- NCVS1992to2013 %>%
  mutate(
    # if_else() codes values as 0 if victims were under 18 and 1 if 
    # they were over 18
    Vic18andoverbin = if_else(age_r < 18, 0, 1) 
  )

Congratulations! You just recoded your first variable into a meaningful and informative binary variable.
For the last part of the assignment, lets visualize this variable by creating a bar chart.
1. To create a bar chart, insert an R code chunk and use the code format you learned above: data %>% ggplot(aes(variable)) + geom_bar(). Remember to call the correct data object and variable (i.e., you assigned your new variable into a new data object)! In this case, your code should be NCVS1992to2013 %>% ggplot(aes(Vic18andoverbin)) + geom_bar()
2. Your RStudio should look like this:

Finally, we are going save this plot as a PNG image file so that we can submit it for Question 7. To save a plot, we repeat the above code with two changes: we will save the plot to an object, and we will call the function ggsave(). Use the code below and name your file LASTNAME_Vic18andoverbin.png, replacing LASTNAME with your own last name.

Vic18andoverplot <- NCVS1992to2013 %>%
  ggplot(aes(Vic18andoverbin)) + 
  geom_bar()

ggsave(here("Ducate_Vic18andoverbin.png"))

You should now have everything that you need to complete the questions in Assignment 4 that parallel those from B&P’s SPSS Exercises for Chapter 3!

If you haven’t done so already, omplete the remainder of the questions in Assignment 4 in your RMD file.
1. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR-MO-DY_LastName_CRIM5305_Assign04. Submit via Blackboard in the assignment called Assignment 4: Word Document.

Assignment 4 Objective Checks

After completing assignment #4, can you…

use theggplot()function from “ggplot2” package to generate basic bar charts and histograms?
recode variables using mutate() and if_else() functions from the “dplyr” package?
save a plot using ggsave()