The purpose of this third assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapters 2 and 3 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
These chapters focused on data distributions and displaying data with tabular or graphical representations. As with the two previous assignments, you will be using R Markdown (with R & RStudio) to complete and present your work. In this assignment, you will learn how to recode variables, generate frequency tables, and create simple graphs in R.
sjmisc::frq()
and
summarytools::freq()
frq()
and
freq()
for creating frequency tablessummarytools::dfsummary()
as another way to
quickly describe one or more variables in a data fileggplot()
function from “ggplot2” package to
generate basic bar charts and histogramsdplyr::select()
mutate()
and
if_else()
functions from the “dplyr” packageif_else()
function works and why we
use it instead of ifelse()
We are building on objectives from Assignments 1 & 2. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actionshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packagessjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsIf you do not recall how to do these things, first review Assignments 1 & 2.
Additionally, you should have read the assigned book chapters and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Create a new RMD file for Assignment 3
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_09_01_Fordham_K300_Assign3)
In the second assignment, you learned how to read in and assign a
dataset to an R object. You also learned how to use the
view_df
function from the sjPlot
package and
the base R attr()
function to display your dataframe and
identify variable attributes. In this third assignment, you will use the
“sjmisc” and “summarytools” packages to display your descriptive data in
frequency tables. You will also learn about the dfsummary()
function from the “summarytools” package, which is an alternative to
sjPlot::view_df
for creating a useful summary of all or a
subset of the variables in a dataset. Additionally, you will learn how
to select and recode variables using the select()
,
mutate()
, and if_else
functions from the
“dplyr” package, and how to display your data in basic bar charts or
histograms using the ggplot()
function from the “ggplot2”
package.
here
package will automatically set our K300_L folder as the top-level
directory.Goal: Read in 2012 States Data and view variable information
tidyverse
, haven
, here
,
sjmisc
, sjPlot
, and
summarytools
.StatesData2012
.
StatesData2012
.
StatesData2012 <- read_spss(here("Datasets", "2012StatesData.sav"))
StatesData2012
## # A tibble: 50 × 30
## State Numbe…¹ Numbe…² South Region Permo…³ cigtax smoke…⁴ tobac…⁵ Persm…⁶
## <chr> <dbl> <dbl> <dbl+l> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alaba… 335 1981 1 [Sou… 1 [Sou… 15.1 0.425 0 318. 22
## 2 Alaska 54 216 0 [Non… 2 [Wes… 21.5 2 0 270. 22
## 3 Arizo… 443 607 0 [Non… 2 [Wes… 19.7 2 1 247. 16
## 4 Arkan… 200 1269 1 [Sou… 1 [Sou… 17.6 1.15 1 324. 22
## 5 Calif… 2766 1882 0 [Non… 2 [Wes… 15.6 0.87 0 235 14
## 6 Color… 349 668 0 [Non… 2 [Wes… 18.2 0.84 1 238. 18
## 7 Conne… 228 418 0 [Non… 4 [Nor… 11 3 0 238. 16
## 8 Delaw… 63 156 1 [Sou… 1 [Sou… 14.1 1.6 1 281. 18
## 9 Flori… 1229 1712 1 [Sou… 1 [Sou… 15.9 1.34 1 259. 17
## 10 Georg… 680 2322 1 [Sou… 1 [Sou… 16.5 0.37 0 299. 19
## # … with 40 more rows, 20 more variables: totalpop <dbl>, DivorceRt <dbl>,
## # perfampoverty <dbl>, perindpoverty <dbl>, MedianIncome <dbl>,
## # Pernoinsurance <dbl>, MurderRt <dbl>, RobberyRt <dbl>, AssaultRt <dbl>,
## # BurglaryRt <dbl>, MVTheftRT <dbl>, InfantMort <dbl>, HeartDeathRt <dbl>,
## # CancerDeathRt <dbl>, PerBachelorD <dbl>, PercentRural <dbl>,
## # Percent18to24 <dbl>, ID <dbl>, Assault_bin <dbl+lbl>, Murdercat <dbl+lbl>,
## # and abbreviated variable names ¹Number1824, ²NumberRural, ³Permoved, …
sjPlot::view_df()
function:
StatesData2012 %>% view_df()
and hit RUNGoal: Use R to create frequency tables for the Murdercat variable (Questions 11-13, Ch.2 (pp.41-42).
In Chapter 2, you learned about levels of measurement and about how frequency tables are used in descriptive research. While there are many different ways to describe variables, frequency tables are one of the most basic and efficient way to do so. Frequency tables describe the number of occurrences in our data for each variable attribute or for grouped variable attributes. There are many ways to generate frequency tables, and we will only cover a couple of them here.
Suppose we want to generate a frequency table for the “Murdercat” (or
murder rate categorical) variable. Here is a simple way to get that
using the frq()
function from the “sjmisc” package:
StatesData2012 %>% frq(Murdercat)
frq()
command from the “sjmisc”
package. frq()
displays a basic frequency table of the
designated variable(s). The table should show the value labels (e.g., “0
to 3 murders per 100k”; “3.1 to 6 murders per 100k”; “6.1 to 9 murders
per 100k”; etc.) and the N, or the total number of units (states) in the
dataset. It also shows the percentage of states within each attribute
value, including the cumulative percentage.As noted, this is one way to create a frequency table, but there are lots of other packages that we could use instead. Each package and function has its various strengths. For examples, see here.
While the above frequency table is easy to generate and has the
descriptive information we need, it is not easy to sort the table
output. So, we will also introduce you to the freq()
function from the ” `“summarytools” package. This package is described
in the link above and in more detail here
StatesData2012 %>% freq(Murdercat)
.
freq()
, not
frq()
. Both functions (with or without the ‘e’) will work
as long as their respective packages are loaded: “summarytools” for
freq()
and “sjmisc” for frq()
.data %>% freq(variable)
and
data %>% frq(variable)
code (with and without an ‘e’) to
see for yourself. One difference is that, unlike the
sjmisc::frq()
output, the summarytools::freq()
output does not include a variable’s attribute value labels (e.g., “0 to
3 murders per 100k”).summarytools::freq()
function is that it allows us to easily sort a frequency table, such as
by highest to lowest frequencies or vice versa. For example, if we want
to sort the frequencies from highest to lowest frequencies with high
values on top, we can use freq(variable, order = "freq")
.
StatesData2012 %>% freq(Murdercat, order = "freq")
. This
will sort the frequencies of Murdercat from lowest to highest.freq
object
(data %>% freq(variable, order = "-freq")
. This means
the frequencies will be sorted from lowest to highest. Sort the
Murdercat variable from lowest to highest frequencies and complete the
rest of the SPSS exercises on page 42.#sorted by frequency, high to low
StatesData2012 %>%
freq(Murdercat, order = "freq")
## Frequencies
## StatesData2012$Murdercat
## Label: Murder rate categorical
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 21 42.00 42.00 42.00 42.00
## 0 17 34.00 76.00 34.00 76.00
## 2 10 20.00 96.00 20.00 96.00
## 3 1 2.00 98.00 2.00 98.00
## 4 1 2.00 100.00 2.00 100.00
## <NA> 0 0.00 100.00
## Total 50 100.00 100.00 100.00 100.00
#sorted by frequency, low to high
StatesData2012 %>%
freq(Murdercat, order = "-freq")
## Frequencies
## StatesData2012$Murdercat
## Label: Murder rate categorical
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 3 1 2.00 2.00 2.00 2.00
## 4 1 2.00 4.00 2.00 4.00
## 2 10 20.00 24.00 20.00 24.00
## 0 17 34.00 58.00 34.00 58.00
## 1 21 42.00 100.00 42.00 100.00
## <NA> 0 0.00 100.00
## Total 50 100.00 100.00 100.00 100.00
plain.ascii = FALSE, style = 'rmarkdown'
after
order = "freq"
to your freq()
code. This
should generate a table with RMD text formatting rather than the default
plain text output (e.g., with asterisks and vertical lines
throughout):#clean it up for knitted RMD
StatesData2012 %>%
freq(Murdercat, order = "freq", plain.ascii = FALSE, style = 'rmarkdown')
## ### Frequencies
## #### StatesData2012$Murdercat
## **Label:** Murder rate categorical
## **Type:** Numeric
##
## | | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## | **1** | 21 | 42.00 | 42.00 | 42.00 | 42.00 |
## | **0** | 17 | 34.00 | 76.00 | 34.00 | 76.00 |
## | **2** | 10 | 20.00 | 96.00 | 20.00 | 96.00 |
## | **3** | 1 | 2.00 | 98.00 | 2.00 | 98.00 |
## | **4** | 1 | 2.00 | 100.00 | 2.00 | 100.00 |
## | **\<NA\>** | 0 | | | 0.00 | 100.00 |
## | **Total** | 50 | 100.00 | 100.00 | 100.00 | 100.00 |
summarytools
forum
above or try knitting your document without these additions and viewing
the changes before and after these additions.dfsummary()
command from the
summarytools
package is particularly useful for generating
a quick summary of all or a subset of variables in a dataset.
code
button below.library(summarytools)
StatesData2012trim <- StatesData2012 %>%
dplyr::select(South, MurderRt, Murdercat)
print(dfSummary(StatesData2012trim, style = 'grid', graph.magnif = 0.85), method = "render", omit.headings = TRUE)
No | Variable | Label | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | South [haven_labelled, vctrs_vctr, double] | State in South |
|
|
50 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||
2 | MurderRt [numeric] | Murder Rate per 100K |
|
37 distinct values | 50 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||
3 | Murdercat [haven_labelled, vctrs_vctr, double] | Murder rate categorical |
|
|
50 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.0)
2022-10-18
Goal: Read in and Identify Characteristics of Lone Offender Assault NCVS Data
Now, we will begin working with the 1992 to 2013 NCVS Lone Assault
data, which details individual experiences with criminal victimization.
You’ll begin by reading this dataset in and displaying the variable view
using sjPlot::view_df()
. Then, you will need to answer the
questions regarding levels of measurement and graphs on pages 71-72 of
B&P (SPSS Exercises). To answer these questions, you will need to
view the “injured”, “maleoff”, “age_r”, and “V2129” variables.
NCVS1992to2013
.NCVS1992to2013
. View the variable summary in the “Viewer”
tab using data %>% view_df()
. Your “Viewer” tab, or
RStudio, should look like this:NCVS1992to2013 <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
NCVS1992to2013 %>% ggplot(aes(injured)) + geom_bar()
.NCVS1992to2013 %>% ggplot(aes(injured)) + geom_histogram()
.NCVS1992to2013 %>%
ggplot(aes(injured)) +
geom_bar()
ggplot()
is a function in the
ggplot2
package (which, like haven and dplyr, is part of
the tidyverse) that allows us to create graphs and plots. We will cover
some basic options for editing elements of a ggplot object in later
assignments. For now, here are a few things to note:
(aes())
function manipulates the aesthetic of the
graph or plot, such as the orientation. In essense, this is the part of
the code that sets up the XY background for your plot. For example,
plots will orient to the x-axis by default if you type
ggplot(aes(variable))
as we did above. Alternatively, if
you type ggplot(aes(y=variable))
, the plot aesthetic will
change by flipping its orientation to the y-axis.geom_bar()
or geom_histogram()
to the object.
To do this, we literally “add” the geometric object layer to the XY
coordinate plot by including a +
sign before it.
data %>% ggplot(aes(variable)) + geom_type
+
sign is on the same line as the
ggplot()
function. Otherwise, R will assume you’re done
with the ggplot()
function, and it will not understand that
you want to add a geometric object to it.Goal: Recode and Create Frequency Table for “Vic18andoverbin” Variable
In the remainder of the exercise, we are interested in the “age_r” variable and determining the proportion of victims who experienced assaults before they were 18. You can do this by recoding the variable and then creating a frequency table, which will display proportions or percentages along with frequencies for your recoded variable.
mutate()
function
in the dplyr
package. We will also use the
if_else()
function, which represents a ‘yes or no’ test
within R.ifelse()
and if_else()
. ifelse()
simply tells R if a condition is met (or true
), the value
should be ‘x’ while, if the condition is not met (or
false
), the value should be ‘y’. For example, if a victim
is under 18, R would code that value to “0”, or ‘x’. If the victim is
over 18, R would code that value to “1”, or ‘y’. This test typically
works as desired for variables that have no missing values. For
instance, we have age values for all victims in the data, so we do not
need to worry about missing data on that variable. However, if we did
have missing values, then it is easy to accidentally recode any
NA
values as “1” or ‘y’ using this function. Thus, it is
good practice to use if_else()
instead, as it will
typically work as desired in situations where there are missing values
on a variable. This is because the if_else()
logical test
adds a missing = NULL
operation by default, meaning it
accounts for missing data by automatically recoding it as
NULL
(or NA
) without the need to precisely
specify this action. To learn more about these differences, visit https://dplyr.tidyverse.org/reference/if_else.html.NCVS1992to2013_trim <- NCVS1992to2013 %>% mutate(Vic18andoverbin = if_else(age_r < 18, 0, 1))
.
mutate()
function recodes the “age_r” variable
according to our if_else()
(i.e., true, false
)
logic statement. If “age_r” (the original variable) is less than 18,
then R assigns a value of “0” to our new variable. If “age_r” is not
less than 18 (i.e., if the value of the variable, or the victim’s age,
is 18 or older and is not missing), then R assigns a value of “1” to the
new variable. The new binary indicator variable is named
“Vic18andoverbin”.view_df()
to see that the range of the
variable, Vic18andoverbin, is 0-1.NCVS1992to2013_trim <- NCVS1992to2013 %>%
mutate(
Vic18andoverbin = if_else(age_r < 18, 0, 1) # if_else() codes values as 0 if victims were under 18 and 1 if they were over 18.
)
Type: Numeric
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
0 | 4411 | 18.40 | 18.40 | 18.40 | 18.40 |
1 | 19558 | 81.60 | 100.00 | 81.60 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 23969 | 100.00 | 100.00 | 100.00 | 100.00 |
data %>% ggplot(aes(variable)) + geom_bar()
. Remember to
call the correct data object and variable (i.e., you assigned your new
variable into a new data object)!You should now have everything that you need to complete the questions in Assignment 3 that parallel those from B&P’s SPSS Exercises for Chapter 3!
sjmisc::frq()
and
summarytools::freq()
?frq()
and
freq()
for creating frequency tables?summarytools::dfsummary()
as another way to
quickly describe one or more variables in a data file?ggplot()
function from “ggplot2” package to
generate basic bar charts and histograms?dplyr::select()
?mutate()
and
if_else()
functions from the “dplyr” package?if_else()
function works and why we use
it instead of ifelse()
?