The purpose of this fourth assignment is to help you use R to complete some of the Assignment 5 exercises adapted from the SPSS Exercises at the end of Chapter 4 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
This chapter focused on measures of central tendency (e.g., mean, median, and mode,) and their advantages and disadvantages as single statistical descriptions of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of central tendency and other statistics (e.g., skewness; kurtosis) that us help standardize and efficiently describe the shape of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.
head()
function to quickly
view a snapshot of your dataglimpse()
function to quickly view
all columns (variables) in your datagt()
(e.g.,
head(data) %>% gt()
)mean()
and
median()
.summarytools::descr()
function$
operator to
reference or call a named element from a list or data object, such as a
specific variable in a data file (e.g.,
mean(data$variable
))?sjmisc::frq()
and summarytools::freq()
ggplot()
We are building on objectives from Assignments 1-3. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actionshere()
for a simple and reproducible
self-referential file directory methodsjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data fileselect()
,
mutate()
, and if_else()
functionssummarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
ggplot()
functionIf you do not recall how to do these things, first review Assignments 1-4.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
Goal: Create a new RMD file for Assignment 5
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-01-23_Ducate_CRIM5305_Assign05)
In the last assignment, you learned how to use
sjmisc::frq()
and summarytools::freq()
functions to generate frequency tables for variables. You also learned
about the summarytools::dfsummary()
function for quickly
summarizing all or a subset of the variables in a data object. Lastly,
you learned how to select and recode variables using dplyr’s
mutate()
and if_else()
functions as well as
how to display data in graphs using ggplot()
. In this
assignment, you will decide which measure of central tendency is most
appropriate for a given variable, then use frequency tables and R
functions to calculate measures of central tendency and other univariate
descriptive statistics.
here
package will automatically set our CRIM5305_L folder as the top-level
directory. Goal: Reading in Data and Creating Frequency Table from Youth Data
We will begin by reading in the Youth dataset and creating a frequency table of the “parnt2” variable. The frequency table will allow us to answer the first question on pages 100-101.
tidyverse
, haven
, here
,
sjmisc
, sjPlot
, summarytools
, and
gt
.
Insert another R code chunk.
In the new R code chunk, read and assign the “Youth_0.sav” SPSS
datafile into an R data object named YouthData
.
In the same code chunk, on a new line below your read data/assign
object command, type the name of your new R data object:
YouthData
.
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
YouthData
YouthData
object, you should see a table with the first 10
out of 1,272 total rows (containing observations) and 23 columns (or
variables).
YouthData
object, you may
comment out this line (i.e., the line that JUST reads
YouthData
. Do not comment out the read_spss()
line!)head(YouthData)
, and hit
RUN.
head()
is a built-in R function that returns a snapshot
of a dataframe or object. In this case, it gives us a snapshot of the
variables and first several rows of observations in the YouthData
object. This can be an especially useful function for quickly glimpsing
large datasets.glimpse()
function is
similarly useful for quickly glimpsing all the columns (variables) in a
dataframe - feel free to check that one out as well!head(YouthData) %>% gt()
, and hit RUN again.
head(data)
function to the gt()
function,
which instructs the “gt” program to reformat our data table using the
package’s default table style settings!head(YouthData)
# glimpse(YouthData)
head(YouthData) %>% gt()
Gender | v2 | v21 | v22 | v63 | v77 | v79 | v109 | v119 | parnt2 | fropinon | frbehave | certain | moral | delinquency | d1 | hoursstudy | filter_$ | heavytvwatcher | studyhard | supervision | drinkingnotbad | Lowcertain_bin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 15 | 36 | 15 | 3 | 1 | 3 | 5 | 1 | 5 | 18 | 9 | 9 | 19 | 8 | 1 | 15 | 1 | 1 | 1 | 0 | 0 | 1 |
0 | 15 | 3 | 5 | 4 | 1 | 2 | 4 | 1 | 8 | 11 | 6 | 11 | 19 | 10 | 1 | 5 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 15 | 20 | 6 | 4 | 1 | 1 | 4 | 2 | 8 | 12 | 5 | 11 | 20 | 1 | 0 | 6 | 0 | 1 | 0 | 1 | 0 | 0 |
0 | 15 | 2 | 4 | 2 | 1 | 2 | 5 | 3 | 4 | 9 | 5 | 13 | 19 | 1 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 14 | 12 | 2 | 4 | 3 | 5 | 5 | 3 | 7 | 27 | 9 | 4 | 18 | 104 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 |
1 | 15 | 1 | 3 | 4 | 1 | 1 | 5 | 3 | 7 | 8 | 9 | 10 | 20 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
Note that, when knitting to Word, the table still isn’t very pretty.
However, it is now formatted as an actual TABLE, which means you can
modify and style it, whereas the output of head(YouthData)
on its own is just code, which you cannot style. There are ways to use
templates to make the output even nicer by default–if you ever decide to
go that route
YouthData %>% view_df()
and hit RUN, then check your
Viewer tab to get a better look at the variable names, labels, and
values.
YouthData %>% freq(parnt2)
to generate a frequency table
for the “parental supervision scale” variable.
view_df()
is helpful – it allows you to see the variable
names, labels, and value labels. For example, if you type “parental
supervision scale” into R studio, nothing will happen, because it is a
label, not a variable (or object) name. However, if you type
parnt2
into R, it returns the variable that measures the
parental supervision scale.#default
YouthData %>% freq(parnt2)
## Frequencies
## YouthData$parnt2
## Label: parental supervision scale
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 2 6 0.47 0.47 0.47 0.47
## 3 16 1.26 1.73 1.26 1.73
## 4 163 12.81 14.54 12.81 14.54
## 5 139 10.93 25.47 10.93 25.47
## 6 440 34.59 60.06 34.59 60.06
## 7 189 14.86 74.92 14.86 74.92
## 8 319 25.08 100.00 25.08 100.00
## <NA> 0 0.00 100.00
## Total 1272 100.00 100.00 100.00 100.00
parnt2
variable is discrete, let’s plot it as a bar graph.
Recall that we can plot bar graphs using the generic command
data %>% ggplot(aes(variable)) + geom_bar()
. YouthData %>% ggplot(aes(parnt2)) + geom_bar()
Goal: Create frequency tables for the “v77”, “v79”, “certain”, and “Gender” variables
Now, we are going to generate frequency tables for four variables and then determine which measure of central tendency is most appropriate for each variable, as well as to determine whether the variable’s distribution is skewed and, if so, the direction of its skew (i.e., negative or positive). Recall, it is important to determine whether a variable’s distribution is skewed and which measurement of central tendency is the most appropriate to ensure that we are reporting meaningful summary statistics when describing a variable. Generating a graph (e.g., bar chart or histogram) might also help you to determine the most appropriate measure of central tendency as well as to identify the direction of a skewed distribution.
YouthData %>% frq(v77)
YouthData %>% sjmisc::frq(v79)
.
freq()
function from the
summarytools
package to generate a frequency distribution;
this time, notice that you were instructed to use the frq()
function from the sjmisc
package instead. One benefit of
the frq()
function is that it automatically calculates the
mean and standard deviation (sd) for a variable. (Note, though, that
it may calculate a mean value even when the mean is not the most
appropriate measure of central tendency!)data %>% ggplot(aes(variable)) + geom_bar()
for bar
graphs, which are the most useful in this case.Your R studio should look like Figures 3 and 4 (see next page).
With the output of the code looking something like this:
mean()
function to calculate the mean of a
particular variable.
mean(YouthData$certain)
.
This will generate the mean of the certain
variable in the
YouthData dataset. You can repeat this step for the other variables,
v77, v79, and Gender.
$
operator is used to call,
access, or reference a named element from a list or data object in R.
Here, we are essentially telling R to access and calculate the
mean
of the certain
column from the
YouthData
object. The $
operator, which you
may recall that we first introduced in Assignment 2, is used frequently
with base R functions. By this point, you have also used the tidyverse
%>%
operator many times. Often, tasks can be
accomplished either with $
operators or with
%>%
operators, though some base R functions do not work
with tidyverse-style pipes. This is one such case where we cannot use
the %>
operator and must use $
instead.
median(YouthData$certain)
. You can use this
approach to calculate the median of the remaining variables.
mode()
function in R,
but it DOES NOT calculate the mode of a
distribution. Instead, it calculates the storage mode of an
object. You can see this for yourself by running
mode(YouthData$Gender)
. To find the mode, rely on frequency
tables instead. descr()
function in the
summarytools
package will generate various descriptive
statistics, including the mean, median, and values indicating the level
of skewness and kurtosis. Using the “certain” variable as an example,
type: YouthData %>% descr(certain)
mean()
,
median()
, and mode()
functions from base R
and/or the descr()
function from the
summarytools
package.You should now be able to answer questions 9-10 of Assignment 5.
Goal: Create a histogram for “delinquency” Variable
Next, you will create a histogram for the delinquency variable to determine its most appropriate measure of central tendency. This will allow you to answer question 11 of Assignment 5.
Create a second-level header titled: “Part 4 (Assignment 5.4)
YouthData %>% ggplot(aes(delinquency)) + geom_histogram()
.YouthData %>%
ggplot(aes(delinquency)) +
geom_histogram()
You should now have everything that you need to complete the questions in Assignment 5 that parallel those from B&P’s SPSS Exercises for Chapter 4!
head()
function to
quickly view a snapshot of your data?glimpse()
function to quickly view all
columns (variables) in your data?gt()
(e.g., head(data) %>% gt()
)?mean()
,
median()
, and mode()
?summarytools::descr()
function?$
operator to
reference or call a named element from a list or data object, such as a
specific variable in a data file (e.g.,
mean(data$variable
))?sjmisc::frq()
and summarytools::freq()
?ggplot()
?