The purpose of this fourth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 4 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
This chapter focused on measures of central tendency (e.g., mean, median, and mode,) and their advantages and disadvantages as single statistical descriptions of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of central tendency and other statistics (e.g., skewness; kurtosis) that us help standardize and efficiently describe the shape of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.
head()
function to quickly
view a snapshot of your dataglimpse()
function to quickly view
all columns (variables) in your datagt()
(e.g.,
head(data) %>% gt()
)mean()
,
median()
, and mode()
.summarytools::descr()
and psych::describe()
functions$
operator to
reference or call a named element from a list or data object, such as a
specific variable in a data file (e.g.,
mean(data$variable
))?sjmisc::frq()
and summarytools::freq()
ggplot()
We are building on objectives from Assignments 1-3. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actionshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packagessjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data fileselect()
,
mutate()
, and if_else()
functionssummarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
ggplot()
functionIf you do not recall how to do these things, first review Assignments 1-3.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Create a new RMD file for Assignment 4
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_01_Fordham_K300Assign4)
In the last assignment, you learned how to use
sjmisc::frq()
and summarytools::freq()
functions to generate frequency tables for variables. You also learned
about the summarytools::dfsummary()
function for quickly
summarizing all or a subset of the variables in a data object. Lastly,
you learned how to select and recode variables using dplyr’s
select()
, mutate()
, and if_else()
functions as well as how to display data in graphs using
ggplot()
. In this assignment, you will decide which measure
of central tendency is most appropriate for a given variable, then use
frequency tables and R functions to calculate measures of central
tendency and other univariate descriptive statistics.
here
package will automatically set our K300_L folder as the top-level
directory. Goal: Reading in Data and Creating Frequency Table from Youth Data
We will begin by reading in the Youth dataset and creating a frequency table of the “parnt2” variable. The frequency table will allow us to answer the first question on pages 100-101.
tidyverse
, haven
, here
,
sjmisc
, sjPlot
, summarytools
, and
gt
.
groundhog.library()
to improve the reproducibility of your
script.YouthData
.
YouthData
.
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
YouthData
## # A tibble: 1,272 × 23
## Gender v2 v21 v22 v63 v77 v79 v109 v119 parnt2
## <dbl+lbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl>
## 1 1 [male] 15 36 15 3 [usual… 1 [alw… 3 [som… 5 [a v… 1 [hur… 5
## 2 0 [female] 15 3 5 4 [alway… 1 [alw… 2 [usu… 4 [a b… 1 [hur… 8
## 3 1 [male] 15 20 6 4 [alway… 1 [alw… 1 [alw… 4 [a b… 2 [hur… 8
## 4 0 [female] 15 2 4 2 [somet… 1 [alw… 2 [usu… 5 [a v… 3 [hur… 4
## 5 0 [female] 14 12 2 4 [alway… 3 [som… 5 [nev… 5 [a v… 3 [hur… 7
## 6 1 [male] 15 1 3 4 [alway… 1 [alw… 1 [alw… 5 [a v… 3 [hur… 7
## 7 0 [female] 16 10 3 4 [alway… 3 [som… 2 [usu… 4 [a b… 1 [hur… 8
## 8 0 [female] 15 25 10 2 [somet… 3 [som… 5 [nev… 4 [a b… 3 [hur… 5
## 9 0 [female] 15 6 10 4 [alway… 2 [usu… 3 [som… 5 [a v… 1 [hur… 8
## 10 1 [male] 15 15 8 4 [alway… 3 [som… 3 [som… 4 [a b… 2 [hur… 7
## # … with 1,262 more rows, and 13 more variables: fropinon <dbl>,
## # frbehave <dbl>, certain <dbl>, moral <dbl>, delinquency <dbl>,
## # d1 <dbl+lbl>, hoursstudy <dbl>, `filter_$` <dbl>, heavytvwatcher <dbl+lbl>,
## # studyhard <dbl+lbl>, supervision <dbl+lbl>, drinkingnotbad <dbl+lbl>,
## # Lowcertain_bin <dbl>
YouthData
object, you should see a table with the first 10
out of 1,272 total rows (containing observations) and 23 columns (or
variables). head(YouthData)
, and hit
RUN.
head()
is a built-in R function that returns a snapshot
of a dataframe or object. In this case, it gives us a snapshot of the
variables and first several rows of observations in the YouthData
object. This can be an especially useful function for quickly glimpsing
large datasets.glimpse()
function is
similarly useful for quickly glimpsing all the columns (variables) in a
dataframe - feel free to check that one out as well!head(YouthData) %>% gt()
, and hit RUN again.
head(data)
function to the gt()
function,
which instructs the “gt” program to reformat our data table using the
package’s default table style settings!head(YouthData)
## # A tibble: 6 × 23
## Gender v2 v21 v22 v63 v77 v79 v109 v119 parnt2
## <dbl+lbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl>
## 1 1 [male] 15 36 15 3 [usuall… 1 [alw… 3 [som… 5 [a v… 1 [hur… 5
## 2 0 [female] 15 3 5 4 [always] 1 [alw… 2 [usu… 4 [a b… 1 [hur… 8
## 3 1 [male] 15 20 6 4 [always] 1 [alw… 1 [alw… 4 [a b… 2 [hur… 8
## 4 0 [female] 15 2 4 2 [someti… 1 [alw… 2 [usu… 5 [a v… 3 [hur… 4
## 5 0 [female] 14 12 2 4 [always] 3 [som… 5 [nev… 5 [a v… 3 [hur… 7
## 6 1 [male] 15 1 3 4 [always] 1 [alw… 1 [alw… 5 [a v… 3 [hur… 7
## # … with 13 more variables: fropinon <dbl>, frbehave <dbl>, certain <dbl>,
## # moral <dbl>, delinquency <dbl>, d1 <dbl+lbl>, hoursstudy <dbl>,
## # `filter_$` <dbl>, heavytvwatcher <dbl+lbl>, studyhard <dbl+lbl>,
## # supervision <dbl+lbl>, drinkingnotbad <dbl+lbl>, Lowcertain_bin <dbl>
# glimpse(YouthData)
head(YouthData) %>% gt()
Gender | v2 | v21 | v22 | v63 | v77 | v79 | v109 | v119 | parnt2 | fropinon | frbehave | certain | moral | delinquency | d1 | hoursstudy | filter_$ | heavytvwatcher | studyhard | supervision | drinkingnotbad | Lowcertain_bin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 15 | 36 | 15 | 3 | 1 | 3 | 5 | 1 | 5 | 18 | 9 | 9 | 19 | 8 | 1 | 15 | 1 | 1 | 1 | 0 | 0 | 1 |
0 | 15 | 3 | 5 | 4 | 1 | 2 | 4 | 1 | 8 | 11 | 6 | 11 | 19 | 10 | 1 | 5 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 15 | 20 | 6 | 4 | 1 | 1 | 4 | 2 | 8 | 12 | 5 | 11 | 20 | 1 | 0 | 6 | 0 | 1 | 0 | 1 | 0 | 0 |
0 | 15 | 2 | 4 | 2 | 1 | 2 | 5 | 3 | 4 | 9 | 5 | 13 | 19 | 1 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 14 | 12 | 2 | 4 | 3 | 5 | 5 | 3 | 7 | 27 | 9 | 4 | 18 | 104 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 |
1 | 15 | 1 | 3 | 4 | 1 | 1 | 5 | 3 | 7 | 8 | 9 | 10 | 20 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
Which table do you prefer in the knitted document? I typically prefer
gt-style tables myself. Some other programs you will use to create
tables for this course are also compatible with and can be piped to the
gt()
function to quickly and easily improve the default
table output (and to customize it if desired).
YouthData %>% view_df()
and hit RUN, then check your
Viewer tab to get a better look at the variable names, labels, and
values.
YouthData %>% freq(parnt2)
to generate a frequency table
for the “parental supervision scale” variable.
view_df()
is helpful – it allows you to see the variable
names, labels, and value labels. For example, if you type “parental
supervision scale” into R studio, nothing will happen, because it is a
label, not a variable (or object) name. However, if you type
parnt2
into R, it returns the variable that measures the
parental supervision scale.#default
YouthData %>% freq(parnt2)
## Frequencies
## YouthData$parnt2
## Label: parental supervision scale
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 2 6 0.47 0.47 0.47 0.47
## 3 16 1.26 1.73 1.26 1.73
## 4 163 12.81 14.54 12.81 14.54
## 5 139 10.93 25.47 10.93 25.47
## 6 440 34.59 60.06 34.59 60.06
## 7 189 14.86 74.92 14.86 74.92
## 8 319 25.08 100.00 25.08 100.00
## <NA> 0 0.00 100.00
## Total 1272 100.00 100.00 100.00 100.00
Goal: Create frequency tables for the “v77”, “v79”, “certain”, and “Gender” variables (Question 2, Ch.4 (pp.100-101))
Now, we are going to generate frequency tables for four variables and then determine which measure of central tendency is most appropriate for each variable, as well as to determine whether the variable’s distribution is skewed and, if so, the direction of its skew (i.e., negative or positive). It is important to determine whether a variable’s distribution is skewed and which measurement of central tendency is the most appropriate to ensure that we are reporting meaningful summary statistics when describing a variable. Generating a graph (e.g., bar chart or histogram) might also help you to determine the most appropriate measure of central tendency as well as to identify the direction of a skewed distribution.
YouthData %>% frq(v77)
YouthData %>% frq(v79)
.
freq()
function from the
summarytools
package to generate a frequency distribution;
this time, notice that you were instructed to use the frq()
function from the sjmisc
package instead. One benefit of
the frq()
function is that it automatically calculates the
mean and standard deviation (sd) for a variable. (Note, though, that it
may calculate a mean value even when the mean is not the most
appropriate measure of central tendency!)data %>% ggplot(aes(variable)) + geom_bar()
mean()
function to calculate the mean of a
particular variable.
mean(YouthData$certain)
.
This will generate the mean of the ‘certain’ variable in the YouthData
dataset. You can repeat this step for the other variables, v77, v79, and
Gender.
$
operator is used to call,
access, or reference a named element from a list or data object in R.
Here, we are essentially telling R to access and calculate the
mean
of the certain
column from the
YouthData
object. The $
operator, which you
may recall that we first introduced in Assignment 2, is used frequently
with base R functions. By this point, you have also used the tidyverse
%>%
operator many times. Often, tasks can be
accomplished either with $
operators or with
%>%
operators, though some base R functions do not work
with tidyverse-style pipes. I typically prefer tidyverse solutions and,
when using tidyverse functions (e.g., “dplyr” and “ggplot2” packages), I
usually recommend using the pipe operator %>%
in lieu of
the $
to initiate a sequence of actions for code
readability, efficiency (e.g., to avoid repeating the data object name),
and to minimize the number of objects assigned or degree of complication
of nested logical sequences. It is worth noting that, as of version 4.1,
base R also has a new pipe operator; you can read more about it here.median(YouthData$certain)
. You can use this
approach to calculate the median of the remaining variables, using the
same above code but replacing the words ‘mean,’ ‘median,’ and ‘mode’ as
necessary.mode()
but DO NOT do it - this actually does
something else entirely! Instead, if you want to calculate the mode
of a vector of values in R, one way to do it is to write a custom
function. In the code below, we show how you can create a function
called getMode()
, which you can then use to calculate the
mode of a variable. getMode <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
getMode(YouthData$certain)
## [1] 12
descr()
function in the
summarytools
package will generate various descriptive
statistics, including the mean, median, and values indicating the level
of skewness and kurtosis. Using the “certain” variable as an example,
type: YouthData %>% descr(certain)
mean()
and
median()
functions from base R and/or the
descr()
function from the summarytools
package, to finish answering question 2 on page 101.Goal: Create a histogram for “delinquency” Variable (Question 3, Ch. 4 (pp. 100-101))
Next, you will create a histogram for the delinquency variable to determine its most appropriate measure of central tendency. This will allow you to answer question 3 on pages 100-101.
YouthData %>% ggplot(aes(delinquency)) + geom_histogram()
.
Answer question 3.YouthData %>%
ggplot(aes(delinquency)) +
geom_histogram()
Goal: Create a Frequency Table for “Gender” Variable
For the last part of the assignment, you will create a frequency table of the “Gender” variable to determine its mean. The table will also allow you to check out the distribution of the variable.
YouthData %>% freq(Gender, plain.ascii = FALSE, style = "rmarkdown")
.
Remember to type r, results = 'asis'}
before the R code in
the top line of the code chunk options, so that R Studio shows a clean
table upon knitting. If you do no recall why we use 'asis'
,
plain.ascii
, or style
, refer to Assignment
3.YouthData %>%
frq(Gender)
## Gender of respondent (Gender) <numeric>
## # total N=1272 valid N=1272 mean=0.47 sd=0.50
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------
## 0 | female | 680 | 53.46 | 53.46 | 53.46
## 1 | male | 592 | 46.54 | 46.54 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
YouthData %>%
freq(Gender, plain.ascii = FALSE, style = "rmarkdown")
Label: Gender of respondent
Type: Numeric
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
0 | 680 | 53.46 | 53.46 | 53.46 | 53.46 |
1 | 592 | 46.54 | 100.00 | 46.54 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 1272 | 100.00 | 100.00 | 100.00 | 100.00 |
psych
package to view our measures
of central tendency.
psych
package and load it into R.
Packages
and looking for a check mark.describe(YouthData$Gender)
describe()
function in the psych
package allows you to look at the mean, median, minimum, maximum, and
standard deviation of a dataset or variable. Remember, the
$
calls a variable in a specific dataset. So, if we typed
describe(YouthData)
, we would get these values for all
variables in the data. But, by using the $
function, we can
specify that we want these values for the “Gender” variable. If you want
more of a refresher, refer to Assignment 2. Your R Studio should look
like this:describe(YouthData$Gender)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1272 0.47 0.5 0 0.47 0 0 1 1 0.14 -1.98 0.01
You should now have everything that you need to complete the questions in Assignment 4 that parallel those from B&P’s SPSS Exercises for Chapter 4!
head()
function to
quickly view a snapshot of your data?glimpse()
function to quickly view all
columns (variables) in your data?gt()
(e.g., head(data) %>% gt()
)?mean()
and
median()
?summarytools::descr()
and
psych::describe
functions?$
operator to
reference or call a named element from a list or data object, such as a
specific variable in a data file (e.g.,
mean(data$variable
))?sjmisc::frq()
and summarytools::freq()
?ggplot()
?