The purpose of this fifth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 5 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
This chapter covered measures of dispersion, including variation ratio, range, interquartile range, variance, and standard deviation. We use measures of dispersion to summarize the “spread” (rather than central tendency) of a data distribution. Likewise, in this assignment, you will learn how to use R to calculate measures of dispersion and create boxplots that help us standardize and efficiently describe the spread of a data distribution. You will also get additional practice with creating frequency tables and simple graphs in R, and you will learn how to modify some elements (e.g., color) of a ggplot object. As with previous assignments, you will be using R Markdown (with R & R Studio) to complete and submit your work.
sjmisc:frq()
or
summarytools::descr()
)boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and color=
followed by specific color names (e.g.,
“orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB”
for cream)+ theme_minimal()
) to a ggplot object to conveniently
modify certain plot elements (e.g., white background color)viridisLite::viridis()
) and specify them for
the outline and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)We are building on objectives from Assignments 1-4. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actionshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packageshead()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data fileselect()
,
mutate()
, and if_else()
functionssummarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g., mean(data$variable
))summarytools::descr()
and psych::describe()
functionsgt()
(e.g., head(data) %>% gt()
)ggplot()
functionIf you do not recall how to do these things, review Assignments 1-4.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Read in Youth Data and Determine Measures of Dispersion
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_08_Fordham_K300Assign5)
In the last assignment, you learned how to identify or calculate measures of central tendency from frequency tables to summarize the most common or “expected” value of a data distribution. In doing so, you learned how to decide which measures of central tendency are most appropriate or useful for summarizing specific variables. In this assignment, you will use frequency tables and boxplots to calculate measures of and visualize dispersion for several variables.
here
package will automatically set our K300_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, sjPlot
, and summarytools
.
In addition, install and load the viridisLite
package.
groundhog.library()
to improve the reproducibility of your
script.viridisLite
package is helpful
for identifying colors that are colorblind accessible.YouthData
.
YouthData
. This will call the object and provide a brief
view of the data. (Note: You can
get a similar but more visually appealing view by simply clicking on the
object in the “Environment” window.) Your R studio session should
now look a lot like this:YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
View Youth Data
As in the image, you should see 1,272 rows (or observations) and 23 columns (or variables.)
YouthData %>% view_df()
, and hit RUN. Check your Viewer
tab to get a better look at the variable names, labels, and values.
YouthData %>% frq(v77)
to generate a frequency table for
the variable that measures the ‘parental supervision scale.’ Using the
frequency table, answer the questions 10 and 11 in Quiz 5.Frequency Table for v77
Goal: Determine Measures of Dispersion for “fropinon” Variable (Question 5, Ch.5 (pp.145))
Now, we are going to generate frequency tables for three variables, use these tables to determine measures of dispersion, and then answer Question 5 on page 145 of your book (i.e., standard deviation, variance, range, minimum value, and maximum value.) These measurements of dispersion will help us to infer meaningful information about spread of these distributions in this sample.
You should have read about how to calculate measures of dispersion by
hand in the book chapter; you can also calculate these directly in R.
For instance, you may have noticed that the frequency table you
generated earlier using sjmisc::frq()
included the standard
deviation (“sd=”) in the output. You may also recall that the
descriptive statistics table you generated in Assignment 4 using
summarytools::descr()
included the standard deviation,
along with the minimum value, maximum value, IQR, and other information.
However, for this part of this assignment, you should generate
the frequency tables in R and then calculate all dispersion measures by
hand. This will help you better understand what the programs
are reporting and how they generated these measures. If you want to read
more about measures of dispersion and how to calculate them in R, you
might want to check out here and here.
YouthData %>% frq(fropinon)
boxplot(YouthData$fropinon)
. Recall that the
$
is a base R operator used to reference an element
(variable) within an object (dataset).Boxplot of ‘fropinon’ variable using base R
boxplot()
function we used above creates a
boxplot of any variable. However, with the base R plotting functions, it
is difficult to manipulate and save the boxplot if desired. Rather, we
recommend using the ggplot()
function (from the
ggplot2
package) to generate plots instead. Below, we will
show you how to create a boxplot using ggplot()
, which you
can then customize various properties including its colors, titles, and
layout orientation.ggplot()
. By adding some color and a
title, we can make the boxplot easier and more appealing to read.YouthData %>% ggplot(aes(fropinon)) + geom_boxplot()
.
ggplot()
is a function included in the
tidyverse
package that allows us to create graphs and
plots.(aes())
function manipulates the aesthetic of the
graph or plot, such as the orientation. For example, plots will orient
to the x-axis by default if you type ggplot(aes(fropinon))
.
If you type ggplot(aes(y=fropinon))
, the plot will be
flipped to the y-axis like the base R boxplot above.geom_boxplot()
function works like the
geom_histogram()
function you used in earlier assignments.
Be sure to include the +
sign before
geom_boxplot()
since you are “adding” this geometric object
layer to the initial XY coordinate plot.
+
sign is on the same
line as the ggplot()
function. Otherwise, R will assume
you’re done with the ggplot()
function, and it will not
understand that you want to add a boxplot to it.Boxplot of ‘fropinon’ using ggplot
YouthData %>% ggplot(aes(fropinon)) + geom_boxplot()
.geom_boxplot
, type
fill = "orange", color = "black"
.
fill =
dictates the inner color of the boxplot.
color =
dictates the color or the outline and lines
comprising the boxplot. Be sure to include the quotation marks
(““).How to add color to a ggplot boxplot
Boxplot of ‘fropinon’ in black and orange
+ theme_minimal()
to our ggplot object.
Boxplot of ‘fropinon’ in crimson and cream
viridis
package and its color palettes here
and here.
viridisLite::viridis()
function to
request two contrasting colors from the palette (hence the
2
in the parentheses). We assigned the resulting two
hexidecimal character codes into an object that we named “cols”, then we
assigned the last color in this vector of two colors to “col1” and the
first in this vector to “col2.”fill=col1
) and outline
(color=col2
) color values.Boxplot of ‘fropinon’ with viridis color scale
YouthData %>% ggplot(aes(fropinon)) + geom_boxplot(fill = "orange", color = "black")
.+ labs(title = "Boxplot of Friends' Opinions on Stealing")
after geom_boxplot(fill = "orange", color = "black")
. If
you break across lines, remember to include the +
at the
end of the previous line and not at the beginning of the new line.
labs()
is a function that allows you to change
labels.title =
designates that you’re working with the boxplot
title and not the caption (caption =
) or a subtitle
(subtitle =
).How to add title to a ggplot boxplot
Congratulations! You just learned how to create and modify a (jazzed up) boxplot in R!
Goal: Determine Measures of Dispersion for “delinquency” and “certain” Variable (Question 5, Ch.5 (pp.145))
Now that you can create a boxplot in R, you will create boxplots for the “delinquency” and “certain” variables as well. You will do this using the method from above.
####
) titled:
“Boxplot for delinquency”Boxplot of ‘delinquency’ Variable
fill = "blue", color = "black"
in the parentheses of
geom_boxplot()
. Then, add a title that says “Boxplot of
Number of Delinquent Acts”. To do this, type
+ labs(title = "Boxplot of Number of Delinquent Acts")
after geom_boxplot(fill = "blue", color = "black")
.
viridis
package’s color palette or various others (e.g., check out the
scico
palettes here
and here) to ensure
that your plot is accessible for individuals with all forms of
colorblindness.Boxplot of ‘delinquency’ variable colors from viridis scale
Creating a custom boxplot for ‘certain’ variable
sjmisc:frq()
or
summarytools::descr()
)?boxplot()
and ggplot()
to visualize dispersion
in a data distribution?geom_boxplot()
) by adding
fill=
and color=
followed by specific color
names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for
crimson; “#EDEBEB” for cream)?+ theme_minimal()
) to a ggplot object to conveniently
modify certain plot elements (e.g., white background color)?viridisLite::viridis()
) and specify
them for the outline and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)?labs()
function (e.g.,
+ labs(title = "My Title")
)?