Assignment 5 Objectives
The
purpose of this fifth assignment is to help you use R to complete some
of the SPSS Exercises from the end of Chapter 5 in Bachman, Paternoster,
& Wilson’s Statistics for Criminology
& Criminal Justice, 5th Ed.
This chapter covered measures of dispersion, including variation
ratio, range, interquartile range, variance, and standard deviation. We
use measures of dispersion to summarize the “spread” (rather than
central tendency) of a data distribution. Likewise, in this assignment,
you will learn how to use R to calculate measures of dispersion and
create boxplots that help us standardize and efficiently describe the
spread of a data distribution. You will also get additional practice
with creating frequency tables and simple graphs in R, and you will
learn how to modify some elements (e.g., color) of a ggplot object. As
with previous assignments, you will be using R Markdown (with R & R
Studio) to complete and submit your work.
By the end of assignment #5, you should…
- be able to calculate measures of dispersion by hand from frequency
tables you generate in R
- be able to generate some measures of dispersion (e.g., standard
deviation) directly in R (e.g., with
sjmisc:frq()
or
summarytools::descr()
)
- be able to generate boxplots using base R
boxplot()
and
ggplot()
to visualize dispersion in a data
distribution
- know how to change outline and fill colors in a ggplot geometric
object (e.g.,
geom_boxplot()
) by adding fill=
and color=
followed by specific color names (e.g.,
“orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB”
for cream)
- know how to add or change a preset theme (e.g.,
+ theme_minimal()
) to a ggplot object to conveniently
modify certain plot elements (e.g., white background color)
- understand how to select colors from a colorblind accessible palette
(e.g., using
viridisLite::viridis()
) and specify them for
the outline and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)
- be able to add a title (and subtitle or caption) to a ggplot object
by adding a label with the
labs()
function (e.g.,
+ labs(title = "My Title")
)
Assumptions & Ground
Rules
We
are building on objectives from Assignments 1-4. By the start of this
assignment, you should already know how to:
Basic R/RStudio skills
- create an R Markdown (RMD) file and add/modify text, level headers,
and R code chunks within it
- install/load R packages and use hashtags (“#”) to comment out
sections of R code so it does not run
- recognize when a function is being called from a specific package
using a double colon with the
package::function()
format
- read in an SPSS data file in an R code chunk using
haven::read_spss()
and assign it to an R object using an
assignment (<-
) operator
- use the
$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
- use a tidyverse
%>%
pipe operator to perform a
sequence of actions
- knit your RMD document into an HTML file that you can then save and
submit for course credit
Reproducibility
- use
here()
for a simple and reproducible
self-referential file directory method
- Use
groundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packages
Data viewing & wrangling
- use the base R
head()
function to quickly view a
snapshot of your data
- use the
glimpse()
function to quickly view all columns
(variables) in your data
- use
sjPlot::view_df()
to quickly browse variables in a
data file
- use
attr()
to identify variable and attribute value
labels
- recognize when missing values are coded as
NA
for
variables in your data file
- select and recode variables using dplyr’s
select()
,
mutate()
, and if_else()
functions
Descriptive data analysis
- use
summarytools::dfsummary()
to quickly describe one
or more variables in a data file
- create frequency tables with
sjmisc:frq()
and
summarytools::freq()
functions
- sort frequency distributions (lowest to highest/highest to lowest)
with
summarytools::freq()
- calculate measures of central tendency for a variable distribution
using base R functions
mean()
and median()
(e.g., mean(data$variable
))
- calculate central tendency and other basic descriptive statistics
for specific variables in a dataset using
summarytools::descr()
and psych::describe()
functions
Data visualization & aesthetics
- improve some knitted tables by piping a function’s results to
gt()
(e.g., head(data) %>% gt()
)
- create basic graphs using ggplot2’s
ggplot()
function
If you do not recall how to do these things, review Assignments
1-4.
Additionally, you should have read the assigned book chapter and
reviewed the SPSS questions that correspond to this assignment, and you
should have completed any other course materials (e.g., videos;
readings) assigned for this week before attempting this R assignment. In
particular, for this week, I assume you understand:
- measures of dispersion, such as variation ratio, range,
interquartile range (IQR), variance, and standard deviation
- the difference between range and IQR
- the relationship between variance and standard deviation
- how to calculate range, variation ratio, and IQR
- how to calculate variance of a population, a sample, and a sample
with grouped data
- how to calculate standard deviation of a population, a sample, and a
sample with grouped data
- how to calculate sample variance and standard deviation with
ungrouped and grouped data using computational formulas
- boxplots, including steps for boxplot construction, elements of a
boxplot, and how to read a boxplot to summarize the central tendency and
dispersion of a data distribution
As noted previously, for this and all future assignments, you MUST
type all commands in by hand. Do not copy & paste except for
troubleshooting purposes (i.e., if you cannot figure out what you
mistyped).
Part 1 (Assignment 5.1)
Goal:
Read in Youth Data and Determine Measures of Dispersion
(Note: Remember that, when
following instructions, always substitute “LastName” for your own last
name and substitute YEAR_MO_DY for the actual date. E.g.,
2022_06_08_Fordham_K300Assign5)
In the last assignment, you learned how to identify or calculate
measures of central tendency from frequency tables to summarize the most
common or “expected” value of a data distribution. In doing so, you
learned how to decide which measures of central tendency are most
appropriate or useful for summarizing specific variables. In this
assignment, you will use frequency tables and boxplots to calculate
measures of and visualize dispersion for several variables.
- Go to your K300_L folder, which should contain the R Markdown file
you created for Assignment 4 (named
YEAR_MO_DY_LastName_K300Assign4). Click to open the R
Markdown file.
- Remember, we open RStudio in this way so the
here
package will automatically set our K300_L folder as the top-level
directory.
- In RStudio, open a new R Markdown document. If you do not recall how
to do this, refer to Assignment 1.
- The dialogue box asks for a Title, an
Author, and a Default Output Format
for your new R Markdown file.
- In the Title box, enter K300 Assignment
5.
- In the Author box, enter your First and Last Name
(e.g., Tyeisha Fordham).
- Under Default Output Format box, select “HTML
document” (HTML is usually the default selection)
- Remember that the new R Markdown file contains a simple
pre-populated template to show users how to do basic tasks like add
settings, create text headings and text, insert R code chunks, and
create plots. Be sure to delete this text before you begin working.
- Create a second-level header titled: “Part 1 (Assignment 5.1).”
Then, create a third-level header titled: “Read in Youth Data and
Determine Measures of Dispersion”
- This assignment must be completed by the student and the student
alone. To confirm that this is your work, please begin all assignments
with this text: This R Markdown document contains my work for
Assignment 5. It is my work and
only my work.
- Now, you need to get data into RStudio. You already know how to do
this, but please refer to Assignment 1 if you cannot recall.
- Create a third-level header in R Markdown (hereafter, “RMD”) file
titled: “Load Libraries”
- Insert an R chunk.
- Inside the new R code chunk, load the following packages:
tidyverse
, haven
, here
,
sjmisc
, sjPlot
, and summarytools
.
In addition, install and load the viridisLite
package.
- Recall, you only need to install packages one time. However, you
must load them each time you start a new R session. Also, remember that
you can optionally use (and we recommend)
groundhog.library()
to improve the reproducibility of your
script.
- In this assignment, you will learn to customize your ggplot graphs,
including changing the default color scheme to any colors you want. As
we will explain later, the
viridisLite
package is helpful
for identifying colors that are colorblind accessible.
- After your first code chunk, create another third-level header in
RMD titled: “Read Data into R”
- Insert another R code chunk.
- In the new R code chunk, read and assign the “Youth_0.sav” SPSS
datafile into an R data object named
YouthData
.
- Forget how to do this? Refer to Assignment 1.
- In the same code chunk, on a new line below your read data/assign
object command, type the name of your new R data object:
YouthData
. This will call the object and provide a brief
view of the data. (Note: You can
get a similar but more visually appealing view by simply clicking on the
object in the “Environment” window.) Your R studio session should
now look a lot like this:
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
As in the image, you should see 1,272 rows (or observations) and 23
columns (or variables.)
- Now, insert an R chunk, type
YouthData %>% view_df()
, and hit RUN. Check your Viewer
tab to get a better look at the variable names, labels, and values.
- Forget how to do this? Refer to Assignment 2.
- Create a third-level header titled: “Frequency Table for ‘v77’
Variable”
- Create a new R code chunk and type
YouthData %>% frq(v77)
to generate a frequency table for
the variable that measures the ‘parental supervision scale.’ Using the
frequency table, answer the questions 10 and 11 in Quiz 5.
- Note: R is case sensitive! Be sure you are
typing “v77” with a lower-case, not upper-case, ‘v’.
- Your frequency table should look like this:
Part 2 (Assignment 5.2)
Goal:
Determine Measures of Dispersion for “fropinon” Variable (Question 5,
Ch.5 (pp.145))
Now, we are going to generate frequency tables for three variables,
use these tables to determine measures of dispersion, and then answer
Question 5 on page 145 of your book (i.e., standard deviation, variance,
range, minimum value, and maximum value.) These measurements of
dispersion will help us to infer meaningful information about spread of
these distributions in this sample.
You should have read about how to calculate measures of dispersion by
hand in the book chapter; you can also calculate these directly in R.
For instance, you may have noticed that the frequency table you
generated earlier using sjmisc::frq()
included the standard
deviation (“sd=”) in the output. You may also recall that the
descriptive statistics table you generated in Assignment 4 using
summarytools::descr()
included the standard deviation,
along with the minimum value, maximum value, IQR, and other information.
However, for this part of this assignment, you should generate
the frequency tables in R and then calculate all dispersion measures by
hand. This will help you better understand what the programs
are reporting and how they generated these measures. If you want to read
more about measures of dispersion and how to calculate them in R, you
might want to check out here and here.
- Create a second-level header titled: “Part 2 (Assignment 5.2).”
Then, create a third-level header titled: “Calculate Measures of
Dispersion for fropinon, delinquency, and
certain”
- Note: The “fropinon” variable is a five-category
ordinal measure asking respondents how wrong they think their friends
think it is to steal. Responses range from 1 (always wrong) to 5 (never
wrong). However, the variable is misspelled – instead of “fropinion”
with two i’s, the variable is “fropinon” with one ‘i’. Be sure to spell
the variable as it is found in the dataset when referencing it in R code
chunks. Also,remember that R is case sensitive. So, if you type
“fropinion” or “Fropinon” instead, R will not be able to find the
variable!
- Create a new R code chunk and type
YouthData %>% frq(fropinon)
- Repeat the above step for the other 2 variables, “delinquency” and
“certain”. Before each new R chunk, create a third-level header titled:
“Frequency Table of [Variable Name]”. For example, when you create the
frequency table for the “delinquency” variable, create a third-level
header above it titled “Frequency Table of ‘delinquency’”. Then, answer
question 5 on page 145 (Questions 12 and 13 in Quiz 5 on Canvas.)
- Graphical representations can be helpful, especially for determining
distribution (or skew.) They can also help to determine measures of
dispersion, such as range and interquartile range. In the next section,
you will create a boxplot for each of these variables to answer
questions 6 and 7 on page 145 of B&P (Questions 14-16 in Quiz
5.)
- Create a third-level header titled: “Basic Boxplot of
fropinon”
- Insert an R chunk. You can create a simple boxplot using base R by
typing
boxplot(YouthData$fropinon)
. Recall that the
$
is a base R operator used to reference an element
(variable) within an object (dataset).
- Your R studio should look like this:
- The base R
boxplot()
function we used above creates a
boxplot of any variable. However, with the base R plotting functions, it
is difficult to manipulate and save the boxplot if desired. Rather, we
recommend using the ggplot()
function (from the
ggplot2
package) to generate plots instead. Below, we will
show you how to create a boxplot using ggplot()
, which you
can then customize various properties including its colors, titles, and
layout orientation.
- Now, let’s jazz up this boxplot a bit by recreating and then
modifying it using
ggplot()
. By adding some color and a
title, we can make the boxplot easier and more appealing to read.
- Create a third-level header titled: “Boxplot of fropinon
using ggplot()”
- Insert another R chunk and type
YouthData %>% ggplot(aes(fropinon)) + geom_boxplot()
.
- Recall,
ggplot()
is a function included in the
tidyverse
package that allows us to create graphs and
plots.
- The
(aes())
function manipulates the aesthetic of the
graph or plot, such as the orientation. For example, plots will orient
to the x-axis by default if you type ggplot(aes(fropinon))
.
If you type ggplot(aes(y=fropinon))
, the plot will be
flipped to the y-axis like the base R boxplot above.
- The
geom_boxplot()
function works like the
geom_histogram()
function you used in earlier assignments.
Be sure to include the +
sign before
geom_boxplot()
since you are “adding” this geometric object
layer to the initial XY coordinate plot.
- Note: If you break your code into multiple lines
(as pictured below,) be sure that the
+
sign is on the same
line as the ggplot()
function. Otherwise, R will assume
you’re done with the ggplot()
function, and it will not
understand that you want to add a boxplot to it.
- Your R studio should look like this:
- Next, we can add some color.
- Create a third-level header titled: “Add Color to fropinon
boxplot”. Then, create a new R chunk and type
YouthData %>% ggplot(aes(fropinon)) + geom_boxplot()
.
- Inside the paratheses after
geom_boxplot
, type
fill = "orange", color = "black"
.
fill =
dictates the inner color of the boxplot.
color =
dictates the color or the outline and lines
comprising the boxplot. Be sure to include the quotation marks
(““).
- Your code should look like this:
- Following the procedures above, we can use built-in color palettes
to change the outline or fill to nearly any colors you want, such as
yellow, turquoise, or magenta! I often use unique hexidecimal codes
instead of color names to precisely select specific colors.
- For example, we can use the hex values for
IU’s colors, which are “#990000” for crimson and “#EDEBEB” for
cream.
- To improve the “cream” contrast on our plot, we will also specify a
minimal theme with a white background by adding
+ theme_minimal()
to our ggplot object.
- Note: As the link above explains, on the web, IU
substitutes gray for the cream in their primary colors because cream
does not reproduce well in online environments.
- The options are nearly limitless! However, when customizing your
graphs, we recommend using a package like “viridis” or “viridisLite” to
help you choose colors that are accessible for individuals with all
forms of colorblindness. You can learn more about the
viridis
package and its color palettes here
and here.
- As explained in the links above, we can use the “viridisLite”
package to automatically apply the viridis color scale to certain ggplot
graphs. However, we can also choose to select our own colors manually.
We show you one way to do this in the boxplot below, where we manually
specified two colors from the viridis color palette. Remember, you can
click the “Code” button to see how we did it.
- First, we use the
viridisLite::viridis()
function to
request two contrasting colors from the palette (hence the
2
in the parentheses). We assigned the resulting two
hexidecimal character codes into an object that we named “cols”, then we
assigned the last color in this vector of two colors to “col1” and the
first in this vector to “col2.”
- From here, we used our ggplot code from earlier to regenerate the
boxplot, but we substituted our “col1” and “col2” hexidecimal character
objects in for our boxplot’s fill (
fill=col1
) and outline
(color=col2
) color values.
- Lastly (for now, at least), we can add titles and labels to our
boxplot. This makes it easier for you and any other reader to know know
what you have plotted. For example, we know the ‘fropinon’ variable
contains survey responses to a question asking participants how wrong
they think their friends think it is to steal, with response values
ranging from 1 (always wrong) to 5 (never wrong). So, we will title the
boxplot “Boxplot of Friends’ Opinions on Stealing”.
- Create a third-level header called “Adding Boxplot Title”. Then,
insert an R chunk.
- Type
YouthData %>% ggplot(aes(fropinon)) + geom_boxplot(fill = "orange", color = "black")
.
- Then, add a plot title by typing
+ labs(title = "Boxplot of Friends' Opinions on Stealing")
after geom_boxplot(fill = "orange", color = "black")
. If
you break across lines, remember to include the +
at the
end of the previous line and not at the beginning of the new line.
labs()
is a function that allows you to change
labels.
title =
designates that you’re working with the boxplot
title and not the caption (caption =
) or a subtitle
(subtitle =
).
- Your R Studio should look something like this:
Congratulations! You just learned how to create and modify a (jazzed
up) boxplot in R!
Part 3 (Assignment 5.3)
Goal:
Determine Measures of Dispersion for “delinquency” and “certain”
Variable (Question 5, Ch.5 (pp.145))
Now that you can create a boxplot in R, you will create boxplots for
the “delinquency” and “certain” variables as well. You will do this
using the method from above.
- Create a second-level header titled: Part 3 (Assignment 5.3). Create
a third-level header titled: “Calculate Measures of Dispersion for
delinquency and certain (Question 5, Ch.5 (pp.145))”.
Then, create a fourth-level header (type
####
) titled:
“Boxplot for delinquency”
- Insert an R chunk and create a boxplot without color or a title for
the “delinquency” variable. Your R studio should look like this:
- Now, add colors to the boxplot by typing
fill = "blue", color = "black"
in the parentheses of
geom_boxplot()
. Then, add a title that says “Boxplot of
Number of Delinquent Acts”. To do this, type
+ labs(title = "Boxplot of Number of Delinquent Acts")
after geom_boxplot(fill = "blue", color = "black")
.
- You can use any colors you want for your boxplot! Try switching the
colors to red, yellow, purple, and so on to see what looks best to you.
Remember, you can also use colors from the
viridis
package’s color palette or various others (e.g., check out the
scico
palettes here
and here) to ensure
that your plot is accessible for individuals with all forms of
colorblindness.
- Your R Studio should look something like this:
- Lastly, repeat this process with the “certain” variable.
- Create a fourth-level header titled: “Boxplot for
certain”
- Insert a R chunk. We want the boxplot to have customized colors with
the title “Boxplot for Certainty of Being Punished.” Remember, you can
use any colors you want!
- You should now have everything that you need to complete the
questions in Assignment 5 that parallel those from B&P’s SPSS
Exercises for Chapter 5! Complete the remainder of the questions in
Assignment 5 in your RMD file.
- Keep the file clean and easy to follow by using RMD level headings
(e.g., denoted with ## or ###) separating R code chunks, organized by
assignment questions.
- Write plain text after headings and before or after code chunks to
explain what you are doing - such text will serve as useful reminders to
you when working on later assignments!
- Upon completing the assignment, “knit” your final RMD file again and
save the final knitted Word document to your “Assignments” folder as:
YEAR_MO_DY_LastName_K300Assign5. Submit via Canvas in
the relevant section (i.e., the last question) for Assignment 5.5.
Assignment 5 Objective Checks
After completing assignment #5…
- are you able to calculate measures of dispersion by hand from
frequency tables you generate in R?
- are you able to generate some measures of dispersion (e.g., standard
deviation) directly in R (e.g., with
sjmisc:frq()
or
summarytools::descr()
)?
- are you able to generate boxplots using base R
boxplot()
and ggplot()
to visualize dispersion
in a data distribution?
- do you know how to change outline and fill colors in a ggplot
geometric object (e.g.,
geom_boxplot()
) by adding
fill=
and color=
followed by specific color
names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for
crimson; “#EDEBEB” for cream)?
- do you know how to add or change a preset theme (e.g.,
+ theme_minimal()
) to a ggplot object to conveniently
modify certain plot elements (e.g., white background color)?
- do you understand how to select colors from a colorblind accessible
palette (e.g., using
viridisLite::viridis()
) and specify
them for the outline and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)?
- are you able to add a title (and subtitle or caption) to a ggplot
object by adding a label with the
labs()
function (e.g.,
+ labs(title = "My Title")
)?