Assignment 9 Objectives
The
purpose of this ninth assignment is to help you use R to complete some
of the SPSS Exercises from the end of Chapter 9 in Bachman, Paternoster,
& Wilson’s Statistics for Criminology
& Criminal Justice, 5th Ed.
In the last assignment, you learned how to conduct a one-sample
z or t hypothesis test of the difference between a
sample and population mean and then, given the test results and the null
hypothesis, to make an appropriate inference about the population mean
by either rejecting or failing to reject the null hypothesis. In this
assignment, you will learn how to make population inferences about the
relationship between two categorical variables by conducting a
chi-squared test of independence on a sample contingency table
(crosstab).
By the end of assignment #9, you should…
- recognize you can manually build a simple tibble row-by-row using
tidyverse’s
tibble::tribble()
- recognize you can use
round()
to specify number of
decimals on numeric values
- recognize you can modify a
gt()
table to add or remove
the decimals in specific columns or rows with
fmt_number()
- be able to conduct a chi-squared test of independence using
sjPlot::sjtab()
or chisq.test()
and interpret
results
- know how to specify different measures of association with
statistics=
using sjPlot::sjtab()
and
interpret them appropriately.
- know how to generate observed and expected frequencies by assigning
results of
chisq.test()
to an object (e.g.,
chisq
) and then calling elements from object (e.g.,
chisq$observed
or chisq$expected
)
Assumptions & Ground
Rules
We
are building on objectives from Assignments 1-8. By the start of this
assignment, you should already know how to:
Basic R/RStudio skills
- create an R Markdown (RMD) file and add/modify text, level headers,
and R code chunks within it
- knit your RMD document into an HTML file that you can then save and
submit for course credit
- install/load R packages and use hashtags (“#”) to comment out
sections of R code so it does not run
- recognize when a function is being called from a specific package
using a double colon with the
package::function()
format
- read in an SPSS data file in an R code chunk using
haven::read_spss()
and assign it to an R object using an
assignment (<-
) operator
- use the
$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
- use a tidyverse
%>%
pipe operator to perform a
sequence of actions
- recognize the R operator
!=
as “not equal to”
- turn off or change scientific notation in R, such as
options(scipen=999, digits = 3)
- create a list or vector and assign to object, such as
listname <- c(item1, item2)
- recognize that use
lapply()
to create a list of
objects, which can help you avoid cluttering the R Environment with
objects
- recognize that you can create your own R functions (e.g., our
funxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errors
Reproducibility
- use
here()
for a simple and reproducible
self-referential file directory method
- Use
groundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packages
- improve reproducibility of randomization tasks in R by setting the
random number generator seed using
set.seed()
- know that you can share examples or troubleshoot code in a
reproducible way by using built-in datasets like
mtcars
that are universally available to R users
Data viewing & wrangling
- use the base R
head()
function to quickly view the
first few rows of data
- use the base R
tail()
function to quickly view the last
few rows of data
- use the
glimpse()
function to quickly view all columns
(variables) in your data
- use
sjPlot::view_df()
to quickly browse variables in a
data file
- use
attr()
to identify variable and attribute value
labels
- recognize when missing values are coded as
NA
for
variables in your data file
- remove missing observations from a variable in R when appropriate
using
filter(!is.na(var))
- change a numeric variable to a factor (e.g., nominal or ordinal)
variable with
haven::as_factor()
- drop an unused factor level (e.g., missing “Don’t know” label) on a
variable using
data %>% droplevels(data$variable)
- select and recode variables using dplyr’s
select()
,
mutate()
, and if_else()
functions
- convert raw column (variable) values into standardized z-score
values using
mutate()
- select random sample from data without or with replacement using
dplyr::sample_n()
- select data with conditions using
dplyr::filter()
and
%in%
operator
- simulate data from normal, truncated normal, or uniform probability
distributions using
rnorm()
,
truncnorm::rtruncnorm()
, or runif()
- draw random samples from data in R
- draw one random sample with
dplyr::slice_sample()
- draw multiple (“replicate”) random samples with
infer::rep_slice_sample()
Descriptive data analysis
- use
summarytools::dfsummary()
to quickly describe one
or more variables in a data file
- create frequency tables with
sjmisc:frq()
and
summarytools::freq()
functions
- sort frequency distributions (lowest to highest/highest to lowest)
with
summarytools::freq()
- calculate measures of central tendency for a frequency distribution
- calculate central tendency using base R functions
mean()
and median()
(e.g.,
mean(data$variable
))
- calculate central tendency and other basic descriptive statistics
for specific variables in a dataset using
summarytools::descr()
and psych::describe()
functions
- calculate measures of dispersion for a variable distribution
- calculate dispersion measures by hand from frequency tables you
generate in R
- calculate some measures of dispersion (e.g., standard deviation)
directly in R (e.g., with
sjmisc:frq()
or
summarytools::descr()
)
- recognize and read the basic elements of a contingency table (aka
crosstab)
- place IV in columns and DV in rows of a crosstab
- recognize column/row marginals and their overlap with univariate
frequency distributions
- calculate marginal, conditional, and joint (frequentist)
probabilities
- compare column percentages (when an IV is in columns of a
crosstab)
- generate and modify a contingency table (crosstab) in R with
dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
or with
crosstable(depvar, by=indepvar)
Data visualization & aesthetics
- improve some knitted tables by piping a function’s results to
gt()
(e.g., head(data) %>% gt()
)
- modify elements of a
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fonts
- create basic graphs using ggplot2’s
ggplot()
function
- generate simple bar charts and histograms to visualize shape and
central tendency of a frequency distribution
- generate boxplots using base R
boxplot()
and
ggplot()
to visualize dispersion in a data
distribution
- modify elements of a ggplot object
- change outline and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)
- add or change a preset theme (e.g.,
+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)
- select colors from a colorblind accessible palette (e.g., using
viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)
- add a title (and subtitle or caption) to a ggplot object by adding a
label with the
labs()
function (e.g.,
+ labs(title = "My Title")
)
- be able to combine multiple
ggplot()
plots into a
single figure using “patchwork” package
- know how to customize plot layout and add title to patchwork
figure
- recognize that one can write a custom function to repeatedly
generate similar plots before combining them with patchwork
- recognize you can use
patchwork::wrap_plots()
to
quickly combine ggplot objects contained in a list
- recognize that one can plot means and confidence intervals using
ggplot() + geom_point() + geom_errorbars
- recognize that one can add elements like vertical lines
(
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
object
Hypothesis testing & statistical inference
- conduct and interpret a null hypothesis significance test
- specify null (test) hypothesis & identify contrasting
alternative hypothesis (or hypotheses)
- set an alpha or significance level (e.g., as risk tolerance or false
positive error control rate)
- calculate a test statistic and corresponding p-value
- compare a test p-value to alpha level and then determine
whether the evidence is sufficient to reject the null hypothesis or
should result in a failure to reject the null hypothesis
- conduct a bimomial hypothesis test in R with
rstatix::binom_test()
- generate and interpret confidence intervals to quantify uncertainty
caused by sampling variability
- identify t or z critical values associated with a
two-tailed confidence level using
qt()
or
qnorm()
- estimate the standard error of a sample mean or proportion in R
- estimate a two-tailed confidence interval around a sample mean or
proportion in R
- properly interpret and avoid common misinterpretations of confidence
intervals
- be able to conduct a one-sample z or t hypothesis
test of the difference between a sample and assumed population mean in R
and interpret results
- be able to conduct a one-sample test using the base R
t.test()
function
- be able to manually calculate a z or t test
statistic by typing formula in R
- be able to conduct a one-sample test using
infer::t_test()
- know how to use
infer::visualize()
to visualize where
your sample statistic would fall in the sampling distribution associated
with your null hypothesis
If you do not recall how to do these things, review
Assignments 1-8.
Additionally, you should have read the assigned book chapter and
reviewed the SPSS questions that correspond to this assignment, and you
should have completed any other course materials (e.g., videos;
readings) assigned for this week before attempting this R assignment. In
particular, for this week, I assume you understand:
- contingency tables or crosstabs
- joint frequency distribution
- column marginal or column frequency
- row marginal or row frequency
- how to compare percentage differences (across IV and within DV
categories)
- chi-squared test of independence
- observed frequency
- expected frequency
- how to calculate with the definitional and computational
formulas
- measures of association
- positive and negative relationships
- how to calculate and when to use phi-coefficient, contingency
coefficient, Cramer’s V (e.g., table size, levels of
measurement)
- how to calculate and when to use proportionate reduction in error
(PRE) measures of association including lambda, Goodman & Kruskal’s
gamma, or Yule’s Q (e.g., table size, levels of measurement)
As noted previously, for this and all future assignments, you MUST
type all commands in by hand. Do not copy & paste except for
troubleshooting purposes (i.e., if you cannot figure out what you
mistyped).
Understanding “statistical independence