## The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 7 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

In the last assignment, you were introduced to frequentist empirical
probabilities and probability distributions, including the standard
normal (z) distribution. You also learned how these are essential to
**null hypothesis significance testing**, which is the
process by which most social scientists make inferences from sample
descriptive statistics to population parameters. Finally, you practiced
various steps in this process, such as calculating frequentist
probabilities from crosstabs, calculating z-scores from raw observation
values, and even conducting a basic binomial hypothesis test.

In Assignment 7, we will dig deeper into the process of making
statistical inferences about population parameters from sample
statistics. For instance, you will learn to think about sample
descriptive statistics (e.g., a sample mean or correlation coefficient)
as *point estimates* of population parameters. Moreover, while
point estimates may represent our best or most likely guess about the
value of a population parameter, a point estimate usually is just one
among many potential estimates of a population parameter that are
plausible under our data and modeling assumptions. In fact, we can use
our knowledge of probability to calculate an *interval estimate*,
such as a frequentist 95% **confidence interval**, around a
point estimate.

Pragmatically, a confidence interval is a range of plausible
estimates that quantifies the degree of uncertainty we have in our
estimate of a population parameter. Learning to think about
sample-derived estimates of population parameters as (uncertain)
*interval estimates* rather than as (fixed and certain) *point
estimates* is essential to truly understanding the process of
statistical inference. After all, we rarely know the true population
parameters and, thus, the seductive tendency to view sample statistics
as “real” and known calculations of population values is misguided and
prone to inference errors.

Unfortunately, frequentist confidence intervals (like their
*p*-value cousins) are notoriously misunderstood and
misinterpreted. For instance, it is quite common to read or hear people
explain a 95% confidence interval like this: “We can be 95% certain that
the true population parameter we seek falls within the 95% confidence
interval we constructed around our sample point estimate.” Sadly, this
is an inaccurate interpretation. When we calculate a frequentist 95%
confidence interval, we are not 95% confident in the calculated interval
itself. Rather, we are confident in the *method* that we used to
generate confidence intervals.

What does this mean? Well, do you recall that long run or “sampling”
notion of probability we discussed in the last assignment? We need to
keep that in mind when interpreting a frequentist confidence interval
around a point estimate. Now, imagine we calculated a sample statistic
and used it as a point estimate to infer a population parameter value,
and we also calculated a 95% confidence interval to quantify uncertainty
around our estimate. Next, let’s hypothetically imagine we were to
repeatedly follow the same sampling procedures a large number of times,
calculating a point estimate and 95% confidence interval the same way
each time. In what are we 95% confident? Assuming our data and modeling
assumptions are appropriate, **we are confident that the interval
estimates we calculate in a large number of repeated trials would
capture the true population parameter 95% of the time.**

Of course, this means that we also expect our 95% confidence intervals would fail to capture the population parameter 5% of the time on repeated trials. Moreover, when we only calculate a confidence interval from a single sample, we cannot know whether we have one of the 95% of parameter-catching interval nets that effectively captured the true population value or, instead, whether we happened to get one of the 5% of atypical interval nets that failed to capture the true population parameter.

Finally, want to catch more parameters? Like capturing Pokemon, you just need to get a bigger or better net! For instance, we can improve our expected parameter-capturing rate (e.g., to 99%) by casting a wider interval net (i.e., by calculating wider confidence intervals). By widening our interval net, we exchange greater imprecision/uncertainty for the prospect of making fewer inference errors. We can avoid this trade-off by getting a better net - that is, by improving features of our research design (e.g., more precise measurement; larger sample sizes).

In the current assignment, you will learn how to calculate confidence intervals around a point estimate in R and to interpret them appropriately. Additionally, you will learn how to simulate data from a probability distribution, which should help you better understand sampling variability and the need for interval estimates. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

- be able to select random sample from data without or with
replacement using
`dplyr::sample_n()`

- know how to improve reproducibility of randomization tasks in R by
setting the random number generator seed using
`set.seed()`

- be able to select data with conditions using
`dplyr::filter()`

and`%in%`

operator - know how to create a list or vector and assign to object, such as
`listname <- c(item1, item2)`

- recognize how to simulate data from normal, truncated normal, or
uniform probability distributions using
`rnorm()`

,`truncnorm::rtruncnorm()`

, or`runif()`

- recognize how one might write a custom function to repeatedly
generate similar plots

- be able to combine multiple
`ggplot()`

plots into a single figure using “patchwork” package- know how to customize plot layout and add title to patchwork figure

- understand how confidence intervals are used to quantify uncertainty
caused by sampling variability
- be able to estimate a two-tailed confidence interval around a sample mean or proportion in R and interpret it appropriately
- be able to identify
*t*or*z*critical values associated with a two-tailed confidence level using`qt()`

or`qnorm()`

- know how to estimate the standard error of a sample mean or proportion in R
- understand how to properly interpret and avoid common misinterpretations of confidence intervals

- recognize how one might plot means and confidence intervals using
`ggplot() + geom_point() + geom_errorbars`

- recognize how one might add elements like vertical lines
(
`+ geom_vline()`

), arrows (`+ geom_segment`

), or text (`+ annotate()`

) elements to a`ggplot()`

object

## We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:

- create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
- knit your RMD document into an HTML file that you can then save and submit for course credit
- install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
- recognize when a function is being called from a specific package
using a double colon with the
`package::function()`

format - read in an SPSS data file in an R code chunk using
`haven::read_spss()`

and assign it to an R object using an assignment (`<-`

) operator

- use the
`$`

symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format`dataobject$varname`

- use a tidyverse
`%>%`

pipe operator to perform a sequence of actions - recognize the R operator
`!=`

as “not equal to” - turn off or change scientific notation in R, such as
`options(scipen=999, digits = 3)`

- recognize that you can create your own R functions (e.g., our
`funxtoz()`

function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

- use
`here()`

for a simple and reproducible self-referential file directory method - Use
`groundhog.library()`

as an optional but recommended reproducible alternative to`library()`

for loading packages

- use the base R
`head()`

function to quickly view a snapshot of your data - use the
`glimpse()`

function to quickly view all columns (variables) in your data - use
`sjPlot::view_df()`

to quickly browse variables in a data file - use
`attr()`

to identify variable and attribute value labels

- recognize when missing values are coded as
`NA`

for variables in your data file - remove missing observations from a variable in R when appropriate
using
`filter(!is.na(var))`

- change a numeric variable to a factor (e.g., nominal or ordinal)
variable with
`haven::as_factor()`

- drop an unused factor level (e.g., missing “Don’t know” label) on a
variable using
`data %>% droplevels(data$variable)`

- select and recode variables using dplyr’s
`select()`

,`mutate()`

, and`if_else()`

functions - convert raw column (variable) values into standardized z-score
values using
`mutate()`

- use
`summarytools::dfsummary()`

to quickly describe one or more variables in a data file - create frequency tables with
`sjmisc:frq()`

and`summarytools::freq()`

functions - sort frequency distributions (lowest to highest/highest to lowest)
with
`summarytools::freq()`

- calculate measures of central tendency for a frequency distribution
- calculate central tendency using base R functions
`mean()`

and`median()`

(e.g.,`mean(data$variable`

)) - calculate central tendency and other basic descriptive statistics
for specific variables in a dataset using
`summarytools::descr()`

and`psych::describe()`

functions

- calculate central tendency using base R functions
- calculate measures of dispersion for a variable distribution
- calculate dispersion measures by hand from frequency tables you generate in R
- calculate some measures of dispersion (e.g., standard deviation)
directly in R (e.g., with
`sjmisc:frq()`

or`summarytools::descr()`

)

- recognize and read the basic elements of a contingency table (aka
crosstab)
- place IV in columns and DV in rows of a crosstab
- recognize column/row marginals and their overlap with univariate frequency distributions
- calculate marginal, conditional, and joint (frequentist) probabilities
- compare column percentages (when an IV is in columns of a crosstab)

- generate and modify a contingency table (crosstab) in R with
`dplyr::select()`

&`sjPlot::sjtab(depvar, indepvar)`

or with`crosstable(depvar, by=indepvar)`

- conduct a bimomial hypothesis test in R with
`rstatix::binom_test()`

- improve some knitted tables by piping a function’s results to
`gt()`

(e.g.,`head(data) %>% gt()`

)- modify elements of a
`gt()`

table, such as adding titles/subtitles with Markdown-formatted (e.g.,`**bold**`

or`*italicized*`

) fonts

- modify elements of a
- create basic graphs using ggplot2’s
`ggplot()`

function- generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
- generate boxplots using base R
`boxplot()`

and`ggplot()`

to visualize dispersion in a data distribution

- modify elements of a ggplot object
- change outline and fill colors in a ggplot geometric object (e.g.,
`geom_boxplot()`

) by adding`fill=`

and`color=`

followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream) - add or change a preset theme (e.g.,
`+ theme_minimal()`

) to a ggplot object to conveniently modify certain plot elements (e.g., white background color) - select colors from a colorblind accessible palette (e.g., using
`viridisLite::viridis()`

) and specify them for the outline and fill colors in a ggplot geometric object (e.g.,`geom_boxplot()`

) - add a title (and subtitle or caption) to a ggplot object by adding a
label with the
`labs()`

function (e.g.,`+ labs(title = "My Title")`

)

- change outline and fill colors in a ggplot geometric object (e.g.,

- conduct and interpret a null hypothesis significance test
- spe