The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 7 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
In the last assignment, you were introduced to frequentist empirical probabilities and probability distributions, including the standard normal (z) distribution. You also learned how these are essential to null hypothesis significance testing, which is the process by which most social scientists make inferences from sample descriptive statistics to population parameters. Finally, you practiced various steps in this process, such as calculating frequentist probabilities from crosstabs, calculating z-scores from raw observation values, and even conducting a basic binomial hypothesis test.
In Assignment 7, we will dig deeper into the process of making statistical inferences about population parameters from sample statistics. For instance, you will learn to think about sample descriptive statistics (e.g., a sample mean or correlation coefficient) as point estimates of population parameters. Moreover, while point estimates may represent our best or most likely guess about the value of a population parameter, a point estimate usually is just one among many potential estimates of a population parameter that are plausible under our data and modeling assumptions. In fact, we can use our knowledge of probability to calculate an interval estimate, such as a frequentist 95% confidence interval, around a point estimate.
Pragmatically, a confidence interval is a range of plausible estimates that quantifies the degree of uncertainty we have in our estimate of a population parameter. Learning to think about sample-derived estimates of population parameters as (uncertain) interval estimates rather than as (fixed and certain) point estimates is essential to truly understanding the process of statistical inference. After all, we rarely know the true population parameters and, thus, the seductive tendency to view sample statistics as “real” and known calculations of population values is misguided and prone to inference errors.
Unfortunately, frequentist confidence intervals (like their p-value cousins) are notoriously misunderstood and misinterpreted. For instance, it is quite common to read or hear people explain a 95% confidence interval like this: “We can be 95% certain that the true population parameter we seek falls within the 95% confidence interval we constructed around our sample point estimate.” Sadly, this is an inaccurate interpretation. When we calculate a frequentist 95% confidence interval, we are not 95% confident in the calculated interval itself. Rather, we are confident in the method that we used to generate confidence intervals.
What does this mean? Well, do you recall that long run or “sampling” notion of probability we discussed in the last assignment? We need to keep that in mind when interpreting a frequentist confidence interval around a point estimate. Now, imagine we calculated a sample statistic and used it as a point estimate to infer a population parameter value, and we also calculated a 95% confidence interval to quantify uncertainty around our estimate. Next, let’s hypothetically imagine we were to repeatedly follow the same sampling procedures a large number of times, calculating a point estimate and 95% confidence interval the same way each time. In what are we 95% confident? Assuming our data and modeling assumptions are appropriate, we are confident that the interval estimates we calculate in a large number of repeated trials would capture the true population parameter 95% of the time.
Of course, this means that we also expect our 95% confidence intervals would fail to capture the population parameter 5% of the time on repeated trials. Moreover, when we only calculate a confidence interval from a single sample, we cannot know whether we have one of the 95% of parameter-catching interval nets that effectively captured the true population value or, instead, whether we happened to get one of the 5% of atypical interval nets that failed to capture the true population parameter.
Finally, want to catch more parameters? Like capturing Pokemon, you just need to get a bigger or better net! For instance, we can improve our expected parameter-capturing rate (e.g., to 99%) by casting a wider interval net (i.e., by calculating wider confidence intervals). By widening our interval net, we exchange greater imprecision/uncertainty for the prospect of making fewer inference errors. We can avoid this trade-off by getting a better net - that is, by improving features of our research design (e.g., more precise measurement; larger sample sizes).
In the current assignment, you will learn how to calculate confidence intervals around a point estimate in R and to interpret them appropriately. Additionally, you will learn how to simulate data from a probability distribution, which should help you better understand sampling variability and the need for interval estimates. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.
dplyr::sample_n()
set.seed()
dplyr::filter()
and %in%
operatorlistname <- c(item1, item2)
rnorm()
,
truncnorm::rtruncnorm()
, or runif()
ggplot()
plots into a
single figure using “patchwork” package
qt()
or
qnorm()
ggplot() + geom_point() + geom_errorbars
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
objectWe are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
funxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packageshead()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
haven::as_factor()
data %>% droplevels(data$variable)
select()
,
mutate()
, and if_else()
functionsmutate()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
or with
crosstable(depvar, by=indepvar)
rstatix::binom_test()
gt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)