Assignment 7 Objectives

The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 7 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

In the last assignment, you were introduced to frequentist empirical probabilities and probability distributions, including the standard normal (z) distribution. You also learned how these are essential to null hypothesis significance testing, which is the process by which most social scientists make inferences from sample descriptive statistics to population parameters. Finally, you practiced various steps in this process, such as calculating frequentist probabilities from crosstabs, calculating z-scores from raw observation values, and even conducting a basic binomial hypothesis test.

In Assignment 7, we will dig deeper into the process of making statistical inferences about population parameters from sample statistics. For instance, you will learn to think about sample descriptive statistics (e.g., a sample mean or correlation coefficient) as point estimates of population parameters. Moreover, while point estimates may represent our best or most likely guess about the value of a population parameter, a point estimate usually is just one among many potential estimates of a population parameter that are plausible under our data and modeling assumptions. In fact, we can use our knowledge of probability to calculate an interval estimate, such as a frequentist 95% confidence interval, around a point estimate.

Pragmatically, a confidence interval is a range of plausible estimates that quantifies the degree of uncertainty we have in our estimate of a population parameter. Learning to think about sample-derived estimates of population parameters as (uncertain) interval estimates rather than as (fixed and certain) point estimates is essential to truly understanding the process of statistical inference. After all, we rarely know the true population parameters and, thus, the seductive tendency to view sample statistics as “real” and known calculations of population values is misguided and prone to inference errors.

Unfortunately, frequentist confidence intervals (like their p-value cousins) are notoriously misunderstood and misinterpreted. For instance, it is quite common to read or hear people explain a 95% confidence interval like this: “We can be 95% certain that the true population parameter we seek falls within the 95% confidence interval we constructed around our sample point estimate.” Sadly, this is an inaccurate interpretation. When we calculate a frequentist 95% confidence interval, we are not 95% confident in the calculated interval itself. Rather, we are confident in the method that we used to generate confidence intervals.

What does this mean? Well, do you recall that long run or “sampling” notion of probability we discussed in the last assignment? We need to keep that in mind when interpreting a frequentist confidence interval around a point estimate. Now, imagine we calculated a sample statistic and used it as a point estimate to infer a population parameter value, and we also calculated a 95% confidence interval to quantify uncertainty around our estimate. Next, let’s hypothetically imagine we were to repeatedly follow the same sampling procedures a large number of times, calculating a point estimate and 95% confidence interval the same way each time. In what are we 95% confident? Assuming our data and modeling assumptions are appropriate, we are confident that the interval estimates we calculate in a large number of repeated trials would capture the true population parameter 95% of the time.

Of course, this means that we also expect our 95% confidence intervals would fail to capture the population parameter 5% of the time on repeated trials. Moreover, when we only calculate a confidence interval from a single sample, we cannot know whether we have one of the 95% of parameter-catching interval nets that effectively captured the true population value or, instead, whether we happened to get one of the 5% of atypical interval nets that failed to capture the true population parameter.

Finally, want to catch more parameters? Like capturing Pokemon, you just need to get a bigger or better net! For instance, we can improve our expected parameter-capturing rate (e.g., to 99%) by casting a wider interval net (i.e., by calculating wider confidence intervals). By widening our interval net, we exchange greater imprecision/uncertainty for the prospect of making fewer inference errors. We can avoid this trade-off by getting a better net - that is, by improving features of our research design (e.g., more precise measurement; larger sample sizes).

In the current assignment, you will learn how to calculate confidence intervals around a point estimate in R and to interpret them appropriately. Additionally, you will learn how to simulate data from a probability distribution, which should help you better understand sampling variability and the need for interval estimates. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

By the end of assignment #7, you should…

  • be able to select random sample from data without or with replacement using dplyr::sample_n()
  • know how to improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()
  • be able to select data with conditions using dplyr::filter() and %in% operator
  • know how to create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize how to simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()
  • recognize how one might write a custom function to repeatedly generate similar plots
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
  • understand how confidence intervals are used to quantify uncertainty caused by sampling variability
    • be able to estimate a two-tailed confidence interval around a sample mean or proportion in R and interpret it appropriately
    • be able to identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • know how to estimate the standard error of a sample mean or proportion in R
    • understand how to properly interpret and avoid common misinterpretations of confidence intervals
  • recognize how one might plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize how one might add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Assumptions & Ground Rules

We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar) or with crosstable(depvar, by=indepvar)
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • spe