Assignment 10 Objectives

In the last assignment, you learned how to estimate and interpret confidence intervals around a point estimate (e.g., sample mean or proportion). You also learned how to simulate data from basic probability distributions to help you better understand sampling variability and the need for interval estimates. In this assignment, you will learn how to conduct a two-tail z-test and t-test and then, given the test results and the null hypothesis, to make an appropriate inference about the population parameter by either rejecting or failing to reject the null hypothesis.

Before you conduct your own hypothesis tests, we will first simulate population data from a normal probability distribution, then take random samples from our simulated population data and plot features of these samples. Our aim will be to help you visualize the sampling distribution of a sample mean, which should lead to a better understanding of the underlying mechanisms that allow us to make valid population inferences from samples with null hypothesis significance testing. While we will not expect you to do all of these tasks yourself, by providing our code along the way, we hope these examples will help you gain a better sense of how one might conduct simulations, create handy user-written functions, and generate more complex visualizations in R.

By the end of Assignment 10, you should…

  • know how to draw random samples from data in R
    • know how to draw one random sample with dplyr::slice_sample()
    • know how to draw multiple (“replicate”) random samples with infer::rep_slice_sample()
  • recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects
  • recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list
  • be able to use the base R tail() function to quickly view the last few rows of data
  • know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users
  • be able to conduct a one-sample z or t hypothesis test in R and interpret results
    • be able to conduct a one-sample test using the base R t.test() function
    • be able to manually calculate a z or t test statistic by typing formula in R
    • be able to conduct a one-sample test using infer::t_test()
    • know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis

Assumptions & Ground Rules

We are building on objectives from Assignments 1-9. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()
  • select random sample from data without or with replacement using dplyr::sample_n()
  • select data with conditions using dplyr::filter() and %in% operator
  • simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar)

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
    • recognize that one can write a custom function to repeatedly generate similar plots before combining them with patchwork
  • recognize that one can plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize that one can add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()
  • generate and interpret confidence intervals to quantify uncertainty caused by sampling variability
    • identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • estimate the standard error of a sample mean or proportion in R
    • estimate a two-tailed confidence interval around a sample mean or proportion in R
    • properly interpret and avoid common misinterpretations of confidence intervals

If you do not recall how to do these things, review Assignments 1-9.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • Sampling variation (aka, sampling variability)
  • Hypothesis testing
    • Null hypothesis
    • Nondirectional & two-tailed hypothesis tests
    • Directional & one-tailed hypothesis tests
  • One-sample hypothesis tests for population means and proportions and how to calculate by hand:
    • a one-sample z test for the difference between an observed sample mean and a given population mean for large samples
    • a one-sample t test for the difference between an observed sample mean and a given population mean for small samples samples
    • a one-sample z test for the difference between an observed sample proportion and a given population proportion for large samples

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Visualizing the Sampling Distribution of a Sample Mean

Goal: Understand the sampling distribution of a sample mean as well as the relationship between sample size and sampling variability

After learning about how confidence intervals help us quantify and communicate uncertainty in statistical estimates of population parameters, this week’s course materials focused on making inferences about population parameters by conducting null hypothesis tests. Briefly, this process involves:

  • Making assumptions that our sample statistics can be used as meaningful estimates of population parameters that we are interested in.
  • Making a baseline assumption about the population, such as by specifying a “null hypothesis” (e.g., of no difference in means, or of a specific population value).
  • Selecting an appropriate test distribution (e.g., z or t distribution)
  • Specifying a alpha level (i.e., a false positive risk tolerance level or error control rate) and rejection region in the test (e.g., z or t) distribution
  • Converting our sample statistic into a standardized test statistic using the test