Assignment 11 Objectives

In the last assignment, you learned how to conduct a one-sample z or t hypothesis test of the difference between a sample and population mean and then, given the test results and the null hypothesis, to make an appropriate inference about the population mean by either rejecting or failing to reject the null hypothesis. In this assignment, you will learn how to make population inferences about the relationship between two categorical variables by conducting a chi-squared test of independence on a sample contingency table (crosstab).

By the end of Assignment 11, you should…

  • recognize you can manually build a simple tibble row-by-row using tidyverse’s tibble::tribble()
  • recognize you can use round() to specify number of decimals on numeric values
  • recognize you can modify a gt() table to add or remove the decimals in specific columns or rows with fmt_number()
  • be able to conduct a chi-squared test of independence using sjPlot::sjtab() or chisq.test() and interpret results
    • know how to specify different measures of association with statistics= using sjPlot::sjtab() and interpret them appropriately.
    • know how to generate observed and expected frequencies by assigning results of chisq.test() to an object (e.g., chisq) and then calling elements from object (e.g., chisq$observed or chisq$expected)

Assumptions & Ground Rules

We are building on objectives from Assignments 1-8. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()
  • know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users

Data viewing & wrangling

  • use the base R head() function to quickly view the first few rows of data
  • use the base R tail() function to quickly view the last few rows of data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()
  • select random sample from data without or with replacement using dplyr::sample_n()
  • select data with conditions using dplyr::filter() and %in% operator
  • simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()
  • draw random samples from data in R
    • draw one random sample with dplyr::slice_sample()
    • draw multiple (“replicate”) random samples with infer::rep_slice_sample()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar) #### Data visualization & aesthetics
  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
    • recognize that one can write a custom function to repeatedly generate similar plots before combining them with patchwork
    • recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list
  • recognize that one can plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize that one can add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()
  • generate and interpret confidence intervals to quantify uncertainty caused by sampling variability
    • identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • estimate the standard error of a sample mean or proportion in R
    • estimate a two-tailed confidence interval around a sample mean or proportion in R
    • properly interpret and avoid common misinterpretations of confidence intervals
  • be able to conduct a one-sample z or t hypothesis test of the difference between a sample and assumed population mean in R and interpret results
    • be able to conduct a one-sample test using the base R t.test() function
    • be able to manually calculate a z or t test statistic by typing formula in R
    • be able to conduct a one-sample test using infer::t_test()
    • know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis

If you do not recall how to do these things, review Assignments 1-10.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • contingency tables or crosstabs
    • joint frequency distribution
    • column marginal or column frequency
    • row marginal or row frequency
    • how to compare percentage differences (across IV and within DV categories)
  • chi-squared test of independence
    • observed frequency
    • expected frequency
    • how to calculate with the definitional and computational formulas
  • measures of association
    • positive and negative relationships
    • how to calculate and when to use phi-coefficient, contingency coefficient, Cramer’s V (e.g., table size, levels of measurement)
    • how to calculate and when to use proportionate reduction in error (PRE) measures of association including lambda, Goodman & Kruskal’s gamma, or Yule’s Q (e.g., table size, levels of measurement)

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Understanding “statistical independence”

Goal: Understand the null hypothesis of statistical independence and visualize chi-squared test of independence

A few assignments back (Assignment 7), you learned how to describe the association between two categorical variables by creating and interpreting a contingency table or crosstab. In this assignment, you will learn how to make an inference about the relationship between two variables in a population by conducting a chi-squared (\(\chi^2\)) test of independence on a sample crosstab. Additionally, we will briefly introduce you to the phi-coefficient and Cramer’s V, two measures of association that can be interpreted to describe the strength of an association between variables in a crosstab.

Note: You might notice that your book uses “chi-square” yet we use “chi-squared” instead (with a “d” at the end). Which term is correct? It does not really matter as long as you realize we are referring to the same statistical quantity.

In Assignment 7, we explained how to set up a crosstab with the independent variable (IV) in the columns and dependent variable (DV) in the rows and then how to describe the association - or lack of association - between the IV and DV by