Assignment 6 Objectives

The purpose of this sixth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!

Our book, like most undergraduate statistics books, largely presents a frequentist view of probability. It starts by presenting us with a basic frequentist mathematical definition of probability as “the number of times that a specific event can occur relative to the total number of times that any event can occur” (p.152). Keen readers will note that this definition of probability sounds uncannily similar to a relative frequency - that’s because it is! In frequentist statistics, empirical probabilities are calculated as observed relative frequencies.

However, observed (known) relative frequencies - aka empirical probabilities - often are used to do more than simply describe a sample; often, they are used to make inferences about unknown (theoretical) population parameters. Your book describes this long run inferential view of empirical probabilities, or what the authors call the second “sampling” notion of probability, as “the chance of an event occurring over the long run with an infinite number of trials” (p.153). Of course, we cannot actually conduct an infinite number of trials, so we use our known relative frequencies from a sample - aka our known empirical probabilities - to infer what we think would likely happen were we to conduct a very large number of trials. After presenting these frequentist notions of probability, the chapter moves on to explain how we could imagine a theoretical “probability distribution” of outcomes that would emerge from repeated trials of an event, then it describes various types of probability distributions, including binomial, normal, and standard normal distributions.

Recall, descriptive statistics involve describing characteristics of a dataset (e.g., a sample), whereas inferential statistics involves making inferences about a population from a subset of sample data drawn from that population. In addition to probability, this chapter also introduces the basics of null hypothesis significance testing, which is the most common procedure by which social scientists use frequentist empirical probabilities and probability distributions to make inferences about populations from descriptions of sample data. Hence, the materials introduced in this chapter and this assignment, including probability, probability rules, probability distributions, standard normal distributions, and standard scores (z-scores), are essential to understanding future assignments that will focus heavily on conducting and interpreting null hypothesis significance tests.

In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

By the end of assignment #6, you should…

  • understand basic elements of a contingency table (aka crosstab)
    • understand why it can be helpful to place IV in columns and DV in rows
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • know how to calculate marginal, conditional, and joint (frequentist) probabilities
    • know how to compare column percentages (when an IV is in columns of a crosstab)
  • recognize the R operator != as “not equal to”
  • know how to turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • be able to remove missing observations from a variable in R using filter(!is.na(var))
  • know how to change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
    • recognize how to drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • be able to generate a crosstab in R using dplyr::select() & sjPlot::sjtab(depvar, indepvar)
    • know how to add a title and column percents to sjtab() table and switch output from viewer to html browser
  • be able to generate a crosstab in R using crosstable(depvar, by=indepvar)
    • know how to include missing data and add column/row marginal frequency totals to a crosstable() table
    • know how to modify decimal digits & table output format in crosstable() table, and how to output it to an aesthetically pleasing html table
  • understand the bimomial hypothesis test and how to conduct one in R
    • understand the basic inputs of a binomial hypothesis test (p, x, α, & n)
  • be able to modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • have a basic understanding of the logic behind null hypothesis significance testing (we will continue to revisit this topic in more detail in remaining weeks)
    • understand difference between a null (test) hypothesis and contrasting alternative hypotheses
    • understand alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • recognize need to calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or instead should result in a failure to reject the null hypothesis
  • be able to convert raw column (variable) values into standardized z-score values using mutate()
  • know that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Assumptions & Ground Rules

We are building on objectives from Assignments 1-5. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an HTML file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))


If you do not recall how to do these things, review Assignments 1-5.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • difference between descriptive and inferential statistics
  • z-scores
    • formula for converting a raw score into a z-score
    • how use a z-score table
  • how to calculate conditional probabilities from a frequency table or cross-tabulation
  • rules of probabilities
    • bounding rule (rule #1; 0 to 1)
    • restricted and general addition rules of probabilities (rule #2a & #2b; unions)
    • restricted and general multiplication rules of probability (rule #3a & #3b; intersections)
  • probability of an event or the complement of an event
    • independent and mutually exclusive events
    • probability of success - binomial theorem
    <