Assignment 7 Objectives

The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!

Our book, like most undergraduate statistics books, largely presents a frequentist view of probability. It starts by presenting us with a basic frequentist mathematical definition of probability as “the number of times that a specific event can occur relative to the total number of times that any event can occur” (p.152). Keen readers will note that this definition of probability sounds uncannily similar to a relative frequency - that’s because it is! In frequentist statistics, empirical probabilities are calculated as observed relative frequencies.

However, observed (known) relative frequencies - aka empirical probabilities - often are used to do more than simply describe a sample; often, they are used to make inferences about unknown (theoretical) population parameters. Your book describes this long run inferential view of empirical probabilities, or what the authors call the second “sampling” notion of probability, as “the chance of an event occurring over the long run with an infinite number of trials” (p.153). Of course, we cannot actually conduct an infinite number of trials, so we use our known relative frequencies from a sample - aka our known empirical probabilities - to infer what we think would likely happen were we to conduct a very large number of trials. After presenting these frequentist notions of probability, the chapter moves on to explain how we could imagine a theoretical “probability distribution” of outcomes that would emerge from repeated trials of an event, then it describes various types of probability distributions, including binomial, normal, and standard normal distributions.

Recall, descriptive statistics involve describing characteristics of a dataset (e.g., a sample), whereas inferential statistics involves making inferences about a population from a subset of sample data drawn from that population. In addition to probability, this chapter also introduces the basics of null hypothesis significance testing, which is the most common procedure by which social scientists use frequentist empirical probabilities and probability distributions to make inferences about populations from descriptions of sample data. Hence, the materials introduced in this chapter and this assignment, including probability, probability rules, probability distributions, standard normal distributions, and standard scores (z-scores), are essential to understanding future assignments that will focus heavily on conducting and interpreting null hypothesis significance tests.

In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

By the end of Assignment 7, you should…

  • understand basic elements of a contingency table (aka crosstab)
    • understand why it can be helpful to place IV in columns and DV in rows
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • know how to calculate marginal, conditional, and joint (frequentist) probabilities
    • know how to compare column percentages (when an IV is in columns of a crosstab)
  • recognize the R operator != as “not equal to”
  • know how to turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • be able to remove missing observations from a variable in R using filter(!is.na(var))
  • be able to generate a crosstab in R using dplyr::select() & sjPlot::sjtab(depvar, indepvar)
    • know how to add a title and column percents to sjtab() table and switch output from viewer to html browser
  • understand the binomial hypothesis test and how to conduct one in R
    • understand the basic inputs of a binomial hypothesis test (p, x, \(\alpha\), & n)
  • be able to modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • have a basic understanding of the logic behind null hypothesis significance testing (we will continue to revisit this topic in more detail in remaining weeks)
    • understand difference between a null (test) hypothesis and contrasting alternative hypotheses
    • understand alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • recognize need to calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or instead should result in a failure to reject the null hypothesis
  • be able to convert raw column (variable) values into standardized z-score values using mutate()

Assumptions & Ground Rules

We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an Word file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median()(e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “turquoise”) or hexidecimal codes (e.g., “#990000” for crimson)
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))

If you do not recall how to do these things, review Assignments 1-6.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • difference between descriptive and inferential statistics
  • z-scores
    • formula for converting a raw score into a z-score
    • how use a z-score table
  • how to calculate conditional probabilities from a frequency table or cross-tabulation
  • rules of probabilities
    • bounding rule (rule #1; 0 to 1)
    • restricted and general addition rules of probabilities (rule #2a & #2b; unions)
    • restricted and general multiplication rules of probability (rule #3a & #3b; intersections)
  • probability of an event or the complement of an event
    • independent and mutually exclusive events
    • probability of success - binomial theorem
  • probability distributions
    • binomial, normal, and standard normal distributions
    • formula for normal distribution
    • sampling distributions
  • logic underlying null hypothesis testing
    • null hypothesis
    • alpha level (level of significance)
    • z-scores and critical regions
    • reject or failure to reject null

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Part 1 (Assignment 7.1)

Goal: Read in NCVS and 2012 States Data

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-02-02_Ducate_CRIM5305_Assign06)

In the last assignment, you learned how to use frequency tables to calculate measures of dispersion and boxplots to visualize dispersion in a frequency distribution. In this assignment, you will learn the basics of probability theory and probability distributions, including binomial and normal distributions. You will also learn about how to covert a raw score into a standard score or z-score using the standard normal distribution, as well as how such standard scores are used to test null hypotheses with the goal of making population inferences from sample data.

For many, probability is one of the most difficult things to understand in statistics. Some probability calculations are quite intuitive, while others seem to defy our intuitions. As you read Chapter 6 and watched this week’s videos, hopefully you began to understand how probability theory underlies inferential statistics and allows us to make claims about larger populations from sample data. We will continue to work toward understanding inferential statistics over the next few assignments. In this assignment, we will practice calculating (frequentist) probabilities and using z-scores.

  1. Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 5 (named YEAR-MO-DY_LastName_CRIM5305_Assign05). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter CRIM5305 Assignment 7.
    5. In the Author box, enter your First and Last Name (e.g., Caitlin Ducate).
    6. Under Default Output Format box, select “Word document”

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 7.1).” Then, create a third-level header titled: “Read in NCVS and 2012 States Data”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 7. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following three packages: tidyverse, haven, here, sjmisc, sjPlot, summarytools, rstatix, and gt.
      • This is our first time using the rstatix packages, which means you’ll need to download them first using install.packages() or the “Install Packages” button under the “Tools” tab. Then, you can use library() to load them in.
      • Recall, you only need to install packages one time; after that, you can comment out that line. However, you must load the packages each time you start a new R session.
      • The rstatix package provides a simple and intuitive framework for performing basic statistical tests, and it is compatible with tidyverse and tidyverse pipe functions. We will use rstatix::binom_test() to conduct a binomial hypothesis test later in the assignment. We will also pipe the tabled output to the gt() function and then modify the table; if you recall, gt() can improve table output aesthetics and permits table customization.
  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and assign the “NCVS lone offender assaults 1992 to 2013.sav” and “2012 states data.sav” SPSS datafiles into R objects. Put the “NCVS lone offender assaults 1992 to 2013.sav” datafile into an object named NCVSData and the “2012 states data.sav” datafile into an object named StatesData.
      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data objects: NCVSData and StatesData.
      • This will call the object and provide a brief view of the data. Once you have done that, comment out the line. (Note: You can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.)
      • Your RStudio session should now look a lot like this:
StatesData <- read_spss(here("Datasets", "2012statesdata.sav"))
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
View NCVS and States Data

View NCVS and States Data

Explaining Cross-tabulations

Goal: Understand basic elements of a contingency table

In the next section, we will generate contingency tables - otherwise known as a “cross-tabulations” or crosstabs - and use them to calculate (frequentist) marginal, conditional, and joint probabilities.

Contingency tables or cross-tabulations are useful for understanding the association (or lack thereof) between two or more variables. Specifically, we will cross-tabulate two variables from the National Crime Victimization Survey data (NCVSData): the ordinal variable relationship, capturing a victim’s relationship to their assailant (0=“total stranger” to 3=“well known”), and the binary variable maleoff, representing the sex of the offender (0=female; 1=male). Before we do this together, let’s cover the basic elements of a contingency table.

  1. Until now, we have focused primarily on univariate descriptive statistics that summarize characteristics of a frequency distribution for a single variable in a sample, such as the relative frequency or probability of a particular value, or the distribution’s central tendency, shape, or dispersion. Let’s pick a couple variables from the NCVS data - privatelocation and reportedtopolice and start by generating their univariate frequency distributions.
    1. Recall, these NCVS data contain survey responses about individuals’ experiences with criminal victimation collected from nationally representative samples of people in U.S. households. This particular subset contains data from the 1992 to 2013 NCVS studies and only includes data from respondents who reported either violent or nonviolent assault by a single offender. For more information, see p.191 in Bachman, Paternoster, & Wilson’s book.
    2. For this example, we selected these two variables to illustrate how a cross-tabulation might help us answer the following research question: Are criminal victimizations that occur in private locations more or less likely to be reported to police than victimizations that do not occur in private locations?
      • privatelocation is a dummy variable (i.e., 0 or 1 values) indicating whether the reported victimization occurred in a private location (0=Not a private location; 1=Private location).
      • reportedtopolice is a dummy variable indicating whether the victimization incident was reported to the police (0=Unreported; 1=Reported).
      • Remember, you can check these details using sjplot::view_df() (e.g., by typing NCVSData %>% viewdf()). Just be sure to comment out this line before you knit
    3. Below are the univariate frequency distributions for the privatelocation and reportedtopolice variables in our NCVS data subset. Look at the code used to generate the tables–some of it should look familiar!
NCVSData %>% 
  filter(!is.na(reportedtopolice)) %>% 
  freq(privatelocation, report.nas = FALSE) %>% 
  tb() %>% 
  gt()
privatelocation freq pct pct_cum
0 17618 76.2354 76.2354
1 5492 23.7646 100.0000
NCVSData %>% 
  filter(!is.na(reportedtopolice)) %>% 
  freq(reportedtopolice, report.nas = FALSE) %>% 
  tb() %>% 
  gt()
reportedtopolice freq pct pct_cum
0 12639 54.69061 54.69061
1 10471 45.30939 100.00000
#The following code more efficiently accomplishes the same goal
# NCVSData %>% freq(reportedtopolice, report.nas = FALSE) %>% tb() %>% gt()
  1. You already know how to create these frequency tables with summarytools::freq().
    1. The major addition to this code is the filter() command, which tells R to keep only the rows where respondents reported a “0” or “1” to the reportedtopolice item and to remove the n=859 rows with missing data on this item. It does so by filtering to keep values that are not NA on the reportedtopolice variable (i.e., filter(!is.na(reportedtopolice)). We do this for comparison purposes, since our basic contingency tables will drop those missing (NA) cases by default.
      • Note: With summarytools::freq(), we could have simply added the report.nas = FALSE option to more efficiently accomplish the same goal in our frequency table. However, we will use this NA filter later in this assignment, so we wanted to introduce you to it here as well.
      • You may notice our freq() code also includes the report.nas = FALSE option, which is not entirely redundant - here, it removes the empty NA row from our table.
    2. The other key difference is we transformed the output to a “tibble” (a simple tidy dataframe) by piping it to tb() so that we could then pipe the output to our preferred gt() table formatting package.
    3. A quick look at these univariate frequency distributions shows us that victimizations in private locations are relatively rare, with only about 24% occurring in private locations. Also, a little less than half (45%) of victimizations are reported to the police. However, we cannot tell from these tables whether there is an association between the two variables. Enter contingency tables…

  2. A contingency table (aka, cross-tabulation or crosstab) is a multivariate or “joint” frequency distribution table that simultaneously presents the overlapping frequency distributions for two (or more) variables. Below is a contingency table containing joint frequencies for the privatelocation and reportedtopolice variables.
Contingency table w/frequencies

Contingency table w/frequencies

  1. Recall, our research question: Are criminal victimizations that occur in private locations more or less likely to be reported to police than victimizations that do not occur in private locations? This question implies that privatelocation is the independent variable (IV) and reportedtopolice is the dependent variable (DV) In other words, we hypothesize that whether or not a victimization is reported to police depends at least in part on the location of the victimization.
    1. Logically, it does not make sense to posit that police reporting causes victimization location.
    2. Meanwhile, even if it is associated, victimization location may not have a direct causal relationship with with police reporting, as such an association could reflect unmeasured confounding or mediating mechanisms (e.g., systematic differences in victim/offender relationships or offense types across locations might actually cause differences in police reporting). Hence, association != causation.
      • Note the != is an operator in R that means “not equal to”. Recall, we also used ! earlier prior to is.na when filtering missing data. While is.na means “is missing”, !is.na means “not is missing” or “not NA”). So in R, if you want to see if two things are the same, you will use ==, which means “equal to”, while if you want to see if two things are not the same, you will use !=, which means “not equal to.” The use of the exclamation point ! to mean “not” is very common in programming.
    3. We might even expect and test more specific directional expectations for the posited association, such as that victimizations occurring in private locations are less likely to be reported to police than victimizations not occurring in private locations. For now, we will stick with a basic non-directional hypothesized association.

  2. In any case, we prefer to generate contingency tables that place the IV on top with its frequencies in the columns and the DV on the side with its frequencies in the rows. While this is a preference, following a consistent practice like this will make it much easier to read contingency tables and result in fewer errors.
    1. Since privatelocation is the independent variable (IV) here, you will note that it is placed in the columns of the contingency table.
    2. Meanwhile, reportedtopolice is the dependent variable, so it is placed in the rows of the contingency table.
  3. The frequency values at the bottom of the IV’s columns and the end of the DV’s rows, respectively, are known as column and row “marginal frequencies” or marginals for short. If you compare these to our univariate frequency distribution tables above, you will notice the marginal frequencies match the univariate frequency distribution values (after using filter to remove NA values).
  4. So, we can identify univariate frequency distributions from the marginal frequencies (row or column totals) in a contingency table. In addition, a cross-tabulation presents joint frequencies.
  5. We can extract a lot of useful information from joint frequencies in a contingency table. For example, we know that 5,492 (column marginal) of the 23,110 (total) victimizations in the data reportedly occurred in a private location. Of those 5,492 private victimizations, 3,075 were reported to police and 2,417 were unreported.
    1. From these joint frequencies, we can easily calculate marginal, joint, or conditional relative frequencies (aka, frequentist probabilities) - we just divide the desired marginal or joint frequency by the total (for marginal or joint) or row/column (for conditional) frequency, respectively.
      • For example, the marginal probability of a crime being reported is p(reported)=0.45 (i.e., 10,471/23,110). Note that this marginal probability (aka, marginal relative frequency) is independent of the values of the other (privatelocation) variable.
      • The conditional probability of an assault being reported given it occurred in a private location is p(reported|private)=0.56 (i.e., 3,075/5,492). Note that this conditional probability depends on - it is conditional on - the value of the privatelocation variable (i.e., privatelocation = 1).
      • The conditional probability of an assault not being reported given it occurred in a private location is p(unreported|private) = 0.44 (i.e., 2,417/5,492). Note that you could also calculate this probability by subtracting its complement, which we just calculated above (i.e., 1-0.56).
      • The joint probability of an assault occurring in a private location and being unreported is p(private unreported)=0.10 (i.e., 2,417/23,110).
    2. You will practice calculating probabilities like these later in the assignment when answering question 2 from the Chapter 6 SPSS exercises. When you do, remember that it is very easy to read and calculate these relative frequencies or probabilities incorrectly, such as by selecting and dividing by the wrong frequency value (e.g., row instead of column, or total instead of row or column). To avoid such mistakes, take your time, be sure you know what it is you want to describe in the data, and be consistent in constructing tables (e.g., always placing your IV in the columns will help a lot!).
  1. In addition to joint frequency distributions, it is also common to see percentage values reported in contingency tables. Once again, it is important to be sure the calculated or reported percentages are the percentages that you want when reading and interpreting the table. If we stick with the recommended practice of placing our IV in the columns, then we often want column percentages (as opposed to row percentages). The image above