Assignment 9 Objectives

The purpose of this ninth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 9 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

In the last assignment, you learned how to conduct a one-sample z or t hypothesis test of the difference between a sample and population mean and then, given the test results and the null hypothesis, to make an appropriate inference about the population mean by either rejecting or failing to reject the null hypothesis. In this assignment, you will learn how to make population inferences about the relationship between two categorical variables by conducting a chi-squared test of independence on a sample contingency table (crosstab).

By the end of assignment #9, you should…

  • recognize you can manually build a simple tibble row-by-row using tidyverse’s tibble::tribble()
  • recognize you can use round() to specify number of decimals on numeric values
  • recognize you can modify a gt() table to add or remove the decimals in specific columns or rows with fmt_number()
  • be able to conduct a chi-squared test of independence using sjPlot::sjtab() or chisq.test() and interpret results
    • know how to specify different measures of association with statistics= using sjPlot::sjtab() and interpret them appropriately.
    • know how to generate observed and expected frequencies by assigning results of chisq.test() to an object (e.g., chisq) and then calling elements from object (e.g., chisq$observed or chisq$expected)

Assumptions & Ground Rules

We are building on objectives from Assignments 1-8. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages
  • improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()
  • know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users

Data viewing & wrangling

  • use the base R head() function to quickly view the first few rows of data
  • use the base R tail() function to quickly view the last few rows of data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()
  • select random sample from data without or with replacement using dplyr::sample_n()
  • select data with conditions using dplyr::filter() and %in% operator
  • simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()
  • draw random samples from data in R
    • draw one random sample with dplyr::slice_sample()
    • draw multiple (“replicate”) random samples with infer::rep_slice_sample()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar) or with crosstable(depvar, by=indepvar)

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
    • recognize that one can write a custom function to repeatedly generate similar plots before combining them with patchwork
    • recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list
  • recognize that one can plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize that one can add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()
  • generate and interpret confidence intervals to quantify uncertainty caused by sampling variability
    • identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • estimate the standard error of a sample mean or proportion in R
    • estimate a two-tailed confidence interval around a sample mean or proportion in R
    • properly interpret and avoid common misinterpretations of confidence intervals
  • be able to conduct a one-sample z or t hypothesis test of the difference between a sample and assumed population mean in R and interpret results
    • be able to conduct a one-sample test using the base R t.test() function
    • be able to manually calculate a z or t test statistic by typing formula in R
    • be able to conduct a one-sample test using infer::t_test()
    • know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis


If you do not recall how to do these things, review Assignments 1-8.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • contingency tables or crosstabs
    • joint frequency distribution
    • column marginal or column frequency
    • row marginal or row frequency
    • how to compare percentage differences (across IV and within DV categories)
  • chi-squared test of independence
    • observed frequency
    • expected frequency
    • how to calculate with the definitional and computational formulas
  • measures of association
    • positive and negative relationships
    • how to calculate and when to use phi-coefficient, contingency coefficient, Cramer’s V (e.g., table size, levels of measurement)
    • how to calculate and when to use proportionate reduction in error (PRE) measures of association including lambda, Goodman & Kruskal’s gamma, or Yule’s Q (e.g., table size, levels of measurement)

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Understanding “statistical independence”

Goal: Understand the null hypothesis of statistical independence and visualize chi-squared test of independence

A few assignments back (#6), you learned how to describe the association between two categorical variables by creating and interpreting a contingency table or crosstab. In this assignment, you will learn how to make an inference about the relationship between two variables in a population by conducting a chi-squared (\(\chi^2\)) test of independence on a sample crosstab. Additionally, we will briefly introduce you to the phi-coefficient and Cramer’s V, two measures of association that can be interpreted to describe the strength of an association between variables in a crosstab.

Note: You might notice that your book uses “chi-square” yet we use “chi-squared” instead (with a “d” at the end). Which term is correct? It does not really matter as long as you realize we are referring to the same statistical quantity.

In Assignment 6, we explained how to set up a crosstab with the independent variable (IV) in the columns and dependent variable (DV) in the rows and then how to describe the association - or lack of association - between the IV and DV by comparing column percentages within rows and across columns. For example, recall this crosstab from the earlier assignment:

Comparing column percentage differences

Comparing column percentage differences


The crosstab above allows us to assess whether there is an association between two variables - “private location of victimization” and “victimization reported to police.” We can describe the association by comparing column percentages within a single row. For instance, in Assignment 6, we compared the percentage of victimization incidents in private locations that were reported to police with the percentage of victimization incidents in non-private locations that were reported to police (purple highlighted rows). Specifically, we stated that “there is an association between these two variables: victimizations occurring in private locations are less likely to go unreported than are victimizations occurring in other places (44% versus 58%, respectively).” Alternatively, about 56% of victimizations in private locations were reported to police, while only about 42% of victimizations in non-private places were reported to police.

So, if these column percentages differ, which they did (58% - 44% = 14 percentage point difference), then we can conclude there is an association between these variables in our sample. But how strong is that association? Or, if we have a probability sample, then can we confidently conclude that there is an association in the population, or the observed difference likely to be due to sampling variation instead?

In Assignment 6, we did not discuss in detail the note line below the table; let’s identify what is reported there. First, the note line reports results of a chi-squared test of independence, including the chi-squared test statistic (\(\chi^2=331.106, df=1\)) and later the p-value (\(p=0.000\)) associated with this test. This test assesses how likely it would be to observe differences in percentages within DV values (rows) and across IV values (columns) that are as at least as large as those observed if the two variables were truly “independent” or unrelated in the population. Additionally, the note line reports the phi-coefficient (\(\phi=0.120\)). As explained in this week’s book chapter, the phi-coefficient is a measure of association that helps us summarize the strength of relationship between two variables.

In this assignment, you will learn how to calculate and interpret a chi-squared test of independence in R. First, though, we want to walk you through the step-by-step (definitional) calculation of the chi-squared statistic. Our goal is to help you understand the logic underlying the chi-squared test of independence and, specifically, the null hypothesis of statistical independence between variables.

Illustrating statistical independence

As noted above, the chi-squared test of independence estimates the probability of observing column percentage differences as large or larger than those observed under the assumption - that is, given the null hypothesis - that the two variables are independent in the population.

What does “independent” mean in this context? Well, it means that we are comparing the observed cell frequencies in our crosstab with the expected cell frequencies under the null hypothesis, or with the cell frequencies that we would expect to see if the column and row marginal totals (i.e., univariate frequency distributions) remained the same yet there was no association (\(\phi=0\)) between the two variables.

As such, no discrepancy between observed cell frequencies and expected (null) cell frequencies will result in a chi-squared value of “0”. In contrast, large discrepancies between observed and expected cell frequencies (and/or large sample sizes) result in large chi-squared test statistics, smaller p-values, and more confidence in the inference that the observed differences would be highly improbable if the null hypothesis of statistical independence were true in the population.

To help you better understand the null hypothesis of independence, we will calculate the expected frequencies under the null for each cell in the example table above. As your book explains, the expected frequencies are the joint cell frequencies we would expect to see if the variables were independent or completely unrelated.

Before we calculate these expected frequencies, let’s start by reproducing the example table with only the marginal totals included (i.e., univariate frequency distributions); in the cells where we need to calculate expected frequencies, we will start by listing them as missing or “NA” values. We will manually build this table by inputting each frequency value row-by-row into a tibble (i.e., tidy data object) using the tidyverse tibble::tribble() function.

expectedfreqs <- tribble(
  ~DV, ~non_private, ~private, ~Total,
  "unreported", NA, NA, 12639,  
  "reported", NA, NA, 10471,
  "Total", 17618, 5492, 23110
  ) 
expectedfreqs %>% 
  gt() %>%
   tab_header(title = md("Marginal Frequencies Only"))
Marginal Frequencies Only
DV non_private private Total
unreported NA NA 12639
reported NA NA 10471
Total 17618 5492 23110


Now, let’s fill in the expected frequency values. Remember, these are the cell frequencies that we would expect to find if the variables were independent - that is, if there were absolutely no association between the two variables. To calculate each expected frequency, we multiply the cell’s row frequency total by the cell’s column frequency total and then divide by the total sample size (N). For example, the expected frequency for the top left cell is calculated as: (\(E_{[1,1]}=\frac{12,639 * 17,618}{23,110}\).

Note: We formatted the expected frequencies to display only two decimals by using the round() function. However, when we added these values to our tibble, the tribble() function automatically made our formatting consistent by adding two decimals to the column marginals as well. So, we also applied a simple fix by modifying our gt() table with fmt_number() to remove the decimals in the column marginal frequencies (columns 2 & 3, row 3).

EC1R1 <- round((12639*17618)/23110, digits=2)
EC1R2 <- round((10471*17618)/23110, digits=2)
EC2R1 <- round((12639*5492)/23110, digits=2)
EC2R2 <- round((10471*5492)/23110, digits=2)

expectedfreqs <- tribble(
  ~DV,          ~non_private, ~private, ~Total,
  "unreported", EC1R1,        EC2R1,    12639,  
  "reported",   EC1R2,        EC2R2,    10471,
  "Total",      17618,        5492,     23110
  ) 
expectedfreqs %>%  
  gt() %>%
   tab_header(title = md("Expected Frequencies")) %>%
   fmt_number(
    columns = 2:3,
    rows = 3,
    decimals = 0
    )
Expected Frequencies
DV non_private private Total
unreported 9635.39 3003.61 12639
reported 7982.61 2488.39 10471
Total 17,618 5,492 23110


Before calculating chi-square, let’s first transform these expected frequencies into expected column percentages to help us interpret our (relative) frequency expectations under the null hypothesis of independence.

EC1R1p <- round((((12639*17618)/23110)/17618), digits=4)
EC1R2p <- round((((10471*17618)/23110)/17618), digits=4)
EC2R1p <- round((((12639*5492)/23110)/5492), digits=4)
EC2R2p <- round((((10471*5492)/23110)/5492), digits=4)
expectedfreqs <- tribble(
  ~DV,          ~non_private, ~private, ~Total,
  "unreported", EC1R1p,        EC2R1p,    12639,  
  "reported",   EC1R2p,        EC2R2p,    10471,
  "Total",      17618,        5492,     23110
  ) 
expectedfreqs %>%  
  gt() %>%
   tab_header(title=md("Expected Column Percents")) %>%
   fmt_number(
    columns = 2:3,
    rows = 3,
    decimals = 0
    ) %>%
   fmt_percent(
    columns = 2:3,
    rows = 1:2,
    decimals = 1
    )
Expected Column Percents
DV non_private private Total
unreported 54.7% 54.7% 12639
reported 45.3% 45.3% 10471
Total 17,618 5,492 23110


So, if the null hypothesis of independence (i.e., no association) were true, then we would expect about 45% of victimization incidents to be reported to police irrespective of whether the incident occurred in a private or a non-private location.

In contrast to these expectations under independence, we observed that about 56% of victimizations occurring in private locations were reported to police, whereas about 42% of victimizations occurring in non-private places were reported to police. Therefore, our observed frequencies differ from the expected frequencies under the null.

To test the null hypothesis of independence, we calculate a test statistic that essentially equals the sum of standardized differences between observed and expected cell frequencies in a crosstab. Specifically, the definitional formula for calculating the chi-squared test statistic is: \(\chi^2_{obt}=\Sigma{\frac{(Obs-Exp)^2}{Exp}}\), or the sum of squared differences between observed and expected cell frequencies divided by expected frequencies. Below, we calculate each of the standardized differences for each cell and place them in our four table cells instead.

stdiffC1R1 <- round(((10222 - EC1R1)^2)/EC1R1, digits=2)
stdiffC1R2 <- round(((7396 - EC1R2)^2)/EC1R2, digits=2)
stdiffC2R1 <- round(((2417 - EC2R1)^2)/EC2R1, digits=2)
stdiffC2R2 <- round(((3075 - EC2R2)^2)/EC2R2, digits=2)

expectedfreqs <- tribble(
  ~DV,          ~non_private, ~private, ~Total,
  "unreported", stdiffC1R1,        stdiffC2R1,    12639,  
  "reported",   stdiffC1R2,        stdiffC2R2,    10471,
  "Total",      17618,        5492,     23110
  ) 
expectedfreqs %>%  
  gt() %>%
   tab_header(title=md("Standardized differences in cell frequencies")) %>%
   fmt_number(
    columns = 2:3,
    rows = 3,
    decimals = 0
    ) 
Standardized differences in cell frequencies
DV non_private private Total
unreported 35.71 114.57 12639
reported 43.11 138.29 10471
Total 17,618 5,492 23110


Our chi-squared test statistic, then, equals the sum of these four cell values:

chisqtest <- stdiffC1R1 + stdiffC1R2 + stdiffC2R1 + stdiffC2R2
chisqtest
## [1] 331.68

Our manually calculated chi-squared test statistic of {r chisqtest} is nearly the same as the test statistic reported with the original table (331.108)!

You might wonder why the values are not exactly the same. It is not due to rounding. In fact, it turns out that R applies what is known as the Yates’ correction when calculating this statistic. Essentially, this correction involves subtracting 1/2 from the absolute value of each difference in observed and expected values in a 2X2 table before squaring each difference, dividing by the expected frequency, and then summing each standardized difference. Below, we show how to manually apply this correction (using the abs() function to generate the absolute difference); doing so results in the exact same value as the one reported in the original table.

stdiffC1R1 <- round(((abs(10222 - EC1R1) -.5)^2)/EC1R1, digits=3)
stdiffC1R2 <- round(((abs(7396 - EC1R2) -.5)^2)/EC1R2, digits=3)
stdiffC2R1 <- round(((abs(2417 - EC2R1) -.5)^2)/EC2R1, digits=3)
stdiffC2R2 <- round(((abs(3075 - EC2R2) -.5)^2)/EC2R2, digits=3)

chisqtest_adj <- stdiffC1R1 + stdiffC1R2 + stdiffC2R1 + stdiffC2R2
chisqtest_adj
## [1] 331.108

Now that you know how to calculate the chi-squared test statistic, what do we do with it? You do the same thing we have done with other test statistics (e.g., t and z) - you compare it to a pre-specified critical value associated with an alpha level and a probability distribution of standardized values!

In this case, our test statistic (chi-squared or \(\chi^2\)) follows a chi-squared probability distribution. Like the t distribution, there are different chi-squared distributions that vary depending on the degrees of freedom associated with the test. In a chi-squared test of independence applied to a crosstab, we calculate degrees of freedom as equal to: (the number of rows minus 1) times (the number of columns minus one), or \(df=(Rows-1)(Cols-1)\).

Below, we use ggplot to visualize the chi-squared probability distribution for a 2X2 table with df=(2-1)(2-1)=1.

x <- as.tibble(rchisq(100000, df = 1)) %>%
  mutate(
    critvals = if_else(value>3.84, "1", "0")
  )

x %>% ggplot() +
  geom_histogram(aes(x=value, y=..density..),
                 color="#1fa187", fill="#4ac16d", alpha=.7,
                 position="identity", breaks = seq(0, 10, by = .05)) + 
  stat_function(fun = dchisq, args = list(df = 1),
                                color='#365c8d', size=1.1) +
  coord_cartesian(expand = FALSE, xlim=c(0,8), ylim=c(0,1)) +
  theme_minimal() + 
  theme(axis.text.y=element_blank(),  
        axis.ticks.y=element_blank()
        ) +
  labs(x="chi-squared value, df=1", y=NULL)

Now, let’s modify this graph a bit. First, we will overlay a vertical line at 3.841, which is the critical value for a chi-squared test with df=1 and alpha=0.05 (\(\chi^2_{crit}=3.841\)). Second, we will shade the distribution beyond our critical value with a different color and fill combination.

x %>% ggplot(aes(x=value)) +
  geom_histogram(aes(y=..count.., color=critvals, fill=critvals), 
                 alpha=.7,  
                 position="identity", breaks = seq(0, 10, by = .05)) + 
  coord_cartesian(expand = FALSE, xlim=c(0,8), ylim=c(0,4000)) +
  geom_vline(xintercept = 3.841, color = "#d44842", size=1.2, alpha=0.7) +
  annotate("text", x=4, y=1000, angle=0, color="#d44842", hjust=0, 
           label="critical value=3.841\ndf=1, alpha=.05") +
  scale_color_manual(values=c("#1fa187","#bc3754")) +
  scale_fill_manual(values=c("#4ac16d","#e45a31")) +
  theme_minimal() + 
  theme(axis.text.y=element_blank(),  
        axis.ticks.y=element_blank(),
        legend.position="none"
        ) +
  labs(x="chi-squared value, df=1", y=NULL)

As with other null hypothesis tests you have learned about thus far, if our chi-squared test statistic is greater than the pre-specified critical value, we reject the null hypothesis of statistical independence. In contrast, if our test statistic is smaller than the critical value, we would fail to reject the null hypothesis of statistical independence.

So, where does our test statistic fall in the chi-squared distribution and in relation to the critical value? Let’s add a dashed line to our plot at the unadjusted value of \(\chi^2_{obt}=331.68\) and see where it falls in the distribution. To see it, we will need to zoom out the plot’s x-axis quite a bit (from xlim(0,8) to xlim(0,332)). We will leave the y-axis limit the same as above to get a better sense of how far away our test statistic is from the critical value.

x %>% ggplot(aes(x=value)) +
  geom_histogram(aes(y=..count.., color=critvals, fill=critvals), 
                 alpha=.7,  
                 position="identity", breaks = seq(0, 10, by = .05)) + 
  coord_cartesian(expand = FALSE, xlim=c(0,332), ylim=c(0,4000)) +
  geom_vline(xintercept = 3.841, color = "#d44842", size=1.2, alpha=0.7) +
  annotate("text", x=10, y=1000, angle=0, color="#d44842", hjust=0, 
           label="critical value=3.841\ndf=1, alpha=.05") + 
  geom_vline(xintercept = 331.68, color = "#365c8d", linetype="dashed", 
             size=1, alpha=0.7) +
  annotate("text", x=325, y=1000, angle=0, color="#365c8d", hjust=1, 
           label="test statistic=331.68") + 
  scale_color_manual(values=c("#1fa187","#bc3754")) +
  scale_fill_manual(values=c("#4ac16d","#e45a31")) +
  theme_minimal() + 
  theme(axis.text.y=element_blank(),  
        axis.ticks.y=element_blank(),
        legend.position="none"
        ) +
  labs(x="chi-squared value, df=1", y=NULL)

As you can see in the plot above, the chi-squared test statistic is greater than critical value (331.68 > 3.841), so there is sufficient evidence to reject the null hypothesis of statistical independence. This is also implied by the p-value, which is less than the alpha level (0.000 < 0.05). If the null hypothesis indeed were true in the population, then it would be highly improbable to obtain a sample with differences at least as large as those observed in the original table due to sampling variability alone. Therefore, based on these sample statistics, we can infer that the null hypothesis of independence is unlikely to be true in the population.


Part 1 (Assignment 9.1)

Goal: Filter data for violent incidents (injured==1) and examine univariate frequency distributions

For this assignment, you will learn to conduct chi-squared tests of independence in R by working through Question #5 from Chapter 9 SPSS Exercises in the book. That question asks you to investigate whether women are more likely than men to experience violent victimization by strangers. To do so, you first need to filter the data to include only victimizations that resulted in injury to the victim (with the ‘injured’ variable), then conduct a chi-squared test of independence on the variables ‘V3018’ (sex of victim) and ‘Relationship’ (relationship between victim and offender) to test the hypothesis that the two variables are statistically independent.

Below, we will demonstrate how to complete the assignment steps by investigating a different question using the same data: Are violent victimizations that occur in private locations more or less likely to be reported to police than violent victimizations that do not occur in private locations? This is nearly the same question that we investigated with sample descriptive statistics (crosstabs) in Assignment 6 and portrayed in the example above. One important difference, however, is that we will be filtering our data to keep only those victimizations where an injury occurred so that we can examine violent victimizations rather than all (violent and nonviolent) criminal victimizations. As before, we will use the ‘privatelocation’ and ‘reportedtopolice’ variables from the NCVS data, then we will generate a crosstab and conduct a hypothesis test to make a population inference about the (non)independence of these two variables.

  • Recall, these NCVS data contain survey responses about individuals’ experiences with criminal victimation collected from nationally representative samples of people in U.S. households. This particular subset contains data from the 1992 to 2013 NCVS studies and only includes data from respondents who reported either violent or nonviolent assault by a single offender. For more information, see Bachman, Paternoster, & Wilson’s book.
    • ‘privatelocation’ is a dummy variable (i.e., 0 or 1 values) indicating whether the reported victimization occurred in a private location (0=Not a private location; 1=Private location).
    • ‘reportedtopolice’ is a dummy variable indicating whether the victimization incident was reported to the police (0=Unreported; 1=Reported).
    • Remember, too, that you can check these details using sjplot::view_df() (e.g., by typing NCVSData %>% view_df()).

By illustrating the process with different variables, you will get the experience of conducting these tests in R yourself using the same data. Then, after working through the process with our example, you will repeat the same steps with the ‘V3018’ (sex of victim) and ‘Relationship’ variables afterwards using the same filtered data to complete the assignment. Be sure to repeat these steps with the correct variables and filtered data listed in the assignment to ensure you receive full credit.

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_07_06_Fordham_K300Assign9)

  1. Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 8 (named YEAR_MO_DY_LastName_K300Assign8). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter K300 Assignment 9.
    5. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
    6. Under Default Output Format box, select “HTML document” (HTML is usually the default selection)

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 9.1).” Then, create a third-level header titled: “Read in and Filter NCVS Data”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 9. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, and gt. Recall, you only need to install packages one time. However, you must load them each time you start a new R session.

  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and assign the “NCVSLoneOffenderAssaults1992to2013.sav” SPSS datafile into R. Assign the data to an object named NCVSData.
      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/save object command, type the name of your new R data object: NCVSData.
      • This will call the object and provide a brief view of the data. Remember, you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.
      • Here is a glimpse of what the first few rows of your data should look like:
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))

head(NCVSData) %>% gt()
YEAR V2119 V2129 V3014 V3016 V3018 V3021 V3023 V3023A V3024 V2026 V4049 V4234 V4235 V4236 V4237 V4237A V4238 V4239 V4240 V4241 V4242 V4243 V4244 V4245 V4246 V4246A V4246B V4246C V4246E V4246F V4246G V4247 V4528 injured privatelocation reportedtopolice weaponpresent medicalcarereceived filter_$ relationship Policereported victimreported thirdpartyreport maleoff age_r vic18andover
2000 2 2 34 6 1 1 1 NA 2 12 1 1 NA 2 6 NA 2 3 NA 1 NA 3 NA 1 1 NA NA NA NA NA NA 2 11 1 1 0 1 1 1 3 0 0 0 0 34 1
2000 2 1 51 3 1 2 2 NA 2 10 1 1 NA 1 6 NA 2 1 1 1 NA 3 NA NA 2 NA NA NA NA NA NA 1 13 0 1 0 1 NA 0 3 0 0 0 1 51 1
2000 2 2 20 5 1 1 1 NA 2 14 2 1 NA 1 5 NA 2 1 1 1 NA 3 NA NA 3 NA NA NA NA NA NA 1 20 0 0 0 0 NA 0 3 0 0 0 1 20 1
2000 2 2 38 3 2 1 1 NA 2 10 2 1 NA 1 6 NA 2 2 NA 1 NA 3 NA 2 1 NA NA NA NA NA NA 1 11 1 1 1 0 1 1 3 1 1 0 1 38 1
2000 2 1 71 1 2 1 1 NA 2 10 2 1 NA 2 6 NA 2 2 NA 1 NA 2 NA NA 2 NA NA NA NA NA NA 1 20 0 1 0 0 NA 0 2 0 0 0 0 71 1
2000 2 2 12 5 1 1 1 NA 2 10 2 1 NA 1 2 NA 3 3 NA 1 NA 2 NA NA 1 NA NA NA NA NA NA 1 14 1 0 1 0 0 1 2 1 0 1 1 12 0


5. Next, the assignment question asks you to select only those cases where the victim was injured by the offender. Recall, we can use dplyr::filter() to select certain rows of data based on column values. a. Create a third-level header titled: “Selecting Cases Where Victim was Injured”. Then, insert an R chunk. b. Type NCVSData_inj <- NCVSData %>%. Hit enter and type dplyr::filter(injured == 1). c. Next, hit enter and type NCVSData_inj %>% frq(injured) to view the frequency distribution for the “injured” variable and ensure our new data object contains only those cases where the victim sustained injuries during victimization (injured == 1). In the frequency table for the “injured” variable, the row labeled “uninjured” (value = “0”) should have an N=0.

#Selecting only those observations (rows) where the offender was a male. 'maleoff' = 1 if the offender was male and 'maleoff' = 0 if the offender was female. If 'maleoff' = 0, it is excluded.
NCVSData_inj <- NCVSData %>% 
  dplyr::filter(injured == 1)
NCVSData_inj %>%
  frq(injured)
## Victim Sustained Injuries During Vicitmization (injured) <numeric> 
## # total N=7809 valid N=7809 mean=1.00 sd=0.00
## 
## Value |     Label |    N | Raw % | Valid % | Cum. %
## ---------------------------------------------------
##     0 | uninjured |    0 |     0 |       0 |      0
##     1 |   injured | 7809 |   100 |     100 |    100
##  <NA> |      <NA> |    0 |     0 |    <NA> |   <NA>
  1. Now let’s create frequency tables for the ‘privatelocation’ and ‘reportedtopolice’ variables from the new filtered data object. This will allow us to examine how often respondents report being injured by strangers versus slightly known, casual acquaintance, or well known offenders.
    1. Create a third-level header titled: “Frequency distributions”
    2. Insert a R chunk. Type NCVSData_inj %>% frq(privatelocation). Then, hit enter and on the next line type NCVSData_inj %>% frq(reportedtopolice) to create a frequency table.
    3. Note: You will follow this same step to answer questions in Quiz 9 about frequency distributions for ‘relationship’ and ‘V3018’ (sex of victim).
NCVSData_inj %>%
  frq(privatelocation)
## Victimization Occurred in a Private Location (privatelocation) <numeric> 
## # total N=7809 valid N=7809 mean=0.39 sd=0.49
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##     0 | 4741 | 60.71 |   60.71 |  60.71
##     1 | 3068 | 39.29 |   39.29 | 100.00
##  <NA> |    0 |  0.00 |    <NA> |   <NA>
NCVSData_inj %>%
  frq(reportedtopolice)
## Incident Reported To Police (reportedtopolice) <numeric> 
## # total N=7809 valid N=7590 mean=0.56 sd=0.50
## 
## Value |      Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------
##     0 | unreported | 3308 | 42.36 |   43.58 |  43.58
##     1 |   reported | 4282 | 54.83 |   56.42 | 100.00
##  <NA> |       <NA> |  219 |  2.80 |    <NA> |   <NA>

Part 2 (Assignment 9.2)

Goal: Generate crosstab and conduct chi-squared test of independence

Now we will generate a crosstab and conduct a chi-square test statistic using R. Additionally, we will calculate a Cramer’s V coefficient to help describe the strength of the relationship between victimization location and reporting to police. Recall, you will then repeat these steps to generate a cross-tab of the variables ‘relationship’ and ‘V3018’ (victim sex) to answer the book’s question #5 and related questions in Quiz 9.

  1. Create a second-level header titled: Part 2 (Assignment 9.2)“. Then, create a third-level header titled:”Generate Crosstab and Conduct Chi-Squared Test of Independence”.

  2. Insert an R chunk.

  3. Type NCVSData_inj %>% and hit enter. Type select(reportedtopolice, privatelocation) %>%. Recall that you should list rows (dependent variables) first then columns (independent variables) last. Also, if the research question warranted it, you could list multiple independent variables. In this case, our dependent variable is ‘reportedtopolice’ because we are investigating whether victim reporting depends on the location of the victimization incident.

  4. Hit enter and typesjtab(title = "Crosstab of Victim Reporting and Incident Location", show.col.prc = TRUE, statistics = "cramer", use.viewer = FALSE)
    1. You may recall the sjtab() function from the sjPlot package from an earlier assignment. It has a pipe-friendly argument structure, where our first argument identifies the data, followed by variables that should be displayed in a contingency table. So, in our case, the function essentially transforms the two variables we input from our data object into a crosstab.
    2. title = allows us to add a title to our crosstab.
    3. show.col.prc = TRUE function tells R we want it to automatically generate and display the column percentages.
    4. use.viewer = TRUE simply tells R to output the cross-tab in our Viewer pane. If we type use.viewer = FALSE, R will open the cross-tab in another (e.g., HTML browser) window.
    5. For our 2x2 table, the phi-coefficient is the default measure of association selected by the program; if you recall, we saw this option reported in the example table above. Here, we show you how to specify Cramer’s V instead with statistics = "cramer". Instead of “cramer”, we could have instead specified “phi”, “spearman”, “kendall”, “pearson”, “fisher”, or “auto” (for auto-selection).
    6. Note: You can also add file = here("Images", "mytable.html") command in sjtab() to save the table as an HTML file. From there, we also included code (commented out) could also use the “webshot” package to save the HTML table as an image file (.png).
NCVSData_inj %>% 
  select(reportedtopolice, privatelocation) %>% #list row (DV) first then col (IV) last - can have multiple IVs 
  sjtab(title = "Crosstab of Incident Location and Victim Reporting", 
        show.col.prc = TRUE,
        statistics = "cramer", 
        use.viewer = TRUE)
Crosstab of Incident Location and Victim Reporting
Incident Reported To
Police
Victimization
Occurred in a
Private Location
Total
0 1
unreported 2169
47.5 %
1139
37.7 %
3308
43.6 %
reported 2399
52.5 %
1883
62.3 %
4282
56.4 %
Total 4568
100 %
3022
100 %
7590
100 %
χ2=70.529 · df=1 · Cramer’s V=0.097 · p=0.000
#note to students to look in viewer pane like they did for sjplot view_df function. the tidy output should show in knitted file!

# Want to save html table as image instead? 
# Add `file = here("Images", "mytable.html")` to sjtab() code above
# library(webshot)
# webshot::install_phantomjs() # run if first time loading webshot package
# webshot(here("Images", "mytable.html"), here("Images", "mytable.png"))
# knitr::include_graphics(here("Images", "mytable.png"))
  1. As in our earlier example, note that this crosstab automatically reports the chi-square test statistic at the bottom, which in this table is \(\chi^2=70.529\). Additionally, the table reports the phi-coefficient, which is \(\phi=0.097\), as well as the p-value associated with the chi-squared test, which is \(p=0.000\). How do we interpret these results?
    1. First, based on the results of the chi-square test, we can reject the null hypothesis of independence. It would be highly unlikely to obtain a sum of standardized differences between observed frequencies and expected frequencies under the null at least as large as those we obtained due to sampling variability alone if the nully hypothesis were true and these two variables were really independent in the population.
    2. Second, although we have sufficient evidence to reject the null hypothesis, the Cramer’s V measure of association indicates that the relationship between these two variables is quite weak.
    3. How do we reconcile these seemingly divergent findings?
      • Well, recall the central limit theorem and how sample size affects sampling variability? With larger sample sizes, we will often reject a null hypothesis of strict independence (or no difference or no relationship) as large samples will generate very precise population estimates irrespective of the strength of the association.
      • So, assuming our measurement and modeling are appropriate, large sample sizes will allow us to be very confident that two variables are not strictly independent in the population even when percentage differences are very small and variable dependencies are very weak.
    4. For more information on intepreting measures of association, see Chapter 9 (p.281) in the book.

  2. What if we wanted to obtain the chi-square test statistic without creating a crosstab using sjtab()? There are many ways to do this in R. We recommend checking out this particularly informative example of conducting a chi-squared test using the “infer” package in R. Here, we will instead show you a simple alternative using base R to conduct a chi-square test of independence.
    1. Insert an R chunk.
    2. Type chisq <- chisq.test(NCVSData$injured, NCVSData$relationship)
      • Here, we are using the base R chisq.test() function to conduct a chi-square test of independence for the ‘reportedtopolice’ and ‘privatelocation’ variables in our filtered NCVS data comprised of victimizations where an injury occurred.
      • Upon conducting the chi-square test, we assign the results to an object called chisq.
      • Notice that this base R function is not (tidyverse magrittr style) pipe-friendly, meaning we must utilize the $ operator to call the variables from the NCVSData_inj data object.
    3. Hit enter and type chisq to view your chi-square test results.
# Run a chi-square test using base R
chisq <- chisq.test(NCVSData_inj$reportedtopolice, NCVSData_inj$privatelocation)
chisq
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  NCVSData_inj$reportedtopolice and NCVSData_inj$privatelocation
## X-squared = 70.529, df = 1, p-value < 2.2e-16
  1. We can also explore our results from the base R chisq.test() to identify the observed and expected frequency values that were used to calculate the chi-square test statistic. From these values, you could easily calculate the chi-square test statistic by hand. To do this, insert a new R chunk and simply call the new data object, chisq and type $observed or $expected immediately after it.
  • Here are the observed frequencies, which we generated with chisq$observed:
#Determine observed frequencies
chisq$observed 
##                              NCVSData_inj$privatelocation
## NCVSData_inj$reportedtopolice    0    1
##                             0 2169 1139
##                             1 2399 1883
  • In comparison, here are the expected frequencies, which we generated with chisq$expected:
#Determine expected frequencies
chisq$expected
##                              NCVSData_inj$privatelocation
## NCVSData_inj$reportedtopolice        0        1
##                             0 1990.902 1317.098
##                             1 2577.098 1704.902
  1. Congratulations! You should now have everything that you need to complete the selected questions in Assignment 9 that parallel those from Bachman, Paternoster, and Wilson’s SPSS Exercises for Chapter 9. Complete the remainder of the assigned questions in Assignment 9 in your RMD file.
    1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign9. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 9.

Assignment 9 Objective Checks

After completing assignment #9…

  • do you recognize you can manually build a simple tibble row-by-row using tidyverse’s tibble::tribble()?
  • do you recognize you can use round() to specify number of decimals on numeric values?
  • do you recognize you can modify a gt() table to add or remove the decimals in specific columns or rows with fmt_number()?
  • are you able to conduct a chi-squared test of independence using sjPlot::sjtab() or chisq.test() and interpret results?
    • do you know how to specify different measures of association with statistics= using sjPlot::sjtab() and interpret them appropriately?
    • do you know how to generate observed and expected frequencies by assigning results of chisq.test() to an object (e.g., chisq) and then calling elements from object (e.g., chisq$observed or chisq$expected)?