Assignment 7 Objectives

The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!

Our book, like most undergraduate statistics books, largely presents a frequentist view of probability. It starts by presenting us with a basic frequentist mathematical definition of probability as “the number of times that a specific event can occur relative to the total number of times that any event can occur” (p.152). Keen readers will note that this definition of probability sounds uncannily similar to a relative frequency - that’s because it is! In frequentist statistics, empirical probabilities are calculated as observed relative frequencies.

However, observed (known) relative frequencies - aka empirical probabilities - often are used to do more than simply describe a sample; often, they are used to make inferences about unknown (theoretical) population parameters. Your book describes this long run inferential view of empirical probabilities, or what the authors call the second “sampling” notion of probability, as “the chance of an event occurring over the long run with an infinite number of trials” (p.153). Of course, we cannot actually conduct an infinite number of trials, so we use our known relative frequencies from a sample - aka our known empirical probabilities - to infer what we think would likely happen were we to conduct a very large number of trials. After presenting these frequentist notions of probability, the chapter moves on to explain how we could imagine a theoretical “probability distribution” of outcomes that would emerge from repeated trials of an event, then it describes various types of probability distributions, including binomial, normal, and standard normal distributions.

Recall, descriptive statistics involve describing characteristics of a dataset (e.g., a sample), whereas inferential statistics involves making inferences about a population from a subset of sample data drawn from that population. In addition to probability, this chapter also introduces the basics of null hypothesis significance testing, which is the most common procedure by which social scientists use frequentist empirical probabilities and probability distributions to make inferences about populations from descriptions of sample data. Hence, the materials introduced in this chapter and this assignment, including probability, probability rules, probability distributions, standard normal distributions, and standard scores (z-scores), are essential to understanding future assignments that will focus heavily on conducting and interpreting null hypothesis significance tests.

In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

By the end of Assignment 7, you should…

  • understand basic elements of a contingency table (aka crosstab)
    • understand why it can be helpful to place IV in columns and DV in rows
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • know how to calculate marginal, conditional, and joint (frequentist) probabilities
    • know how to compare column percentages (when an IV is in columns of a crosstab)
  • recognize the R operator != as “not equal to”
  • know how to turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • be able to remove missing observations from a variable in R using filter(!is.na(var))
  • be able to generate a crosstab in R using dplyr::select() & sjPlot::sjtab(depvar, indepvar)
    • know how to add a title and column percents to sjtab() table and switch output from viewer to html browser
  • understand the binomial hypothesis test and how to conduct one in R
    • understand the basic inputs of a binomial hypothesis test (p, x, \(\alpha\), & n)
  • be able to modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • have a basic understanding of the logic behind null hypothesis significance testing (we will continue to revisit this topic in more detail in remaining weeks)
    • understand difference between a null (test) hypothesis and contrasting alternative hypotheses
    • understand alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • recognize need to calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or instead should result in a failure to reject the null hypothesis
  • be able to convert raw column (variable) values into standardized z-score values using mutate()

Assumptions & Ground Rules

We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • knit your RMD document into an Word file that you can then save and submit for course credit

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median()(e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “turquoise”) or hexidecimal codes (e.g., “#990000” for crimson)
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))

If you do not recall how to do these things, review Assignments 1-6.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • difference between descriptive and inferential statistics
  • z-scores
    • formula for converting a raw score into a z-score
    • how use a z-score table
  • how to calculate conditional probabilities from a frequency table or cross-tabulation
  • rules of probabilities
    • bounding rule (rule #1; 0 to 1)
    • restricted and general addition rules of probabilities (rule #2a & #2b; unions)
    • restricted and general multiplication rules of probability (rule #3a & #3b; intersections)
  • probability of an event or the complement of an event
    • independent and mutually exclusive events
    • probability of success - binomial theorem
  • probability distributions
    • binomial, normal, and standard normal distributions
    • formula for normal distribution
    • sampling distributions
  • logic underlying null hypothesis testing
    • null hypothesis
    • alpha level (level of significance)
    • z-scores and critical regions
    • reject or failure to reject null

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Part 1 (Assignment 7.1)

Goal: Read in NCVS and 2012 States Data

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-02-02_Ducate_CRIM5305_Assign06)

In the last assignment, you learned how to use frequency tables to calculate measures of dispersion and boxplots to visualize dispersion in a frequency distribution. In this assignment, you will learn the basics of probability theory and probability distributions, including binomial and normal distributions. You will also learn about how to covert a raw score into a standard score or z-score using the standard normal distribution, as well as how such standard scores are used to test null hypotheses with the goal of making population inferences from sample data.

For many, probability is one of the most difficult things to understand in statistics. Some probability calculations are quite intuitive, while others seem to defy our intuitions. As you read Chapter 6 and watched this week’s videos, hopefully you began to understand how probability theory underlies inferential statistics and allows us to make claims about larger populations from sample data. We will continue to work toward understanding inferential statistics over the next few assignments. In this assignment, we will practice calculating (frequentist) probabilities and using z-scores.

  1. Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 5 (named YEAR-MO-DY_LastName_CRIM5305_Assign05). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter CRIM5305 Assignment 7.
    5. In the Author box, enter your First and Last Name (e.g., Caitlin Ducate).
    6. Under Default Output Format box, select “Word document”

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 7.1).” Then, create a third-level header titled: “Read in NCVS and 2012 States Data”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 7. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following three packages: tidyverse, haven, here, sjmisc, sjPlot, summarytools, rstatix, and gt.
      • This is our first time using the rstatix packages, which means you’ll need to download them first using install.packages() or the “Install Packages” button under the “Tools” tab. Then, you can use library() to load them in.
      • Recall, you only need to install packages one time; after that, you can comment out that line. However, you must load the packages each time you start a new R session.
      • The rstatix package provides a simple and intuitive framework for performing basic statistical tests, and it is compatible with tidyverse and tidyverse pipe functions. We will use rstatix::binom_test() to conduct a binomial hypothesis test later in the assignment. We will also pipe the tabled output to the gt() function and then modify the table; if you recall, gt() can improve table output aesthetics and permits table customization.
  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and assign the “NCVS lone offender assaults 1992 to 2013.sav” and “2012 states data.sav” SPSS datafiles into R objects. Put the “NCVS lone offender assaults 1992 to 2013.sav” datafile into an object named NCVSData and the “2012 states data.sav” datafile into an object named StatesData.
      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data objects: NCVSData and StatesData.
      • This will call the object and provide a brief view of the data. Once you have done that, comment out the line. (Note: You can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.)
      • Your RStudio session should now look a lot like this:
StatesData <- read_spss(here("Datasets", "2012statesdata.sav"))
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
View NCVS and States Data

View NCVS and States Data

Explaining Cross-tabulations

Goal: Understand basic elements of a contingency table

In the next section, we will generate contingency tables - otherwise known as a “cross-tabulations” or crosstabs - and use them to calculate (frequentist) marginal, conditional, and joint probabilities.

Contingency tables or cross-tabulations are useful for understanding the association (or lack thereof) between two or more variables. Specifically, we will cross-tabulate two variables from the National Crime Victimization Survey data (NCVSData): the ordinal variable relationship, capturing a victim’s relationship to their assailant (0=“total stranger” to 3=“well known”), and the binary variable maleoff, representing the sex of the offender (0=female; 1=male). Before we do this together, let’s cover the basic elements of a contingency table.

  1. Until now, we have focused primarily on univariate descriptive statistics that summarize characteristics of a frequency distribution for a single variable in a sample, such as the relative frequency or probability of a particular value, or the distribution’s central tendency, shape, or dispersion. Let’s pick a couple variables from the NCVS data - privatelocation and reportedtopolice and start by generating their univariate frequency distributions.
    1. Recall, these NCVS data contain survey responses about individuals’ experiences with criminal victimation collected from nationally representative samples of people in U.S. households. This particular subset contains data from the 1992 to 2013 NCVS studies and only includes data from respondents who reported either violent or nonviolent assault by a single offender. For more information, see p.191 in Bachman, Paternoster, & Wilson’s book.
    2. For this example, we selected these two variables to illustrate how a cross-tabulation might help us answer the following research question: Are criminal victimizations that occur in private locations more or less likely to be reported to police than victimizations that do not occur in private locations?
      • privatelocation is a dummy variable (i.e., 0 or 1 values) indicating whether the reported victimization occurred in a private location (0=Not a private location; 1=Private location).
      • reportedtopolice is a dummy variable indicating whether the victimization incident was reported to the police (0=Unreported; 1=Reported).
      • Remember, you can check these details using sjplot::view_df() (e.g., by typing NCVSData %>% viewdf()). Just be sure to comment out this line before you knit
    3. Below are the univariate frequency distributions for the privatelocation and reportedtopolice variables in our NCVS data subset. Look at the code used to generate the tables–some of it should look familiar!
NCVSData %>% 
  filter(!is.na(reportedtopolice)) %>% 
  freq(privatelocation, report.nas = FALSE) %>% 
  tb() %>% 
  gt()
privatelocation freq pct pct_cum
0 17618 76.2354 76.2354
1 5492 23.7646 100.0000
NCVSData %>% 
  filter(!is.na(reportedtopolice)) %>% 
  freq(reportedtopolice, report.nas = FALSE) %>% 
  tb() %>% 
  gt()
reportedtopolice freq pct pct_cum
0 12639 54.69061 54.69061
1 10471 45.30939 100.00000
#The following code more efficiently accomplishes the same goal
# NCVSData %>% freq(reportedtopolice, report.nas = FALSE) %>% tb() %>% gt()
  1. You already know how to create these frequency tables with summarytools::freq().
    1. The major addition to this code is the filter() command, which tells R to keep only the rows where respondents reported a “0” or “1” to the reportedtopolice item and to remove the n=859 rows with missing data on this item. It does so by filtering to keep values that are not NA on the reportedtopolice variable (i.e., filter(!is.na(reportedtopolice)). We do this for comparison purposes, since our basic contingency tables will drop those missing (NA) cases by default.
      • Note: With summarytools::freq(), we could have simply added the report.nas = FALSE option to more efficiently accomplish the same goal in our frequency table. However, we will use this NA filter later in this assignment, so we wanted to introduce you to it here as well.
      • You may notice our freq() code also includes the report.nas = FALSE option, which is not entirely redundant - here, it removes the empty NA row from our table.
    2. The other key difference is we transformed the output to a “tibble” (a simple tidy dataframe) by piping it to tb() so that we could then pipe the output to our preferred gt() table formatting package.
    3. A quick look at these univariate frequency distributions shows us that victimizations in private locations are relatively rare, with only about 24% occurring in private locations. Also, a little less than half (45%) of victimizations are reported to the police. However, we cannot tell from these tables whether there is an association between the two variables. Enter contingency tables…

  2. A contingency table (aka, cross-tabulation or crosstab) is a multivariate or “joint” frequency distribution table that simultaneously presents the overlapping frequency distributions for two (or more) variables. Below is a contingency table containing joint frequencies for the privatelocation and reportedtopolice variables.
Contingency table w/frequencies

Contingency table w/frequencies

  1. Recall, our research question: Are criminal victimizations that occur in private locations more or less likely to be reported to police than victimizations that do not occur in private locations? This question implies that privatelocation is the independent variable (IV) and reportedtopolice is the dependent variable (DV) In other words, we hypothesize that whether or not a victimization is reported to police depends at least in part on the location of the victimization.
    1. Logically, it does not make sense to posit that police reporting causes victimization location.
    2. Meanwhile, even if it is associated, victimization location may not have a direct causal relationship with with police reporting, as such an association could reflect unmeasured confounding or mediating mechanisms (e.g., systematic differences in victim/offender relationships or offense types across locations might actually cause differences in police reporting). Hence, association != causation.
      • Note the != is an operator in R that means “not equal to”. Recall, we also used ! earlier prior to is.na when filtering missing data. While is.na means “is missing”, !is.na means “not is missing” or “not NA”). So in R, if you want to see if two things are the same, you will use ==, which means “equal to”, while if you want to see if two things are not the same, you will use !=, which means “not equal to.” The use of the exclamation point ! to mean “not” is very common in programming.
    3. We might even expect and test more specific directional expectations for the posited association, such as that victimizations occurring in private locations are less likely to be reported to police than victimizations not occurring in private locations. For now, we will stick with a basic non-directional hypothesized association.

  2. In any case, we prefer to generate contingency tables that place the IV on top with its frequencies in the columns and the DV on the side with its frequencies in the rows. While this is a preference, following a consistent practice like this will make it much easier to read contingency tables and result in fewer errors.
    1. Since privatelocation is the independent variable (IV) here, you will note that it is placed in the columns of the contingency table.
    2. Meanwhile, reportedtopolice is the dependent variable, so it is placed in the rows of the contingency table.
  3. The frequency values at the bottom of the IV’s columns and the end of the DV’s rows, respectively, are known as column and row “marginal frequencies” or marginals for short. If you compare these to our univariate frequency distribution tables above, you will notice the marginal frequencies match the univariate frequency distribution values (after using filter to remove NA values).
  4. So, we can identify univariate frequency distributions from the marginal frequencies (row or column totals) in a contingency table. In addition, a cross-tabulation presents joint frequencies.
  5. We can extract a lot of useful information from joint frequencies in a contingency table. For example, we know that 5,492 (column marginal) of the 23,110 (total) victimizations in the data reportedly occurred in a private location. Of those 5,492 private victimizations, 3,075 were reported to police and 2,417 were unreported.
    1. From these joint frequencies, we can easily calculate marginal, joint, or conditional relative frequencies (aka, frequentist probabilities) - we just divide the desired marginal or joint frequency by the total (for marginal or joint) or row/column (for conditional) frequency, respectively.
      • For example, the marginal probability of a crime being reported is p(reported)=0.45 (i.e., 10,471/23,110). Note that this marginal probability (aka, marginal relative frequency) is independent of the values of the other (privatelocation) variable.
      • The conditional probability of an assault being reported given it occurred in a private location is p(reported|private)=0.56 (i.e., 3,075/5,492). Note that this conditional probability depends on - it is conditional on - the value of the privatelocation variable (i.e., privatelocation = 1).
      • The conditional probability of an assault not being reported given it occurred in a private location is p(unreported|private) = 0.44 (i.e., 2,417/5,492). Note that you could also calculate this probability by subtracting its complement, which we just calculated above (i.e., 1-0.56).
      • The joint probability of an assault occurring in a private location and being unreported is p(private unreported)=0.10 (i.e., 2,417/23,110).
    2. You will practice calculating probabilities like these later in the assignment when answering question 2 from the Chapter 6 SPSS exercises. When you do, remember that it is very easy to read and calculate these relative frequencies or probabilities incorrectly, such as by selecting and dividing by the wrong frequency value (e.g., row instead of column, or total instead of row or column). To avoid such mistakes, take your time, be sure you know what it is you want to describe in the data, and be consistent in constructing tables (e.g., always placing your IV in the columns will help a lot!).
  1. In addition to joint frequency distributions, it is also common to see percentage values reported in contingency tables. Once again, it is important to be sure the calculated or reported percentages are the percentages that you want when reading and interpreting the table. If we stick with the recommended practice of placing our IV in the columns, then we often want column percentages (as opposed to row percentages). The image above includes column percentages directly beneath the frequencies. With this table setup and with column percentages, we will usually total down and compare across when assessing a relationship between the IV and the DV.

    1. If set up and calculated correctly, column percentages should total down within a column to 100%.
    • This means that when column percentages are reported, the column percents should add to 100% and the column marginals at the bottom of the table should show (or calculate to) 100% as well.
    • However, these percentages usually will NOT add up to 100% within a row; you can see this in the image after, where the percentages reported in the row marginals do not equal 100%.
    1. If set up and calculated correctly, column percentages are usually compared across columns of the IV and within a row of the DV.
    • For example, the table above shows 56% of the victimizations that occurred in private locations were reported to the police.
    • In comparison (i.e., compare within same row of the DV), 42% of victimizations that did not occur in private places were reported to police.
    1. Recall from the univariate frequency distributions and our column marginal frequencies that victimizations in private locations are relatively rare, with only about 24% occurring in private locations (i.e., 5,492/23,110).
    • Now, from a comparison of joint frequencies and column percentages in our contingency table, we can conclude that victimizations occurring in private locations are somewhat more likely to be reported to police compared to those not occurring in private locations (56% versus 42% respectively).
      • Put differently, there is an association between these two variables: victimizations occurring in private locations are less likely to go unreported than are victimizations occurring in other places (44% versus 58%, respectively).

Part 2 (Assignment 7.2)

Goal: Create Cross-Tabulations of sex (V3018) and relationship variables

You will learn to generate contingency tables like those presented above. Though there are many options, the sjtab() function from the “sjPlot” package, which was used to create the tables above, is a tidyverse-friendly option for creating useful and aesthetically pleasing contingency tables (for more details, see here, here and here. With the sjtab() approach, we will also rely on the select() function, which we have used before, to select the specific variables we want to include in our contingency table.

Let’s start by using the sjtab() package to generate a table that will help us answer question 2 (part a [i-iii] & part b [i-ii]) from the SPSS exercises at the end of Chapter 6 (p.192) of the book. This question requires us to generate a contingency table of relationship (victim’s relationship to assailant) by V3018 (victim’s sex) to assess whether there are gender differences in victims’ relationships to their assailants. For instance, a contingency table might help us determine whether the probability of being assaulted by a stranger or by a well-known acquaintance is equivalent or differs for men and women.

  1. Create a second-level header titled “Part 2 (Assignment 7.2)”
    1. Create a third-level header titled “Cross-Tabulation of gender and relationship”
    2. Insert an R chunk and type NCVSData %>%. Hit enter and type select(relationship, V3018) %>%.
    3. Hit enter again and, on the next line, type sjtab(title = "Cross-Tabulation of Victim Gender and Relationship to Assailant", show.col.prc = TRUE, use.viewer = FALSE).
      • In the past, you may have typed most of your code on one line. While that method often works, for coding purposes it makes it really hard to see what you are doing and keep track of your functions. Here, I am showing you how to break your code into lines, so that you can easily find a function without scrolling through lots of text.

  2. Recall, the dplyr::select() function allows us to select the variables we want to use in a dataset. When working with just one or two variables from a large dataset, this can be helpful. When working with an entire dataset, we do not need to use select(). For this assignment, we will use it to select the V3018 and relationship variables from our NCVS dataset.
    1. When using the select() function in combination with sjtab(), the dependent variable should be listed first and the independent variable listed second. This will ensure that the DV is in the rows and the IV is in the columns of the table.
      • You can have as many independent variables as you want, but we recommend you always place your independent variables in the columns (listed last) and your dependent variable in the rows (listed first).
    2. As we explained above, we wish to assess the relative frequency of specific categories of relationship to assailant (e.g., strangers) by sex of the victim. Setting our table up with our IV in the columns and DV in the rows allows us easily to compare across values of the IV and within rows of the DV to assess the association of interest.
      • Also, as noted above and explained in your book, technically it does not matter which variables are in the rows or columns as long as we request the appropriate relative frequencies (e.g., row or column percentages) and make the correct comparisons.
      • However, if we always put our IV in the columns (on the top of the table), then it is easier to remember that we also want “column percentages” and we can follow the same procedures each time to compare across columns and within rows.

  3. The sjtab() function allows us to make a contingency table in R that is customizable using tidyverse design philosophy.
    1. The title = function allows us to specify a title for the crosstab.
    2. The show.col.prc = TRUE function requests the column percentages - i.e., the percentages of male offenders (or the percentages of female offenders) that were reported to be in each of the relationship categories (e.g., % of male offenders that were strangers; % of male offenders that were casual acquaintances; etc.).
    3. Lastly, I prefer to specify the use.viewer = FALSE command, which instructs R to output the resulting table in a pop-up HTML document rather than the built-in viewer pane (i.e., the viewer pane is where you find the view_df() output). Feel free to change that to TRUE if you prefer. Either way, the cross-tabulation table should populate in the final knitted HTML document.

  4. If you choose to output to the viewer (use.viewer = TRUE), then your RStudio session should look something like this:
Crosstab of Victim Gender and Relationship to Assailant

Crosstab of Victim Gender and Relationship to Assailant

  1. You generated a tidy crosstab with column percentages (plus a Chi-squared test & Cramer’s V)!
    1. Depending on the options you selected, it should have populated in your viewer pane or in a pop-up browser window. In either case, it should knit to your final Word document. Try knitting now just to make sure.
    2. Now, go to the Chapter 6 SPSS exercises on p.192 of the book, then use your table to answer question 2, part a (i-iii) and part b (i-ii only).
      • Note: To answer these questions, you should use the frequency values in your table to calculate the requested probabilities.
  2. Now you know how to generate crosstabs, and you learned a little more about data wrangling along the way! Let’s move on to learn how to conduct a binomial hypothesis test.

Part 3 (Assignment 7.3)

Goal: Conduct a binomial hypothesis test

In this next section, you will learn how to answer question 3 on page 192 of B&P’s Chapter 6. While the last question focused on the victim’s sex and relationship to assailant, this question focuses instead on the offender’s sex. Specifically, the question asks you to conduct a binomial hypothesis test to infer whether women are more likely than men to assault someone else in the U.S. population.

To test this hypothesis, we will use the maleoff variable, which records victims’ reports of the sex of the offender (0=female offender; 1=male offender), to generate relative frequency/proportion values for male and female offenders. We will then use the binomial distribution to test how probable the observed proportions would be under a null hypothesis of no sex differences in offending (i.e., p=0.50).

You can find a detailed description and example application of the binomial hypothesis test on pages 165-172 of the book. Check out this helpful resource for another description of the binomial test and an alternative way of conducting it in R. Put briefly, here is the essential process:

Now that we know what we need, let’s get to work.

  1. Create a second-level header titled “Part 3 (Assignment 7.3)”
    1. Create a third-level header titled “Conduct a binomial hypothesis test”

  2. Specify the null hypothesis (‘p’)
    1. Following question 3.a in the book, we will specify a null hypothesized probability of success (‘p’) equal to 0.50. Be sure to record your null hypothesis in your RMD text.
    2. A null hypothesis value of p=0.50 indicates we expect the probability of assault to be equal for men and women; that is we expect no sex differences in assaults.
      • The logic underlying null hypothesis testing can be quite confusing for non-statisticians (and even for some statisticians!). Remember, we want to know if women are more likely than men to assault someone else in the U.S. population. So, we start with the opposite assumption that there are no sex differences in assault (i.e., p[assault by woman]=0.50), which effectively stacks the test in favor of the opponent or contrasting hypothesis. Then, our test essentially asks the question: If we live in a world in which the null hypothesis is true, then how improbable would it be to observe data at least as extreme as those I observed? Or, using our assignment example: If we live in a world in which there are no sex differences in assault, then how improbable would it be to observe a proportion of assaults by women that is at least as large as what I observed in these data?

  3. Establish an alpha level (‘\(\alpha\)’).
    1. Question 3.c asks us to select an alpha level. Be sure to read more about alpha or significance levels on p.169-70. Following convention, we will use an alpha level of 0.05 (or 5%) for this assignment. Be sure to record this in your RMD text as well.
    2. Put simply, the alpha level is our risk tolerance or Type I error control rate. Setting ‘\(\alpha\)’ at 0.05 essentially means we will tolerate rejecting a true null hypothesis - that is, of making a false positive inference error - up to 5% of the time.

  4. Determine ‘x’ and ‘n’
    1. Recall, ‘x’ is the number of trials or observations we are analyzing; in this case, it refers to the number of assaults committed by female offenders.
    2. Meanwhile, ‘n’ is the total number of observations; here, this refers to the total number of assaults committed (i.e., by women and men).
    3. You already know how to get this information - generate a frequency distribution table!

    4. Insert an R chunk and type NCVSData %>% frq(maleoff) or NCVSData %>% freq(maleoff) to create a frequency table using either the sjmisc or summarytools package.
      • Remember, you can type NCVSData %>% on one line, hit enter, and type frq(maleoff) on the next. This will make reading your code much easier in the long run - particularly as you string more commands together and your lines get longer.
      • From your table, you should be able to see that ‘x’ is equal to 4,798, because there are 4,798 female offenders reported in the dataset.
      • You should also be able to determine that ‘n’ is equal to 23,969, because there are 23,969 total offenders reported by victims in the dataset. You can get this number in the marginal total row in the freq() output or by adding 4,798 (the number of female victims) with 19,171 (the number of male offenders) from either table. frq() will also provide the total N at the top of the output
# NCVSData %>% frq(maleoff) 
NCVSData %>% freq(maleoff) 
## Frequencies  
## NCVSData$maleoff  
## Label: Male offender =1, 0=female offender  
## Type: Numeric  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           0    4798     20.02          20.02     20.02          20.02
##           1   19171     79.98         100.00     79.98         100.00
##        <NA>       0                               0.00         100.00
##       Total   23969    100.00         100.00    100.00         100.00
  1. Now that we have all the inputs we need, we can use the binomial test function from the rstatix package we installed to conduct the hypothesis test.
    1. Insert an R chunk and type binom_test(x = 4798, n = 23969, p = .5) on one line.
      • binom_test() is the function that runs binomial hypothesis tests in the rstatix package. Within this function, we are specifying x = 4798, n = 23969, and p = 0.5.
      • If you want, you can assign the binomial test to an object (e.g., use <- and perhaps call it binomtestfemoff). Then, you can click the object in your environment and see it in a new tab.
    2. Your RStudio should look something like this:
Binomial Test of Assailant Gender

Binomial Test of Assailant Gender

  1. We are not quite done! Let’s make this binomial test output a bit more visually appealing and easier to read by piping it to the gt() function.
    1. Remember, gt() allows us to create beautiful tables that are more accessible and intuitive to read.
    2. To do this, you simply type %>% (a pipe) after binom_test(x = 4798, n = 23969, p = .5). Then, hit enter and type gt().
    3. If you run your chunk now, your RStudio session should look something like this:
Binomial Test with 'gt' Package

Binomial Test with ‘gt’ Package

  1. Now is a good time to show you how to customize tables using gt(). You can do all sorts of custom modifications; some examples include: adding titles, subtitles, and note lines; changing column or value labels; and making fonts bold, italicized, or underlined. To give you a sense of how this works, we will add a title and subtitle to our table.
    1. Add another pipe (%>%) after gt(), then hit enter
    2. On the next line, type tab_header( then hit enter
    3. On the next line, type title = md("**Binomial test of sex differences in assaults using NCVS subset**"), then hit enter
    4. On the next line, type subtitle = md("*H0:* p=0.50, x(female assaults)=4,798, n=23,969"))
      • tab_header() adds/modifies a table header.
      • md() specifies our text input as Markdown-formatted text, which allows us to bold the **title** using double asterisks and italicize *H0:* with single asterisks.
    5. Run your code chunk, then use your new gt-modified table to interpret results from your binomial hypothesis test.
# options(scipen=999, digits = 3 )
# options(scipen=10, digits = 3 )

binom_test(x = 4798, n = 23969, p = .5) %>% 
  gt() %>%
    tab_header(
      title = md("**Binomial test of sex differences in assaults using NCVS subset**"),
      subtitle = md("*H0:* p=0.50; x(female assaults)=4,798; n=23,969"))
Binomial test of sex differences in assaults using NCVS subset
H0: p=0.50; x(female assaults)=4,798; n=23,969
n estimate conf.low conf.high p p.signif
23969 0.2001752 0.1951254 0.2052978 4.940656e-324 ****
  1. Compare the test p-value to our alpha level

    1. When looking at the results, the first thing you might notice is that the p-value is in scientific notation (i.e., p=4.94e-324). This happens when numbers are extremely large or, in this case, extremely small.

      • For those who do not frequently use scientific notation, it may be confusing when the number is so small that it is output in this format. For a brief refresher on scientific notation, check out this website; to learn more about scientific notation and changing its output in R, see here.

      • Recall, a number followed by “10” or “e” to the power of a negative number implies a small number. Here, the number 4.94 followed by ‘e to the -324’ indicates the probability is very close to zero (specifically, 323 zeroes followed by 494 after the decimal).

      • If you want to see just how many zeros, you can turn off scientific notation by typing the following command right above your binomial test and then running your code chunk again: options(scipen=999, digits = 3). This changes the global options of our R workspace.

      • You can turn scientific notation back on and specify when it should be used by changing the number following scipen=. For instance, typing options(scipen=10, digits = 3) and then running your code again will result in scientific notation in your table once more.

      • If you wish to restore the R default, type and run: options(scipen=0, digits=7)

    2. So, our test p-value is close to zero and is definitely smaller than our pre-specified alpha level of 0.05 (i.e., 4.94e-324 < 0.05).

  2. Use results of statistical test to draw inferences about population

    1. Since our p-value is less than our alpha level (i.e., p-value < \(\alpha\)), then we would conclude that it would be highly unlikely (p-value = 4.94e-324) to observe a relative frequency/probability of assault by women as or more extreme than the one we observed (i.e., x <= 0.20) if the null hypothesis of no sex differences in assault (i.e., if p=0.50) were true in the population.

    2. Therefore, we would conclude that there is enough evidence to reject the null hypothesis of no sex differences in assault.

  3. Finish answering question 3 (p.192) of the book and any remaining quiz questions related to it.

Part 4 (Assignment 7.4)

Goal: Calculating Z-score and Creating Histogram for MurderRt Variable

From this week’s book chapter, you also learned how to use the z-score formula to convert raw values by hand to z-score values that are standardized to the normal curve. For the last part of this assignment, you will learn how to convert a variable’s raw values into z-scores in R.

We will work with MurderRt variable in the 2012 States Data. Essentially, we want to visualize the distribution of homicide rates across states in the US, then make informative standardized comparisons across states’ homicide rates. To do this, we will first create a histogram using ggplot(). Next, we will examine raw murder rates in individual states and compare these values to the average murder rate in the US from these data. Finally, we will convert states’ raw homicide rates into z-scores, then compare how Alabama’s and Alaska’s standardized murder rates compare with sample average US homicide rate.

  1. Create a second-level header titled: “Part 4 (Assignment 7.4)”. Then, create a third-level header titled: “Creating Histogram for MurderRt Variable”.
    1. Insert an R chunk and create a histogram of the MurderRt variable.
    2. Type StatesData %>%. Hit enter and type ggplot() +. Then, hit enter one last time and type geom_histogram(aes(x=MurderRt)). Answer question 5 on page 192 of B&P.
      • Note: It is not necessary to put the x= before MurderRt in aes(), as the default setting is to plot the histogram on the x-axis. However, it is good practice to write out default settings anyway; in this case, it helps you build an intuition about what ggplot() is doing, and you can more easily modify it (e.g., by switching to y-axis with y=) as desired.
#start by showing histogram for students (doing part of q5 for them)
StatesData %>%
  ggplot() +
  geom_histogram(aes(x=MurderRt))

  1. Next, we are going to create a new column (variable) in which we convert MurderRt variable values into z-score standardized values. Recall from this week’s readings and other course materials that the z-score allows us to make standardized comparisons about how close or far away a particular value (e.g., a state’s homicide rate) is from the mean of the data series (e.g., sample mean homicide rate across states). For example, we can see how close the homicide rates of states like Texas or California are to the sample-average US homicide rate.
    1. Create a third-level header titled: “Converting MurderRt Values to Z-scores”
    2. Insert an R chunk, then select only the columns we need - State & MurderRt - and assign only these two variables (columns) into a new data object.
      • You know how to do this already - just follow the format: datasubset <- data %>% dplyr::select(var1, var2). Try doing it yourself; if you get tripped up, that is fine. We include step-by-step instructions below as usual.
      • Note: While this step is not strictly necessary, it is often easier to work with just the subset of data that we need rather than the entire data object.
    3. Type StatesDataSub <- StatesData %>%.
      • We start by assigning our States data into a States data subset. At this point, StatesDataSub and StatesData are identical.
    4. Hit enter and type select(State, MurderRt).
      • Here, we are selecting just the “State” and MurderRt variables. After this line, the StatesDataSub data object will contain only these two variables.
    5. To view your new data object with just the “State” and MurderRt variables, type StatesDataSub in the R chunk. Your Rstudio session should look something like this (but with header lines and descriptive text):
Selecting State and MurderRt Variables

Selecting State and MurderRt Variables

  1. Now, we want to use the z-score formula to convert the raw MurderRt values to z-scores. Doing so will allow you to answer question 7 on page 192 of the book and complete the assignment. Recall from this week’s readings/materials that the formula for z-scores is z = = (x - mean(x))/sd(x), where x represents a vector of raw scores for a continuous variable (e.g., MurderRt values), mean(x) equals the sample mean of x, and sd(x) equals the sample standard deviation of x.

  2. As with most tasks in R, there are various ways to convert the MurderRt values to z-scores. For instance, use mutate(), there are various ways to use the z-score formula to convert our old variable into a new standardized variable. We could calculate the mean and sd, then plug in the formula. We could simply write out the formula as above and plug MurderRt in for ‘x’. Or, we could use base R’s built-in scale() function for converting z-scores. We could even create our own function to standardize variables. Creating functions is especially useful if we think we will need to standardize variables again in the future; it saves us from duplicating our efforts and from potentially making copy-and-past errors in the process. For examples and more details on the scale() method or creating a function to convert to z-scores, see here and here. We will walk you through a few different ways below.

  3. First, we will find the mean and standard deviation of MurderRt, then use these values with mutate() to create a new variable containing the converted z-scores. This is NOT a method that we recommend - typing in values to create variables is a potentially error-prone process. However, this simple approach is useful for seeing how z-scores are calculated using other methods.
    1. Let’s start by finding the mean and standard deviation of MurderRt. Insert an R chunk and generate descriptive statistics for the MurderRt variable with summarytools::descr() using the code: StatesDataSub %>% descr(MurderRt).
    2. Type StatesDataSub <- StatesDataSub %>%, then hit enter and type mutate(.
      • Notice that we are assigning the StatesDataSub data object into the same object again. We do not want to overwrite the original StatesData dataset, but we can always recreate this subset with our code above if we mess up. Assigning into our previously created object ensures the original dataset remains clean while minimizing unnecessary clutter in our R Environment.
    3. Hit enter again and type ZMurderRt = (x - mean(x))/sd(x)). Replace mean(x) with the sample mean and replace sd(x) with the sample standard deviation. Hit enter once more and type StatesDataSub to view your data object. Note: We added head() to display only the first five rows, which allows you to compare your results without adding long tables to our assignment file. If you want to see the entire output, remove the head() function: StatesDataSub %>% gt()
#select only the variable(s) we need & assign to new df 
StatesDataSub <- StatesData %>% 
  select(State, MurderRt)

StatesDataSub %>%
  descr(MurderRt) 
## Descriptive Statistics  
## StatesDataSub$MurderRt  
## Label: Murder Rate per 100K  
## N: 50  
## 
##                     MurderRt
## ----------------- ----------
##              Mean       4.51
##           Std.Dev       2.41
##               Min       0.90
##                Q1       2.60
##            Median       4.65
##                Q3       6.00
##               Max      12.30
##               MAD       2.59
##               IQR       3.35
##                CV       0.53
##          Skewness       0.73
##       SE.Skewness       0.34
##          Kurtosis       0.59
##           N.Valid      50.00
##         Pct.Valid     100.00
  #note mean = 4.51, sd=2.41

#manually covert z-scores
StatesDataSub <- StatesDataSub %>% 
  mutate(
    ZMurderRt = (MurderRt - 4.51)/2.41)
StatesDataSub %>% head() %>% gt()
State MurderRt ZMurderRt
Alabama 7.1 1.0746888
Alaska 3.2 -0.5435685
Arizona 5.5 0.4107884
Arkansas 6.3 0.7427386
California 5.4 0.3692946
Colorado 3.2 -0.5435685
  1. Recall, you can also determine the mean and standard deviation with base R as well. Simply create a new R chunk and type mean(StatesData$MurderRt) and sd(StatesData$MurderRt). These functions will calculate the mean and standard deviation without all the other descriptive statistics.

  2. We can also convert to z-scores without the separate steps of calculating the mean and standard deviation and then manually inputting them into the formula. Instead of plugging in those numbers, simply replace x with MurderRt in your mutate() formula. We strongly recommend this method over the first approach.
    1. Create a new R chunk.
    2. Type StatesDataSub <- StatesDataSub %>% mutate(ZMurderRt = (x - mean(x))/sd(x)) like before. Replace x with MurderRt.
    3. Note: The resulting z-score values will differ slightly from those generated above - particularly after the second decimal place. These differences are due to rounding. Recall, in the first approach, we rounded the mean and standard deviation values to two decimal places; in this second approach, R used far more precise values for the mean and standard deviation. Also, we added head() again to display only the first five rows, which allows you to compare your results without adding long tables to our assignment file.
#find z-score for each data value 
StatesDataSub <- StatesDataSub %>% 
  mutate(
    ZMurderRt = (MurderRt - mean(MurderRt))/sd(MurderRt)
    )
StatesDataSub %>% head() %>% gt()
State MurderRt ZMurderRt
Alabama 7.1 1.0727213
Alaska 3.2 -0.5450719
Arizona 5.5 0.4090113
Arkansas 6.3 0.7408663
California 5.4 0.3675294
Colorado 3.2 -0.5450719
  1. Finally, we can use the scale() function to calculate the z-scores for us. We can use similar code as above but without writing the z-score formula ourselves. Note that it produces identical results as our table above.
#find z-score for each data value 
StatesDataSub <- StatesDataSub %>% 
  mutate(
    ZMurderRt = scale(MurderRt)
    )
StatesDataSub %>% head() %>% gt()
State MurderRt ZMurderRt
Alabama 7.1 1.0727213
Alaska 3.2 -0.5450719
Arizona 5.5 0.4090113
Arkansas 6.3 0.7408663
California 5.4 0.3675294
Colorado 3.2 -0.5450719
  1. Congratulations! You should now have everything that you need to complete the questions in Assignment 7 that parallel those from B&P’s SPSS Exercises for Chapter 6! Complete the remainder of the questions in Assignment 7 in your RMD file.
    1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document as: YEAR_MO_DY_LastName_CRIM5305_Assign07. Submit via Blackboard in the relevant section for Assignment 7.

Assignment 7 Objective Checks

After completing Assignment 7…

  • do you understand basic elements of a contingency table (aka crosstab)?
    • do you understand why it can be helpful to place IV in columns and DV in rows?
    • do you recognize column/row marginals and their overlap with univariate frequency distributions
    • do you know how to calculate marginal, conditional, and joint (frequentist) probabilities?
    • do you know how to compare column percentages (when an IV is in columns of a crosstab)?
  • do you recognize the R operator != as “not equal to”?
  • do you know how to turn off or change scientific notation in R, such as options(scipen=999, digits = 3)?
  • are you able to remove missing observations from a variable in R using filter(!is.na(var))?
  • are you able to generate a crosstab in R using dplyr::select() & sjPlot::sjtab(depvar, indepvar)?
    • do you know how to add a title and column percents to sjtab() table and switch output from viewer to html browser?
  • do you understand the bimomial hypothesis test and how to conduct one in R?
    • do you understand the basic inputs of a binomial hypothesis test (p, x, \(\alpha\), & n)?
  • are you able to modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts?
  • do you have a basic understanding of the logic behind null hypothesis significance testing?
    • do you understand alpha or significance level (e.g., as risk tolerance or false positive error control rate)?
    • do you know how to compare a test p-value to alpha level and appropriately decide whether to reject/fail to reject the null hypothesis?
  • are you able to convert raw column (variable) values into standardized z-score values using mutate()?