The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!

Our book, like most undergraduate statistics books, largely presents
a frequentist
view of probability. It starts by presenting us with a basic frequentist
mathematical definition of probability as “the number of times that a
specific event can occur relative to the total number of times that any
event can occur” (p.152). Keen readers will note that this definition of
probability sounds uncannily similar to a *relative frequency* -
that’s because it is! In frequentist statistics, empirical
probabilities are calculated as observed relative frequencies.

However, observed (known) relative frequencies - aka empirical
probabilities - often are used to do more than simply describe a sample;
often, they are used to *make inferences* about unknown
(theoretical) population parameters. Your book describes this long run
inferential view of empirical probabilities, or what the authors call
the second “sampling” notion of probability, as “the chance of an event
occurring over the long run with an infinite number of trials” (p.153).
Of course, we cannot actually conduct an infinite number of trials, so
we use our known relative frequencies from a sample - aka our known
empirical probabilities - to *infer* what we think would likely
happen were we to conduct a very large number of trials. After
presenting these frequentist notions of probability, the chapter moves
on to explain how we could imagine a theoretical “probability
distribution” of outcomes that would emerge from repeated trials of an
event, then it describes various types of probability distributions,
including binomial, normal, and standard normal distributions.

Recall, **descriptive statistics** involve describing
characteristics of a dataset (e.g., a sample), whereas
**inferential statistics** involves making inferences about
a population from a subset of sample data drawn from that population. In
addition to probability, this chapter also introduces the basics of
**null hypothesis significance testing**, which is the most
common procedure by which social scientists use frequentist empirical
probabilities and probability distributions to make inferences about
populations from descriptions of sample data. Hence, the materials
introduced in this chapter and this assignment, including probability,
probability rules, probability distributions, standard normal
distributions, and standard scores (z-scores), are essential to
understanding future assignments that will focus heavily on conducting
and interpreting null hypothesis significance tests.

In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

- understand basic elements of a contingency table (aka crosstab)
- understand why it can be helpful to place IV in columns and DV in rows
- recognize column/row marginals and their overlap with univariate frequency distributions
- know how to calculate marginal, conditional, and joint (frequentist) probabilities
- know how to compare column percentages (when an IV is in columns of a crosstab)

- recognize the R operator
`!=`

as “not equal to” - know how to turn off or change scientific notation in R, such as
`options(scipen=999, digits = 3)`

- be able to remove missing observations from a variable in R using
`filter(!is.na(var))`

- be able to generate a crosstab in R using
`dplyr::select()`

&`sjPlot::sjtab(depvar, indepvar)`

- know how to add a title and column percents to
`sjtab()`

table and switch output from viewer to html browser

- know how to add a title and column percents to
- understand the binomial hypothesis test and how to conduct one in R
- understand the basic inputs of a binomial hypothesis test (p, x, \(\alpha\), & n)

- be able to modify elements of a
`gt()`

table, such as adding titles/subtitles with Markdown-formatted (e.g.,`**bold**`

or`*italicized*`

) fonts - have a basic understanding of the logic behind null hypothesis
significance testing (we will continue to revisit this topic in more
detail in remaining weeks)
- understand difference between a null (test) hypothesis and contrasting alternative hypotheses
- understand alpha or significance level (e.g., as risk tolerance or false positive error control rate)
- recognize need to calculate a test statistic and corresponding
*p*-value - compare a test
*p*-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or instead should result in a failure to reject the null hypothesis

- be able to convert raw column (variable) values into standardized
z-score values using
`mutate()`

We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:

- create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
- install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
- recognize when a function is being called from a specific package
using a double colon with the
`package::function()`

format - read in an SPSS data file in an R code chunk using
`haven::read_spss()`

and assign it to an R object using an assignment (`<-`

) operator - use the
`$`

symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format`dataobject$varname`

- use a tidyverse
`%>%`

pipe operator to perform a sequence of actions - knit your RMD document into an Word file that you can then save and submit for course credit

- use
`here()`

for a simple and reproducible self-referential file directory method

- use the base R
`head()`

function to quickly view a snapshot of your data - use the
`glimpse()`

function to quickly view all columns (variables) in your data - use
`sjPlot::view_df()`

to quickly browse variables in a data file - use
`attr()`

to identify variable and attribute value labels

- recognize when missing values are coded as
`NA`

for variables in your data file - select and recode variables using dplyr’s
`select()`

,`mutate()`

, and`if_else()`

functions

- use
`summarytools::dfsummary()`

to quickly describe one or more variables in a data file - create frequency tables with
`sjmisc:frq()`

and`summarytools::freq()`

functions - sort frequency distributions (lowest to highest/highest to lowest)
with
`summarytools::freq()`

- calculate measures of central tendency for a frequency distribution
- calculate central tendency using base R functions
`mean()`

and`median()`

(e.g.,`mean(data$variable`

)) - calculate central tendency and other basic descriptive statistics
for specific variables in a dataset using
`summarytools::descr()`

and`psych::describe()`

functions

- calculate central tendency using base R functions
- calculate measures of dispersion for a variable distribution
- calculate dispersion measures by hand from frequency tables you generate in R
- calculate some measures of dispersion (e.g., standard deviation)
directly in R (e.g., with
`sjmisc:frq()`

or`summarytools::descr()`

)

- improve some knitted tables by piping a function’s results to
`gt()`

(e.g.,`head(data) %>% gt()`

)

- create basic graphs using ggplot2’s
`ggplot()`

function- generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
- generate boxplots using base R
`boxplot()`

and`ggplot()`

to visualize dispersion in a data distribution

- modify elements of a ggplot object
- change outline and fill colors in a ggplot geometric object (e.g.,
`geom_boxplot()`

) by adding`fill=`

and`color=`

followed by specific color names (e.g., “turquoise”) or hexidecimal codes (e.g., “#990000” for crimson) - add a title (and subtitle or caption) to a ggplot object by adding a
label with the
`labs()`

function (e.g.,`+ labs(title = "My Title")`

)

- change outline and fill colors in a ggplot geometric object (e.g.,

*If you do not recall how to do these things, review Assignments
1-6.*

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

- difference between descriptive and inferential statistics
- z-scores
- formula for converting a raw score into a z-score

- how use a z-score table

- formula for converting a raw score into a z-score
- how to calculate conditional probabilities from a frequency table or cross-tabulation
- rules of probabilities
- bounding rule (rule #1; 0 to 1)
- restricted and general addition rules of probabilities (rule #2a & #2b; unions)
- restricted and general multiplication rules of probability (rule #3a & #3b; intersections)

- probability of an event or the complement of an event
- independent and mutually exclusive events
- probability of success - binomial theorem

- probability distributions
- binomial, normal, and standard normal distributions
- formula for normal distribution
- sampling distributions

- logic underlying null hypothesis testing
- null hypothesis
- alpha level (level of significance)
- z-scores and critical regions
- reject or failure to reject null

As noted previously, for this and all future assignments, you MUST
type all commands in by hand. *Do not copy & paste except for
troubleshooting purposes (i.e., if you cannot figure out what you
mistyped).*

Goal: Read in NCVS and 2012 States Data

*( Note:*

In the last assignment, you learned how to use frequency tables to calculate measures of dispersion and boxplots to visualize dispersion in a frequency distribution. In this assignment, you will learn the basics of probability theory and probability distributions, including binomial and normal distributions. You will also learn about how to covert a raw score into a standard score or z-score using the standard normal distribution, as well as how such standard scores are used to test null hypotheses with the goal of making population inferences from sample data.

For many, probability is one of the most difficult things to understand in statistics. Some probability calculations are quite intuitive, while others seem to defy our intuitions. As you read Chapter 6 and watched this week’s videos, hopefully you began to understand how probability theory underlies inferential statistics and allows us to make claims about larger populations from sample data. We will continue to work toward understanding inferential statistics over the next few assignments. In this assignment, we will practice calculating (frequentist) probabilities and using z-scores.

- Go to your CRIM5305_L folder, which should contain the R Markdown
file you created for Assignment 5 (named
**YEAR-MO-DY_LastName_CRIM5305_Assign05**). Click to open the R Markdown file.- Remember, we open RStudio in this way so the
`here`

package will automatically set our CRIM5305_L folder as the top-level directory. - In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
- The dialogue box asks for a
**Title**, an**Author**, and a**Default Output Format**for your new R Markdown file. - In the
**Title**box, enter*CRIM5305 Assignment 7*. - In the
**Author**box, enter your First and Last Name (e.g.,*Caitlin Ducate*). - Under
**Default Output Format**box, select “Word document”

- Remember, we open RStudio in this way so the
- Remember that the new R Markdown file contains a simple
pre-populated template to show users how to do basic tasks like add
settings, create text headings and text, insert R code chunks, and
create plots. Be sure to delete this text before you begin working.
- Create a second-level header titled: “Part 1 (Assignment 7.1).” Then, create a third-level header titled: “Read in NCVS and 2012 States Data”
- This assignment must be completed by the student and the student
alone. To confirm that this is your work, please begin all assignments
with this text: This R Markdown document contains
*my work*for Assignment 7. It is**my work**andmy work.*only* - Now, you need to get data into RStudio. You already know how to do
this, but please refer to Assignment 1 if you have questions.

- Create a third-level header in R Markdown (hereafter, “RMD”) file
titled: “Load Libraries”
- Insert an R chunk.
- Inside the new R code chunk, load the following three packages:
`tidyverse`

,`haven`

,`here`

,`sjmisc`

,`sjPlot`

,`summarytools`

,`rstatix`

, and`gt`

.- This is our first time using the
`rstatix`

packages, which means you’ll need to download them first using`install.packages()`

or the “Install Packages” button under the “Tools” tab. Then, you can use`library()`

to load them in. - Recall, you only need to install packages one time; after that, you can comment out that line. However, you must load the packages each time you start a new R session.
- The
`rstatix`

package provides a simple and intuitive framework for performing basic statistical tests, and it is compatible with tidyverse and tidyverse pipe functions. We will use`rstatix::binom_test()`

to conduct a binomial hypothesis test later in the assignment. We will also pipe the tabled output to the`gt()`

function and then modify the table; if you recall,`gt()`

can improve table output aesthetics and permits table customization.

- This is our first time using the

- After your first code chunk, create another third-level header in
RMD titled: “Read Data into R”
- Insert another R code chunk.
- In the new R code chunk, read and assign the “NCVS lone offender
assaults 1992 to 2013.sav” and “2012 states data.sav” SPSS datafiles
into R objects. Put the “NCVS lone offender assaults 1992 to 2013.sav”
datafile into an object named
`NCVSData`

and the “2012 states data.sav” datafile into an object named`StatesData`

.- Forget how to do this? Refer to any of the previous assignments.

- In the same code chunk, on a new line below your read data/assign
object command, type the name of your new R data objects:
`NCVSData`

and`StatesData`

.- This will call the object and provide a brief view of the data. Once
you have done that, comment out the line. (
**Note:**You can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.) - Your RStudio session should now look a lot like this:

- This will call the object and provide a brief view of the data. Once
you have done that, comment out the line. (

```
StatesData <- read_spss(here("Datasets", "2012statesdata.sav"))
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
```

Goal: Understand basic elements of a contingency table

In the next section, we will generate contingency tables - otherwise known as a “cross-tabulations” or crosstabs - and use them to calculate (frequentist) marginal, conditional, and joint probabilities.

Contingency tables or cross-tabulations are useful for understanding
the association (or lack thereof) between two or more variables.
Specifically, we will cross-tabulate two variables from the National
Crime Victimization Survey data (`NCVSData`

): the ordinal
variable `relationship`

, capturing a victim’s relationship to
their assailant (0=“total stranger” to 3=“well known”), and the binary
variable `maleoff`

, representing the sex of the offender
(0=female; 1=male). Before we do this together, let’s cover the basic
elements of a contingency table.

- Until now, we have focused primarily on univariate descriptive
statistics that summarize characteristics of a frequency distribution
for a single variable in a sample, such as the relative frequency or
probability of a particular value, or the distribution’s central
tendency, shape, or dispersion. Let’s pick a couple variables from the
NCVS data -
`privatelocation`

and`reportedtopolice`

and start by generating their univariate frequency distributions.- Recall, these NCVS data contain survey responses about individuals’ experiences with criminal victimation collected from nationally representative samples of people in U.S. households. This particular subset contains data from the 1992 to 2013 NCVS studies and only includes data from respondents who reported either violent or nonviolent assault by a single offender. For more information, see p.191 in Bachman, Paternoster, & Wilson’s book.
- For this example, we selected these two variables to illustrate how
a cross-tabulation might help us answer the following research question:
*Are criminal victimizations that occur in private locations more or less likely to be reported to police than victimizations that do not occur in private locations?*`privatelocation`

is a dummy variable (i.e., 0 or 1 values) indicating whether the reported victimization occurred in a private location (0=*Not a private location*; 1=*Private location*).`reportedtopolice`

is a dummy variable indicating whether the victimization incident was reported to the police (0=*Unreported*; 1=*Reported*).- Remember, you can check these details using
`sjplot::view_df()`

(e.g., by typing`NCVSData %>% viewdf()`

). Just be sure to comment out this line before you knit

- Below are the univariate frequency distributions for the
`privatelocation`

and`reportedtopolice`

variables in our NCVS data subset. Look at the code used to generate the tables–some of it should look familiar!

```
NCVSData %>%
filter(!is.na(reportedtopolice)) %>%
freq(privatelocation, report.nas = FALSE) %>%
tb() %>%
gt()
```

privatelocation | freq | pct | pct_cum |
---|---|---|---|

0 | 17618 | 76.2354 | 76.2354 |

1 | 5492 | 23.7646 | 100.0000 |

```
NCVSData %>%
filter(!is.na(reportedtopolice)) %>%
freq(reportedtopolice, report.nas = FALSE) %>%
tb() %>%
gt()
```

reportedtopolice | freq | pct | pct_cum |
---|---|---|---|

0 | 12639 | 54.69061 | 54.69061 |

1 | 10471 | 45.30939 | 100.00000 |

```
#The following code more efficiently accomplishes the same goal
# NCVSData %>% freq(reportedtopolice, report.nas = FALSE) %>% tb() %>% gt()
```

- You already know how to create these frequency tables with
`summarytools::freq()`

.- The major addition to this code is the
`filter()`

command, which tells R to keep only the rows where respondents reported a “0” or “1” to the`reportedtopolice`

item and to remove the n=859 rows with missing data on this item. It does so by filtering to keep values that are not`NA`

on the`reportedtopolice`

variable (i.e.,`filter(!is.na(reportedtopolice))`

. We do this for comparison purposes, since our basic contingency tables will drop those missing (`NA`

) cases by default.*Note:*With`summarytools::freq()`

, we could have simply added the`report.nas = FALSE`

option to more efficiently accomplish the same goal in our frequency table. However, we will use this`NA`

filter later in this assignment, so we wanted to introduce you to it here as well.- You may notice our
`freq()`

code also includes the`report.nas = FALSE`

option, which is not entirely redundant - here, it removes the empty`NA`

row from our table.

- The other key difference is we transformed
the output to a “tibble” (a simple tidy dataframe) by piping it to
`tb()`

so that we could then pipe the output to our preferred`gt()`

table formatting package. - A quick look at these univariate frequency distributions shows us
that victimizations in private locations are relatively rare, with only
about 24% occurring in private locations. Also, a little less than half
(45%) of victimizations are reported to the police. However, we cannot
tell from these tables whether there is an association between the two
variables. Enter contingency tables…

- The major addition to this code is the
- A contingency table (aka, cross-tabulation or crosstab) is a
multivariate or “joint” frequency distribution table that simultaneously
presents the overlapping frequency distributions for two (or more)
variables. Below is a contingency table containing joint frequencies for
the
`privatelocation`

and`reportedtopolice`

variables.

- Recall, our research question:
*Are criminal victimizations that occur in private locations more or less likely to be reported to police than victimizations that do not occur in private locations?*This question implies that`privatelocation`

is the**independent variable (IV)**and`reportedtopolice`

is the**dependent variable (DV)**In other words, we hypothesize that whether or not a victimization is reported to police depends at least in part on the location of the victimization.- Logically, it does not make sense to posit that police reporting causes victimization location.
- Meanwhile, even if it is associated, victimization location may not
have a direct causal relationship with with police reporting, as such an
association could reflect unmeasured confounding or mediating mechanisms
(e.g., systematic differences in victim/offender relationships or
offense types across locations might actually cause differences in
police reporting). Hence, association
`!=`

causation.- Note the
`!=`

is an operator in R that means “not equal to”. Recall, we also used`!`

earlier prior to`is.na`

when filtering missing data. While`is.na`

means “is missing”,`!is.na`

means “not is missing” or “not NA”). So in R, if you want to see if two things are the same, you will use`==`

, which means “equal to”, while if you want to see if two things are not the same, you will use`!=`

, which means “not equal to.” The use of the exclamation point`!`

to mean “not” is very common in programming.

- Note the
- We might even expect and test more specific directional expectations
for the posited association, such as that victimizations occurring in
private locations are
*less likely*to be reported to police than victimizations not occurring in private locations. For now, we will stick with a basic non-directional hypothesized association.

- In any case, we prefer to
**generate contingency tables that place the IV on top with its frequencies in the columns and the DV on the side with its frequencies in the rows**. While this is a preference, following a consistent practice like this will make it much easier to read contingency tables and result in fewer errors.- Since
`privatelocation`

is the independent variable (IV) here, you will note that it is placed in the columns of the contingency table. - Meanwhile,
`reportedtopolice`

is the dependent variable, so it is placed in the rows of the contingency table.

- Since
- The frequency values at the bottom of the IV’s columns and the end
of the DV’s rows, respectively, are known as column and row “marginal
frequencies” or marginals for short. If you compare these to our
univariate frequency distribution tables above, you will notice the
marginal frequencies match the univariate frequency distribution values
(after using filter to remove
`NA`

values). - So, we can identify univariate frequency distributions from the marginal frequencies (row or column totals) in a contingency table. In addition, a cross-tabulation presents joint frequencies.
- We can extract a lot of useful information from joint frequencies in
a contingency table. For example, we know that 5,492 (column marginal)
of the 23,110 (total) victimizations in the data reportedly occurred in
a private location. Of those 5,492 private victimizations, 3,075 were
reported to police and 2,417 were unreported.
- From these joint frequencies, we can easily calculate marginal,
joint, or conditional relative frequencies (aka, frequentist
probabilities) - we just divide the desired marginal or joint frequency
by the total (for marginal or joint) or row/column (for conditional)
frequency, respectively.
- For example, the marginal probability of a crime being reported is
p(reported)=0.45 (i.e., 10,471/23,110). Note that this marginal
probability (aka, marginal relative frequency) is independent of the
values of the other (
`privatelocation`

) variable. - The conditional probability of an assault being reported given it
occurred in a private location is p(reported|private)=0.56 (i.e.,
3,075/5,492). Note that this conditional probability depends on - it is
conditional on - the value of the
`privatelocation`

variable (i.e.,`privatelocation`

= 1). - The conditional probability of an assault
*not*being reported given it occurred in a private location is p(unreported|private) = 0.44 (i.e., 2,417/5,492). Note that you could also calculate this probability by subtracting its*complement*, which we just calculated above (i.e., 1-0.56). - The joint probability of an assault occurring in a private location and being unreported is p(private unreported)=0.10 (i.e., 2,417/23,110).

- For example, the marginal probability of a crime being reported is
p(reported)=0.45 (i.e., 10,471/23,110). Note that this marginal
probability (aka, marginal relative frequency) is independent of the
values of the other (
- You will practice calculating probabilities like these later in the assignment when answering question 2 from the Chapter 6 SPSS exercises. When you do, remember that it is very easy to read and calculate these relative frequencies or probabilities incorrectly, such as by selecting and dividing by the wrong frequency value (e.g., row instead of column, or total instead of row or column). To avoid such mistakes, take your time, be sure you know what it is you want to describe in the data, and be consistent in constructing tables (e.g., always placing your IV in the columns will help a lot!).

- From these joint frequencies, we can easily calculate marginal,
joint, or conditional relative frequencies (aka, frequentist
probabilities) - we just divide the desired marginal or joint frequency
by the total (for marginal or joint) or row/column (for conditional)
frequency, respectively.

In addition to joint frequency distributions, it is also common to see percentage values reported in contingency tables. Once again, it is important to be sure the calculated or reported percentages

*are the percentages that you want*when reading and interpreting the table. If we stick with the recommended practice of placing our IV in the columns, then we often want**column percentages**(as opposed to row percentages). The image above