## The purpose of this sixth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!

Our book, like most undergraduate statistics books, largely presents
a frequentist
view of probability. It starts by presenting us with a basic frequentist
mathematical definition of probability as “the number of times that a
specific event can occur relative to the total number of times that any
event can occur” (p.152). Keen readers will note that this definition of
probability sounds uncannily similar to a *relative frequency* -
that’s because it is! In frequentist statistics, empirical
probabilities are calculated as observed relative frequencies.

However, observed (known) relative frequencies - aka empirical
probabilities - often are used to do more than simply describe a sample;
often, they are used to *make inferences* about unknown
(theoretical) population parameters. Your book describes this long run
inferential view of empirical probabilities, or what the authors call
the second “sampling” notion of probability, as “the chance of an event
occurring over the long run with an infinite number of trials” (p.153).
Of course, we cannot actually conduct an infinite number of trials, so
we use our known relative frequencies from a sample - aka our known
empirical probabilities - to *infer* what we think would likely
happen were we to conduct a very large number of trials. After
presenting these frequentist notions of probability, the chapter moves
on to explain how we could imagine a theoretical “probability
distribution” of outcomes that would emerge from repeated trials of an
event, then it describes various types of probability distributions,
including binomial, normal, and standard normal distributions.

Recall, **descriptive statistics** involve describing
characteristics of a dataset (e.g., a sample), whereas
**inferential statistics** involves making inferences about
a population from a subset of sample data drawn from that population. In
addition to probability, this chapter also introduces the basics of
**null hypothesis significance testing**, which is the most
common procedure by which social scientists use frequentist empirical
probabilities and probability distributions to make inferences about
populations from descriptions of sample data. Hence, the materials
introduced in this chapter and this assignment, including probability,
probability rules, probability distributions, standard normal
distributions, and standard scores (z-scores), are essential to
understanding future assignments that will focus heavily on conducting
and interpreting null hypothesis significance tests.

In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

- understand basic elements of a contingency table (aka crosstab)
- understand why it can be helpful to place IV in columns and DV in rows
- recognize column/row marginals and their overlap with univariate frequency distributions
- know how to calculate marginal, conditional, and joint (frequentist) probabilities
- know how to compare column percentages (when an IV is in columns of a crosstab)

- recognize the R operator
`!=`

as “not equal to” - know how to turn off or change scientific notation in R, such as
`options(scipen=999, digits = 3)`

- be able to remove missing observations from a variable in R using
`filter(!is.na(var))`

- know how to change a numeric variable to a factor (e.g., nominal or
ordinal) variable with
`haven::as_factor()`

- recognize how to drop an unused factor level (e.g., missing “Don’t
know” label) on a variable using
`data %>% droplevels(data$variable)`

- recognize how to drop an unused factor level (e.g., missing “Don’t
know” label) on a variable using
- be able to generate a crosstab in R using
`dplyr::select()`

&`sjPlot::sjtab(depvar, indepvar)`

- know how to add a title and column percents to
`sjtab()`

table and switch output from viewer to html browser

- know how to add a title and column percents to
- be able to generate a crosstab in R using
`crosstable(depvar, by=indepvar)`

- know how to include missing data and add column/row marginal
frequency totals to a
`crosstable()`

table - know how to modify decimal digits & table output format in
`crosstable()`

table, and how to output it to an aesthetically pleasing html table

- know how to include missing data and add column/row marginal
frequency totals to a
- understand the bimomial hypothesis test and how to conduct one in R
- understand the basic inputs of a binomial hypothesis test (p, x,
*α*, & n)

- understand the basic inputs of a binomial hypothesis test (p, x,
- be able to modify elements of a
`gt()`

table, such as adding titles/subtitles with Markdown-formatted (e.g.,`**bold**`

or`*italicized*`

) fonts

- have a basic understanding of the logic behind null hypothesis
significance testing (we will continue to revisit this topic in more
detail in remaining weeks)
- understand difference between a null (test) hypothesis and contrasting alternative hypotheses
- understand alpha or significance level (e.g., as risk tolerance or false positive error control rate)
- recognize need to calculate a test statistic and corresponding
*p*-value - compare a test
*p*-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or instead should result in a failure to reject the null hypothesis

- be able to convert raw column (variable) values into standardized
z-score values using
`mutate()`

- know that you can create your own R functions (e.g., our
`funxtoz()`

function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

## We are building on objectives from Assignments 1-5. By the start of this assignment, you should already know how to:

- create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
- install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
- recognize when a function is being called from a specific package
using a double colon with the
`package::function()`

format - read in an SPSS data file in an R code chunk using
`haven::read_spss()`

and assign it to an R object using an assignment (`<-`

) operator

- use the
`$`

symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format`dataobject$varname`

- use a tidyverse
`%>%`

pipe operator to perform a sequence of actions - knit your RMD document into an HTML file that you can then save and submit for course credit

- use
`here()`

for a simple and reproducible self-referential file directory method - Use
`groundhog.library()`

as an optional but recommended reproducible alternative to`library()`

for loading packages

- use the base R
`head()`

function to quickly view a snapshot of your data - use the
`glimpse()`

function to quickly view all columns (variables) in your data - use
`sjPlot::view_df()`

to quickly browse variables in a data file - use
`attr()`

to identify variable and attribute value labels

- recognize when missing values are coded as
`NA`

for variables in your data file - select and recode variables using dplyr’s
`select()`

,`mutate()`

, and`if_else()`

functions

- use
`summarytools::dfsummary()`

to quickly describe one or more variables in a data file - create frequency tables with
`sjmisc:frq()`

and`summarytools::freq()`

functions - sort frequency distributions (lowest to highest/highest to lowest)
with
`summarytools::freq()`

- calculate measures of central tendency for a frequency distribution
- calculate central tendency using base R functions
`mean()`

and`median()`

(e.g.,`mean(data$variable`

)) - calculate central tendency and other basic descriptive statistics
for specific variables in a dataset using
`summarytools::descr()`

and`psych::describe()`

functions

- calculate central tendency using base R functions
- calculate measures of dispersion for a variable distribution
- calculate dispersion measures by hand from frequency tables you generate in R
- calculate some measures of dispersion (e.g., standard deviation)
directly in R (e.g., with
`sjmisc:frq()`

or`summarytools::descr()`

)

- improve some knitted tables by piping a function’s results to
`gt()`

(e.g.,`head(data) %>% gt()`

)

- create basic graphs using ggplot2’s
`ggplot()`

function- generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
- generate boxplots using base R
`boxplot()`

and`ggplot()`

to visualize dispersion in a data distribution

- modify elements of a ggplot object
- change outline and fill colors in a ggplot geometric object (e.g.,
`geom_boxplot()`

) by adding`fill=`

and`color=`

followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream) - add or change a preset theme (e.g.,
`+ theme_minimal()`

) to a ggplot object to conveniently modify certain plot elements (e.g., white background color) - select colors from a colorblind accessible palette (e.g., using
`viridisLite::viridis()`

) and specify them for the outline and fill colors in a ggplot geometric object (e.g.,`geom_boxplot()`

) - add a title (and subtitle or caption) to a ggplot object by adding a
label with the
`labs()`

function (e.g.,`+ labs(title = "My Title")`

)

- change outline and fill colors in a ggplot geometric object (e.g.,

*If you do not recall how to do these things, review Assignments
1-5.*

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

- difference between descriptive and inferential statistics
- z-scores
- formula for converting a raw score into a z-score

- how use a z-score table

- formula for converting a raw score into a z-score
- how to calculate conditional probabilities from a frequency table or cross-tabulation
- rules of probabilities
- bounding rule (rule #1; 0 to 1)
- restricted and general addition rules of probabilities (rule #2a & #2b; unions)
- restricted and general multiplication rules of probability (rule #3a & #3b; intersections)

- probability of an event or the complement of an event
- independent and mutually exclusive events
- probability of success - binomial theorem