In the last assignment, you learned how to conduct a one-sample z or t hypothesis test of the difference between a sample and population mean and then, given the test results and the null hypothesis, to make an appropriate inference about the population mean by either rejecting or failing to reject the null hypothesis. In this assignment, you will learn how to make population inferences about the relationship between two categorical variables by conducting a chi-squared test of independence on a sample contingency table (crosstab).
tibble::tribble()
round()
to specify number of
decimals on numeric valuesgt()
table to add or remove
the decimals in specific columns or rows with
fmt_number()
sjPlot::sjtab()
or chisq.test()
and interpret
results
statistics=
using sjPlot::sjtab()
and
interpret them appropriately.chisq.test()
to an object (e.g.,
chisq
) and then calling elements from object (e.g.,
chisq$observed
or chisq$expected
)We are building on objectives from Assignments 1-8. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
listname <- c(item1, item2)
lapply()
to create a list of
objects, which can help you avoid cluttering the R Environment with
objectsfunxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorshere()
for a simple and reproducible
self-referential file directory methodset.seed()
mtcars
that are universally available to R usershead()
function to quickly view the
first few rows of datatail()
function to quickly view the last
few rows of dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
haven::as_factor()
select()
,
mutate()
, and if_else()
functionsmutate()
dplyr::sample_n()
dplyr::filter()
and
%in%
operatorrnorm()
,
truncnorm::rtruncnorm()
, or runif()
dplyr::slice_sample()
infer::rep_slice_sample()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
#### Data visualization
& aestheticsgt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)labs()
function (e.g.,
+ labs(title = "My Title")
)ggplot()
plots into a
single figure using “patchwork” package
patchwork::wrap_plots()
to
quickly combine ggplot objects contained in a listggplot() + geom_point() + geom_errorbars
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
objectrstatix::binom_test()
qt()
or
qnorm()
t.test()
functioninfer::t_test()
infer::visualize()
to visualize where
your sample statistic would fall in the sampling distribution associated
with your null hypothesisIf you do not recall how to do these things, review Assignments 1-10.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Understand the null hypothesis of statistical independence and visualize chi-squared test of independence
A few assignments back (Assignment 7), you learned how to describe the association between two categorical variables by creating and interpreting a contingency table or crosstab. In this assignment, you will learn how to make an inference about the relationship between two variables in a population by conducting a chi-squared (\(\chi^2\)) test of independence on a sample crosstab. Additionally, we will briefly introduce you to the phi-coefficient and Cramer’s V, two measures of association that can be interpreted to describe the strength of an association between variables in a crosstab.
Note: You might notice that your book uses “chi-square” yet we use “chi-squared” instead (with a “d” at the end). Which term is correct? It does not really matter as long as you realize we are referring to the same statistical quantity.
In Assignment 7, we explained how to set up a crosstab with the independent variable (IV) in the columns and dependent variable (DV) in the rows and then how to describe the association - or lack of association - between the IV and DV by comparing column percentages within rows and across columns. For example, recall this crosstab from the earlier assignment:
The crosstab above allows us to assess whether there is an association between two variables - “private location of victimization” and “victimization reported to police.” We can describe the association by comparing column percentages within a single row. For instance, in Assignment 7, we compared the percentage of victimization incidents in private locations that were reported to police with the percentage of victimization incidents in non-private locations that were reported to police (purple highlighted rows). Specifically, we stated that “there is an association between these two variables: victimizations occurring in private locations are less likely to go unreported than are victimizations occurring in other places (44% versus 58%, respectively).” Alternatively, about 56% of victimizations in private locations were reported to police, while only about 42% of victimizations in non-private places were reported to police.
So, if these column percentages differ, which they did (58% - 44% = 14 percentage point difference), then we can conclude there is an association between these variables in our sample. But how strong is that association? Or, if we have a probability sample, then can we confidently conclude that there is an association in the population, or the observed difference likely to be due to sampling variation instead?
In Assignment 7, we did not discuss in detail the note line below the table; let’s identify what is reported there. First, the note line reports results of a chi-squared test of independence, including the chi-squared test statistic (\(\chi^2=331.106, df=1\)) and later the p-value (\(p=0.000\)) associated with this test. This test assesses how likely it would be to observe differences in percentages within DV values (rows) and across IV values (columns) that are as at least as large as those observed if the two variables were truly “independent” or unrelated in the population. Additionally, the note line reports the phi-coefficient (\(\phi=0.120\)). As explained in this week’s book chapter, the phi-coefficient is a measure of association that helps us summarize the strength of relationship between two variables.
In this assignment, you will learn how to calculate and interpret a chi-squared test of independence in R. First, though, we want to walk you through the step-by-step (definitional) calculation of the chi-squared statistic. Our goal is to help you understand the logic underlying the chi-squared test of independence and, specifically, the null hypothesis of statistical independence between variables.
As noted above, the chi-squared test of independence estimates the probability of observing column percentage differences as large or larger than those observed under the assumption - that is, given the null hypothesis - that the two variables are independent in the population.
What does “independent” mean in this context? Well, it means that we are comparing the observed cell frequencies in our crosstab with the expected cell frequencies under the null hypothesis, or with the cell frequencies that we would expect to see if the column and row marginal totals (i.e., univariate frequency distributions) remained the same yet there was no association (\(\phi=0\)) between the two variables.
As such, no discrepancy between observed cell frequencies and expected (null) cell frequencies will result in a chi-squared value of “0”. In contrast, large discrepancies between observed and expected cell frequencies (and/or large sample sizes) result in large chi-squared test statistics, smaller p-values, and more confidence in the inference that the observed differences would be highly improbable if the null hypothesis of statistical independence were true in the population.
To help you better understand the null hypothesis of independence, we will calculate the expected frequencies under the null for each cell in the example table above. As your book explains, the expected frequencies are the joint cell frequencies we would expect to see if the variables were independent or completely unrelated.
Before we calculate these expected frequencies, let’s start by
reproducing the example table with only the marginal totals included
(i.e., univariate frequency distributions); in the cells where we need
to calculate expected frequencies, we will start by listing them as
missing or “NA” values. We will manually build this table by inputting
each frequency value row-by-row into a tibble (i.e., tidy data object)
using the tidyverse tibble::tribble()
function.
expectedfreqs <- tribble(
~DV, ~non_private, ~private, ~Total,
"unreported", NA, NA, 12639,
"reported", NA, NA, 10471,
"Total", 17618, 5492, 23110
)
expectedfreqs %>%
gt() %>%
tab_header(title = md("Marginal Frequencies Only"))
Marginal Frequencies Only | |||
DV | non_private | private | Total |
---|---|---|---|
unreported | NA | NA | 12639 |
reported | NA | NA | 10471 |
Total | 17618 | 5492 | 23110 |
Now, let’s fill in the expected frequency values. Remember,
these are the cell frequencies that we would expect to find if the
variables were independent - that is, if there were absolutely no
association between the two variables. To calculate each expected
frequency, we multiply the cell’s row frequency total by the cell’s
column frequency total and then divide by the total sample size (N). For
example, the expected frequency for the top left cell is calculated as:
(\(E_{[1,1]}=\frac{12,639 *
17,618}{23,110}\).
Note: We formatted the expected frequencies to display only
two decimals by using the round()
function. However, when
we added these values to our tibble, the tribble()
function
automatically made our formatting consistent by adding two decimals to
the column marginals as well. So, we also applied a simple fix by
modifying our gt()
table with fmt_number()
to
remove the decimals in the column marginal frequencies (columns 2 &
3, row 3).
EC1R1 <- round((12639*17618)/23110, digits=2)
EC1R2 <- round((10471*17618)/23110, digits=2)
EC2R1 <- round((12639*5492)/23110, digits=2)
EC2R2 <- round((10471*5492)/23110, digits=2)
expectedfreqs <- tribble(
~DV, ~non_private, ~private, ~Total,
"unreported", EC1R1, EC2R1, 12639,
"reported", EC1R2, EC2R2, 10471,
"Total", 17618, 5492, 23110
)
expectedfreqs %>%
gt() %>%
tab_header(title = md("Expected Frequencies")) %>%
fmt_number(
columns = 2:3,
rows = 3,
decimals = 0
)
Expected Frequencies | |||
DV | non_private | private | Total |
---|---|---|---|
unreported | 9635.39 | 3003.61 | 12639 |
reported | 7982.61 | 2488.39 | 10471 |
Total | 17,618 | 5,492 | 23110 |
Before calculating chi-square, let’s first transform these
expected frequencies into expected column percentages to help us
interpret our (relative) frequency expectations under the null
hypothesis of independence.
EC1R1p <- round((((12639*17618)/23110)/17618), digits=4)
EC1R2p <- round((((10471*17618)/23110)/17618), digits=4)
EC2R1p <- round((((12639*5492)/23110)/5492), digits=4)
EC2R2p <- round((((10471*5492)/23110)/5492), digits=4)
expectedfreqs <- tribble(
~DV, ~non_private, ~private, ~Total,
"unreported", EC1R1p, EC2R1p, 12639,
"reported", EC1R2p, EC2R2p, 10471,
"Total", 17618, 5492, 23110
)
expectedfreqs %>%
gt() %>%
tab_header(title=md("Expected Column Percents")) %>%
fmt_number(
columns = 2:3,
rows = 3,
decimals = 0
) %>%
fmt_percent(
columns = 2:3,
rows = 1:2,
decimals = 1
)
Expected Column Percents | |||
DV | non_private | private | Total |
---|---|---|---|
unreported | 54.7% | 54.7% | 12639 |
reported | 45.3% | 45.3% | 10471 |
Total | 17,618 | 5,492 | 23110 |
So, if the null hypothesis of independence (i.e., no
association) were true, then we would expect about 45% of victimization
incidents to be reported to police irrespective of whether the
incident occurred in a private or a non-private location.
In contrast to these expectations under independence, we observed that about 56% of victimizations occurring in private locations were reported to police, whereas about 42% of victimizations occurring in non-private places were reported to police. Therefore, our observed frequencies differ from the expected frequencies under the null.
To test the null hypothesis of independence, we calculate a test statistic that essentially equals the sum of standardized differences between observed and expected cell frequencies in a crosstab. Specifically, the definitional formula for calculating the chi-squared test statistic is: \(\chi^2_{obt}=\Sigma{\frac{(Obs-Exp)^2}{Exp}}\), or the sum of squared differences between observed and expected cell frequencies divided by expected frequencies. Below, we calculate each of the standardized differences for each cell and place them in our four table cells instead.
stdiffC1R1 <- round(((10222 - EC1R1)^2)/EC1R1, digits=2)
stdiffC1R2 <- round(((7396 - EC1R2)^2)/EC1R2, digits=2)
stdiffC2R1 <- round(((2417 - EC2R1)^2)/EC2R1, digits=2)
stdiffC2R2 <- round(((3075 - EC2R2)^2)/EC2R2, digits=2)
expectedfreqs <- tribble(
~DV, ~non_private, ~private, ~Total,
"unreported", stdiffC1R1, stdiffC2R1, 12639,
"reported", stdiffC1R2, stdiffC2R2, 10471,
"Total", 17618, 5492, 23110
)
expectedfreqs %>%
gt() %>%
tab_header(title=md("Standardized differences in cell frequencies")) %>%
fmt_number(
columns = 2:3,
rows = 3,
decimals = 0
)
Standardized differences in cell frequencies | |||
DV | non_private | private | Total |
---|---|---|---|
unreported | 35.71 | 114.57 | 12639 |
reported | 43.11 | 138.29 | 10471 |
Total | 17,618 | 5,492 | 23110 |
Our chi-squared test statistic, then, equals the sum of these
four cell values:
chisqtest <- stdiffC1R1 + stdiffC1R2 + stdiffC2R1 + stdiffC2R2
chisqtest
## [1] 331.68
Our manually calculated chi-squared test statistic of {r chisqtest} is nearly the same as the test statistic reported with the original table (331.108)!
You might wonder why the values are not exactly the same. It is not
due to rounding. In fact, it turns out that R
applies what is known as the Yates’ correction when calculating this
statistic. Essentially, this correction involves subtracting 1/2 from
the absolute value of each difference in observed and expected values in
a 2X2 table before squaring each difference, dividing by the expected
frequency, and then summing each standardized difference. Below, we show
how to manually apply this correction (using the abs()
function to generate the absolute difference); doing so results in the
exact same value as the one reported in the original table.
stdiffC1R1 <- round(((abs(10222 - EC1R1) -.5)^2)/EC1R1, digits=3)
stdiffC1R2 <- round(((abs(7396 - EC1R2) -.5)^2)/EC1R2, digits=3)
stdiffC2R1 <- round(((abs(2417 - EC2R1) -.5)^2)/EC2R1, digits=3)
stdiffC2R2 <- round(((abs(3075 - EC2R2) -.5)^2)/EC2R2, digits=3)
chisqtest_adj <- stdiffC1R1 + stdiffC1R2 + stdiffC2R1 + stdiffC2R2
chisqtest_adj
## [1] 331.108
Now that you know how to calculate the chi-squared test statistic, what do we do with it? You do the same thing we have done with other test statistics (e.g., t and z) - you compare it to a pre-specified critical value associated with an alpha level and a probability distribution of standardized values!
In this case, our test statistic (chi-squared or \(\chi^2\)) follows a chi-squared probability distribution. Like the t distribution, there are different chi-squared distributions that vary depending on the degrees of freedom associated with the test. In a chi-squared test of independence applied to a crosstab, we calculate degrees of freedom as equal to: (the number of rows minus 1) times (the number of columns minus one), or \(df=(Rows-1)(Cols-1)\).
Below, we use ggplot to visualize the chi-squared probability distribution for a 2X2 table with df=(2-1)(2-1)=1.
x <- as.tibble(rchisq(100000, df = 1)) %>%
mutate(
critvals = if_else(value>3.84, "1", "0")
)
x %>% ggplot() +
geom_histogram(aes(x=value, y=..density..),
color="#1fa187", fill="#4ac16d", alpha=.7,
position="identity", breaks = seq(0, 10, by = .05)) +
stat_function(fun = dchisq, args = list(df = 1),
color='#365c8d', size=1.1) +
coord_cartesian(expand = FALSE, xlim=c(0,8), ylim=c(0,1)) +
theme_minimal() +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank()
) +
labs(x="chi-squared value, df=1", y=NULL)
Now, let’s modify this graph a bit. First, we will overlay a vertical line at 3.841, which is the critical value for a chi-squared test with df=1 and alpha=0.05 (\(\chi^2_{crit}=3.841\)). Second, we will shade the distribution beyond our critical value with a different color and fill combination.
x %>% ggplot(aes(x=value)) +
geom_histogram(aes(y=..count.., color=critvals, fill=critvals),
alpha=.7,
position="identity", breaks = seq(0, 10, by = .05)) +
coord_cartesian(expand = FALSE, xlim=c(0,8), ylim=c(0,4000)) +
geom_vline(xintercept = 3.841, color = "#d44842", size=1.2, alpha=0.7) +
annotate("text", x=4, y=1000, angle=0, color="#d44842", hjust=0,
label="critical value=3.841\ndf=1, alpha=.05") +
scale_color_manual(values=c("#1fa187","#bc3754")) +
scale_fill_manual(values=c("#4ac16d","#e45a31")) +
theme_minimal() +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
legend.position="none"
) +
labs(x="chi-squared value, df=1", y=NULL)
As with other null hypothesis tests you have learned about thus far, if our chi-squared test statistic is greater than the pre-specified critical value, we reject the null hypothesis of statistical independence. In contrast, if our test statistic is smaller than the critical value, we would fail to reject the null hypothesis of statistical independence.
So, where does our test statistic fall in the chi-squared
distribution and in relation to the critical value? Let’s add a dashed
line to our plot at the unadjusted value of \(\chi^2_{obt}=331.68\) and see where it
falls in the distribution. To see it, we will need to zoom out the
plot’s x-axis quite a bit (from xlim(0,8)
to
xlim(0,332)
). We will leave the y-axis limit the same as
above to get a better sense of how far away our test statistic is from
the critical value.
x %>% ggplot(aes(x=value)) +
geom_histogram(aes(y=..count.., color=critvals, fill=critvals),
alpha=.7,
position="identity", breaks = seq(0, 10, by = .05)) +
coord_cartesian(expand = FALSE, xlim=c(0,332), ylim=c(0,4000)) +
geom_vline(xintercept = 3.841, color = "#d44842", size=1.2, alpha=0.7) +
annotate("text", x=10, y=1000, angle=0, color="#d44842", hjust=0,
label="critical value=3.841\ndf=1, alpha=.05") +
geom_vline(xintercept = 331.68, color = "#365c8d", linetype="dashed",
size=1, alpha=0.7) +
annotate("text", x=325, y=1000, angle=0, color="#365c8d", hjust=1,
label="test statistic=331.68") +
scale_color_manual(values=c("#1fa187","#bc3754")) +
scale_fill_manual(values=c("#4ac16d","#e45a31")) +
theme_minimal() +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
legend.position="none"
) +
labs(x="chi-squared value, df=1", y=NULL)
As you can see in the plot above, the chi-squared test statistic is greater than critical value (331.68 > 3.841), so there is sufficient evidence to reject the null hypothesis of statistical independence. This is also implied by the p-value, which is less than the alpha level (0.000 < 0.05). If the null hypothesis indeed were true in the population, then it would be highly improbable to obtain a sample with differences at least as large as those observed in the original table due to sampling variability alone. Therefore, based on these sample statistics, we can infer that the null hypothesis of independence is unlikely to be true in the population.
Goal: Filter data for violent incidents (injured==1) and examine univariate frequency distributions
For this assignment, you will learn to conduct chi-squared tests of
independence in R. Questions 20-24 asks you to investigate whether women
are more likely than men to experience violent victimization by
strangers. To do so, you first need to filter the data to include only
victimizations that resulted in injury to the victim (with the
injured
variable), then conduct a chi-squared test of
independence on the variables V3018
(sex of victim) and
relationship
(relationship between victim and offender) to
test the hypothesis that the two variables are statistically
independent.
Below, we will demonstrate how to complete the assignment steps by
investigating a different question using the same data: Are violent
victimizations that occur in private locations more or less likely to be
reported to police than violent victimizations that do not occur in
private locations? This is nearly the same question that we
investigated with sample descriptive statistics (crosstabs) in
Assignment 7 and portrayed in the example above. One important
difference, however, is that we will be filtering our data to keep only
those victimizations where an injury occurred so that we can examine
violent victimizations rather than all (violent and nonviolent) criminal
victimizations. As before, we will use the privatelocation
and reportedtopolice
variables from the NCVS data, then we
will generate a crosstab and conduct a hypothesis test to make a
population inference about the (non)independence of these two
variables.
privatelocation
is a dummy variable (i.e., 0 or 1
values) indicating whether the reported victimization occurred in a
private location (0=Not a private location; 1=Private location).reportedtopolice
is a dummy variable indicating whether
the victimization incident was reported to the police (0=Unreported;
1=Reported).sjplot::view_df()
(e.g., by typing
NCVSData %>% view_df()
).By illustrating the process with different variables, you will get
the experience of conducting these tests in R yourself using the same
data. Then, after working through the process with our example, you will
repeat the same steps with the V3018
(sex of victim) and
relationship
variables afterwards using the same filtered
data to complete the assignment. Be sure to repeat these steps
with the correct variables and filtered data listed in the assignment to
ensure you receive full credit.
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date)
here
package will automatically set our CRIM5305_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, sjPlot
, and gt
. Recall,
you only need to install packages one time. However, you must load them
each time you start a new R session.NCVSData
.
NCVSData
.
This will call the object and provide a brief view of the data. Remember, you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.
Here is a glimpse of what the first few rows of your data should look like:
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
head(NCVSData) %>% gt()
YEAR | V2119 | V2129 | V3014 | V3016 | V3018 | V3021 | V3023 | V3023A | V3024 | V2026 | V4049 | V4234 | V4235 | V4236 | V4237 | V4237A | V4238 | V4239 | V4240 | V4241 | V4242 | V4243 | V4244 | V4245 | V4246 | V4246A | V4246B | V4246C | V4246E | V4246F | V4246G | V4247 | V4528 | injured | privatelocation | reportedtopolice | weaponpresent | medicalcarereceived | filter_$ | relationship | Policereported | victimreported | thirdpartyreport | maleoff | age_r | vic18andover |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2000 | 2 | 2 | 34 | 6 | 1 | 1 | 1 | NA | 2 | 12 | 1 | 1 | NA | 2 | 6 | NA | 2 | 3 | NA | 1 | NA | 3 | NA | 1 | 1 | NA | NA | NA | NA | NA | NA | 2 | 11 | 1 | 1 | 0 | 1 | 1 | 1 | 3 | 0 | 0 | 0 | 0 | 34 | 1 |
2000 | 2 | 1 | 51 | 3 | 1 | 2 | 2 | NA | 2 | 10 | 1 | 1 | NA | 1 | 6 | NA | 2 | 1 | 1 | 1 | NA | 3 | NA | NA | 2 | NA | NA | NA | NA | NA | NA | 1 | 13 | 0 | 1 | 0 | 1 | NA | 0 | 3 | 0 | 0 | 0 | 1 | 51 | 1 |
2000 | 2 | 2 | 20 | 5 | 1 | 1 | 1 | NA | 2 | 14 | 2 | 1 | NA | 1 | 5 | NA | 2 | 1 | 1 | 1 | NA | 3 | NA | NA | 3 | NA | NA | NA | NA | NA | NA | 1 | 20 | 0 | 0 | 0 | 0 | NA | 0 | 3 | 0 | 0 | 0 | 1 | 20 | 1 |
2000 | 2 | 2 | 38 | 3 | 2 | 1 | 1 | NA | 2 | 10 | 2 | 1 | NA | 1 | 6 | NA | 2 | 2 | NA | 1 | NA | 3 | NA | 2 | 1 | NA | NA | NA | NA | NA | NA | 1 | 11 | 1 | 1 | 1 | 0 | 1 | 1 | 3 | 1 | 1 | 0 | 1 | 38 | 1 |
2000 | 2 | 1 | 71 | 1 | 2 | 1 | 1 | NA | 2 | 10 | 2 | 1 | NA | 2 | 6 | NA | 2 | 2 | NA | 1 | NA | 2 | NA | NA | 2 | NA | NA | NA | NA | NA | NA | 1 | 20 | 0 | 1 | 0 | 0 | NA | 0 | 2 | 0 | 0 | 0 | 0 | 71 | 1 |
2000 | 2 | 2 | 12 | 5 | 1 | 1 | 1 | NA | 2 | 10 | 2 | 1 | NA | 1 | 2 | NA | 3 | 3 | NA | 1 | NA | 2 | NA | NA | 1 | NA | NA | NA | NA | NA | NA | 1 | 14 | 1 | 0 | 1 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 1 | 12 | 0 |
dplyr::filter()
to select certain rows of data based on
column values.NCVSData_inj <- NCVSData %>%
. Hit enter and
type dplyr::filter(injured == 1)
.NCVSData_inj %>% frq(injured)
to view the frequency
distribution for the “injured” variable and ensure our new data object
contains only those cases where the victim sustained injuries during
victimization (injured == 1
). In the frequency table for
the “injured” variable, the row labeled “uninjured” (value = “0”) should
have an N=0.#Selecting only those observations (rows) where the offender was a male. 'maleoff' = 1 if the offender was male and 'maleoff' = 0 if the offender was female. If 'maleoff' = 0, it is excluded.
NCVSData_inj <- NCVSData %>%
dplyr::filter(injured == 1)
NCVSData_inj %>%
frq(injured)
## Victim Sustained Injuries During Vicitmization (injured) <numeric>
## # total N=7809 valid N=7809 mean=1.00 sd=0.00
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ---------------------------------------------------
## 0 | uninjured | 0 | 0 | 0 | 0
## 1 | injured | 7809 | 100 | 100 | 100
## <NA> | <NA> | 0 | 0 | <NA> | <NA>
privatelocation
and reportedtopolice
variables
from the new filtered data object. This will allow us to examine how
often respondents report being injured by strangers versus slightly
known, casual acquaintance, or well known offenders.
NCVSData_inj %>% frq(privatelocation)
. Then, hit enter
and on the next line type
NCVSData_inj %>% frq(reportedtopolice)
to create a
frequency table.Raw % column
calculates percentages
including NA
s. The Valid %
column excludes
NA
s before calculation.relationship
and
V3018
(sex of victim).NCVSData_inj %>%
frq(privatelocation)
## Victimization Occurred in a Private Location (privatelocation) <numeric>
## # total N=7809 valid N=7809 mean=0.39 sd=0.49
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## 0 | 4741 | 60.71 | 60.71 | 60.71
## 1 | 3068 | 39.29 | 39.29 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
NCVSData_inj %>%
frq(reportedtopolice)
## Incident Reported To Police (reportedtopolice) <numeric>
## # total N=7809 valid N=7590 mean=0.56 sd=0.50
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ----------------------------------------------------
## 0 | unreported | 3308 | 42.36 | 43.58 | 43.58
## 1 | reported | 4282 | 54.83 | 56.42 | 100.00
## <NA> | <NA> | 219 | 2.80 | <NA> | <NA>
Goal: Generate crosstab and conduct chi-squared test of independence
Now we will generate a crosstab and conduct a chi-square test
statistic using R. Additionally, we will calculate a Cramer’s V
coefficient to help describe the strength of the relationship between
victimization location and reporting to police. Recall, you will then
repeat these steps to generate a cross-tab of the variables
relationship
and V3018
(victim sex) to answer
the book’s question #5 and related questions in Quiz 9.
NCVSData_inj %>%
and hit enter. Type
select(reportedtopolice, privatelocation) %>%
. Recall
that you should list rows (dependent variables) first then columns
(independent variables) last. Also, if the research question warranted
it, you could list multiple independent variables. In this case, our
dependent variable is reportedtopolice
because we are
investigating whether victim reporting depends on the location of the
victimization incident.sjtab(title = "Crosstab of Victim Reporting and Incident Location", show.col.prc = TRUE, statistics = "cramer", use.viewer = FALSE)
sjtab()
function from the
sjPlot
package from an earlier assignment. It has a
pipe-friendly argument structure, where our first argument identifies
the data, followed by variables that should be displayed in a
contingency table. So, in our case, the function essentially transforms
the two variables we input from our data object into a crosstab.title =
allows us to add a title to our crosstab.show.col.prc = TRUE
function tells R we want it to
automatically generate and display the column percentages.use.viewer = TRUE
simply tells R to output the
cross-tab in our Viewer pane. If we type
use.viewer = FALSE
, R will open the cross-tab in another
(e.g., HTML browser) window.statistics = "cramer"
.
Instead of “cramer”, we could have instead specified “phi”, “spearman”,
“kendall”, “pearson”, “fisher”, or “auto” (for auto-selection).file = here("Images", "mytable.html")
command in
sjtab()
to save the table as an HTML file. From there, we
also included code (commented out) could also use the “webshot” package
to save the HTML table as an image file (.png).NCVSData_inj %>%
select(reportedtopolice, privatelocation) %>% #list row (DV) first then col (IV) last - can have multiple IVs
sjtab(title = "Crosstab of Incident Location and Victim Reporting",
show.col.prc = TRUE,
statistics = "cramer",
use.viewer = TRUE)
Incident Reported To Police |
Victimization Occurred in a Private Location |
Total | |
---|---|---|---|
0 | 1 | ||
unreported |
2169 47.5 % |
1139 37.7 % |
3308 43.6 % |
reported |
2399 52.5 % |
1883 62.3 % |
4282 56.4 % |
Total |
4568 100 % |
3022 100 % |
7590 100 % |
χ2=70.529 · df=1 · Cramer’s V=0.097 · p=0.000 |
#note to students to look in viewer pane like they did for sjplot view_df function. the tidy output should show in knitted file!
# Want to save html table as image instead?
# Add `file = here("Images", "mytable.html")` to sjtab() code above
# library(webshot)
# webshot::install_phantomjs() # run if first time loading webshot package
# webshot(here("Images", "mytable.html"), here("Images", "mytable.png"))
# knitr::include_graphics(here("Images", "mytable.png"))
sjtab()
? There are many ways to
do this in R. We recommend checking out this particularly informative example
of conducting a chi-squared test using the “infer” package in R.
Here, we will instead show you a simple alternative using base R to
conduct a chi-square test of independence.
chisq <- chisq.test(NCVSData$injured, NCVSData$relationship)
chisq.test()
function to
conduct a chi-square test of independence for the
reportedtopolice
and privatelocation
variables
in our filtered NCVS data comprised of victimizations where an injury
occurred.chisq
.magrittr
style) pipe-friendly, meaning we must utilize the
$
operator to call the variables from the NCVSData_inj data
object.chisq
to view your chi-square test
results.# Run a chi-square test using base R
chisq <- chisq.test(NCVSData_inj$reportedtopolice, NCVSData_inj$privatelocation)
chisq
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: NCVSData_inj$reportedtopolice and NCVSData_inj$privatelocation
## X-squared = 70.529, df = 1, p-value < 2.2e-16
chisq.test()
to identify the observed and expected
frequency values that were used to calculate the chi-square test
statistic. From these values, you could easily calculate the chi-square
test statistic by hand. To do this, insert a new R chunk and simply call
the new data object, chisq
and type $observed
or $expected
immediately after it.chisq$observed
:#Determine observed frequencies
chisq$observed
## NCVSData_inj$privatelocation
## NCVSData_inj$reportedtopolice 0 1
## 0 2169 1139
## 1 2399 1883
chisq$expected
:#Determine expected frequencies
chisq$expected
## NCVSData_inj$privatelocation
## NCVSData_inj$reportedtopolice 0 1
## 0 1990.902 1317.098
## 1 2577.098 1704.902
tibble::tribble()
?round()
to specify number
of decimals on numeric values?gt()
table to add or
remove the decimals in specific columns or rows with
fmt_number()
?sjPlot::sjtab()
or chisq.test()
and interpret
results?
statistics=
using sjPlot::sjtab()
and
interpret them appropriately?chisq.test()
to an object (e.g.,
chisq
) and then calling elements from object (e.g.,
chisq$observed
or chisq$expected
)?