The purpose of this sixth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!
Our book, like most undergraduate statistics books, largely presents a frequentist view of probability. It starts by presenting us with a basic frequentist mathematical definition of probability as “the number of times that a specific event can occur relative to the total number of times that any event can occur” (p.152). Keen readers will note that this definition of probability sounds uncannily similar to a relative frequency - that’s because it is! In frequentist statistics, empirical probabilities are calculated as observed relative frequencies.
However, observed (known) relative frequencies - aka empirical probabilities - often are used to do more than simply describe a sample; often, they are used to make inferences about unknown (theoretical) population parameters. Your book describes this long run inferential view of empirical probabilities, or what the authors call the second “sampling” notion of probability, as “the chance of an event occurring over the long run with an infinite number of trials” (p.153). Of course, we cannot actually conduct an infinite number of trials, so we use our known relative frequencies from a sample - aka our known empirical probabilities - to infer what we think would likely happen were we to conduct a very large number of trials. After presenting these frequentist notions of probability, the chapter moves on to explain how we could imagine a theoretical “probability distribution” of outcomes that would emerge from repeated trials of an event, then it describes various types of probability distributions, including binomial, normal, and standard normal distributions.
Recall, descriptive statistics involve describing characteristics of a dataset (e.g., a sample), whereas inferential statistics involves making inferences about a population from a subset of sample data drawn from that population. In addition to probability, this chapter also introduces the basics of null hypothesis significance testing, which is the most common procedure by which social scientists use frequentist empirical probabilities and probability distributions to make inferences about populations from descriptions of sample data. Hence, the materials introduced in this chapter and this assignment, including probability, probability rules, probability distributions, standard normal distributions, and standard scores (z-scores), are essential to understanding future assignments that will focus heavily on conducting and interpreting null hypothesis significance tests.
In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.
!=
as “not equal to”options(scipen=999, digits = 3)
filter(!is.na(var))
haven::as_factor()
data %>% droplevels(data$variable)
dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
sjtab()
table and switch output from viewer to html browsercrosstable(depvar, by=indepvar)
crosstable()
tablecrosstable()
table, and how to output it to an
aesthetically pleasing html tablegt()
table, such as
adding titles/subtitles with Markdown-formatted (e.g.,
**bold**
or *italicized*
) fontsmutate()
funxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorsWe are building on objectives from Assignments 1-5. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actionshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packageshead()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data fileselect()
,
mutate()
, and if_else()
functionssummarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)gt()
(e.g., head(data) %>% gt()
)ggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)If you do not recall how to do these things, review Assignments 1-5.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Read in NCVS and 2012 States Data
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_15_Fordham_K300Assign6)
In the last assignment, you learned how to use frequency tables to calculate measures of dispersion and boxplots to visualize dispersion in a frequency distribution. In this assignment, you will learn the basics of probability theory and probability distributions, including binomial and normal distributions. You will also learn about how to covert a raw score into a standard score or z-score using the standard normal distribution, as well as how such standard scores are used to test null hypotheses with the goal of making population inferences from sample data.
For many, probability is one of the most difficult things to understand in statistics. Some probability calculations are quite intuitive, while others seem to defy our intuitions. As you read Chapter 6 and watched this week’s videos, hopefully you began to understand how probability theory underlies inferential statistics and allows us to make claims about larger populations from sample data. We will continue to work toward understanding inferential statistics over the next few assignments. In this assignment, we will practice calculating (frequentist) probabilities and using z-scores.
here
package will automatically set our K300_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, sjPlot
, crosstable
,
rstatix
, and gt
.
crosstable
and
rstatix
packages, which means you’ll need to download them
first using install.packages()
or the “Install Packages”
button under the “Tools” tab. Then, you can use library()
to load them in.groundhog.library()
to improve the reproducibility of your
script.sjPlot::sjtab()
, we will also briefly introduce you
crosstable::crosstable()
as an alternative package for
generating various univariate and bivariate statistics, including
crosstabs.rstatix
package provides a simple and intuitive
framework for performing basic statistical tests, and it is compatible
with tidyverse and tidyverse pipe functions. We will use
rstatix::binom_test()
to conduct a binomial hypothesis test
later in the assignment. We will also pipe the tabled output to the
gt()
function and then modify the table; if you recall,
gt()
can improve table output aesthetics and permits table
customization. NCVSData
and the “2012StatesData.sav” datafile into an
object named StatesData
.
NCVSData
and StatesData
.
StatesData <- read_spss(here("Datasets", "2012StatesData.sav"))
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
Goal: Understand basic elements of a contingency table
In the next section, we will generate contingency tables - otherwise known as a “cross-tabulations” or crosstabs - and use them to calculate (frequentist) marginal, conditional, and joint probabilities.
Contingency tables or cross-tabulations are useful for understanding
the association (or lack thereof) between two or more variables.
Specifically, we will cross-tabulate two variables from the National
Crime Victimization Survey data (NCVSData
): the ordinal
variable ‘relationship’, capturing a victim’s relationship to their
assailant (0=“total stranger” to 3=“well known”), and the binary
variable ‘maleoff’, representing the sex of the offender (0=female;
1=male). Before we do this together, let’s cover the basic elements of a
contingency table.
sjplot::view_df()
(e.g., by typing
NCVSData %>% viewdf()
).NCVSData %>%
filter(!is.na(reportedtopolice)) %>%
freq(privatelocation, report.nas = TRUE) %>%
tb() %>%
gt()
privatelocation | freq | pct_valid | pct_valid_cum | pct_tot | pct_tot_cum |
---|---|---|---|---|---|
0 | 17618 | 76.2354 | 76.2354 | 76.2354 | 76.2354 |
1 | 5492 | 23.7646 | 100.0000 | 23.7646 | 100.0000 |
<NA> | 0 | NA | NA | 0.0000 | 100.0000 |
NCVSData %>%
filter(!is.na(reportedtopolice)) %>%
freq(reportedtopolice, report.nas = TRUE) %>%
tb() %>%
gt()
reportedtopolice | freq | pct_valid | pct_valid_cum | pct_tot | pct_tot_cum |
---|---|---|---|---|---|
0 | 12639 | 54.69061 | 54.69061 | 54.69061 | 54.69061 |
1 | 10471 | 45.30939 | 100.00000 | 45.30939 | 100.00000 |
<NA> | 0 | NA | NA | 0.00000 | 100.00000 |
#The following code more efficiently accomplishes the same goal
# NCVSData %>% freq(reportedtopolice, report.nas = TRUE) %>% tb() %>% gt()
summarytools::freq()
.
filter()
command, which tells R to keep only the rows where respondents reported
a “0” or “1” to the ‘reportedtopolice’ item and to remove the n=859 rows
with missing data on this item. It does so by filtering to keep values
that are not NA
on the ‘reportedtopolice’ variable (i.e.,
filter(!is.na(reportedtopolice))
. We do this for comparison
purposes, since our basic contingency tables will drop those missing
(NA
) cases by default.
summarytools::freq()
, we could have
simply added the report.nas = FALSE
option to more
efficiently accomplish the same goal in our frequency table. However, we
will use this NA
filter later in this assignment, so we
wanted to introduce you to it here as well.freq()
code also includes the
report.nas = FALSE
option, which is not entirely redundant
- here, it removes the empty NA
row from our table.tb()
so that we could then pipe the output to our preferred
gt()
table formatting package.!=
causation.
!=
is an operator in R that means “not equal
to”. Recall, we also used !
earlier prior to
is.na
when filtering missing data. While is.na
means “is missing”, !is.na
means “not is missing” or “not
NA”).NA
values).Goal: Create Cross-Tabulations of sex (“V3018”) and “relationship” variables
You will learn to use two functions to generate contingency tables
like those presented above. First, the sjtab()
function from the
“sjPlot” package, which was used to create the tables above, is a
tidyverse-friendly option for creating useful and aesthetically pleasing
contingency tables (for more details, see here,
here
and here.
With the sjtab()
approach, we will also rely on the
select()
function, which we have used before, to select the
specific variables we want to include in our contingency table.
Second, the crosstable()
function from the
“crosstable” package is another tidyverse-friendly way to generate
useful and aesthetically pleasing contingency tables. In addition to generating
contingency tables with numerous customizing options, you can use
the package to generate univariate and grouped descriptive statistics,
bivariate correlations, bivariate effect sizes, and more! Moreover, this
package has the advantage of printing beautiful HTML tables directly in
your RMD document rather than outputting them to the viewer pane. We
will learn to use this package in the next section.
Let’s start by using the sjtab()
package to generate a
table that will help us answer question 2 (part a [i-iii] & part b
[i-ii]) from the SPSS exercises at the end of Chapter 6 (p.192) of the
book. This question requires us to generate a contingency table of
‘relationship’ (victim’s relationship to assailant) by ‘V3018’ (victim’s
sex) to assess whether there are gender differences in victims’
relationships to their assailants. For instance, a contingency table
might help us determine whether the probability of being assaulted by a
stranger or by a well-known acquaintance is equivalent or differs for
men and women.
NCVSData %>%
. Hit enter
and type select(relationship, V3018) %>%
.sjtab(title = "Cross-Tabulation of Victim Gender and Relationship to Assailant", show.col.prc = TRUE, use.viewer = FALSE)
.
dplyr::select()
function allows us to
select the variables we want to use in a dataset. When working with just
one or two variables from a large dataset, this can be helpful. When
working with an entire dataset, we do not need to use
select()
. For this assignment, we will use it to select the
“V3018” and “relationship” variables from our NCVS dataset.
select()
function in combination with
sjtab()
, the dependent variable should be listed first and
the independent variable listed second. This will ensure that the DV is
in the rows and the IV is in the columns of the table.
sjtab()
function allows us to make a contingency
table in R that is customizable using tidyverse
design
philosophy.
title =
function allows us to specify a title for
the crosstab.show.col.prc = TRUE
function requests the column
percentages - i.e., the percentages of male offenders (or the
percentages of female offenders) that were reported to be in each of the
relationship categories (e.g., % of male offenders that were strangers;
% of male offenders that were casual acquaintances; etc.).use.viewer = FALSE
command, which instructs R to output the resulting table in a pop-up
HTML document rather than the built-in viewer pane (i.e., the viewer
pane is where you find the view_df()
output). Feel free to
change that to TRUE
if you prefer. Either way, the
cross-tabulation table should populate in the final knitted HTML
document.use.viewer = TRUE
), then your RStudio session should look
something like this:Goal: Learn alternative method to create cross-tab of sex (“V3018”) and “relationship”
Before moving on, we will show you how to generate a similar
contingency table using the crosstable()
function from the
“crosstable” package. This part of the assignment is optional, though
feel free to follow along in your own RMD file.
crosstable()
, the
first thing we need to do is modify our ‘relationship’ variable.
crosstab()
function views variables
with three or more numbered values - like our relationship variable,
which has four values ranging from “0” to “3” - as a numeric variable.
With numeric variables, the function’s default action is to output
descriptive statistics rather than a contingency table. So, to generate
a cross-tab, we will first need to transform ‘relationship’ into a
factor variable.NA
values. We could simply
generate the table and include a row for “Don’t know.” However, we will
add a couple more steps to drop these NA
values so our
table exactly matches the one we generated using sjtab()
above.NCVSData2 <- NCVSData %>%
filter(!is.na(relationship)) %>%
mutate(relationshipfac = haven::as_factor(relationship))
NCVSData2 <- NCVSData2 %>% droplevels(NCVSData2$relationshipfac)
NCVSData2 <- NCVSData
).
filter()
command. You
should recognize this - we used it earlier in the assignment in the
section explaining crosstabs. Here, we filtered the data to drop missing
values on the ‘relationship’ variable
(filter(!is.na(relationship))
).
sjtab()
table generated above, which removed
missing data by default to perform basic statistical procedures (e.g.,
Chi-squared test of independence; Cramer’s V calculation).mutate()
command. In this line, we transformed our old ‘relationship’ variable
into a factor using the haven::as_factor()
function and
assigned it to a new variable called ‘relationshipfac’
(mutate(relationshipfac = haven::as_factor(relationship))
).
haven::as_factor
). This is because there are other popular
packages (e.g., “sjlabelled”; “forcats”) that also use the same
function. Recall, we will sometimes do this with
dplyr::select()
for similar reasons.droplevels()
command. If we omit this line, our crosstab will display an empty row
for the “Don’t know” (NA
) response option. Remember, we
filtered out the “Don’t know” values. This statement now tells R to drop
the unused (empty) factor level.crosstable()
function! Click the “code” button to see how
the sausage is made.NCVSData2 %>%
crosstable(relationshipfac, by=V3018, showNA="ifany", total="both",
percent_digits=1, percent_pattern="{n} ({p_col})") %>%
as_flextable(keep_id=FALSE)
label | variable | SEX (ALLOCATED) | Total | |
1 | 2 | |||
relationshipfac | stranger | 4614 (38.0%) | 1933 (17.3%) | 6547 (28.1%) |
slightly known | 1760 (14.5%) | 1190 (10.7%) | 2950 (12.7%) | |
casual acquiant | 2609 (21.5%) | 1967 (17.6%) | 4576 (19.6%) | |
well known | 3157 (26.0%) | 6070 (54.4%) | 9227 (39.6%) | |
Total | 12140 (52.1%) | 11160 (47.9%) | 23300 (100.0%) |
sjtab()
, we will specify our DV
(‘relationship’) first, then list our IV second (i.e.,
by=V3018
).showNA="ifany"
would include our NA
(“Don’t know”) values had we not filtered them out and dropped the
factor level. While this is a useful toggling option for showing or
omitting NA
values, selecting “no” does not change the
total N, meaning our percentages would not have matched the table we
created earlier. You can see what I mean by simply changing the code to
instead specify our original data (i.e., to NCVSData
) and
our original dependent variable (relationship
), then
running the crosstab()
code again.total="both"
adds row and column marginal totals. You
can select both, row, column, or none.percent_digits=1
outputs percentage values to the one
decimal place. You can choose your level of precision or select no
decimals (0
).percent_pattern="{n} ({p_col})")
controls what values
are included in the table and how those values are formatted. In this
case, we requested the frequency values ({n}
) followed by
the column percentages in parentheses (({p_col})
).%>% as_flextable(keep_id=FALSE)
converts the default
table into a more aesthetically pleasing HTML table. The
keep_id=FALSE
drops the first column with the variable
name; in this case, the id
column and the variable
label
column were redundant. Note: If you want,
you could replace this line with a pipe to gt()
instead.Goal: Conduct a binomial hypothesis test
In this next section, you will learn how to answer question 3 on page 192 of B&P’s Chapter 6. While the last question focused on the victim’s sex and relationship to assailant, this question focuses instead on the offender’s sex. Specifically, the question asks you to conduct a binomial hypothesis test to infer whether women are more likely than men to assault someone else in the U.S. population.
To test this hypothesis, we will use the ‘maleoff’ variable, which records victims’ reports of the sex of the offender (0=female offender; 1=male offender), to generate relative frequency/proportion values for male and female offenders. We will then use the binomial distribution to test how probable the observed proportions would be under a null hypothesis of no sex differences in offending (i.e., p=0.50).
You can find a detailed description and example application of the binomial hypothesis test on pages 165-172 of the book. Check out this helpful resource for another description of the binomial test and an alternative way of conducting it in R. Put briefly, here is the essential process:
If we have data on the number of “successes” (‘x’ or the number of assaults by women in this case) in a fixed number of trials (‘n’, total sample size, or the total number of assaults in this case), then we can use the binomial distribution to calculate the joint probability (summarized by a p-value) that the binary event (success or fail; 0 or 1) would occur as or more frequently than we observed given a pre-determined null hypothesis. We then compare our p-value from the binomial test to our pre-specified alpha level (α).
If our test p-value > α then we would conclude there is not enough evidence to reject the null hypothesis; that is, we would fail to reject the null hypothesis.
Alternatively, if our test p-value < α, then we would conclude that it would be highly unlikely to observe a relative frequency/probability (of assault by women) as or more extreme than the one we observed if the null hypothesis were true in the population. Likewise, we would conclude that there is enough evidence to reject the null hypothesis.
So, to conduct a bimomial hypothesis test, we need:
Now that we know what we need, let’s get to work.
NCVSData %>% frq(maleoff)
or NCVSData %>% freq(maleoff)
to create a frequency
table using either the “sjmisc” or “summarytools” package.
NCVSData %>%
on one line, hit
enter, and type frq(maleoff)
on the next. This will make
reading your code much easier in the long run - particularly as you
string more commands together and your lines get longer.freq()
output or by adding 4,798 (the number of female
victims) with 19,171 (the number of male offenders) from either
table.# NCVSData %>% frq(maleoff)
NCVSData %>% freq(maleoff)
## Frequencies
## NCVSData$maleoff
## Label: Male offender =1, 0=female offender
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## 0 4798 20.02 20.02 20.02 20.02
## 1 19171 79.98 100.00 79.98 100.00
## <NA> 0 0.00 100.00
## Total 23969 100.00 100.00 100.00 100.00
binom_test(x = 4798, n = 23969, p = .5)
on one line.
binom_test()
is the function that runs binomial
hypothesis tests in the “rstatix” package. Within this function, we are
specifying x = 4798
, n = 23969
, and
p = .5
.<-
and perhaps call it
binomtestfemoff
). Then, you can click the object in your
environment and see it in a new tab.gt()
function.
gt()
allows us to create beautiful tables
that are more accessible and intuitive to read.%>%
(a pipe) after
binom_test(x = 4798, n = 23969, p = .5)
. Then, hit enter
and type gt()
.gt()
. You can do all sorts of custom modifications; some
examples include: adding titles, subtitles, and note lines; changing
column or value labels; and making fonts bold, italicized, or
underlined. To give you a sense of how this works, we will add a title
and subtitle to our table.
gt()
, then hit
entertab_header(
then hit entertitle = md("**Binomial test of sex differences in assaults using NCVS subset**"),
then hit entersubtitle = md("*H0:* p=0.50, x(female assaults)=4,798, n=23,969"))
tab_header()
adds/modifies a table
header.md()
specifies our text input as Markdown-formatted
text, which allows us to bold the **title**
using
double asterisks and italicize *H0:*
with single
asterisks.# options(scipen=999, digits = 3 )
# options(scipen=10, digits = 3 )
binom_test(x = 4798, n = 23969, p = .5) %>%
gt() %>%
tab_header(
title = md("**Binomial test of sex differences in assaults using NCVS subset**"),
subtitle = md("*H0:* p=0.50; x(female assaults)=4,798; n=23,969"))
Binomial test of sex differences in assaults using NCVS subset | |||||
H0: p=0.50; x(female assaults)=4,798; n=23,969 | |||||
n | estimate | conf.low | conf.high | p | p.signif |
---|---|---|---|---|---|
23969 | 0.2001752 | 0.1951254 | 0.2052978 | 4.940656e-324 | **** |
8. Compare the test p-value to our alpha level a. When
looking at the results, the first thing you might notice is that the
p-value is in scientific notation (i.e., p=4.94e-324). This
happens when numbers are extremely large or, in this case, extremely
small. - For those who do not frequently use scientific notation, it may
be confusing when the number is so small that it is output in this
format. For a brief refresher on scientific notation, check out this
website; to learn more about scientific notation and changing its
output in R, see here.
- Recall, a number followed by “10” or “e” to the power of a negative
number implies a small number. Here, the number 4.94 followed by ‘e to
the -324’ indicates the probability is very close to zero (specifically,
323 zeroes followed by 494 after the decimal). - If you want to see just
how many zeros, you can turn off scientific notation by typing the
following command right above your binomial test and then running your
code chunk again: options(scipen=999, digits = 3)
. This
changes the global
options of our R workspace. - You can turn scientific notation back
on and specify when it should be used by changing the number following
scipen=
. For instance, typing
options(scipen=10, digits = 3)
and then running your code
again will result in scientific notation in your table once more. - If
you wish to restore the R default, type and run:
options(scipen=0, digits=7)
b. So, our test
p-value is close to zero and is definitely smaller than our
pre-specified alpha level of 0.05 (.e., 4.94e-324 < 0.05).
9. Use results of statistical test to draw inferences about
population a. Since our p-value is less than our alpha level
(i.e., p-value < α), then we would conclude that it
would be highly unlikely (p-value = 4.94e-324) to observe a
relative frequency/probability of assault by women as or more extreme
than the one we observed (i.e., x <= 0.20) if the null hypothesis of
no sex differences in assault (i.e., if p=0.50) were true in the
population. b. Therefore, we would conclude that there is enough
evidence to reject the null hypothesis of no sex
differences in assault.
10. Finish answering question 3 (p.192) of the book and any
remaining quiz questions related to it.
Goal: Calculating Z-score and Creating Histogram for “MurderRt” Variable
From this week’s book chapter, you also learned how to use the z-score formula to convert raw values by hand to z-score values that are standardized to the normal curve. For the last part of this assignment, you will learn how to convert a variable’s raw values into z-scores in R.
We will work with “MurderRt” variable in the 2012 States Data.
Essentially, we want to visualize the distribution of homicide rates
across states in the US, then make informative standardized comparisons
across states’ homicide rates. To do this, we will first create a
histogram using ggplot()
. Next, we will examine raw murder
rates in individual states and compare these values to the average
murder rate in the US from these data. Finally, we will convert states’
raw homicide rates into z-scores, then compare how Alabama’s and
Alaska’s standardized murder rates compare with sample average US
homicide rate.
StatesData %>%
. Hit enter and type
ggplot() +
. Then, hit enter one last time and type
geom_histogram(aes(x=MurderRt))
. Answer question 5 on page
192 of B&P.
x=
before
MurderRt
in aes()
, as the default setting is
to plot the histogram on the x-axis. However, it is good practice to
write out default settings anyway; in this case, it helps you build an
intuition about what ggplot()
is doing, and you can more
easily modify it (e.g., by switching to y-axis with y=
) as
desired.#start by showing histogram for students (doing part of q5 for them)
StatesData %>%
ggplot() +
geom_histogram(aes(x=MurderRt))
datasubset <- data %>% dplyr::select(var1, var2)
. Try
doing it yourself; if you get tripped up, that is fine. We include
step-by-step instructions below as usual.StatesDataSub <- StatesData %>%
.
StatesDataSub
and StatesData
are
identical.select(State, MurderRt)
.
StatesDataSub
data object will contain
only these two variables.StatesDataSub
in the R chunk. Your Rstudio
session should look something like this (but with header lines and
descriptive text):z = = (x - mean(x))/sd(x)
, where x
represents
a vector of raw scores for a continuous variable (e.g., “MurderRt”
values), mean(x)
equals the sample mean of x, and
sd(x)
equals the sample standard deviation of x.mutate()
,
there are various ways to use the z-score formula to convert our old
variable into a new standardized variable. We could calculate the mean
and sd, then plug in the formula. We could simply write out the formula
as above and plug “MurderRt” in for ‘x’. Or, we could use base R’s
built-in scale()
function for converting z-scores. We could
even create our own function to standardize
variables. Creating functions is especially useful if we think we will
need to standardize variables again in the future; it saves us from
duplicating our efforts and from potentially making copy-and-past errors
in the process. For examples and more details on the
scale()
method or creating a function to convert to
z-scores, see
here and here.
We will walk you through a few different ways below.mutate()
to create a new
variable containing the converted z-scores. This is NOT a method that we
recommend - typing in values to create variables is a potentially
error-prone process. However, this simple approach is useful for seeing
how z-scores are calculated using other methods.
summarytools::descr()
using
the code: StatesDataSub %>% descr(MurderRt)
.StatesDataSub <- StatesDataSub %>%
, then hit
enter and type mutate(
.
StatesDataSub
data
object into the same object again. We do not want to overwrite the
original StatesData
dataset, but we can always recreate
this subset with our code above if we mess up. Assigning into our
previously created object ensures the original dataset remains clean
while minimizing unnecessary clutter in our R Environment.ZMurderRt = (x - mean(x))/sd(x))
. Replace
mean(x)
with the sample mean and replace sd(x)
with the sample standard deviation. Hit enter once more and type
StatesDataSub
to view your data object. Note: We
added head()
to display only the first five rows, which
allows you to compare your results without adding long tables to our
assignment file.#select only the variable(s) we need & assign to new df
StatesDataSub <- StatesData %>%
select(State, MurderRt)
StatesDataSub %>%
descr(MurderRt)
## Descriptive Statistics
## StatesDataSub$MurderRt
## Label: Murder Rate per 100K
## N: 50
##
## MurderRt
## ----------------- ----------
## Mean 4.51
## Std.Dev 2.41
## Min 0.90
## Q1 2.60
## Median 4.65
## Q3 6.00
## Max 12.30
## MAD 2.59
## IQR 3.35
## CV 0.53
## Skewness 0.73
## SE.Skewness 0.34
## Kurtosis 0.59
## N.Valid 50.00
## Pct.Valid 100.00
#note mean = 4.51, sd=2.41
#manually covert z-scores
StatesDataSub <- StatesDataSub %>%
mutate(
ZMurderRt = (MurderRt - 4.51)/2.41)
StatesDataSub %>% head() %>% gt()
State | MurderRt | ZMurderRt |
---|---|---|
Alabama | 7.1 | 1.0746888 |
Alaska | 3.2 | -0.5435685 |
Arizona | 5.5 | 0.4107884 |
Arkansas | 6.3 | 0.7427386 |
California | 5.4 | 0.3692946 |
Colorado | 3.2 | -0.5435685 |
mean(StatesData$MurderRt)
and
sd(StatesData$MurderRt)
. These functions will calculate the
mean and standard deviation without all the other descriptive
statistics.x
with MurderRt
in your
mutate()
formula. We strongly recommend this method over
the first approach.
StatesDataSub <- StatesDataSub %>% mutate(ZMurderRt = (x - mean(x))/sd(x))
like before. Replace x
with MurderRt
.head()
again to display only the first five rows, which allows you to compare
your results without adding long tables to our assignment file.#find z-score for each data value
StatesDataSub <- StatesDataSub %>%
mutate(
ZMurderRt = (MurderRt - mean(MurderRt))/sd(MurderRt)
)
StatesDataSub %>% head() %>% gt()
State | MurderRt | ZMurderRt |
---|---|---|
Alabama | 7.1 | 1.0727213 |
Alaska | 3.2 | -0.5450719 |
Arizona | 5.5 | 0.4090113 |
Arkansas | 6.3 | 0.7408663 |
California | 5.4 | 0.3675294 |
Colorado | 3.2 | -0.5450719 |
# Let's try a function!
funxtoz = function(x){(x-mean(x))/sd(x)}
# Check to make sure it works. Do the values match those from previous method?
funxtoz(StatesDataSub$MurderRt)
## [1] 1.07272133 -0.54507186 0.40901130 0.74086632 0.36752943 -0.54507186
## [7] -0.62803561 0.03567441 0.40901130 0.61642069 -1.12581813 -1.25026376
## [13] 1.61198573 0.32604755 -1.33322751 0.07715629 -0.08877122 3.22977891
## [19] -1.04285438 1.32161259 -0.75248124 0.74086632 -1.25026376 0.98975758
## [25] 0.86531195 -0.54507186 -0.83544500 0.57493881 -1.49915502 -0.33766248
## [31] 2.27569575 -0.21321685 0.36752943 -1.04285438 0.20160192 0.82383007
## [37] -0.91840875 0.36752943 -0.62803561 0.90679382 -0.37914435 1.19716696
## [43] 0.36752943 -1.29174564 -1.33322751 0.07715629 -0.71099937 0.16012004
## [49] -0.79396312 -1.04285438
## attr(,"label")
## [1] "Murder Rate per 100K"
## attr(,"format.spss")
## [1] "F8.2"
# Incorporate function into our mutate() code
StatesDataSub <- StatesDataSub %>%
mutate(
ZMurderRt = funxtoz(StatesDataSub$MurderRt)
)
StatesDataSub %>% head() %>% gt()
State | MurderRt | ZMurderRt |
---|---|---|
Alabama | 7.1 | 1.0727213 |
Alaska | 3.2 | -0.5450719 |
Arizona | 5.5 | 0.4090113 |
Arkansas | 6.3 | 0.7408663 |
California | 5.4 | 0.3675294 |
Colorado | 3.2 | -0.5450719 |
!=
as “not equal
to”?options(scipen=999, digits = 3)
?filter(!is.na(var))
?haven::as_factor()
?
data %>% droplevels(data$variable)
?dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
?
sjtab()
table and switch output from viewer to html
browser?crosstable(depvar, by=indepvar)
?
crosstable()
table?crosstable()
table, and how to output it to an
aesthetically pleasing html table?gt()
table, such
as adding titles/subtitles with Markdown-formatted (e.g.,
**bold**
or *italicized*
) fonts?mutate()
?funxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errors?