The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 6 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
This chapter provided an introduction to probability, including foundational rules of probability and probability distributions. It is likely you have heard the term “probability” before and have some intuitions about what it means. You might be surprised to learn that there are different philosophical views about what probability is and is not, and our position on probability will have important implications for the way we approach statistical description and inference!
Our book, like most undergraduate statistics books, largely presents a frequentist view of probability. It starts by presenting us with a basic frequentist mathematical definition of probability as “the number of times that a specific event can occur relative to the total number of times that any event can occur” (p.152). Keen readers will note that this definition of probability sounds uncannily similar to a relative frequency - that’s because it is! In frequentist statistics, empirical probabilities are calculated as observed relative frequencies.
However, observed (known) relative frequencies - aka empirical probabilities - often are used to do more than simply describe a sample; often, they are used to make inferences about unknown (theoretical) population parameters. Your book describes this long run inferential view of empirical probabilities, or what the authors call the second “sampling” notion of probability, as “the chance of an event occurring over the long run with an infinite number of trials” (p.153). Of course, we cannot actually conduct an infinite number of trials, so we use our known relative frequencies from a sample - aka our known empirical probabilities - to infer what we think would likely happen were we to conduct a very large number of trials. After presenting these frequentist notions of probability, the chapter moves on to explain how we could imagine a theoretical “probability distribution” of outcomes that would emerge from repeated trials of an event, then it describes various types of probability distributions, including binomial, normal, and standard normal distributions.
Recall, descriptive statistics involve describing characteristics of a dataset (e.g., a sample), whereas inferential statistics involves making inferences about a population from a subset of sample data drawn from that population. In addition to probability, this chapter also introduces the basics of null hypothesis significance testing, which is the most common procedure by which social scientists use frequentist empirical probabilities and probability distributions to make inferences about populations from descriptions of sample data. Hence, the materials introduced in this chapter and this assignment, including probability, probability rules, probability distributions, standard normal distributions, and standard scores (z-scores), are essential to understanding future assignments that will focus heavily on conducting and interpreting null hypothesis significance tests.
In the current assignment, you will gain a better understanding of frequentist probability by learning to create cross-tabulations or joint frequency contingency tables and calculating z-scores. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.
!=
as “not equal to”options(scipen=999, digits = 3)
filter(!is.na(var))
dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
sjtab()
table and switch output from viewer to html browsergt()
table, such as
adding titles/subtitles with Markdown-formatted (e.g.,
**bold**
or *italicized*
) fontsmutate()
We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actionshere()
for a simple and reproducible
self-referential file directory methodhead()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data fileselect()
,
mutate()
, and if_else()
functionssummarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)gt()
(e.g., head(data) %>% gt()
)ggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “turquoise”)
or hexidecimal codes (e.g., “#990000” for crimson)labs()
function (e.g.,
+ labs(title = "My Title")
)If you do not recall how to do these things, review Assignments 1-6.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Read in NCVS and 2012 States Data
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-02-02_Ducate_CRIM5305_Assign06)
In the last assignment, you learned how to use frequency tables to calculate measures of dispersion and boxplots to visualize dispersion in a frequency distribution. In this assignment, you will learn the basics of probability theory and probability distributions, including binomial and normal distributions. You will also learn about how to covert a raw score into a standard score or z-score using the standard normal distribution, as well as how such standard scores are used to test null hypotheses with the goal of making population inferences from sample data.
For many, probability is one of the most difficult things to understand in statistics. Some probability calculations are quite intuitive, while others seem to defy our intuitions. As you read Chapter 6 and watched this week’s videos, hopefully you began to understand how probability theory underlies inferential statistics and allows us to make claims about larger populations from sample data. We will continue to work toward understanding inferential statistics over the next few assignments. In this assignment, we will practice calculating (frequentist) probabilities and using z-scores.
here
package will automatically set our CRIM5305_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, sjPlot
, summarytools
,
rstatix
, and gt
.
rstatix
packages,
which means you’ll need to download them first using
install.packages()
or the “Install Packages” button under
the “Tools” tab. Then, you can use library()
to load them
in.rstatix
package provides a simple and intuitive
framework for performing basic statistical tests, and it is compatible
with tidyverse and tidyverse pipe functions. We will use
rstatix::binom_test()
to conduct a binomial hypothesis test
later in the assignment. We will also pipe the tabled output to the
gt()
function and then modify the table; if you recall,
gt()
can improve table output aesthetics and permits table
customization. NCVSData
and the “2012 states
data.sav” datafile into an object named StatesData
.
NCVSData
and StatesData
.
StatesData <- read_spss(here("Datasets", "2012statesdata.sav"))
NCVSData <- read_spss(here("Datasets", "NCVSLoneOffenderAssaults1992to2013.sav"))
Goal: Understand basic elements of a contingency table
In the next section, we will generate contingency tables - otherwise known as a “cross-tabulations” or crosstabs - and use them to calculate (frequentist) marginal, conditional, and joint probabilities.
Contingency tables or cross-tabulations are useful for understanding
the association (or lack thereof) between two or more variables.
Specifically, we will cross-tabulate two variables from the National
Crime Victimization Survey data (NCVSData
): the ordinal
variable relationship
, capturing a victim’s relationship to
their assailant (0=“total stranger” to 3=“well known”), and the binary
variable maleoff
, representing the sex of the offender
(0=female; 1=male). Before we do this together, let’s cover the basic
elements of a contingency table.
privatelocation
and
reportedtopolice
and start by generating their univariate
frequency distributions.
privatelocation
is a dummy variable (i.e., 0 or 1
values) indicating whether the reported victimization occurred in a
private location (0=Not a private location; 1=Private
location).reportedtopolice
is a dummy variable indicating whether
the victimization incident was reported to the police
(0=Unreported; 1=Reported).sjplot::view_df()
(e.g., by typing
NCVSData %>% viewdf()
). Just be sure to comment out this
line before you knitprivatelocation
and reportedtopolice
variables
in our NCVS data subset. Look at the code used to generate the
tables–some of it should look familiar!NCVSData %>%
filter(!is.na(reportedtopolice)) %>%
freq(privatelocation, report.nas = FALSE) %>%
tb() %>%
gt()
privatelocation | freq | pct | pct_cum |
---|---|---|---|
0 | 17618 | 76.2354 | 76.2354 |
1 | 5492 | 23.7646 | 100.0000 |
NCVSData %>%
filter(!is.na(reportedtopolice)) %>%
freq(reportedtopolice, report.nas = FALSE) %>%
tb() %>%
gt()
reportedtopolice | freq | pct | pct_cum |
---|---|---|---|
0 | 12639 | 54.69061 | 54.69061 |
1 | 10471 | 45.30939 | 100.00000 |
#The following code more efficiently accomplishes the same goal
# NCVSData %>% freq(reportedtopolice, report.nas = FALSE) %>% tb() %>% gt()
summarytools::freq()
.
filter()
command, which tells R to keep only the rows where respondents reported
a “0” or “1” to the reportedtopolice
item and to remove the
n=859 rows with missing data on this item. It does so by filtering to
keep values that are not NA
on the
reportedtopolice
variable (i.e.,
filter(!is.na(reportedtopolice))
. We do this for comparison
purposes, since our basic contingency tables will drop those missing
(NA
) cases by default.
summarytools::freq()
, we could have
simply added the report.nas = FALSE
option to more
efficiently accomplish the same goal in our frequency table. However, we
will use this NA
filter later in this assignment, so we
wanted to introduce you to it here as well.freq()
code also includes the
report.nas = FALSE
option, which is not entirely redundant
- here, it removes the empty NA
row from our table.tb()
so that we could then pipe the output to our preferred
gt()
table formatting package.privatelocation
and reportedtopolice
variables.privatelocation
is the
independent variable (IV) and
reportedtopolice
is the dependent variable
(DV) In other words, we hypothesize that whether or not a
victimization is reported to police depends at least in part on the
location of the victimization.
!=
causation.
!=
is an operator in R that means “not equal
to”. Recall, we also used !
earlier prior to
is.na
when filtering missing data. While is.na
means “is missing”, !is.na
means “not is missing” or “not
NA”). So in R, if you want to see if two things are the same, you will
use ==
, which means “equal to”, while if you want to see if
two things are not the same, you will use !=
, which means
“not equal to.” The use of the exclamation point !
to mean
“not” is very common in programming.privatelocation
is the independent variable (IV)
here, you will note that it is placed in the columns of the contingency
table.reportedtopolice
is the dependent variable,
so it is placed in the rows of the contingency table.NA
values).privatelocation
) variable.privatelocation
variable
(i.e., privatelocation
= 1).In addition to joint frequency distributions, it is also common to see percentage values reported in contingency tables. Once again, it is important to be sure the calculated or reported percentages are the percentages that you want when reading and interpreting the table. If we stick with the recommended practice of placing our IV in the columns, then we often want column percentages (as opposed to row percentages). The image above includes column percentages directly beneath the frequencies. With this table setup and with column percentages, we will usually total down and compare across when assessing a relationship between the IV and the DV.
Goal: Create Cross-Tabulations of sex (
V3018
) andrelationship
variables
You will learn to generate contingency tables like those presented
above. Though there are many options, the sjtab()
function
from
the “sjPlot” package, which was used to create the tables above, is
a tidyverse-friendly option for creating useful and aesthetically
pleasing contingency tables (for more details, see here,
here
and here.
With the sjtab()
approach, we will also rely on the
select()
function, which we have used before, to select the
specific variables we want to include in our contingency table.
Let’s start by using the sjtab()
package to generate a
table that will help us answer question 2 (part a [i-iii] & part b
[i-ii]) from the SPSS exercises at the end of Chapter 6 (p.192) of the
book. This question requires us to generate a contingency table of
relationship
(victim’s relationship to assailant) by
V3018
(victim’s sex) to assess whether there are gender
differences in victims’ relationships to their assailants. For instance,
a contingency table might help us determine whether the probability of
being assaulted by a stranger or by a well-known acquaintance is
equivalent or differs for men and women.
NCVSData %>%
. Hit enter
and type select(relationship, V3018) %>%
.sjtab(title = "Cross-Tabulation of Victim Gender and Relationship to Assailant", show.col.prc = TRUE, use.viewer = FALSE)
.
dplyr::select()
function allows us to
select the variables we want to use in a dataset. When working with just
one or two variables from a large dataset, this can be helpful. When
working with an entire dataset, we do not need to use
select()
. For this assignment, we will use it to select the
V3018
and relationship
variables from our NCVS
dataset.
select()
function in combination with
sjtab()
, the dependent variable should be listed first and
the independent variable listed second. This will ensure that the DV is
in the rows and the IV is in the columns of the table.
sjtab()
function allows us to make a contingency
table in R that is customizable using tidyverse
design
philosophy.
title =
function allows us to specify a title for
the crosstab.show.col.prc = TRUE
function requests the column
percentages - i.e., the percentages of male offenders (or the
percentages of female offenders) that were reported to be in each of the
relationship categories (e.g., % of male offenders that were strangers;
% of male offenders that were casual acquaintances; etc.).use.viewer = FALSE
command, which instructs R to output the resulting table in a pop-up
HTML document rather than the built-in viewer pane (i.e., the viewer
pane is where you find the view_df()
output). Feel free to
change that to TRUE
if you prefer. Either way, the
cross-tabulation table should populate in the final knitted HTML
document.use.viewer = TRUE
), then your RStudio session should look
something like this:Goal: Conduct a binomial hypothesis test
In this next section, you will learn how to answer question 3 on page 192 of B&P’s Chapter 6. While the last question focused on the victim’s sex and relationship to assailant, this question focuses instead on the offender’s sex. Specifically, the question asks you to conduct a binomial hypothesis test to infer whether women are more likely than men to assault someone else in the U.S. population.
To test this hypothesis, we will use the maleoff
variable, which records victims’ reports of the sex of the offender
(0=female offender; 1=male offender), to generate relative
frequency/proportion values for male and female offenders. We will then
use the binomial distribution to test how probable the observed
proportions would be under a null hypothesis of no sex differences in
offending (i.e., p=0.50).
You can find a detailed description and example application of the binomial hypothesis test on pages 165-172 of the book. Check out this helpful resource for another description of the binomial test and an alternative way of conducting it in R. Put briefly, here is the essential process:
If we have data on the number of “successes” (‘x’ or the number of assaults by women in this case) in a fixed number of trials (‘n’, total sample size, or the total number of assaults in this case), then we can use the binomial distribution to calculate the joint probability (summarized by a p-value) that the binary event (success or fail; 0 or 1) would occur as or more frequently than we observed given a pre-determined null hypothesis. We then compare our p-value from the binomial test to our pre-specified alpha level (\(\alpha\)).
If our test p-value > \(\alpha\) then we would conclude there is not enough evidence to reject the null hypothesis; that is, we would fail to reject the null hypothesis.
Alternatively, if our test p-value < \(\alpha\), then we would conclude that it would be highly unlikely to observe a relative frequency/probability (of assault by women) as or more extreme than the one we observed if the null hypothesis were true in the population. Likewise, we would conclude that there is enough evidence to reject the null hypothesis.
So, to conduct a bimomial hypothesis test, we need:
Now that we know what we need, let’s get to work.
NCVSData %>% frq(maleoff)
or NCVSData %>% freq(maleoff)
to create a frequency
table using either the sjmisc
or summarytools
package.
NCVSData %>%
on one line, hit
enter, and type frq(maleoff)
on the next. This will make
reading your code much easier in the long run - particularly as you
string more commands together and your lines get longer.freq()
output or by adding 4,798 (the number of female
victims) with 19,171 (the number of male offenders) from either table.
frq()
will also provide the total N at the top of the
output# NCVSData %>% frq(maleoff)
NCVSData %>% freq(maleoff)
## Frequencies
## NCVSData$maleoff
## Label: Male offender =1, 0=female offender
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## 0 4798 20.02 20.02 20.02 20.02
## 1 19171 79.98 100.00 79.98 100.00
## <NA> 0 0.00 100.00
## Total 23969 100.00 100.00 100.00 100.00
rstatix
package we installed to
conduct the hypothesis test.
binom_test(x = 4798, n = 23969, p = .5)
on one line.
binom_test()
is the function that runs binomial
hypothesis tests in the rstatix
package. Within this
function, we are specifying x = 4798
,
n = 23969
, and p = 0.5
.<-
and perhaps call it
binomtestfemoff
). Then, you can click the object in your
environment and see it in a new tab.gt()
function.
gt()
allows us to create beautiful tables
that are more accessible and intuitive to read.%>%
(a pipe) after
binom_test(x = 4798, n = 23969, p = .5)
. Then, hit enter
and type gt()
.gt()
. You can do all sorts of custom modifications; some
examples include: adding titles, subtitles, and note lines; changing
column or value labels; and making fonts bold, italicized, or
underlined. To give you a sense of how this works, we will add a title
and subtitle to our table.
gt()
, then hit
entertab_header(
then hit entertitle = md("**Binomial test of sex differences in assaults using NCVS subset**"),
then hit entersubtitle = md("*H0:* p=0.50, x(female assaults)=4,798, n=23,969"))
tab_header()
adds/modifies a table
header.md()
specifies our text input as Markdown-formatted
text, which allows us to bold the **title**
using
double asterisks and italicize *H0:*
with single
asterisks.# options(scipen=999, digits = 3 )
# options(scipen=10, digits = 3 )
binom_test(x = 4798, n = 23969, p = .5) %>%
gt() %>%
tab_header(
title = md("**Binomial test of sex differences in assaults using NCVS subset**"),
subtitle = md("*H0:* p=0.50; x(female assaults)=4,798; n=23,969"))
Binomial test of sex differences in assaults using NCVS subset | |||||
H0: p=0.50; x(female assaults)=4,798; n=23,969 | |||||
n | estimate | conf.low | conf.high | p | p.signif |
---|---|---|---|---|---|
23969 | 0.2001752 | 0.1951254 | 0.2052978 | 4.940656e-324 | **** |
Compare the test p-value to our alpha level
When looking at the results, the first thing you might notice is that the p-value is in scientific notation (i.e., p=4.94e-324). This happens when numbers are extremely large or, in this case, extremely small.
For those who do not frequently use scientific notation, it may be confusing when the number is so small that it is output in this format. For a brief refresher on scientific notation, check out this website; to learn more about scientific notation and changing its output in R, see here.
Recall, a number followed by “10” or “e” to the power of a negative number implies a small number. Here, the number 4.94 followed by ‘e to the -324’ indicates the probability is very close to zero (specifically, 323 zeroes followed by 494 after the decimal).
If you want to see just how many zeros, you can turn off
scientific notation by typing the following command right above your
binomial test and then running your code chunk again:
options(scipen=999, digits = 3)
. This changes the global
options of our R workspace.
You can turn scientific notation back on and specify when it
should be used by changing the number following scipen=
.
For instance, typing options(scipen=10, digits = 3)
and
then running your code again will result in scientific notation in your
table once more.
If you wish to restore the R default, type and run:
options(scipen=0, digits=7)
So, our test p-value is close to zero and is definitely smaller than our pre-specified alpha level of 0.05 (i.e., 4.94e-324 < 0.05).
Use results of statistical test to draw inferences about population
Since our p-value is less than our alpha level (i.e., p-value < \(\alpha\)), then we would conclude that it would be highly unlikely (p-value = 4.94e-324) to observe a relative frequency/probability of assault by women as or more extreme than the one we observed (i.e., x <= 0.20) if the null hypothesis of no sex differences in assault (i.e., if p=0.50) were true in the population.
Therefore, we would conclude that there is enough evidence to reject the null hypothesis of no sex differences in assault.
Finish answering question 3 (p.192) of the book and any remaining quiz questions related to it.
Goal: Calculating Z-score and Creating Histogram for
MurderRt
Variable
From this week’s book chapter, you also learned how to use the z-score formula to convert raw values by hand to z-score values that are standardized to the normal curve. For the last part of this assignment, you will learn how to convert a variable’s raw values into z-scores in R.
We will work with MurderRt
variable in the 2012 States
Data. Essentially, we want to visualize the distribution of homicide
rates across states in the US, then make informative standardized
comparisons across states’ homicide rates. To do this, we will first
create a histogram using ggplot()
. Next, we will examine
raw murder rates in individual states and compare these values to the
average murder rate in the US from these data. Finally, we will convert
states’ raw homicide rates into z-scores, then compare how Alabama’s and
Alaska’s standardized murder rates compare with sample average US
homicide rate.
MurderRt
Variable”.
MurderRt
variable.StatesData %>%
. Hit enter and type
ggplot() +
. Then, hit enter one last time and type
geom_histogram(aes(x=MurderRt))
. Answer question 5 on page
192 of B&P.
x=
before
MurderRt
in aes()
, as the default setting is
to plot the histogram on the x-axis. However, it is good practice to
write out default settings anyway; in this case, it helps you build an
intuition about what ggplot()
is doing, and you can more
easily modify it (e.g., by switching to y-axis with y=
) as
desired.#start by showing histogram for students (doing part of q5 for them)
StatesData %>%
ggplot() +
geom_histogram(aes(x=MurderRt))
MurderRt
variable values into z-score standardized
values. Recall from this week’s readings and other course materials that
the z-score allows us to make standardized comparisons about how close
or far away a particular value (e.g., a state’s homicide rate) is from
the mean of the data series (e.g., sample mean homicide rate across
states). For example, we can see how close the homicide rates of states
like Texas or California are to the sample-average US homicide rate.
MurderRt
Values to Z-scores”State
& MurderRt
- and assign only these
two variables (columns) into a new data object.
datasubset <- data %>% dplyr::select(var1, var2)
. Try
doing it yourself; if you get tripped up, that is fine. We include
step-by-step instructions below as usual.StatesDataSub <- StatesData %>%
.
StatesDataSub
and StatesData
are
identical.select(State, MurderRt)
.
MurderRt
variables. After this line, the StatesDataSub
data object
will contain only these two variables.MurderRt
variables, type StatesDataSub
in the
R chunk. Your Rstudio session should look something like this (but with
header lines and descriptive text):MurderRt
values to z-scores. Doing so will allow you to
answer question 7 on page 192 of the book and complete the assignment.
Recall from this week’s readings/materials that the formula for z-scores
is z = = (x - mean(x))/sd(x)
, where x
represents a vector of raw scores for a continuous variable (e.g.,
MurderRt
values), mean(x)
equals the sample
mean of x, and sd(x)
equals the sample standard deviation
of x.MurderRt
values to z-scores. For instance, use
mutate()
, there are various ways to use the z-score formula
to convert our old variable into a new standardized variable. We could
calculate the mean and sd, then plug in the formula. We could simply
write out the formula as above and plug MurderRt
in for
‘x’. Or, we could use base R’s built-in scale()
function
for converting z-scores. We could even create our own function to standardize
variables. Creating functions is especially useful if we think we will
need to standardize variables again in the future; it saves us from
duplicating our efforts and from potentially making copy-and-past errors
in the process. For examples and more details on the
scale()
method or creating a function to convert to
z-scores, see
here and here.
We will walk you through a few different ways below.MurderRt
, then use these values with mutate()
to create a new variable containing the converted z-scores. This is NOT
a method that we recommend - typing in values to create variables is a
potentially error-prone process. However, this simple approach is useful
for seeing how z-scores are calculated using other methods.
MurderRt
. Insert an R chunk and generate descriptive
statistics for the MurderRt
variable with
summarytools::descr()
using the code:
StatesDataSub %>% descr(MurderRt)
.StatesDataSub <- StatesDataSub %>%
, then hit
enter and type mutate(
.
StatesDataSub
data
object into the same object again. We do not want to overwrite the
original StatesData
dataset, but we can always recreate
this subset with our code above if we mess up. Assigning into our
previously created object ensures the original dataset remains clean
while minimizing unnecessary clutter in our R Environment.ZMurderRt = (x - mean(x))/sd(x))
. Replace
mean(x)
with the sample mean and replace sd(x)
with the sample standard deviation. Hit enter once more and type
StatesDataSub
to view your data object. Note: We
added head()
to display only the first five rows, which
allows you to compare your results without adding long tables to our
assignment file. If you want to see the entire output, remove the
head()
function:
StatesDataSub %>% gt()
#select only the variable(s) we need & assign to new df
StatesDataSub <- StatesData %>%
select(State, MurderRt)
StatesDataSub %>%
descr(MurderRt)
## Descriptive Statistics
## StatesDataSub$MurderRt
## Label: Murder Rate per 100K
## N: 50
##
## MurderRt
## ----------------- ----------
## Mean 4.51
## Std.Dev 2.41
## Min 0.90
## Q1 2.60
## Median 4.65
## Q3 6.00
## Max 12.30
## MAD 2.59
## IQR 3.35
## CV 0.53
## Skewness 0.73
## SE.Skewness 0.34
## Kurtosis 0.59
## N.Valid 50.00
## Pct.Valid 100.00
#note mean = 4.51, sd=2.41
#manually covert z-scores
StatesDataSub <- StatesDataSub %>%
mutate(
ZMurderRt = (MurderRt - 4.51)/2.41)
StatesDataSub %>% head() %>% gt()
State | MurderRt | ZMurderRt |
---|---|---|
Alabama | 7.1 | 1.0746888 |
Alaska | 3.2 | -0.5435685 |
Arizona | 5.5 | 0.4107884 |
Arkansas | 6.3 | 0.7427386 |
California | 5.4 | 0.3692946 |
Colorado | 3.2 | -0.5435685 |
mean(StatesData$MurderRt)
and
sd(StatesData$MurderRt)
. These functions will calculate the
mean and standard deviation without all the other descriptive
statistics.x
with MurderRt
in your
mutate()
formula. We strongly recommend this method over
the first approach.
StatesDataSub <- StatesDataSub %>% mutate(ZMurderRt = (x - mean(x))/sd(x))
like before. Replace x
with MurderRt
.head()
again to display only the first five rows, which allows you to compare
your results without adding long tables to our assignment file.#find z-score for each data value
StatesDataSub <- StatesDataSub %>%
mutate(
ZMurderRt = (MurderRt - mean(MurderRt))/sd(MurderRt)
)
StatesDataSub %>% head() %>% gt()
State | MurderRt | ZMurderRt |
---|---|---|
Alabama | 7.1 | 1.0727213 |
Alaska | 3.2 | -0.5450719 |
Arizona | 5.5 | 0.4090113 |
Arkansas | 6.3 | 0.7408663 |
California | 5.4 | 0.3675294 |
Colorado | 3.2 | -0.5450719 |
scale()
function to calculate
the z-scores for us. We can use similar code as above but without
writing the z-score formula ourselves. Note that it produces identical
results as our table above.#find z-score for each data value
StatesDataSub <- StatesDataSub %>%
mutate(
ZMurderRt = scale(MurderRt)
)
StatesDataSub %>% head() %>% gt()
State | MurderRt | ZMurderRt |
---|---|---|
Alabama | 7.1 | 1.0727213 |
Alaska | 3.2 | -0.5450719 |
Arizona | 5.5 | 0.4090113 |
Arkansas | 6.3 | 0.7408663 |
California | 5.4 | 0.3675294 |
Colorado | 3.2 | -0.5450719 |
!=
as “not equal
to”?options(scipen=999, digits = 3)
?filter(!is.na(var))
?dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
?
sjtab()
table and switch output from viewer to html
browser?gt()
table, such
as adding titles/subtitles with Markdown-formatted (e.g.,
**bold**
or *italicized*
) fonts?mutate()
?