The purpose of this tenth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 10 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
In the last assignment, you learned how to make statistical inferences from a contingency table by conducting and appropriately interpreting chi-squared tests of independence. You also learned how to estimate some measures of the magnitude of an association in a crosstab, such as Cramer’s V or the phi-coefficient. In this assignment, you will learn how to find the difference between two independent sample means in R and conduct an independent sample t-test of equality in population means.
`r `
, which can improve reproducibility and accuracy of
reporting results by helping avoid typos or copy-and-paste errors and by
auto-updating results$
for inline text
equations or two $$
to offset equations on a new
line.datawizard::data_filter()
mutate(newvar = as.factor(oldvar))
t.test()
and interpret the
results
car::leveneTest()
geom_violindot()
to visualize group distributions and
sample sizes simultaneouslyaes(fill=groupvar)
to fill plots with
different colors for each category of a grouping variablescale_fill_manual()
geom_hline()
or geom_vline()
We are building on objectives from Assignments 1-9. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
listname <- c(item1, item2)
lapply()
to create a list of
objects, which can help you avoid cluttering the R Environment with
objectsfunxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorsround()
can be used to specify number of
decimals on numeric valueshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packagesset.seed()
mtcars
that are universally available to R usershead()
function to quickly view the
first few rows of datatail()
function to quickly view the last
few rows of dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
haven::as_factor()
data %>% droplevels(data$variable)
select()
,
mutate()
, and if_else()
functionsmutate()
dplyr::sample_n()
dplyr::filter()
and
%in%
operatorrnorm()
,
truncnorm::rtruncnorm()
, or runif()
dplyr::slice_sample()
infer::rep_slice_sample()
tibble::tribble()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
or with
crosstable(depvar, by=indepvar)
gt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsgt()
table can be modified to add or
remove the decimals in specific columns or rows with
fmt_number()
ggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)ggplot()
plots into a
single figure using “patchwork” package
patchwork::wrap_plots()
to
quickly combine ggplot objects contained in a listggplot() + geom_point() + geom_errorbars
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
objectrstatix::binom_test()
qt()
or
qnorm()
t.test()
functioninfer::t_test()
infer::visualize()
to visualize where
your sample statistic would fall in the sampling distribution associated
with your null hypothesissjPlot::sjtab()
or chisq.test()
and interpret
results
statistics=
using sjPlot::sjtab()
and
interpret them appropriately.chisq.test()
to an object (e.g., chisq
) and
then calling elements from object (e.g., chisq$observed
or
chisq$expected
)If you do not recall how to do these things, review Assignments 1-9.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Visualize sampling distribution of sample mean differences, sampling variability, and standard errors; understand statistical power of a two-sample t-test
This week’s book chapter focused on null hypothesis tests to make inferences about the equality of two group means or proportions using sample data. The chapter discussed in detail tests involving differences between independent and dependent, or matched, samples as well as the appropriate time to use pooled or separate variance estimates.
In this assignment, you will learn how to make an inference about the (in)equality of two population group means by conducting an independent sample t-test and then determining whether to reject or fail to reject the null hypothesis of no difference in population group means. We will also briefly introduce Levene’s test as a method for determining whether one should assume equal or unequal group variances in the population, and we will introduce the half-violin/half-dotplot as a way to visualize variable distributions and potential outliers for two (or more) groups.
Before jumping into conducting independent samples t-tests and plotting in R, we want to be sure you understand the mechanisms that allow us to apply these statistical tests to samples and then make valid inferences about population group mean differences. So, we will first use simulation and plotting to help you visualize the sampling distribution of sample mean differences. While your book discusses this in detail, we are reviewing the topic here as well because it is essential to understanding the inferential logic behind the independent samples t-test of equality in means.
Let’s start by using rnorm()
to draw two random samples
from independent “population groups”, each represented by a distinct
normal probability distribution. We will imagine that the first
“population group” of scores is normally distributed with a mean=20 and
sd=5. In comparison, we will imagine the second “population group” of
scores is also normally distributed, with a different mean=24 but equal
variance (sd=5).
It may help to think back to our “car miles per gallon (mpg)” example
from Assignment 8. In that example, we first simulated a population of
cars with normally distributed mpg values (mean=22, sd=6), then
repeatedly drew random samples from that simulated population, plotted
the sample mpg distributions, then illustrated the sampling distribution
of the sample mean (for different sample sizes) by plotting all the
sample means. We then conducted a one-sample t-test that
compared the sample mean mpg from the mtcars
data with our
(null) hypothesized population mean mpg (\(\mu=22\)).
Likewise, in our current example, it may be helpful to imagine that our two simulated populations are independent groups of car mpg values - say, two population groups of cars manufactured in different years. Since we are simulating the populations, we know that the two population group means differ by an average of 4 units (\(\mu_2-\mu_1 = 24-20 = 4\)), or by an average of 4mpg.
However, when we draw random samples from independent population groups, we rarely know the true population parameters, such as the population group means (\(\mu_1\) or \(\mu_2\)) or the population group mean difference (\(\mu_2-\mu_1\)). Hence, we may use descriptive statistics from our samples as estimates of population parameters, then rely on inferential statistics to draw conclusions about the populations from which the samples were drawn. Here, we will illustrate this process by drawing samples from two independent (known) population groups and then conducting independent samples t-tests to make inferences about the populations from which our samples were drawn.
We will begin with relatively small samples of n=100 random draws from each simulated population group. Below we plot histograms and overlay normal curves for our two independent random samples of n=100 values each.
# https://stackoverflow.com/questions/21008884/simlulating-a-t-test-in-r
# https://www.guru99.com/r-apply-sapply-tapply.html
# https://stackoverflow.com/questions/6967664/ggplot2-histogram-with-normal-curve
#Illustrate sample mean differences for two independent population groups
nvals <- 100
set.seed(1138)
#sim two columns (groups) of rnorm values with "nval" size
grpA <- rnorm(n=nvals, mean=20, sd=5) %>% as_tibble_col(column_name = "A")
grpB <- rnorm(n=nvals, mean=24, sd=5) %>% as_tibble_col(column_name = "B")
#column bind & transform to long form for graphing
data <- cbind(grpA, grpB) %>%
pivot_longer(
cols = c(`A`, `B`),
names_to = "Group",
values_to = "Simval"
)
#Create dnorm x/sd values in data to draw w/geom_point()
# adds gganimate transitions on curves, which do not work w/stat_function
xA <- as_tibble_col(seq(min(grpA$A), max(grpA$A), length.out = nvals), column_name = "A")
xB <- as_tibble_col(seq(min(grpB$B), max(grpB$B), length.out = nvals), column_name = "B")
dnormx <- cbind(xA, xB) %>%
pivot_longer(
cols = c(`A`, `B`),
names_to = "GroupX",
values_to = "xval"
)
sdA <- as_tibble_col(dnorm(seq(min(grpA$A), max(grpA$A), length.out = nvals),
mean(grpA$A), sd(grpA$A)), column_name = "A")
sdB <- as_tibble_col(dnorm(seq(min(grpB$B), max(grpB$B), length.out = nvals),
mean(grpB$B), sd(grpB$B)), column_name = "B")
dnormsd <- cbind(sdA, sdB) %>%
pivot_longer(
cols = c(`A`, `B`),
names_to = "GroupSD",
values_to = "sdval"
)
#merge (column bind) data & drop redundant Group cols
data <- cbind(data, dnormx, dnormsd) %>% select(c(-GroupX, -GroupSD))
#plot base figures
twosampfig <- data %>% ggplot() +
geom_histogram(aes(x = Simval, y =..density.., fill=Group), alpha=0.3,
binwidth=2, position="identity", breaks = seq(0, 45, by = 2)) +
scale_fill_manual(values=c('#365c8d','#4ac16d')) +
geom_point(aes(x = xval, y = sdval, color=Group), size = 2, alpha = 0.5) +
scale_color_manual(values=c('#365c8d','#4ac16d')) +
labs(x= "Two Samples (n=100 each) from Independent Population Groups") +
theme_minimal() +
coord_cartesian(expand = FALSE)
# NOTE: this code will not work properly with groundhog - need to raise issue w/devs
# To create and save the animated gif file rather than simply load image (next chunk):
# 1. load gganimate & transformer library in library chunk above
# (remove comment # *before* groundhog code)
# 2. change chunk options here from eval=FALSE to eval=TRUE
#add plot animations
twosampgif <- twosampfig +
transition_states(Group,
transition_length = 2,
state_length = 1) +
ease_aes() +
enter_fade() +
exit_fade()
anim_save(here("Images","twosampgif.gif"), twosampgif)
The figure below highlights the observed sample difference in group means, which is represented by the red double-sided arrow in the space between the two vertical lines drawn at each of the two sample group mean values. The sample difference in group means from these two independent random samples is 3.63.
Note: It may not be obvious to you from our knitted RMD
assignment file, but we are actually using inline R code to call desired
output values in the text, such as the sample difference in means of
3.63 that we reported above. To do this, we initiate an inline R code
chunk by typing `r `
followed by our code in the text to
call the result that we want. For instance, to report the sample mean
difference above, after initiating the R code, we typed
round(mean(grpB$B) - mean(grpA$A), digits=2)
.
Using inline R code to report results is a preferred practice over simply typing or copy-and-pasting output values for reproducibility purposes. First, it helps us avoid all-to-common typos or copy-and-paste errors. Second, if we change our code at all in a way that affects our results (e.g., different data; variable selections; variable recoding; modeling decisions), then we may forget to retype or repaste the new results, but our inline R code will automatically update with the new values!
twosampfig +
geom_vline(xintercept = mean(grpA$A),
color = "#365c8d", size=1.2) +
geom_vline(xintercept = mean(grpB$B),
color = "#4ac16d", size=1.2) +
geom_segment(x = mean(grpA$A), y = .13, xend = mean(grpB$B), yend = .13,
color="#bc3754", size=1.5,
arrow = arrow(length = unit(0.025, "npc"), ends = "both")) +
annotate("text", x=34, y=.125, label="Difference in group means = 3.63\n(sample #1)",
angle=0, color="#bc3754")
Remember, social scientists often are less interested in observed sample group differences and more interested in making inferences about the underlying population group differences that generated those sample differences we observe. In this case, we know that the true population difference in group means is 24 - 20 = 4. We also know that our observed sample group difference of 3.63 is slightly off the mark, specifically by an error (observed - expected) of 3.63 - 4 = -0.37. Since we randomly selected data from each population and we did not add any other source of bias in our simulation, we also know that this error of -0.37 between the observed difference in sample means and the expected or true difference in population means is due to sampling variability - random “chance” variations in statistical estimates that vary from sample to sample.
By now, we also know that we can reduce sampling variability by drawing larger random samples from each population. We will return to this idea later. Moreover, based on our understanding of the central limit theorem, we expect that repeated sample draws from our two populations will result in an array of observed differences in sample means that are normally distributed and that cluster around the true difference in population means (=4). We illustrate this in the figure below.
#Illustrate sampling distribution of differences in sample means, two indep pop grps
nsamps <- 1000
nvals <- 100
set.seed(1138)
#sim 1000 columns (samples) of rnorm values with "nval" size for each group
grpC <- replicate(nsamps, rnorm(n=nvals, mean=20, sd=5), simplify=TRUE)
grpD <- replicate(nsamps, rnorm(n=nvals, mean=24, sd=5), simplify=TRUE)
#calculate difference in sample means for all 1000 samples
mudiffDC <- as_tibble(colMeans(grpD)-colMeans(grpC))
#plot difference in sample means for all 1000 samples
sampdiffs <- mudiffDC %>% ggplot(aes(x=value)) +
geom_dotplot(method="histodot", color="#21918c", fill="#5ec962", binwidth=.105,
dotsize=0.4) +
labs(x= "Observed Sample Differences in Group Means\nN=1,000 Samples (Group n=100)") +
stat_function(fun = dnorm, args = list(mean = mean(mudiffDC$value), sd = sd(mudiffDC$value)),
color='#443983', size=1.1, alpha=0.7) +
geom_point(aes(x=3.57, y=0.005), color="#bc3754", fill="#bc3754", size=2.5) +
geom_segment(x = 2.8, y = .40, xend = 3.53, yend = .025,
color="#bc3754", size=1.5,
arrow = arrow(length = unit(0.025, "npc"))) +
annotate("text", x=2.5, y=.45, label="Difference in group means\nobserved in our first sample\n(sample #1 diff. = 3.63)",
angle=0, color="#bc3754") +
geom_vline(xintercept = mean(mudiffDC$value), linetype="dashed", color = "#443983",
size=1.2, alpha=0.7) +
theme_minimal() +
coord_cartesian(expand = FALSE, ylim=c(0,.6)) +
scale_y_continuous(NULL, breaks = NULL) #y-axis values not meaningful so hide
sampdiffs
To create this plot, we drew N=1,000 random samples like the Group A
sample above, each with n=100 values, as well as N=1,000 samples like
the Group B sample above, each with n=100 values as well. Next, for each
pair of Group A and Group B samples of n=100 each, we calculated the
difference in group means (mean[GroupB] - mean[GroupA]). Finally, we
plotted these N=1,000 differences in observed sample means between
Groups A and B using ggplot’s geom_dotplot()
. The plot,
then, helps us visualize the sampling distribution of sample
mean differences between Groups A and B (see pp.305-7 in your
book).
As you can see, the differences in independent sample means are approximately normally distributed and cluster around the true population mean (with rounding: average difference=4.01).
This is an important point to understand about sampling distributions - even with relatively small samples (n=100 each), the mean of the sampling distribution will cluster around the true population mean. However, the sampling distribution will be more variable (i.e., more spread out) with smaller samples, and it will be less variable (i.e., narrower) with larger samples. One implication is that, when we draw small samples (e.g., n=100), there is a relatively high chance of estimating a group mean difference that is quite a ways off (e.g., <3 or >5) from the true population difference (=4). Put differently, our average error in estimating the true population difference will be larger when we draw small samples, and it will be smaller when we draw large samples.
Instead of plotting observed differences in sample means, since we know the true difference in population means, we could illustrate this point by instead plotting our error in estimation due to sampling variability. To do this, we would simply subtract the true population mean difference (=4) from each sample difference in means that we calculated.
Like the sampling distribution of sample mean differences, we expect the distribution of sampling errors (like the one we observed: 3.63 - 4 = -0.37) will also be normally distributed. However, this distribution will be comprised of positive sampling errors (overestimates) and negative sampling errors (underestimates) that all cluster around a mean error=0.
#Illustrate sampling distribution of errors in estimating diffs in sample means
set.seed(1138)
#calculate difference in sample means for all 1000 samples
mudiffDC <- mudiffDC %>% mutate(
error = value - 4 # group diff in population mean (4) - observed diff in samp means (value)
)
#plot difference in sample means for all 1000 samples
samperror <- mudiffDC %>% ggplot(aes(x=error)) +
geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.105,
dotsize=0.4) +
labs(x= "Population - Sample Group Mean Diff. (Sampling Errors)\nN=1,000 Samples (Group n=100)") +
stat_function(fun = dnorm, args = list(mean = mean(mudiffDC$error), sd = sd(mudiffDC$error)),
color='#57106e', size=1.1, alpha=0.7) +
geom_point(aes(x=-0.42, y=0.005), color="#bc3754", fill="#bc3754", size=2.5) +
geom_segment(x = -1.5, y = .3, xend = -0.5, yend = .025,
color="#bc3754", size=1.5,
arrow = arrow(length = unit(0.025, "npc"))) +
annotate("text", x=-1.7, y=.36, label="Sampling error from\nour first sample\n(sample #1 error. = -0.37)",
angle=0, color="#bc3754") +
geom_vline(xintercept = mean(mudiffDC$error), linetype="dashed", color = "#57106e",
size=1.2, alpha=0.7) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(-2.6,2.6), ylim=c(0,.6)) +
scale_y_continuous(NULL, breaks = NULL)
samperror
As expected, the distribution of sample estimation errors, or the sampling error distribution is normally distributed with a mean =“0”. However, also note that a lot of our samples generated estimates of the population group mean difference that were substantially off the true difference. In fact, 15% (N=150) of our 1,000 samples overestimated or underestimated the true population group mean difference by more than 1 unit (i.e., estimated diff. <3 or >5).
Given this, what do you think would happen to the sampling error distribution if we increase our initial sample group sizes from n=100 per group to n=1,000 per group? If we raise the group sample size to n=1,000 in each group, how many of our N=1000 simulated sample group pairs will generate estimated group mean differences with errors of more than 1 unit (i.e., overestimated or underestimated mean differences <3 or >5)?
#Illustrate sampling distribution of differences in sample means, two indep pop grps
nvals <-1000
nsamps <- 1000
set.seed(1138)
#sim 1000 columns (samples) of rnorm values with "nval" size for each group
grpE <- replicate(nsamps, rnorm(n=nvals, mean=20, sd=5), simplify=TRUE)
grpF <- replicate(nsamps, rnorm(n=nvals, mean=24, sd=5), simplify=TRUE)
#calculate difference in sample means for all 1000 samples
mudiffFE <- as_tibble(colMeans(grpF)-colMeans(grpE))
#calculate difference in sample means for all 1000 samples
mudiffFE <- mudiffFE %>% mutate(
error = value - 4 # group diff in population mean (4) - observed diff in samp means (value)
)
#plot difference in sample means for all 1000 samples
samperrorn1000 <- mudiffFE %>% ggplot(aes(x=error)) +
geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.085,
dotsize=0.4) +
labs(x= "Sampling Errors\nN=1,000 Samples\nGroup n=1000") +
stat_function(fun = dnorm, args = list(mean = mean(mudiffFE$error), sd = sd(mudiffFE$error)),
color='#57106e', size=1.1, alpha=0.7) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(-2.5,2.5), ylim=c(0,2)) +
scale_y_continuous(NULL, breaks = NULL)
samperrorn100 <- mudiffDC %>% ggplot(aes(x=error)) +
geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.085,
dotsize=0.4) +
labs(x= "Sampling Errors\nN=1,000 Samples\nGroup n=100") +
stat_function(fun = dnorm, args = list(mean = mean(mudiffDC$error), sd = sd(mudiffDC$error)),
color='#57106e', size=1.1, alpha=0.7) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(-2.5,2.5), ylim=c(0,2)) +
scale_y_continuous(NULL, breaks = NULL)
samperrorn100 + samperrorn1000
# mudiffDC %>% count(error >=1)
# mudiffDC %>% count(error <=1)
# https://rpruim.github.io/s341/S19/from-class/MathinRmd.html
The figure on the left (above) is the same one we showed before - it displays the sampling error distribution for our N=1,000 simulated samples with group sizes of n=100 per group. The figure on the right (above) displays the sampling error distribution for our N=1,000 simulated samples with group sizes of n=1,000 per group. As you can see, the distribution of sample estimation errors, or the sampling error distribution, is normally distributed with a mean =“0” in both figures.
However, note that the spread of the sampling distribution is much wider in the figure on the left (small groups of n=100 each) and much narrower in the figure on the right (large groups of n=1,000 each). Moreover, while 15% of samples on the left underestimated or overestimated the true group difference by at least 1 unit (i.e., <3 or >5; error <-1 or >1), none of the samples on the right generated group mean difference estimates that were in error by more than +/-1 unit.
This illustrates another essential point about sampling distributions: the spread - e.g., the variance or standard deviation - of the sampling distribution of a statistical estimate decreases as sample sizes increase! In fact, as you learned in this chapter (p.306), the formula for the standard deviation of the sampling distribution of the difference in means explicitly adjusts for the sample group sizes (n_1 & n_2):
\[ \sigma_{\overline{X}_1-\overline{X}_2} = \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \]
Note: You can add LaTeX style mathematical
equations to your R Markdown text just like we did above! To offset
the equation on its own line, type two dollar signs ($$
),
input the equation, then close the equation editor by typing two more
dollar signs. To add equations as inline text, type one dollar sign
($
), input the equation, then close the equation editor
with a second dollar sign.
For instance, to display the equation above, we typed the following:
$$
\sigma_{\overline{X}_1-\overline{X}_2} =
\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}
$$
Recall, we know the population standard deviation for both groups because we set them each to equal \(\sigma=5\) in our simulations. Therefore, we do not need to use the formula for calculating the pooled variance estimate of the standard error of the difference (see p.309) to estimate the standard deviation of the sampling distribution using the observed sd of our first sample. Instead, we can simply plug in the population values into the formula above to calculate the standard error of the difference (aka, the standard deviation of the sampling distribution of sample means) for the small (n=100) and large (n=1,000) sample groups.
Population standard error of the difference, group
n=100
popsediffn100 <- sqrt((5^2/100) + (5^2/100))
popsediffn100
## [1] 0.7071068
Population standard error of the difference, group
n=1,000
popsediffn1000 <- sqrt((5^2/1000) + (5^2/1000))
popsediffn1000
## [1] 0.2236068
By now, it should not be surprising to see that the standard error of
the difference is smaller for larger groups (n=1,000) and larger for the
smaller groups (n=100). We can also check the standard deviation of our
N=1,000 simulated sample differences to see if they are close to these
known population values.
Sample estimated standard error of difference, group
n=100
simsediffn100 <- sd(mudiffDC$value)
simsediffn100
## [1] 0.7414356
Sample estimated standard error of difference, group
n=1,000”
simsediffn1000 <- sd(mudiffFE$value)
simsediffn1000
## [1] 0.2224673
Note that both estimates from our N=1,000 simulated samples of small and large groups are close to the true population standard errors of the difference values we calculated above - so our simulation of the sampling distribution worked as expected. However, the small group simulated estimate (0.74) was further away from our population value (0.71). This too should be unsurprising, as you know by now to expect more sampling variability in statistical estimates with smaller samples.
Bottom line: If researchers collect small samples, they take a substantial risk of making estimation and inference errors!
Now, let’s return to our first sample of n=100 in Group A and n=100 in Group B. We will use estimates of the population group means from these two independent samples and a two-sample t-test with equal variance assumed to test the null hypothesis of no difference in population group means. Let’s assume a two-tailed test with a conventional alpha threshold of 0.05.
#conduct t-test
t_testAB <- t.test(grpA, grpB, paired=FALSE, var.equal=TRUE)
t_testAB
##
## Two Sample t-test
##
## data: grpA and grpB
## t = -5.5416, df = 198, p-value = 9.479e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.926117 -2.340310
## sample estimates:
## mean of x mean of y
## 20.33748 23.97070
We conducted a t-test! Now, what does it mean?
Well, as your book explains, we start with the observed sample mean difference of 20.34 (\(\overline{X}_A\)) minus 23.97 (\(\overline{X}_B\)), which equals -3.63 (\(\overline{X}_A - \overline{X}_B\)). We then convert this sample mean difference into a t-standardized score by dividing the mean difference by the standard error of the difference using the pooled variance t-test formula (see p.309).
Although we know the actual population standard error of the
difference, here we are strictly estimating from our samples to
illustrate the sample inference process. Thus, R will divide the mean
difference by the sample pooled variance estimate of the
standard error as described on p.309 of your book. We specified
a pooled variance estimate, which assumes equal variance between groups,
by including var.equal=TRUE
.
We can check the model’s standard error estimate by clicking on the
“t_testAB” object in our environment and navigating to the “stderr”
value, or we can call it in code with t_testAB$stderr
; that
estimate is 0.6556248.
Likewise, we can use these values to see how the
t-standardized mean difference was calculated in R:
Mean difference, Group A & B (Sample #1)
diffAB <- mean(grpA$A) - mean(grpB$B)
diffAB
## [1] -3.633213
Standard error of difference, Mean(A) - Mean(B)
stderrAB <- t_testAB$stderr
stderrAB
## [1] 0.6556248
t-standardized mean difference, Group A & B
tvalAB <- diffAB / stderrAB
tvalAB
## [1] -5.541605
Note the t value here is the same as the one found in our t-test output. That output also includes a 95% confidence interval for the estimated t-standardized mean difference. Additionally, the p-value is near zero (9.4793139^{-8}), which is well below our pre-specified alpha level of 0.05. Given this, with our first simulated sample, we would reject the null hypothesis of no difference in means; it would be highly improbable to observe a sample group mean difference as large or larger than the one we observed (i.e., observed diff >= +/-3.63) if the true population mean difference between groups A and B were equal to “0”.
Of course, we know this is a valid inference because we set the true population difference to be equal to “4”, so we would hope such a test would reject the null hypothesis of diff=0. Yet, do you know how many of our N=1,000 simulated samples with group size n=100 would have reach the same valid inference? There are various ways to approach this question, which essentially asks about the (inverse of the) statistical power of our test, or the probability of making a Type II (false negative) error.
One way we could approach answering this question is by using
replicate()
to perform t-tests on all N=1,000
samples and then counting the number of tests that failed to reject the
null hypothesis. Let’s try it by generating 1,000 new samples of group
size n=100 each and then counting the number of Type II (false negative)
errors.
How many of our N=1,000 samples (group n=100) led to false
negative test? (i.e., ‘TRUE’)
set.seed(1138)
simtrep <- replicate(1000,t.test(rnorm(n=100, mean=20, sd=5),rnorm(n=100, mean=24, sd=5)), simplify=FALSE)
pvals <- as_tibble(sapply(simtrep, '[[', 'p.value'))
pvals %>% count(value>=0.05)
## # A tibble: 1 × 2
## `value >= 0.05` n
## <lgl> <int>
## 1 FALSE 1000
You might find it interesting to see that none of our N=1,000 simulated samples with relatively small group sizes of n=100 each would have generated a false negative inference from the t-test results. That is, each sample test generated a p-value that was less than our alpha threshold of 0.05 (i.e., n[TRUE]=0; n[FALSE]=1000). This is despite the fact that many of these samples would have led to non-negligible overestimates or underestimates of the true population mean difference!
This helps us highlight one of the key pitfalls of interpreting results of null hypothesis significance testing: A significant p-value does not mean we have a good sample estimate of the population. Rather, it merely means that we had sufficient statistical power to detect a difference of the observed magnitude (or larger). A statistically significant estimate could be way off the magnitude or even the direction of the true population difference - especially if we are estimating from relatively small samples with a high degree of sampling variability.
Additionally, though it was not an issue above, a statistically non-significant estimate could be a false positive - especially with small samples and/or when the true population magnitude is small (relative to our sample size). In the example above, we selected a magnitude for the population mean difference (24-20=4) that was large enough to reliability detect with group sizes of n=100. In other words, our test was designed to have sufficient statistical power to detect an effect of that magnitude given a null hypothesis of no mean difference, as about 99.99% of tests (like we observed) would be expected to detect a true population difference. But what if we had set a different null hypothesis of, say, a magnitude of diff=2 rather than diff=0? Or, what if the true population difference was a smaller magnitude of, say, 2 units instead? Or, what if the two populations were more heterogeneous and, hence, had higher standard deviations?
Let’s explore what might happen by running the simulation again, but this time we will specify a true population mean of \(\overline{X}_A=22\) for the second group (instead of “24”) and a population standard deviation of \(\sigma=6\) for both groups (instead of “5”).
If population diff=2 & sigma=6, how many samples (group n=100) would lead to false negative test? (i.e., ‘TRUE’)
set.seed(1138)
simtrep <- replicate(1000,t.test(rnorm(n=100, mean=20, sd=6),rnorm(n=100, mean=22, sd=6)), simplify=FALSE)
pvals <- as_tibble(sapply(simtrep, '[[', 'p.value'))
pvals %>% count(value>=0.05)
## # A tibble: 2 × 2
## `value >= 0.05` n
## <lgl> <int>
## 1 FALSE 658
## 2 TRUE 342
In this example, with a true mean difference of “2” units (22 - 20) and a true population standard deviation of “6” for both groups, 65.8%% of our samples generated a p-value that is less than our alpha level of 0.05. In other words, this test had an approximate power of around 0.66 (or 0.65), meaning that if the true population characteristics match what we specified in our simulation and if our sample sizes were the same each time n=100 per group), then we can a priori expect around 66% of sample t-test replications to accurately detect the true population difference.
In contrast, 34% of our simulated samples generated a p-value that is as large or larger than our alpha level of 0.05, meaning that about one-third of these samples would have generated false negative t-test results (Type II inference errors) in this situation.
So, what lessons can we take from this example? All else equal, when
the true population difference is smaller and/or the population groups
are more heterogeneous (i.e., more variable), then random samples drawn
from the population also will be more variable, and sample inferential
tests will become less reliable. Once again, one way to improve the
reliability (power) of those inferential tests - and improve estimation
- is increasing our sample size. Let’s show results of one final
p-value simulation example, keeping our population mean
difference at “2” units and our population standard deviation at “6” as
in the last example, but increasing our group sample size to n=300 per
group.
If diff=2 & sigma=6, how many samples (group n=1000) would
lead to false negative test? (i.e., ‘TRUE’)
set.seed(1138)
simtrep <- replicate(1000,t.test(rnorm(n=300, mean=20, sd=6),rnorm(n=300, mean=22, sd=6)), simplify=FALSE)
pvals <- as_tibble(sapply(simtrep, '[[', 'p.value'))
pvals %>% count(value>=0.05)
## # A tibble: 2 × 2
## `value >= 0.05` n
## <lgl> <int>
## 1 FALSE 980
## 2 TRUE 20
By increasing our group size to n=300 each (n=600 total), we drastically increase the statistical power of our test (to about 0.98), while reducing the number of expected false negative results (Type II inference errors) to about 2% of trials!
Okay, that is enough about sampling variability, sampling error, and statistical power for now. Let’s move on to learning how to conduct your own two-sample t-tests so you can complete the SPSS Exercises for Chapter 10.
Goal: Read in Youth Dataset and Find Mean Differences
In this assignment, you will work through Question #2 from the SPSS Exercises at the end of Chapter 10. This question ask you to investigate whether students who spend lots of time studying also view delinquency as more wrong compared to those who spend less time studying. Essentially, you will be testing the null hypothesis of no difference in means on the ‘moral’ variable between two independent groups of youth - those who study 8 or fewer hours (<9) per week (‘studyhard’=0) and those who study more than 8 hours (9+) per week (‘studyhard’=1).
As we have done in the past couple assignments, we will first illustrate the steps you will take to complete the assignment using a different research question and different variables. Specifically, we will instead examine whether those students who spend lots of time studying are less likely to engage in delinquency compared to those who study less frequently. As such, ‘studyhard’ will still be our focal independent variable that defines our two independent groups. However, we will substitute ‘delinquency’ for our dependent variable (instead of ‘moral’).
Once again, we want to ensure you learn how to conduct an independent sample t-test of mean differences in R using real data and that you can apply these steps on your own to answer the assignment questions. Simply replace ‘delinquency’ with ‘moral’ throughout the assignment to obtain the correct information. Be sure to repeat these steps with the correct variables listed in the assignment to ensure you receive full credit.
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_07_06_Fordham_K300Assign10)
here
package will automatically set our K300_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, see
, datawizard
,
car
, and sjPlot
.
see
, car
,
and datawizard
packages. see
is part of the
easystats
suite of packages. easystats
is a
collection of R packages which aims to provide a unifying, consistent,
and easy to use framework for analyzing and visualizing data in R. The
package we will use from this suite, see
, is a plotting
companion that allows us to make aesthetically pleasing visualizations
of our data relatively easily. To learn more about see
and
other easystats
packages, visit GitHub. We will
briefly discuss the datawizard
and car
packages later in the assignment.YouthData
.
YouthData
.
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
head(YouthData) %>% gt()
Gender | v2 | v21 | v22 | v63 | v77 | v79 | v109 | v119 | parnt2 | fropinon | frbehave | certain | moral | delinquency | d1 | hoursstudy | filter_$ | heavytvwatcher | studyhard | supervision | drinkingnotbad | Lowcertain_bin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 15 | 36 | 15 | 3 | 1 | 3 | 5 | 1 | 5 | 18 | 9 | 9 | 19 | 8 | 1 | 15 | 1 | 1 | 1 | 0 | 0 | 1 |
0 | 15 | 3 | 5 | 4 | 1 | 2 | 4 | 1 | 8 | 11 | 6 | 11 | 19 | 10 | 1 | 5 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 15 | 20 | 6 | 4 | 1 | 1 | 4 | 2 | 8 | 12 | 5 | 11 | 20 | 1 | 0 | 6 | 0 | 1 | 0 | 1 | 0 | 0 |
0 | 15 | 2 | 4 | 2 | 1 | 2 | 5 | 3 | 4 | 9 | 5 | 13 | 19 | 1 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 14 | 12 | 2 | 4 | 3 | 5 | 5 | 3 | 7 | 27 | 9 | 4 | 18 | 104 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 |
1 | 15 | 1 | 3 | 4 | 1 | 1 | 5 | 3 | 7 | 8 | 9 | 10 | 20 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
sjPlot::view_df()
to get a snapshot of your data and
variable information.
YouthData %>% frq(studyhard)
. Hit enter and
type YouthData %>% frq(delinquency)
.
# show distribution & central tendency of variables
YouthData %>%
frq(studyhard)
## studies 9+ hours a week (studyhard) <numeric>
## # total N=1272 valid N=1214 mean=0.27 sd=0.45
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ---------------------------------------------------------
## 0 | 8 or fewer hours | 881 | 69.26 | 72.57 | 72.57
## 1 | 9+ hours | 333 | 26.18 | 27.43 | 100.00
## <NA> | <NA> | 58 | 4.56 | <NA> | <NA>
YouthData %>%
frq(delinquency)
## time 1 delinquency scale (delinquency) <numeric>
## # total N=1272 valid N=1272 mean=26.40 sd=39.27
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 288 | 22.64 | 22.64 | 22.64
## 1 | 91 | 7.15 | 7.15 | 29.80
## 2 | 71 | 5.58 | 5.58 | 35.38
## 3 | 56 | 4.40 | 4.40 | 39.78
## 4 | 37 | 2.91 | 2.91 | 42.69
## 5 | 62 | 4.87 | 4.87 | 47.56
## 6 | 38 | 2.99 | 2.99 | 50.55
## 7 | 29 | 2.28 | 2.28 | 52.83
## 8 | 22 | 1.73 | 1.73 | 54.56
## 9 | 15 | 1.18 | 1.18 | 55.74
## 10 | 30 | 2.36 | 2.36 | 58.10
## 11 | 21 | 1.65 | 1.65 | 59.75
## 12 | 23 | 1.81 | 1.81 | 61.56
## 13 | 15 | 1.18 | 1.18 | 62.74
## 14 | 13 | 1.02 | 1.02 | 63.76
## 15 | 22 | 1.73 | 1.73 | 65.49
## 16 | 11 | 0.86 | 0.86 | 66.35
## 17 | 9 | 0.71 | 0.71 | 67.06
## 18 | 8 | 0.63 | 0.63 | 67.69
## 19 | 3 | 0.24 | 0.24 | 67.92
## 20 | 20 | 1.57 | 1.57 | 69.50
## 21 | 7 | 0.55 | 0.55 | 70.05
## 22 | 3 | 0.24 | 0.24 | 70.28
## 23 | 5 | 0.39 | 0.39 | 70.68
## 24 | 5 | 0.39 | 0.39 | 71.07
## 25 | 13 | 1.02 | 1.02 | 72.09
## 26 | 6 | 0.47 | 0.47 | 72.56
## 27 | 6 | 0.47 | 0.47 | 73.03
## 28 | 8 | 0.63 | 0.63 | 73.66
## 29 | 2 | 0.16 | 0.16 | 73.82
## 30 | 5 | 0.39 | 0.39 | 74.21
## 31 | 11 | 0.86 | 0.86 | 75.08
## 32 | 6 | 0.47 | 0.47 | 75.55
## 33 | 6 | 0.47 | 0.47 | 76.02
## 34 | 2 | 0.16 | 0.16 | 76.18
## 35 | 11 | 0.86 | 0.86 | 77.04
## 36 | 6 | 0.47 | 0.47 | 77.52
## 37 | 2 | 0.16 | 0.16 | 77.67
## 38 | 3 | 0.24 | 0.24 | 77.91
## 39 | 3 | 0.24 | 0.24 | 78.14
## 40 | 4 | 0.31 | 0.31 | 78.46
## 41 | 2 | 0.16 | 0.16 | 78.62
## 42 | 2 | 0.16 | 0.16 | 78.77
## 43 | 4 | 0.31 | 0.31 | 79.09
## 44 | 2 | 0.16 | 0.16 | 79.25
## 45 | 6 | 0.47 | 0.47 | 79.72
## 46 | 4 | 0.31 | 0.31 | 80.03
## 48 | 2 | 0.16 | 0.16 | 80.19
## 50 | 8 | 0.63 | 0.63 | 80.82
## 51 | 3 | 0.24 | 0.24 | 81.05
## 52 | 8 | 0.63 | 0.63 | 81.68
## 53 | 5 | 0.39 | 0.39 | 82.08
## 55 | 1 | 0.08 | 0.08 | 82.15
## 56 | 1 | 0.08 | 0.08 | 82.23
## 58 | 1 | 0.08 | 0.08 | 82.31
## 59 | 1 | 0.08 | 0.08 | 82.39
## 60 | 2 | 0.16 | 0.16 | 82.55
## 61 | 1 | 0.08 | 0.08 | 82.63
## 62 | 2 | 0.16 | 0.16 | 82.78
## 63 | 2 | 0.16 | 0.16 | 82.94
## 64 | 2 | 0.16 | 0.16 | 83.10
## 65 | 2 | 0.16 | 0.16 | 83.25
## 66 | 1 | 0.08 | 0.08 | 83.33
## 68 | 1 | 0.08 | 0.08 | 83.41
## 69 | 1 | 0.08 | 0.08 | 83.49
## 70 | 3 | 0.24 | 0.24 | 83.73
## 71 | 1 | 0.08 | 0.08 | 83.81
## 72 | 2 | 0.16 | 0.16 | 83.96
## 74 | 1 | 0.08 | 0.08 | 84.04
## 75 | 2 | 0.16 | 0.16 | 84.20
## 77 | 2 | 0.16 | 0.16 | 84.36
## 78 | 2 | 0.16 | 0.16 | 84.51
## 80 | 2 | 0.16 | 0.16 | 84.67
## 81 | 4 | 0.31 | 0.31 | 84.98
## 82 | 2 | 0.16 | 0.16 | 85.14
## 84 | 1 | 0.08 | 0.08 | 85.22
## 85 | 4 | 0.31 | 0.31 | 85.53
## 86 | 3 | 0.24 | 0.24 | 85.77
## 90 | 1 | 0.08 | 0.08 | 85.85
## 91 | 1 | 0.08 | 0.08 | 85.93
## 92 | 1 | 0.08 | 0.08 | 86.01
## 93 | 1 | 0.08 | 0.08 | 86.08
## 94 | 3 | 0.24 | 0.24 | 86.32
## 95 | 1 | 0.08 | 0.08 | 86.40
## 97 | 2 | 0.16 | 0.16 | 86.56
## 99 | 2 | 0.16 | 0.16 | 86.71
## 100 | 10 | 0.79 | 0.79 | 87.50
## 101 | 5 | 0.39 | 0.39 | 87.89
## 102 | 4 | 0.31 | 0.31 | 88.21
## 103 | 5 | 0.39 | 0.39 | 88.60
## 104 | 2 | 0.16 | 0.16 | 88.76
## 105 | 1 | 0.08 | 0.08 | 88.84
## 106 | 1 | 0.08 | 0.08 | 88.92
## 107 | 2 | 0.16 | 0.16 | 89.07
## 109 | 1 | 0.08 | 0.08 | 89.15
## 110 | 5 | 0.39 | 0.39 | 89.54
## 111 | 1 | 0.08 | 0.08 | 89.62
## 114 | 2 | 0.16 | 0.16 | 89.78
## 115 | 1 | 0.08 | 0.08 | 89.86
## 116 | 1 | 0.08 | 0.08 | 89.94
## 118 | 128 | 10.06 | 10.06 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
datawizard
package, which is
included with the see
package, to filter and manipulate our
data. I know – packages within packages? As complex as that may seem,
datawizard
is one of several packages that automatically
are installed with see
. This package allows us to tidily
filter our data into objects. Visit this website to learn more
about all that the see
package can do!
studyhi <- data_filter(YouthData, studyhard == 1)
. Then,
hit enter and type mean(studyhi$delinquency)
to see the
‘delinquency’ mean in our new filtered data object.
datawizard
package to filter our data to keep
only those cases where the youth studied 9 or more hours a week
(studyhard == 1
), then assign the filtered data to a new
data object named studyhi
.dplyr::filter()
. We wanted to introduce you to the
“easystats” suite of packages and, in the process, to show you that you
can use its selection of self-contained and inter-compatible packages to
accomplish many of the data wrangling, analysis, and visualization tasks
you have learned about thus far.studyhard == 0
), assign
to a new data object named studylow
, and check the
‘delinquency’ mean in this group.#Use datawizard package for tidy filtering & data manip
#https://easystats.github.io/datawizard/
#mean high studying group
studyhi <- data_filter(YouthData, studyhard == 1)
mean(studyhi$delinquency)
## [1] 19.81381
#mean low/no studying group
studylow <- data_filter(YouthData, studyhard == 0)
mean(studylow$delinquency)
## [1] 28.98751
meandiff <- mean(studyhi$delinquency) - mean(studylow$delinquency)
.
Then, hit enter and type meandiff
to view the resulting
value.#calc mean diff btw groups
meandiff <- mean(studyhi$delinquency) - mean(studylow$delinquency)
meandiff
## [1] -9.1737
As we anticipated, the negative mean difference in delinquency values between these two independent samples indicates that the high studying group engages in less delinquency on average compared to the low studying group - about 9 fewer acts on average in the past year (-9.17).
This descriptive statistic informs us about the observed mean difference in delinquency between two independent groups in a sample of youth. What if we wanted to make an inference about group mean differences in the population from which this sample was drawn?
To do this, we could conduct an independent samples t-test. Most commonly, this is done as a null hypothesis test in which we assume the two group means are equal in the population (i.e., a null hypothesis of equality in means) and then estimate the probability of obtaining a mean difference at least as large as that observed due to sampling variability alone under the null hypothesis of no difference in means.
As we illustrated in the simulations above, this process involves
using the t distribution to transform our raw observed mean
difference into a t-standardized mean difference score (\(t_{obt}\)). From there, we can directly
compare our t-obtained test statistic to a t critical
value that is determined by our pre-specified alpha level and the
degrees of freedom for the test. We can also transform our
t-standardized mean difference score (i.e., our t-test
statistic) into a p-value that represents the estimated
probability of observing a mean difference at least as large as that
which we observed under the null hypothesis (i.e., if the means were
truly equal in the population). There are various ways to conduct
independent samples t-tests in R. For some additional examples,
you might wish to check out here,
here,
and here.
We will describe how to do so using the base R t.test()
function.
Goal: Conduct independent samples t-test
myttest_eq <- t.test (delinquency ~ studyhard, var.equal=TRUE, data = YouthData)
.
Then, hit enter and type print(myttest_eq)
to view our
t statistic, p-value, and confidence intervals (at 95%
confidence level).t.test()
function conducts an independent sample
t-test.delinquency ~ studyhard
is our statistical model, which
specifies that ‘delinquency’ is distributed as a random normal variable
with different mean values across each of the two categories of the
‘studyhard’ variable.var.equal = TRUE
specifies that the data have met the
assumption of homogeneity (i.e., equality) of variances. In other words,
this tells R to conduct a pooled variance t-test.
myttest_eq
with myttest_uneq
(for unequal
variances) and var.equal = TRUE
with
var.equal = FALSE
. Run your code and print the
results.Independent samples t-test, pooled variance
#t-test will convert raw mean diff to stdz t value, then calc p-value of that stdz diff
#indep samp t-test w/equal var assumed;
myttest_eq <- t.test(delinquency ~ studyhard, var.equal=TRUE, data = YouthData) # assumes that the data has met the assumption of homogeneity of variances
print(myttest_eq)
##
## Two Sample t-test
##
## data: delinquency by studyhard
## t = 3.6444, df = 1212, p-value = 0.0002794
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 4.235134 14.112266
## sample estimates:
## mean in group 0 mean in group 1
## 28.98751 19.81381
Independent samples t-test, separate variance
#indep samp t-test w/unequal var assumed;
myttest_uneq <- t.test (delinquency ~ studyhard, var.equal=FALSE, data = YouthData) # does not assume that the data has met the assumption of homogeneity of variances
print(myttest_uneq)
##
## Welch Two Sample t-test
##
## data: delinquency by studyhard
## t = 3.9005, df = 690.37, p-value = 0.0001054
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 4.555967 13.791434
## sample estimates:
## mean in group 0 mean in group 1
## 28.98751 19.81381
While we demonstrated how to conduct both a pooled and separate
variance independent samples t-test above, which of those two
tests should we have conducted? According to your book, it is typically
safest to assume unequal variances since we are making inferences from a
sample and do not know whether the variances indeed are equal in the
population. However, Question #2.e in the SPSS Exercises introduces the
Levene’s test for equality of variances as another method for making
this decision. This test uses our sample group variances as estimates of
the population group variances and then conducts an F test to
help us make an inference about the homogeneity of variances in the
population by either rejecting or failing to reject the null hypothesis
of equal variances. We will show you how to conduct this test and answer
the associated assignment questions using R.
Create a third-level header titled: “Levene’s Test for Equality
of Variances” and then insert an R chunk.
First, we need to recode the ‘studyhard’ variable into a factor
rather than numeric variable. Then, we will assign our new dataset
containing the modified ‘studyhard’ factor variable into a new data
object named YouthData2
.
Type
YouthData2 <- YouthData %>% mutate(studyhard = as.factor(studyhard))
From here on, be sure to use our new data object, YouthData2, and
not the original!
mutate()
function is one you have used before.
Recall that it allows us to recode and manipulate our data.studyhard = as.factor(studyhard)
mutates the variable
‘studyhard’ from numeric into a factor variable.YouthData2
.#First, need to recode studyhard as factor variable; we will save into new dataset YouthData2 just to be safe
YouthData2 <- YouthData %>% mutate(studyhard = as.factor(studyhard))
Now, we are ready to conduct Levene’s test for homogeneity of variance. As noted above, Question #2.e in the SPSS Exercise for Chapter 10 in your book mentions this test but does not provide a detailed explanation. As we explained, this procedure uses observed sample group variances to test the null hypothesis that the variances (standard deviations) are equal across both groups in the population. If the sample distributions differ substantially, we should reject the null hypothesis of equal variances and choose to conduct a separate (unequal) variance t-test of equality of means across groups. (Note: To be safe, it is fine to choose the unequal variance option even when Levene’s test results in a failure to reject the null hypothesis.)
car
package, which is an acronym for “companion to applied regression.” For
examples of conducting this test in R, see here, here,
and here.leveneTest(yvar ~ factorvar, data)
. Of course, you will
need to input your own y-variable, factor or group variable, and
dataset!#Levene's test found in car package (companion to applied regression)
leveneTest(delinquency ~ studyhard, YouthData2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 11.94 0.0005685 ***
## 1212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Goal: Visualize sample means, distributions, and outlier observations
For the last part of the assignment, we will visualize our variables
using the see
package and geom_violindot()
.
Recall that the see
package allows us to visualize our data
from spread to outliers. We have used the ggplot()
and
aes()
functions before to plot data. Now, we will use these
functions with the geom_violindot()
function to graph our
variables side-by-side for comparison.
NA
) observations from our data. For example, if a
participant did not give a response on their time studying, we will need
to drop that case from our data. One way to do this is to type:
data <- droplevels(data_filter(YouthData2, studyhard != "NA"))
data_filter()
function again to tidily
filter out any NA
observations on our ‘studyhard’ variable
from our new YouthData2
data object.==
) a particular value (e.g.,
studyhard == 1
). Here, we use the !=
operator
to filter, which tells R that we do not want cases
equal to this value on the variable in our data. So,
studyhard != "NA"
tells R not to include
cases where there is no information on hours spent studying.droplevels()
command, which removes the now empty
NA
grouping level for the ‘studyhard’ variable altogether
so it will not show up in tables or figures.data
. # Filter out NA values on studyhard & drop NA factor level
data <- droplevels(data_filter(YouthData2, studyhard != "NA"))
ggplot()
to let R know we are
creating a graph. Insert an R chunk and type
ggplot(data, aes(x = studyhard, y = delinquency, fill = studyhard)) +
aes()
allows us to determine our axes.
Here, we want our independent variable, ‘studyhard’, to be on the x-axis
and our dependent variable, ‘delinquency’, to be on the y-axis.fill = studyhard
tells ggplot to fill the violin plots
by ‘studyhard’ group. In other words, it will fill the delinquency
distributions with different colors for each study group.+
sign on the same
line as ggplot()
to add layers (i.e., at the end of each
line rather than the beginning of a new line) so R knows you want more
than a blank graph.geom_violindot(color_dots = "#d44842", size_dots = 4) +
.
geom_violindot()
generates a half-violin and
half-dotplot, which is useful for visualizing the distribution and the
group sample sizes simultaneously. Violin plots (right side of each
image) are mirrored density plots that are displayed in a similar way to
boxplots, while the dotplot (left side of each image) allows more
precise visualization of each observation in the group. To learn more
about geom_violindot
, visit this
website.color_dots =
and size_dots =
functions
adjusts the color and size of the our frequency dotplot (left side). We
can use any color we’d like and make the dots as large or small as we
prefer. Here, we chose a color from the inferno palette.scale_fill_manual(values=c("#4ac16d","#e45a31")) +
theme_modern()
. This is a
default function that tell R what theme to use for our plot.# Visualize distributions and observations
myplot <- ggplot(data, aes(x = studyhard, y = delinquency, fill = studyhard)) +
geom_violindot(color_dots = "#d44842", size_dots = 4) +
scale_fill_manual(values=c("#365c8d", "#4ac16d")) +
theme_modern()
myplot
+
) two horizontal
lines at the delinquency mean values for each group.geom_hline()
to add a horizontal line to
any ggplot object. If you want to add a vertical line, use
geom_vline()
yintercept =
specifies where each horizontal line
should cross the y-axis. We added a line at the delinquency mean value
for the ‘studylow’ group (mean(studylow$delinquency)
) and
another at the ‘studyhi’ delinquency mean
(mean(studyhi$delinquency)
).color =
to specify the same group colors that we
used to fill our violin plot in the ggplot code chunk above.myplot +
geom_hline(yintercept = mean(studylow$delinquency), color = "#365c8d") +
geom_hline(yintercept = mean(studyhi$delinquency), color = "#4ac16d")
`r `
, which can improve reproducibility and accuracy of
reporting results by helping avoid typos or copy-and-paste errors?$
for
inline text equations or two $$
to offset equations on a
new line.?datawizard::data_filter()
?mutate(newvar = as.factor(oldvar))
?t.test()
and interpret the
results?
car::leveneTest()
?geom_violindot()
to visualize group distributions and
sample sizes simultaneously?aes(fill=groupvar)
to fill plots
with different colors for each category of a grouping variable?scale_fill_manual()
?geom_hline()
or geom_vline()
?