In the last assignment, you learned how to estimate and interpret confidence intervals around a point estimate (e.g., sample mean or proportion). You also learned how to simulate data from basic probability distributions to help you better understand sampling variability and the need for interval estimates. In this assignment, you will learn how to conduct a two-tail z-test and t-test and then, given the test results and the null hypothesis, to make an appropriate inference about the population parameter by either rejecting or failing to reject the null hypothesis.
Before you conduct your own hypothesis tests, we will first simulate population data from a normal probability distribution, then take random samples from our simulated population data and plot features of these samples. Our aim will be to help you visualize the sampling distribution of a sample mean, which should lead to a better understanding of the underlying mechanisms that allow us to make valid population inferences from samples with null hypothesis significance testing. While we will not expect you to do all of these tasks yourself, by providing our code along the way, we hope these examples will help you gain a better sense of how one might conduct simulations, create handy user-written functions, and generate more complex visualizations in R.
dplyr::slice_sample()
infer::rep_slice_sample()
lapply()
to create a list of
objects, which can help you avoid cluttering the R Environment with
objectspatchwork::wrap_plots()
to
quickly combine ggplot objects contained in a listtail()
function to quickly
view the last few rows of datamtcars
that are universally available to R userst.test()
functioninfer::t_test()
infer::visualize()
to visualize where
your sample statistic would fall in the sampling distribution associated
with your null hypothesisWe are building on objectives from Assignments 1-9. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
listname <- c(item1, item2)
funxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorshere()
for a simple and reproducible
self-referential file directory methodset.seed()
head()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
haven::as_factor()
data %>% droplevels(data$variable)
select()
,
mutate()
, and if_else()
functionsmutate()
dplyr::sample_n()
dplyr::filter()
and
%in%
operatorrnorm()
,
truncnorm::rtruncnorm()
, or runif()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
gt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)+ theme_minimal()
) to a ggplot object to conveniently
modify certain plot elements (e.g., white background color)viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)ggplot()
plots into a
single figure using “patchwork” package
ggplot() + geom_point() + geom_errorbars
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
objectrstatix::binom_test()
qt()
or
qnorm()
If you do not recall how to do these things, review Assignments 1-9.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Understand the sampling distribution of a sample mean as well as the relationship between sample size and sampling variability
After learning about how confidence intervals help us quantify and communicate uncertainty in statistical estimates of population parameters, this week’s course materials focused on making inferences about population parameters by conducting null hypothesis tests. Briefly, this process involves:
While anyone can learn how to apply these steps to generate p-values and then make claims about null and research hypotheses, a true understanding of the process of statistical inference requires one to understand the central limit theorem and the theoretical concept of a sampling distribution of a sample statistic. You were first introduced to these concepts in Chapter 6 of your book. Since the current chapter is about transitioning from estimation to statistical tests and population inferences, we think it is a good time to revisit the concept of a sampling distribution because it is essential to understanding statistical inferences via null hypothesis testing. Our hope is that we can help you visualize the sampling distribution of a sample mean so that you can better understand what we are doing - and what we are not doing - when we conduct a hypothesis test.
Typically, researchers start with a sample and then work backwards to make inferences about the population from which a sample was drawn. Here, we will start by simulating a known population from a probability distribution and then draw random samples from that population to illustrate how and why we can reliably make inferences about the population as a result of our knowledge of sampling distributions.
To give our example a sense of realism, let’s use the average miles driven per gallon, or “mpg”, of the entire fleet of light-duty automobiles (cars & light trucks) in the US in 2015 as our population.
With our assumptions above, we now have what we need to simulate a
population of cars with varying average combined mpg values. Below, we
simulate a normally distributed population of N=240,000 vehicle mpg
values with a mpg mean=22 and sd=6. Of course, we set the seed (random
starting value) as well to ensure that we can reproduce the same sample
each time. We assigned the simulated population data to an object
(mpg2015pop
) and then displayed the first 10 rows (out of
240,000 total) by piping the data object to head(10)
and
gt()
.
set.seed(1138)
mpgmeanpop <- 22
mpgsdpop <- 6
Ncars <- 240000
mpg2015pop <- rnorm(n=Ncars, mean=mpgmeanpop, sd=mpgsdpop) %>% as_tibble_col(column_name = "mpg2015")
mpg2015pop %>% head(10) %>% gt()
mpg2015 |
---|
18.38730 |
18.35317 |
24.62183 |
22.52139 |
18.49235 |
18.32817 |
33.74955 |
17.42020 |
23.74810 |
25.15090 |
Let’s check the descriptive statistics for our simulated population data.
mpg2015pop %>% descr(show=c("n","mean","sd","se","md","range")) %>% gt()
var | n | mean | sd | se | md | range |
---|---|---|---|---|---|---|
mpg2015 | 240000 | 21.99317 | 5.98781 | 0.01222257 | 21.9952 | 57.43 (-7.64-49.8) |
As expected, these descriptive statistics closely match the
population values that we specified (mean=22; sd=6). Next, we plot our
simulated population of mpg values using
ggplot() + geom_histogram()
.
#plot base figures
mpg2015fig <- mpg2015pop %>% ggplot() +
geom_histogram(aes(x = mpg2015, y = after_stat(density)), color="#4ac16d", fill="#365c8d",
position="identity", breaks = seq(0, 45, by = .1)) +
labs(x= "Simulated population of auto mpgs in 2015 (N=240k)") +
theme_minimal() +
coord_cartesian(expand = FALSE)
mpg2015fig
Now, imagine we wanted to design a study focused on average auto mpg
in 2015. It would be cost-prohibitive and likely impossible to gather
data from the entire population. So, instead, say we used state vehicle
databases to collect a random sample of n=100 vehicles from this
population instead, with the goal of drawing inferences about the
population. We can illustrate this process by randomly drawing a sample
of n=100 values from our simulated population data using the tidy
dplyr::slice_sample()
function. We assign this random
sample to an object named mpg2015s100
.
set.seed(1138)
mpg2015s100 <- mpg2015pop %>% slice_sample(n = 100)
mpg2015s100 %>% head() %>% gt()
mpg2015 |
---|
36.89898 |
21.55515 |
26.54984 |
15.61276 |
25.23936 |
21.99027 |
Let’s check the descriptive statistics for our sample.
mpg2015s100 %>% descr(show=c("n","mean","sd","se","md","range")) %>% gt()
var | n | mean | sd | se | md | range |
---|---|---|---|---|---|---|
mpg2015 | 100 | 22.05171 | 6.508541 | 0.6508541 | 22.17755 | 31.12 (5.78-36.9) |
Now let’s plot our sample in a histogram. We will also overlay the normal distribution representing our known population parameters (mean=22, sd=6) so we can visualize how close our sample distribution matches the population from which it was drawn.
#plot base figures
mpg2015s100fig <- mpg2015s100 %>% ggplot() +
geom_histogram(aes(x = mpg2015, y =..density..), color="#1fa187", fill="#4ac16d",
position="identity", breaks = seq(0, 45, by = 2)) +
labs(x= "Random sample (n=100) of simulated auto mpgs in 2015") +
stat_function(fun = dnorm, alpha=.7,
args = list(mean = 22, sd = 6),
color='#365c8d', size=1.1) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.1))
mpg2015s100fig
Looking at the descriptive statistics and the histogram, this random
sample varies somewhat yet nonetheless closely resembles the simulated
population from which it was drawn. But would any random sample of the
same size (n=100) be similarly close to the population distribution?
Well, let’s see! We will start by drawing 25 samples of n=100 using the
infer package’s rep_slice_sample()
function, then plot all
25 samples.
As usual, we show the first six observations, which are the first
rows of data in the first replicated sample of n=100, using the
head()
function. However, this time we also use the
tail()
function to show the last six observations,
which correspond to the last six rows of data in the 25th sample of
n=100.
#select 25 samples of n=100 each from simulated population data
set.seed(1138)
mpg2015s100rep <- mpg2015pop %>%
rep_slice_sample(n = 100, reps = 25)
mpg2015s100rep %>% head()
## # A tibble: 6 × 2
## # Groups: replicate [1]
## replicate mpg2015
## <int> <dbl>
## 1 1 36.9
## 2 1 21.6
## 3 1 26.5
## 4 1 15.6
## 5 1 25.2
## 6 1 22.0
mpg2015s100rep %>% tail()
## # A tibble: 6 × 2
## # Groups: replicate [1]
## replicate mpg2015
## <int> <dbl>
## 1 25 32.6
## 2 25 11.7
## 3 25 30.3
## 4 25 17.8
## 5 25 25.4
## 6 25 31.5
Now we want to plot all 25 samples and combine the plots using
patchwork. To do this efficiently, we need to write a function to
generate the ggplot objects. We then present commented-out code that one
could use to explicitly create and combine 25 plot objects, followed by
the code we actually used to more efficiently create and assign these
plots to a list before combining with the
patchwork::wrap_plots()
function.
set.seed(1138)
dat<-mpg2015s100rep
# create function to quickly generate histograms
funhist = function(dat, repnum){
dat %>% filter(replicate==repnum) %>% ggplot() +
geom_histogram(aes(x = mpg2015, y =..density..), color="#1fa187", fill="#4ac16d",
position="identity", breaks = seq(0, 45, by = 2)) +
stat_function(fun = dnorm, alpha=.7,
args = list(mean = 22, sd = 6),
color='#365c8d', size=1.1) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.1)) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()
) +
labs(x=NULL, y=NULL)
}
# Code below displays logic of generating & plotting 25 ggplot objects
# however, we will show & use more efficient methods w/for loops or lists
# here is how you could use our function to plot a histogram for all 25 samples
# plots100r1 <- funhist(dat, 1)
# plots100r2 <- funhist(dat, 2)
# plots100r3 <- funhist(dat, 3)
# plots100r4 <- funhist(dat, 4)
# plots100r5 <- funhist(dat, 5)
# plots100r6 <- funhist(dat, 6)
# plots100r7 <- funhist(dat, 7)
# plots100r8 <- funhist(dat, 8)
# plots100r9 <- funhist(dat, 9)
# plots100r10 <- funhist(dat, 10)
# plots100r11 <- funhist(dat, 11)
# plots100r12 <- funhist(dat, 12)
# plots100r13 <- funhist(dat, 13)
# plots100r14 <- funhist(dat, 14)
# plots100r15 <- funhist(dat, 15)
# plots100r16 <- funhist(dat, 16)
# plots100r17 <- funhist(dat, 17)
# plots100r18 <- funhist(dat, 18)
# plots100r19 <- funhist(dat, 19)
# plots100r20 <- funhist(dat, 20)
# plots100r21 <- funhist(dat, 21)
# plots100r22 <- funhist(dat, 22)
# plots100r23 <- funhist(dat, 23)
# plots100r24 <- funhist(dat, 24)
# plots100r25 <- funhist(dat, 25)
# with this explicit approach, you could then combine plots with patchwork
# plots100r1 + plots100r2 + plots100r3 + plots100r4 + plots100r5 +
# plots100r6 + plots100r7 + plots100r8 + plots100r9 + plots100r10 +
# plots100r11 + plots100r12 + plots100r13 + plots100r14 + plots100r15 +
# plots100r16 + plots100r17 + plots100r18 + plots100r19 + plots100r20 +
# plots100r21 + plots100r22 + plots100r23 + plots100r24 + plots100r25 +
# plot_layout(ncol = 5)
# a more efficient way to do above is to make 25 plots with for loop & our function
# generates all 25 plots & assigns to objects p1, p2, ... p25
# for(i in 1:25){
# assign(paste0('p',i), funhist(dat, i))
# }
# instead, we generate 25 plots & put in one list
# this avoids cluttering our R Environment with a bunch of plot objects
plotlist <-lapply(
seq(1,25),
function(i) funhist(dat, i)
)
wrap_plots(plotlist)
# use patchwork::wrap_plots() to display all plots in a list
# https://patchwork.data-imaginist.com/articles/guides/assembly.html
For the most part, these sample distributions look like they resemble the population from which they were drawn. However, these are only 25 of the many, many unique samples that we could have drawn of the same sample size - that is, these are only 25 out of all the potential samples that make up the sampling distribution of size n=100 from this population. If you can imagine how many unique samples of n=100 could have possibly been drawn from this population of N=240k, then you can imagine a sampling distribution of sample size n=100 values!
Now, what if we had drawn smaller samples of, say, n=30 “cars” (i.e., average mpg values) instead? Let’s create a comparable plot of 25 samples each comprised of n=30 average mpg values drawn from our simulated population data of 240k cars. We hope this will help you visualize how the sampling distribution for n=30 cars differs from the sampling distribution for n=100 cars.
set.seed(1138)
#simulate 25 samples of size n=30
mpg2015s30rep <- mpg2015pop %>%
rep_slice_sample(n = 30, reps = 25)
#assign new sim data to dat object to call in function
dat <- mpg2015s30rep
#generate 25 plots in a list
plotlist <-lapply(
seq(1,25),
function(i) funhist(dat, i)
)
#combine 25 plots of sample size n=30
wrap_plots(plotlist)
The first thing you should notice is different each of these samples is from one another and from the underlying normal probability distribution in the population from which they were drawn. These plots also are much more variable compared to the plots above for the 25 samples of n=100 cars, despite each sample being drawn from the same population of 240k simulated cars.
This should help you visualize the concept of sampling variability and the central limit theorem. If you draw samples of size “n” from a population, those samples will vary (be different) from each other and from the population. Smaller sample sizes (n) and more heterogenous populations result in more sampling variability, or greater variation among the samples in a sampling distribution. Conversely, increasing the sample size (n) reduces sampling variability and, instead, results in a sampling distribution comprised of potential samples that are more representative of the underlying population.
To illustrate, let’s create this plot once again for 25 samples of size n=1,000 “cars” (average mpg values) drawn from our simulated population of 240k cars.
set.seed(1138)
#simulate 25 samples of size n=1000
mpg2015s1000rep <- mpg2015pop %>%
rep_slice_sample(n = 1000, reps = 25)
#assign new sim data to dat object to call in function
dat <- mpg2015s1000rep
#generate 25 plots in a list
plotlist <-lapply(
seq(1,25),
function(i) funhist(dat, i)
)
#combine 25 plots of sample size n=1000
wrap_plots(plotlist)
Notice how each of these 25 samples of n=1,000 simulated cars has an average mpg distribution that quite closely resembles the population distribution that generated these samples? Moreover, while each sample is different, they are far more similar to one another compared to the distributions found in the plot of n=30 samples above.
This is a primary reason that sample size matters: Larger random samples are more representative of the populations from which they are drawn than are smaller samples. Likewise, sample statistics are more likely to generate valid estimates of population parameters and inferences about populations are more likely to be valid when we use large samples rather than small samples.
To further illustrate this point, let’s visualize the
sampling distribution of the mean for our n=30 samples.
To do this, we will start by calculating the mean of each of the 25
samples of n=30 mean mgp values drawn from our simulated car mpg data;
we will display the first few rows of sample mean mpg values using
head()
. Then, we will plot each of the 25 sample means as a
dot in a dot histogram using geom_dotplot()
.
# simulate 10 samples of size n=30
# mpg2015s30rep <- mpg2015pop %>%
# rep_slice_sample(n = 30, reps = 25)
mus30rep25 <- mpg2015s30rep %>%
group_by(replicate) %>%
summarize(mean_mpg = mean(mpg2015))
head(mus30rep25) %>% gt() %>%
tab_header(title = md("First few rows of N=25 sample mean mpg"),
subtitle = md("(Sample size n=30 per sample)")
)
First few rows of N=25 sample mean mpg | |
(Sample size n=30 per sample) | |
replicate | mean_mpg |
---|---|
1 | 22.29934 |
2 | 22.46044 |
3 | 21.08830 |
4 | 21.47506 |
5 | 22.56471 |
6 | 22.76905 |
mus30rep25plot <- mus30rep25 %>% ggplot(aes(x=mean_mpg)) +
geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.25,
dotsize=0.7) +
labs(x= "N=25 samples") +
theme_minimal() +
coord_cartesian(expand = FALSE, ylim=c(0,1)) +
scale_x_continuous(limits=c(17, 27), breaks=seq(18,26,2)) +
scale_y_continuous(NULL, breaks = NULL, name="Sample n=30")
mus30rep25plot + plot_annotation(
title = 'Visualizing Sampling Distribution of Mean for Sample Size n=30')
If we kept drawing more samples of n=30 “cars” each from our simulated population, how variable would the sample means from each of the different samples be? The plot above gives us a sense of the spread or variability of the sampling distribution of sample means for n=30 simulated cars. We can get a better sense by drawing more samples of the same size (n=30 each) and plotting those means as well.
We will do this below by drawing N=250 and N=2,500 samples of n=30 cars each from our simulated population of 240k cars, then we will calculate the mean for each of the 250 and 2,500 samples respectively and, finally, we will plot these sample mean mgp values and combine the plots using patchwork.
set.seed(1138)
# write function to draw N=Nrep simulated samples of size n=nsamp from sim pop
# then calculate sample mean from each Nrep sample
funmeans = function(nsamp, Nrep){
mpg2015pop %>%
rep_slice_sample(n = nsamp, reps = Nrep) %>%
group_by(replicate) %>%
summarize(mean_mpg = mean(mpg2015))
}
mus30rep250 <- funmeans(30,250)
head(mus30rep250) %>% gt() %>%
tab_header(title = md("First few rows of N=250 sample mean mpg**"),
subtitle = md("(Sample size n=30 per sample)")
)
First few rows of N=250 sample mean mpg** | |
(Sample size n=30 per sample) | |
replicate | mean_mpg |
---|---|
1 | 22.29934 |
2 | 22.46044 |
3 | 21.08830 |
4 | 21.47506 |
5 | 22.56471 |
6 | 22.76905 |
mus30rep2500 <- funmeans(30,2500)
head(mus30rep2500) %>% gt() %>%
tab_header(title = md("First few rows of N=2,500 sample mean mpg**"),
subtitle = md("(Sample size n=30 per sample)")
)
First few rows of N=2,500 sample mean mpg** | |
(Sample size n=30 per sample) | |
replicate | mean_mpg |
---|---|
1 | 22.50078 |
2 | 19.90967 |
3 | 22.39007 |
4 | 21.76854 |
5 | 24.25293 |
6 | 21.12095 |
# create function to quickly generate dot histograms
fundots = function(dat, binw, dsiz, xlabel){
dat %>% ggplot(aes(x=mean_mpg)) +
geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=binw,
dotsize=dsiz) +
labs(x= xlabel) +
theme_minimal() +
coord_cartesian(expand = FALSE, ylim=c(0,1)) +
scale_x_continuous(limits=c(17, 27), breaks=seq(18,26,2)) +
scale_y_continuous(NULL, breaks = NULL)
}
#generate plots
mus30rep250plot <- fundots(mus30rep250, .15, .5, "N=250 samples")
mus30rep2500plot <- fundots(mus30rep2500, .05, .4, "N=2,500 samples")
#combine plots
mus30rep25plot + mus30rep250plot + mus30rep2500plot +
plot_layout(ncol = 3) +
plot_annotation(
title = 'Visualizing Sampling Distribution of Mean for Sample Size n=30')
As you can see, the sampling distribution of sample means for samples of size n=30 clusters around the true population mean (=22), but the individual sample means range from around mean mpg=18 to to mean mpg=25.
Let’s repeat this process for sample sizes n=100 and n=1,000.
set.seed(1138)
mus100rep25 <- funmeans(100,25)
mus100rep250 <- funmeans(100,250)
mus100rep2500 <- funmeans(100,2500)
mus1000rep25 <- funmeans(1000,25)
mus1000rep250 <- funmeans(1000,250)
mus1000rep2500 <- funmeans(1000,2500)
#generate plots
mus30rep25plot <- fundots(mus30rep25, .25, .7, NULL) +
scale_y_continuous(name="Sample n=30") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank())
mus30rep250plot <- fundots(mus30rep250, .15, .5, NULL)
mus30rep2500plot <- fundots(mus30rep2500, .05, .4, NULL)
mus100rep25plot <- fundots(mus100rep25, .25, .7, NULL)+
scale_y_continuous(name="Sample n=100") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank())
mus100rep250plot <- fundots(mus100rep250, .15, .5, NULL)
mus100rep2500plot <- fundots(mus100rep2500, .05, .4, NULL)
mus1000rep25plot <- fundots(mus1000rep25, .25, .7, "N=25 samples") +
scale_y_continuous(name="Sample n=1,000") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank())
mus1000rep250plot <- fundots(mus1000rep250, .15, .5, "N=250 samples")
mus1000rep2500plot <- fundots(mus1000rep2500, .05, .4, "N=2,500 samples")
#combine plots
mus30rep25plot + mus30rep250plot + mus30rep2500plot +
mus100rep25plot + mus100rep250plot + mus100rep2500plot +
mus1000rep25plot + mus1000rep250plot + mus1000rep2500plot +
plot_layout(ncol = 3) +
plot_annotation(
title = 'Visualizing Sampling Distributions of the Mean, Sample Sizes n=30,n=100, & n=1,000')
Once again, the sampling distributions of sample means for samples of size n=30, n=100, and n=1,000 all cluster around the true population mean (=22). However, the spread of the sampling distribution of sample means diminishes as the sample size increases. For instance, while the individual sample means from the n=30 samples have a relatively wide range from around mean mpg=18 to mean mpg=25, the individual sample means cluster more tightly around the population mean value (=22) and the range of sample mean values decreases substantially as the sample size increases (e.g., to n=100 or n=1,000).
So, what does this mean for a researcher who draws a random sample from a population with the goal of estimating and making inferences about population parameters? It means that research designs relying on small samples, which are more variable, are more likely to generate invalid population estimates and inferences than are research designs relying on large samples.
mtcars
dataWe hope these plots have helped better understand sampling distributions of the sample mean. Now, let’s move on to examples that help illustrate the logic of a one-sample hypothesis test. For now, we will build on the mpg example by bringing in the “mtcars” dataset that comes built into R. First, let’s view those data.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# next assignment, use starwars data!
# data(starwars)
# https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html
Built-in datasets are extremely useful for creating examples - and especially for sharing or troubleshooting code in a reproducible way by using data that are universally available to the community of R users.
To learn more about the “mtcars” data, type ?mtcars
in
your console (or a code chunk) and hit enter (or run). Doing so will
bring up a help page, where you will learn that the data were “extracted
from the 1974 Motor Trend US magazine, and comprises fuel consumption
and 10 aspects of automobile design and performance for 32 automobiles
(1973–74 models).”
So, these are not a random sample of cars - but that will not stop us from imagining that they are a random sample! Specifically, let’s pretend the mtcars data were randomly selected from the population of cars in 1974. Then, imagine that we want to know if the true population mean miles per gallon (mpg) in 1974 was equivalent to that in 2015, and we want to use the “random” (just pretend with us here) mtcars sample to make an inference about the average mpg of the population of cars in 1974.
With null hypothesis testing, we cannot answer our question directly using a one-sample t-test and the data available to us. However, here is what we can do:
Let’s try it. First, we start by examining the sample distribution of “mpg” in the mtcars data in a frequency table and a histogram.
mtcars %>% frq(mpg)
## mpg <numeric>
## # total N=32 valid N=32 mean=20.09 sd=6.03
##
## Value | N | Raw % | Valid % | Cum. %
## ------------------------------------
## 10.40 | 2 | 6.25 | 6.25 | 6.25
## 13.30 | 1 | 3.12 | 3.12 | 9.38
## 14.30 | 1 | 3.12 | 3.12 | 12.50
## 14.70 | 1 | 3.12 | 3.12 | 15.62
## 15.00 | 1 | 3.12 | 3.12 | 18.75
## 15.20 | 2 | 6.25 | 6.25 | 25.00
## 15.50 | 1 | 3.12 | 3.12 | 28.12
## 15.80 | 1 | 3.12 | 3.12 | 31.25
## 16.40 | 1 | 3.12 | 3.12 | 34.38
## 17.30 | 1 | 3.12 | 3.12 | 37.50
## 17.80 | 1 | 3.12 | 3.12 | 40.62
## 18.10 | 1 | 3.12 | 3.12 | 43.75
## 18.70 | 1 | 3.12 | 3.12 | 46.88
## 19.20 | 2 | 6.25 | 6.25 | 53.12
## 19.70 | 1 | 3.12 | 3.12 | 56.25
## 21.00 | 2 | 6.25 | 6.25 | 62.50
## 21.40 | 2 | 6.25 | 6.25 | 68.75
## 21.50 | 1 | 3.12 | 3.12 | 71.88
## 22.80 | 2 | 6.25 | 6.25 | 78.12
## 24.40 | 1 | 3.12 | 3.12 | 81.25
## 26.00 | 1 | 3.12 | 3.12 | 84.38
## 27.30 | 1 | 3.12 | 3.12 | 87.50
## 30.40 | 2 | 6.25 | 6.25 | 93.75
## 32.40 | 1 | 3.12 | 3.12 | 96.88
## 33.90 | 1 | 3.12 | 3.12 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
mtmpgplot <- mtcars %>% ggplot() +
geom_histogram(aes(x = mpg, y =..density..), color="#1fa187", fill="#4ac16d",
position="identity", breaks = seq(0, 45, by = 2)) +
labs(x= "Sample distribution of mpg values in mtcars data") +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.1))
mtmpgplot
Now, let’s overlay a line representing the distribution of our null
hypothesis, which here is a normal curve with a mean=22 and a standard
deviation that we will estimate using the sample sd value
(sd(mtcars$mpg)
). We will also draw a vertical solid line
at the mean of this null distribution (mean=22). Finally, we will add a
second dashed vertical line at the mean value of the mtcars sample
distribution for mpg.
These additions will hopefully help you better visualize the comparison of the mtcars sample estimate with our assumed population parameter value. Recall, the aim of our test is to determine how improbable it would be to get a “random” sample of this size and variation that generated a sample mean as far or further from mean=22 than the one we observed.
mtmpgplot <- mtcars %>% ggplot() +
geom_histogram(aes(x = mpg, y =..density..), color="#1fa187", fill="#4ac16d",
position="identity", breaks = seq(0, 45, by = 2)) +
labs(x= "Sample distribution of mpg values in mtcars data") +
stat_function(fun = dnorm, alpha=.7,
args = list(mean = 22, sd = sd(mtcars$mpg)),
color='#d44842', size=1.1) +
geom_vline(xintercept = 22, color = "#d44842", size=1.2, alpha=0.7) +
geom_vline(xintercept = mean(mtcars$mpg), linetype="dashed",
color = "#365c8d", size=1.2, alpha=0.7) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.125))
mtmpgplot
Now, we conduct our t-test. You read about how to calculate z- and t-test statistics in Chapter 8. We will also spend more time on how this t-standardized test statistic is calculated in the next section.
t.test(mtcars$mpg, mu = 22, alternative = "two.sided")
##
## One Sample t-test
##
## data: mtcars$mpg
## t = -1.7921, df = 31, p-value = 0.08288
## alternative hypothesis: true mean is not equal to 22
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x
## 20.09062
Our t-standardized test statistic is t=-1.79. With this t-value and 31 degrees of freedom (df=n-1), the associated p-value is 0.083. Assuming data, measurement, and modeling assumptions are valid (e.g., assuming the mtcars data were actually a random sample), we can interpret this p-value as the probability of observing a sample mean at least as far away from mean=22 as the observed sample mean (20.09) if the null hypothesis that the 1974 population mpg mean=22 were true.
Since the p-value is greater than our pre-specified alpha level (p>0.05), the appropriate inference would be to fail to reject the null hypothesis. In this case, there is insufficient evidence to reject the possibility that the true 1974 population mean mpg was equal to 22. In essence, if the true 1974 population mean mpg really was equal to 22, then it would not be all that improbable to draw a random sample of n=32 cars with a mean mpg of 20.09 (or more extreme) simply due to sampling variability.
To further illustrate this point, let’s return to our simulated population data to examine just how rare it would be to draw samples of n=32 cars with means that are at least this extreme. We will draw N=10,000 random samples of n=32 “cars” each from our simulated population of 240k mpg values, which we know were simulated from a normal distribution with a mean of 22 and a standard deviation of 6. (Note: The standard deviation of the mpg values in the mtcars data equals 6.03).
# If we randomly select N=10,000 samples of n=32 cars each from our simulated population, what proportion of sample means would be at least as extreme as the mtcars data?
set.seed(1138)
funmeans(32,10000) %>%
filter(mean_mpg<=20.09062 | mean_mpg>=23.90938) %>%
summarise(n_extreme = n()) %>%
mutate(
prop_extreme = n_extreme / 10000,
pct_extreme = (n_extreme / 10000)*100) %>%
gt() %>%
tab_header(title = md("Percent of sample means as/more extreme than mtcars data"),
subtitle = md("(N=10,000 samples of size n=32 from simulated cars population)")
)
Percent of sample means as/more extreme than mtcars data | ||
(N=10,000 samples of size n=32 from simulated cars population) | ||
n_extreme | prop_extreme | pct_extreme |
---|---|---|
747 | 0.0747 | 7.47 |
#alt method with count
# set.seed(1138)
# funmeans(32,10000) %>%
# count(mean_mpg<=20.09062 | mean_mpg>=23.90938)
So, after drawing N=10,000 random samples from our simulated population that we know has a mean=22 and sd=6, we found about 7.5% of those samples had a sample mean mpg that was at least as extreme as the mean mpg value observed in the mtcars data. This is not exactly the same as the p-value we observed (0.083), but it is pretty close!
Now, let’s move on to one-sample hypothesis test examples that are more similar to the SPSS exercises found at the end of Chapter 8 in your book.
Goal: Read in Youth data and manually conduct a one-sample hypothesis test
For this assignment, you will be working through Question #3 from
Chapter 8 SPSS Exercises in the book. That question asks you to conduct
a one-sample hypothesis test using the variable ‘moral’ from the Youth
data. Below, we will demonstrate how to complete the assignment steps
using an alternative variable - parnt2 - from the Youth data. This
ensures that you get the experience of calculating the answers and
working with R yourself using the same data. After working through the
process with us using the parnt2 variable, you will then simply repeat
the same steps with the ‘moral
’ variable afterwards.
Be sure to repeat these steps with the correct variable from the
assignment to ensure you receive full credit.
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2023_03_21_Ducate_CRIM5305_Assign10)
here
package will automatically set our CRIM5305_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, sjPlot
, gt
,
patchwork
, and infer
.
infer
package, which
means you’ll need to download it first using
install.packages()
or the “Install Packages” button under
the “Tools” tab. Then, you can use library()
to load it in.
While we will briefly introduce you to the popular and easy to use
t.test()
function from the base R stats package, we will
also show you how to conduct t-tests using the
infer
package. This package is a tidy-style, pipe-friendly
alternative for conducting and visualizing common inferential
tests.Insert another R code chunk.
In the new R code chunk, read in the “Youth_0.sav” SPSS datafile
and assign it to an R object named YouthData
.
In the same code chunk, on a new line below your read data/assign
object command, type the name of your new R data object:
YouthData
.
head()
function (piped to
gt()
) to print the first six rows so you can compare your
output.YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
YouthData %>% head() %>% gt()
Gender | v2 | v21 | v22 | v63 | v77 | v79 | v109 | v119 | parnt2 | fropinon | frbehave | certain | moral | delinquency | d1 | hoursstudy | filter_$ | heavytvwatcher | studyhard | supervision | drinkingnotbad | Lowcertain_bin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 15 | 36 | 15 | 3 | 1 | 3 | 5 | 1 | 5 | 18 | 9 | 9 | 19 | 8 | 1 | 15 | 1 | 1 | 1 | 0 | 0 | 1 |
0 | 15 | 3 | 5 | 4 | 1 | 2 | 4 | 1 | 8 | 11 | 6 | 11 | 19 | 10 | 1 | 5 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 15 | 20 | 6 | 4 | 1 | 1 | 4 | 2 | 8 | 12 | 5 | 11 | 20 | 1 | 0 | 6 | 0 | 1 | 0 | 1 | 0 | 0 |
0 | 15 | 2 | 4 | 2 | 1 | 2 | 5 | 3 | 4 | 9 | 5 | 13 | 19 | 1 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 14 | 12 | 2 | 4 | 3 | 5 | 5 | 3 | 7 | 27 | 9 | 4 | 18 | 104 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 |
1 | 15 | 1 | 3 | 4 | 1 | 1 | 5 | 3 | 7 | 8 | 9 | 10 | 20 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
parnt
’”.
YouthData %>% frq(parnt2)
.YouthData %>%
frq(parnt2)
## parental supervision scale (parnt2) <numeric>
## # total N=1272 valid N=1272 mean=6.23 sd=1.38
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 2 | 6 | 0.47 | 0.47 | 0.47
## 3 | 16 | 1.26 | 1.26 | 1.73
## 4 | 163 | 12.81 | 12.81 | 14.54
## 5 | 139 | 10.93 | 10.93 | 25.47
## 6 | 440 | 34.59 | 34.59 | 60.06
## 7 | 189 | 14.86 | 14.86 | 74.92
## 8 | 319 | 25.08 | 25.08 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
t_test
function from the
infer
package.
parnt2
variable.parnt2
variable in
our data is different from 6.0 (i.e., a two-sided hypothesis test).n <- length(YouthData$parnt2)
. Then, hit enter, type
n
, and your R studio should give you the value 1,272.
length()
function tells R to output the length of
the dataset, which is equivalent to the number of observations or
participants in this particular dataset.z_stat <- (mean(YouthData$parnt2) - 6) / (sd(YouthData$parnt2) / sqrt(n))
.
Then, hit enter and type z_stat
to view your
z-statistic.
sd
is the standard deviation;
sqrt(n)
is the square root of n; and our (null)
hypothesized population mean \(\mu\) =
6.#alpha = .05
#z-crit = 1.96
#t-crit = 1.96 (df > 120)
#get n
n <- length(YouthData$parnt2)
n
## [1] 1272
# calculate the z-statistic
z_stat <- (mean(YouthData$parnt2) - 6) / (sd(YouthData$parnt2) / sqrt(n))
z_stat
## [1] 5.880774
#> [1] -290.7536
Goal: Conduct a one-sample hypothesis test with
infer::t_test()
In the last part of the assignment, we will illustrate how to conduct
a one-sample t-test using the t_test()
function
from the infer
package. Recall, we previously illustrated
the base R t.test()
function in our mtcars
example earlier. For another example of conducting a one-sample
t-test by hand and with the t.test()
function in
base R, see here.
In comparison, the infer
alternative allows us to run basic
statistical tests such as a one-sample t-test within a
tidyverse framework that permits sequencing functions together using
tidyverse-style pipes. We will soon illustrate why such piping can be
quite helpful.
parnt2
variable”.
infer::t_test()
, type
YouthData %>% t_test(parnt2 ~ 1, mu = 6, conf_level = 0.90)
and then click run to generate results of your t-test.#infer
t_stat <- YouthData %>% infer::t_test(response = parnt2, mu = 6, conf_level = 0.90)
t_stat
## # A tibble: 1 × 7
## statistic t_df p_value alternative estimate lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 5.88 1271 0.00000000521 two.sided 6.23 6.16 6.29
#NOTE: if you load both rstatix & infer, there is conflict over t_test function call
# alternative: calculating t-score using rstatix
#https://www.datanovia.com/en/lessons/t-test-in-r/
# YouthData %>% rstatix::t_test(parnt2 ~ 1, mu = 6)
infer::t_test
function also
outputs this p-value automatically (p = 5.2137228^{-9}). In
this case, the p-value is much smaller than our pre-specified
alpha threshold of 0.10, so again we would reject the null hypothesis of
\(\mu\)=6.infer
package also employs simulation methods similar to those we introduced
in the beginning of this assignment to make it relatively quick and easy
to
visualize where your sample statistic would fall in the sampling
distribution associated with your null hypothesis. The tidy-style
piping comes in handy for this part, which involves executing a
three-step process:
parnt2
variable in YouthData
. Here, the code
below specifies our YouthData
data object, specifies
parnt2
as the response variable, then calculates the mean
and assigns it to the object we named
observed_statistic
.parnt2
),
hypothesize the (null) point estimate for the population mean as mu = 6,
then generate 10,000 simulated samples (using bootstrapping methods to
estimate sampling variation) and calculate the mean of each sample
before assigning these simulated sample means to an object we named
null_dist_1_sample
.observed_statistic
) and the simulated mean values from the
null distribution (null_dist_1_sample
), we can use
infer::visualize()
to plot these together.color =
to the shade_p_value
function and
specify your new color just as you did for the boxplots in a previous
assignment.# a. calculate the observed statistic
observed_statistic <- YouthData %>%
specify(response = parnt2) %>%
calculate(stat = "mean")
# b. generate the null distribution
null_dist_1_sample <- YouthData %>%
specify(response = parnt2) %>%
hypothesize(null = "point", mu = 6) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "mean")
# c. visualize the null distribution and test statistic!
null_dist_1_sample %>%
visualize() +
shade_p_value(observed_statistic,
direction = "two-sided",
color = "turquoise")
parnt2
variable falls well
outside this range.
infer
package!# conduct t-test
YouthData %>% infer::t_test(response = parnt2, mu = 6.2, conf_level = 0.90)
## # A tibble: 1 × 7
## statistic t_df p_value alternative estimate lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0.722 1271 0.470 two.sided 6.23 6.16 6.29
# a. calculate the observed statistic
observed_statistic <- YouthData %>%
specify(response = parnt2) %>%
calculate(stat = "mean")
# b. generate the null distribution
null_dist_1_sample <- YouthData %>%
specify(response = parnt2) %>%
hypothesize(null = "point", mu = 6.2) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "mean")
# c. visualize the null distribution and test statistic!
null_dist_1_sample %>%
visualize() +
shade_p_value(observed_statistic,
direction = "two-sided")
moral
variable in the YouthData
object.
dplyr::slice_sample()
?infer::rep_slice_sample()
?lapply()
to create a list of
objects, which can help you avoid cluttering the R Environment with
objects?patchwork::wrap_plots()
to
quickly combine ggplot objects contained in a list?tail()
function to
quickly view the last few rows of data?mtcars
that are universally available to R users?t.test()
function?infer::t_test()
?infer::visualize()
to visualize
where your sample statistic would fall in the sampling distribution
associated with your null hypothesis?