Assignment 10 Objectives

In the last assignment, you learned how to estimate and interpret confidence intervals around a point estimate (e.g., sample mean or proportion). You also learned how to simulate data from basic probability distributions to help you better understand sampling variability and the need for interval estimates. In this assignment, you will learn how to conduct a two-tail z-test and t-test and then, given the test results and the null hypothesis, to make an appropriate inference about the population parameter by either rejecting or failing to reject the null hypothesis.

Before you conduct your own hypothesis tests, we will first simulate population data from a normal probability distribution, then take random samples from our simulated population data and plot features of these samples. Our aim will be to help you visualize the sampling distribution of a sample mean, which should lead to a better understanding of the underlying mechanisms that allow us to make valid population inferences from samples with null hypothesis significance testing. While we will not expect you to do all of these tasks yourself, by providing our code along the way, we hope these examples will help you gain a better sense of how one might conduct simulations, create handy user-written functions, and generate more complex visualizations in R.

By the end of Assignment 10, you should…

  • know how to draw random samples from data in R
    • know how to draw one random sample with dplyr::slice_sample()
    • know how to draw multiple (“replicate”) random samples with infer::rep_slice_sample()
  • recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects
  • recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list
  • be able to use the base R tail() function to quickly view the last few rows of data
  • know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users
  • be able to conduct a one-sample z or t hypothesis test in R and interpret results
    • be able to conduct a one-sample test using the base R t.test() function
    • be able to manually calculate a z or t test statistic by typing formula in R
    • be able to conduct a one-sample test using infer::t_test()
    • know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis

Assumptions & Ground Rules

We are building on objectives from Assignments 1-9. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()
  • select random sample from data without or with replacement using dplyr::sample_n()
  • select data with conditions using dplyr::filter() and %in% operator
  • simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar)

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
    • recognize that one can write a custom function to repeatedly generate similar plots before combining them with patchwork
  • recognize that one can plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize that one can add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()
  • generate and interpret confidence intervals to quantify uncertainty caused by sampling variability
    • identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • estimate the standard error of a sample mean or proportion in R
    • estimate a two-tailed confidence interval around a sample mean or proportion in R
    • properly interpret and avoid common misinterpretations of confidence intervals

If you do not recall how to do these things, review Assignments 1-9.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • Sampling variation (aka, sampling variability)
  • Hypothesis testing
    • Null hypothesis
    • Nondirectional & two-tailed hypothesis tests
    • Directional & one-tailed hypothesis tests
  • One-sample hypothesis tests for population means and proportions and how to calculate by hand:
    • a one-sample z test for the difference between an observed sample mean and a given population mean for large samples
    • a one-sample t test for the difference between an observed sample mean and a given population mean for small samples samples
    • a one-sample z test for the difference between an observed sample proportion and a given population proportion for large samples

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Visualizing the Sampling Distribution of a Sample Mean

Goal: Understand the sampling distribution of a sample mean as well as the relationship between sample size and sampling variability

After learning about how confidence intervals help us quantify and communicate uncertainty in statistical estimates of population parameters, this week’s course materials focused on making inferences about population parameters by conducting null hypothesis tests. Briefly, this process involves:

  • Making assumptions that our sample statistics can be used as meaningful estimates of population parameters that we are interested in.
  • Making a baseline assumption about the population, such as by specifying a “null hypothesis” (e.g., of no difference in means, or of a specific population value).
  • Selecting an appropriate test distribution (e.g., z or t distribution)
  • Specifying a alpha level (i.e., a false positive risk tolerance level or error control rate) and rejection region in the test (e.g., z or t) distribution
  • Converting our sample statistic into a standardized test statistic using the test (e.g., z or t) distribution and our initial population assumption (e.g., null hypothesis).
  • Comparing our sample test z- or t-standardized statistic (or associated probability, aka p-value) to the critical value (or probability) set by our predetermined alpha level.
    • Making an appropriate inference about the population parameter based on the sample statistic and null hypothesis by rejecting or failing to reject the null hypothesis.

While anyone can learn how to apply these steps to generate p-values and then make claims about null and research hypotheses, a true understanding of the process of statistical inference requires one to understand the central limit theorem and the theoretical concept of a sampling distribution of a sample statistic. You were first introduced to these concepts in Chapter 6 of your book. Since the current chapter is about transitioning from estimation to statistical tests and population inferences, we think it is a good time to revisit the concept of a sampling distribution because it is essential to understanding statistical inferences via null hypothesis testing. Our hope is that we can help you visualize the sampling distribution of a sample mean so that you can better understand what we are doing - and what we are not doing - when we conduct a hypothesis test.

Typically, researchers start with a sample and then work backwards to make inferences about the population from which a sample was drawn. Here, we will start by simulating a known population from a probability distribution and then draw random samples from that population to illustrate how and why we can reliably make inferences about the population as a result of our knowledge of sampling distributions.

Visualizing random samples from a simulated population of car mpg values

To give our example a sense of realism, let’s use the average miles driven per gallon, or “mpg”, of the entire fleet of light-duty automobiles (cars & light trucks) in the US in 2015 as our population.

  • In 2015, US vehicles averaged about 22 mpg; we will use this as our baseline population mean mpg for all vehicles.
  • Let’s also assume that vehicle mpg is normally distributed in the population, with a standard deviation of 6. This means that about 68% of all vehicles on the road in 2015 would have had an average combined mpg between 16 (i.e., 22-1sd, or 22-6) and 28 (22+1sd, or 22+6), while about 95% of all vehicles on the road would have had an average combined mpg between 10 (22-12) and 34 (22+12).
  • Finally, while there were approximately 240,000,000 light-duty vehicles on the road in 2015, we will instead assume the population of US vehicles was 240,000 cars. Although the key lessons would remain the same with an even larger population, we would need far greater computing power and processing time to analyze hundreds of millions of cases instead of hundreds of thousands.
    • Note: Want to see for yourself? Just change the simulation below by adding three zeros to the number assigned to the Ncars object. Warning - be sure to save everything first and have plenty of RAM and processing power available!

With our assumptions above, we now have what we need to simulate a population of cars with varying average combined mpg values. Below, we simulate a normally distributed population of N=240,000 vehicle mpg values with a mpg mean=22 and sd=6. Of course, we set the seed (random starting value) as well to ensure that we can reproduce the same sample each time. We assigned the simulated population data to an object (mpg2015pop) and then displayed the first 10 rows (out of 240,000 total) by piping the data object to head(10) and gt().

set.seed(1138)
mpgmeanpop <- 22
mpgsdpop <- 6
Ncars <- 240000 

mpg2015pop <- rnorm(n=Ncars, mean=mpgmeanpop, sd=mpgsdpop) %>% as_tibble_col(column_name = "mpg2015")
mpg2015pop %>% head(10) %>% gt()
mpg2015
18.38730
18.35317
24.62183
22.52139
18.49235
18.32817
33.74955
17.42020
23.74810
25.15090

Let’s check the descriptive statistics for our simulated population data.

mpg2015pop %>% descr(show=c("n","mean","sd","se","md","range")) %>% gt()
var n mean sd se md range
mpg2015 240000 21.99317 5.98781 0.01222257 21.9952 57.43 (-7.64-49.8)

As expected, these descriptive statistics closely match the population values that we specified (mean=22; sd=6). Next, we plot our simulated population of mpg values using ggplot() + geom_histogram().

#plot base figures
mpg2015fig <- mpg2015pop %>% ggplot() +
  geom_histogram(aes(x = mpg2015, y = after_stat(density)), color="#4ac16d", fill="#365c8d", 
                 position="identity", breaks = seq(0, 45, by = .1)) +
  labs(x= "Simulated population of auto mpgs in 2015 (N=240k)") + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE)
mpg2015fig

Now, imagine we wanted to design a study focused on average auto mpg in 2015. It would be cost-prohibitive and likely impossible to gather data from the entire population. So, instead, say we used state vehicle databases to collect a random sample of n=100 vehicles from this population instead, with the goal of drawing inferences about the population. We can illustrate this process by randomly drawing a sample of n=100 values from our simulated population data using the tidy dplyr::slice_sample() function. We assign this random sample to an object named mpg2015s100.

set.seed(1138)
mpg2015s100 <- mpg2015pop %>% slice_sample(n = 100)
mpg2015s100 %>% head() %>% gt()
mpg2015
36.89898
21.55515
26.54984
15.61276
25.23936
21.99027

Let’s check the descriptive statistics for our sample.

mpg2015s100 %>% descr(show=c("n","mean","sd","se","md","range")) %>% gt()
var n mean sd se md range
mpg2015 100 22.05171 6.508541 0.6508541 22.17755 31.12 (5.78-36.9)

Now let’s plot our sample in a histogram. We will also overlay the normal distribution representing our known population parameters (mean=22, sd=6) so we can visualize how close our sample distribution matches the population from which it was drawn.

#plot base figures
mpg2015s100fig <- mpg2015s100 %>% ggplot() +
  geom_histogram(aes(x = mpg2015, y =..density..), color="#1fa187", fill="#4ac16d", 
                 position="identity", breaks = seq(0, 45, by = 2)) +
  labs(x= "Random sample (n=100) of simulated auto mpgs in 2015") + 
  stat_function(fun = dnorm, alpha=.7,
                args = list(mean = 22, sd = 6), 
                                color='#365c8d', size=1.1) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.1))
mpg2015s100fig

Looking at the descriptive statistics and the histogram, this random sample varies somewhat yet nonetheless closely resembles the simulated population from which it was drawn. But would any random sample of the same size (n=100) be similarly close to the population distribution? Well, let’s see! We will start by drawing 25 samples of n=100 using the infer package’s rep_slice_sample() function, then plot all 25 samples.

As usual, we show the first six observations, which are the first rows of data in the first replicated sample of n=100, using the head() function. However, this time we also use the tail() function to show the last six observations, which correspond to the last six rows of data in the 25th sample of n=100.

#select 25 samples of n=100 each from simulated population data
set.seed(1138)
mpg2015s100rep <- mpg2015pop %>%
  rep_slice_sample(n = 100, reps = 25)
mpg2015s100rep %>% head() 
## # A tibble: 6 × 2
## # Groups:   replicate [1]
##   replicate mpg2015
##       <int>   <dbl>
## 1         1    36.9
## 2         1    21.6
## 3         1    26.5
## 4         1    15.6
## 5         1    25.2
## 6         1    22.0
mpg2015s100rep %>% tail() 
## # A tibble: 6 × 2
## # Groups:   replicate [1]
##   replicate mpg2015
##       <int>   <dbl>
## 1        25    32.6
## 2        25    11.7
## 3        25    30.3
## 4        25    17.8
## 5        25    25.4
## 6        25    31.5

Now we want to plot all 25 samples and combine the plots using patchwork. To do this efficiently, we need to write a function to generate the ggplot objects. We then present commented-out code that one could use to explicitly create and combine 25 plot objects, followed by the code we actually used to more efficiently create and assign these plots to a list before combining with the patchwork::wrap_plots() function.

set.seed(1138)
dat<-mpg2015s100rep

# create function to quickly generate histograms
funhist = function(dat, repnum){
  dat %>% filter(replicate==repnum) %>% ggplot() +
  geom_histogram(aes(x = mpg2015, y =..density..), color="#1fa187", fill="#4ac16d", 
                 position="identity", breaks = seq(0, 45, by = 2)) +
  stat_function(fun = dnorm,  alpha=.7,
                args = list(mean = 22, sd = 6), 
                                color='#365c8d', size=1.1) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.1)) +
  theme(axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(),  
        axis.ticks.y=element_blank()
        ) +
  labs(x=NULL, y=NULL)
}

# Code below displays logic of generating & plotting 25 ggplot objects
  # however, we will show & use more efficient methods w/for loops or lists

# here is how you could use our function to plot a histogram for all 25 samples 
# plots100r1 <- funhist(dat, 1)
# plots100r2 <- funhist(dat, 2)
# plots100r3 <- funhist(dat, 3)
# plots100r4 <- funhist(dat, 4)
# plots100r5 <- funhist(dat, 5)
# plots100r6 <- funhist(dat, 6)
# plots100r7 <- funhist(dat, 7)
# plots100r8 <- funhist(dat, 8)
# plots100r9 <- funhist(dat, 9)
# plots100r10 <- funhist(dat, 10)
# plots100r11 <- funhist(dat, 11)
# plots100r12 <- funhist(dat, 12)
# plots100r13 <- funhist(dat, 13)
# plots100r14 <- funhist(dat, 14)
# plots100r15 <- funhist(dat, 15)
# plots100r16 <- funhist(dat, 16)
# plots100r17 <- funhist(dat, 17)
# plots100r18 <- funhist(dat, 18)
# plots100r19 <- funhist(dat, 19)
# plots100r20 <- funhist(dat, 20)
# plots100r21 <- funhist(dat, 21)
# plots100r22 <- funhist(dat, 22)
# plots100r23 <- funhist(dat, 23)
# plots100r24 <- funhist(dat, 24)
# plots100r25 <- funhist(dat, 25)

# with this explicit approach, you could then combine plots with patchwork
# plots100r1 + plots100r2 + plots100r3 + plots100r4 + plots100r5 + 
#   plots100r6 + plots100r7 + plots100r8 + plots100r9 + plots100r10 +
#   plots100r11 + plots100r12 + plots100r13 + plots100r14 + plots100r15 +
#   plots100r16 + plots100r17 + plots100r18 + plots100r19 + plots100r20 +
#   plots100r21 + plots100r22 + plots100r23 + plots100r24 + plots100r25 + 
# plot_layout(ncol = 5) 

# a more efficient way to do above is to make 25 plots with for loop & our function
  # generates all 25 plots & assigns to objects p1, p2, ... p25
# for(i in 1:25){
# assign(paste0('p',i), funhist(dat, i))  
# }

# instead, we generate 25 plots & put in one list 
# this avoids cluttering our R Environment with a bunch of plot objects
plotlist <-lapply(
  seq(1,25), 
  function(i) funhist(dat, i)
)

wrap_plots(plotlist)

  # use patchwork::wrap_plots() to display all plots in a list
  # https://patchwork.data-imaginist.com/articles/guides/assembly.html

For the most part, these sample distributions look like they resemble the population from which they were drawn. However, these are only 25 of the many, many unique samples that we could have drawn of the same sample size - that is, these are only 25 out of all the potential samples that make up the sampling distribution of size n=100 from this population. If you can imagine how many unique samples of n=100 could have possibly been drawn from this population of N=240k, then you can imagine a sampling distribution of sample size n=100 values!

  • Better yet, instead of imagining, let’s do math. Remember combinatorics? Calculating the number of unique samples (i.e., without repetition but in any order) of size n=100 that could be drawn from a population of N=240,000 is a n choose k combination problem It turns out that there are (240,000 choose 100) options, which equates to \(\frac{240000!}{100!(240000-100)!}\approx1.101×10^{380}\) possible samples of n=100 that could be drawn!

Now, what if we had drawn smaller samples of, say, n=30 “cars” (i.e., average mpg values) instead? Let’s create a comparable plot of 25 samples each comprised of n=30 average mpg values drawn from our simulated population data of 240k cars. We hope this will help you visualize how the sampling distribution for n=30 cars differs from the sampling distribution for n=100 cars.

set.seed(1138)
#simulate 25 samples of size n=30
mpg2015s30rep <- mpg2015pop %>%
  rep_slice_sample(n = 30, reps = 25)

#assign new sim data to dat object to call in function
dat <- mpg2015s30rep

#generate 25 plots in a list 
plotlist <-lapply(
  seq(1,25), 
  function(i) funhist(dat, i)
)

#combine 25 plots of sample size n=30
wrap_plots(plotlist)

The first thing you should notice is different each of these samples is from one another and from the underlying normal probability distribution in the population from which they were drawn. These plots also are much more variable compared to the plots above for the 25 samples of n=100 cars, despite each sample being drawn from the same population of 240k simulated cars.

This should help you visualize the concept of sampling variability and the central limit theorem. If you draw samples of size “n” from a population, those samples will vary (be different) from each other and from the population. Smaller sample sizes (n) and more heterogenous populations result in more sampling variability, or greater variation among the samples in a sampling distribution. Conversely, increasing the sample size (n) reduces sampling variability and, instead, results in a sampling distribution comprised of potential samples that are more representative of the underlying population.

To illustrate, let’s create this plot once again for 25 samples of size n=1,000 “cars” (average mpg values) drawn from our simulated population of 240k cars.

set.seed(1138)
#simulate 25 samples of size n=1000
mpg2015s1000rep <- mpg2015pop %>%
  rep_slice_sample(n = 1000, reps = 25)

#assign new sim data to dat object to call in function
dat <- mpg2015s1000rep

#generate 25 plots in a list 
plotlist <-lapply(
  seq(1,25), 
  function(i) funhist(dat, i)
)

#combine 25 plots of sample size n=1000
wrap_plots(plotlist)

Notice how each of these 25 samples of n=1,000 simulated cars has an average mpg distribution that quite closely resembles the population distribution that generated these samples? Moreover, while each sample is different, they are far more similar to one another compared to the distributions found in the plot of n=30 samples above.

This is a primary reason that sample size matters: Larger random samples are more representative of the populations from which they are drawn than are smaller samples. Likewise, sample statistics are more likely to generate valid estimates of population parameters and inferences about populations are more likely to be valid when we use large samples rather than small samples.

Visualize sampling distribution of mean with simulated sample mean mpgs

To further illustrate this point, let’s visualize the sampling distribution of the mean for our n=30 samples. To do this, we will start by calculating the mean of each of the 25 samples of n=30 mean mgp values drawn from our simulated car mpg data; we will display the first few rows of sample mean mpg values using head(). Then, we will plot each of the 25 sample means as a dot in a dot histogram using geom_dotplot().

# simulate 10 samples of size n=30
# mpg2015s30rep <- mpg2015pop %>%
#   rep_slice_sample(n = 30, reps = 25)

mus30rep25 <- mpg2015s30rep %>%
  group_by(replicate) %>%
  summarize(mean_mpg = mean(mpg2015))
head(mus30rep25) %>% gt() %>%
  tab_header(title = md("First few rows of N=25 sample mean mpg"),
    subtitle = md("(Sample size n=30 per sample)")
  )
First few rows of N=25 sample mean mpg
(Sample size n=30 per sample)
replicate mean_mpg
1 22.29934
2 22.46044
3 21.08830
4 21.47506
5 22.56471
6 22.76905
mus30rep25plot <- mus30rep25 %>% ggplot(aes(x=mean_mpg)) +
  geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.25,
               dotsize=0.7) +
  labs(x= "N=25 samples") +
  theme_minimal() +
  coord_cartesian(expand = FALSE, ylim=c(0,1)) +
  scale_x_continuous(limits=c(17, 27), breaks=seq(18,26,2)) +
  scale_y_continuous(NULL, breaks = NULL, name="Sample n=30") 
mus30rep25plot + plot_annotation(
    title = 'Visualizing Sampling Distribution of Mean for Sample Size n=30')

If we kept drawing more samples of n=30 “cars” each from our simulated population, how variable would the sample means from each of the different samples be? The plot above gives us a sense of the spread or variability of the sampling distribution of sample means for n=30 simulated cars. We can get a better sense by drawing more samples of the same size (n=30 each) and plotting those means as well.

We will do this below by drawing N=250 and N=2,500 samples of n=30 cars each from our simulated population of 240k cars, then we will calculate the mean for each of the 250 and 2,500 samples respectively and, finally, we will plot these sample mean mgp values and combine the plots using patchwork.

set.seed(1138)
# write function to draw N=Nrep simulated samples of size n=nsamp from sim pop 
  # then calculate sample mean from each Nrep sample 
funmeans = function(nsamp, Nrep){
    mpg2015pop %>% 
    rep_slice_sample(n = nsamp, reps = Nrep) %>%  
    group_by(replicate) %>%
    summarize(mean_mpg = mean(mpg2015))
}

mus30rep250 <- funmeans(30,250)

head(mus30rep250) %>% gt() %>%
  tab_header(title = md("First few rows of N=250 sample mean mpg**"),
    subtitle = md("(Sample size n=30 per sample)")
  )
First few rows of N=250 sample mean mpg**
(Sample size n=30 per sample)
replicate mean_mpg
1 22.29934
2 22.46044
3 21.08830
4 21.47506
5 22.56471
6 22.76905
mus30rep2500 <- funmeans(30,2500)

head(mus30rep2500) %>% gt() %>%
  tab_header(title = md("First few rows of N=2,500 sample mean mpg**"),
    subtitle = md("(Sample size n=30 per sample)")
  )
First few rows of N=2,500 sample mean mpg**
(Sample size n=30 per sample)
replicate mean_mpg
1 22.50078
2 19.90967
3 22.39007
4 21.76854
5 24.25293
6 21.12095
# create function to quickly generate dot histograms
fundots = function(dat, binw, dsiz, xlabel){
  dat %>% ggplot(aes(x=mean_mpg)) +
  geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=binw,
               dotsize=dsiz) +
  labs(x= xlabel) +
  theme_minimal() +
  coord_cartesian(expand = FALSE, ylim=c(0,1)) +
  scale_x_continuous(limits=c(17, 27), breaks=seq(18,26,2)) +
  scale_y_continuous(NULL, breaks = NULL)
}

#generate plots
mus30rep250plot <- fundots(mus30rep250, .15, .5, "N=250 samples")
mus30rep2500plot <- fundots(mus30rep2500, .05, .4, "N=2,500 samples")  
  
#combine plots
mus30rep25plot + mus30rep250plot + mus30rep2500plot +
  plot_layout(ncol = 3) + 
  plot_annotation(
    title = 'Visualizing Sampling Distribution of Mean for Sample Size n=30')

As you can see, the sampling distribution of sample means for samples of size n=30 clusters around the true population mean (=22), but the individual sample means range from around mean mpg=18 to to mean mpg=25.

Let’s repeat this process for sample sizes n=100 and n=1,000.

set.seed(1138)

mus100rep25 <- funmeans(100,25)
mus100rep250 <- funmeans(100,250)
mus100rep2500 <- funmeans(100,2500)
mus1000rep25 <- funmeans(1000,25)
mus1000rep250 <- funmeans(1000,250)
mus1000rep2500 <- funmeans(1000,2500)

#generate plots
mus30rep25plot <- fundots(mus30rep25, .25, .7, NULL) + 
  scale_y_continuous(name="Sample n=30") + 
  theme(axis.text.y = element_blank(), axis.ticks = element_blank())
mus30rep250plot <- fundots(mus30rep250, .15, .5, NULL)
mus30rep2500plot <- fundots(mus30rep2500, .05, .4, NULL)  
mus100rep25plot <- fundots(mus100rep25, .25, .7, NULL)+ 
  scale_y_continuous(name="Sample n=100") + 
  theme(axis.text.y = element_blank(), axis.ticks = element_blank())
mus100rep250plot <- fundots(mus100rep250, .15, .5, NULL)
mus100rep2500plot <- fundots(mus100rep2500, .05, .4, NULL) 
mus1000rep25plot <- fundots(mus1000rep25, .25, .7, "N=25 samples") + 
  scale_y_continuous(name="Sample n=1,000") + 
  theme(axis.text.y = element_blank(), axis.ticks = element_blank())
mus1000rep250plot <- fundots(mus1000rep250, .15, .5, "N=250 samples")
mus1000rep2500plot <- fundots(mus1000rep2500, .05, .4, "N=2,500 samples") 
  
#combine plots
mus30rep25plot + mus30rep250plot + mus30rep2500plot +
  mus100rep25plot + mus100rep250plot + mus100rep2500plot +
  mus1000rep25plot + mus1000rep250plot + mus1000rep2500plot +
  plot_layout(ncol = 3) + 
  plot_annotation(
    title = 'Visualizing Sampling Distributions of the Mean, Sample Sizes n=30,n=100, & n=1,000')

Once again, the sampling distributions of sample means for samples of size n=30, n=100, and n=1,000 all cluster around the true population mean (=22). However, the spread of the sampling distribution of sample means diminishes as the sample size increases. For instance, while the individual sample means from the n=30 samples have a relatively wide range from around mean mpg=18 to mean mpg=25, the individual sample means cluster more tightly around the population mean value (=22) and the range of sample mean values decreases substantially as the sample size increases (e.g., to n=100 or n=1,000).

So, what does this mean for a researcher who draws a random sample from a population with the goal of estimating and making inferences about population parameters? It means that research designs relying on small samples, which are more variable, are more likely to generate invalid population estimates and inferences than are research designs relying on large samples.

Illustrating a one-sample t-test with mtcars data

We hope these plots have helped better understand sampling distributions of the sample mean. Now, let’s move on to examples that help illustrate the logic of a one-sample hypothesis test. For now, we will build on the mpg example by bringing in the “mtcars” dataset that comes built into R. First, let’s view those data.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# next assignment, use starwars data!
# data(starwars)
# https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html

Built-in datasets are extremely useful for creating examples - and especially for sharing or troubleshooting code in a reproducible way by using data that are universally available to the community of R users.

To learn more about the “mtcars” data, type ?mtcars in your console (or a code chunk) and hit enter (or run). Doing so will bring up a help page, where you will learn that the data were “extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).”

So, these are not a random sample of cars - but that will not stop us from imagining that they are a random sample! Specifically, let’s pretend the mtcars data were randomly selected from the population of cars in 1974. Then, imagine that we want to know if the true population mean miles per gallon (mpg) in 1974 was equivalent to that in 2015, and we want to use the “random” (just pretend with us here) mtcars sample to make an inference about the average mpg of the population of cars in 1974.

With null hypothesis testing, we cannot answer our question directly using a one-sample t-test and the data available to us. However, here is what we can do:

  • We can start with the assumption that the population mean mpg in 1974 was 22. In this case, this is our null hypothesis, even though we are specifying a non-null point rather than a true “null” or “0” value here.
  • Next, we can estimate the mean mgp in our (imaginary) “random” sample of cars from the “mtcars” data.
  • Since “mtcars” is a small sample, we can use the t-distribution to convert our sample mean mpg into a t-value, which becomes our test statistic.
  • We will also specify a conventional alpha level of 0.05.
  • We do not have any reason to suspect that the “mtcars” data will have a higher or lower sample value than mpg=22, so we will conduct a two-tailed test.
  • From here, we will convert our t-test statistic to a probability or p-value that represents the probability of observing a sample mean mpg as far or farther from the assumed population mean of 22 if we lived in a world in which the true population mean mpg in 1974 was actually equal to 22.
    • This is confusing, we know. However, this is the type of answer a null hypothesis significance test can provide.
    • Essentially, we start by assuming the null hypothesis is true, then we ask how (im)probable our data are under the assumption that the null is true.
  • Our final step, then, is to compare our test statistic or associated p-value with our prespecified alpha level and make an inference about the population by rejecting or failing to reject the null hypothesis.
    • Again, the point of the null hypothesis test is to determine just how improbable our data would be under the null hypothesis; that is, we are estimating how atypical our “random” sample would be if the true population mean mpg were actually equal to 22.
    • Would our data and sample statistics be highly improbable if the null hypothesis were true? If so, we reject the null hypothesis. If not, we lack sufficient evidence to reject the null.

Let’s try it. First, we start by examining the sample distribution of “mpg” in the mtcars data in a frequency table and a histogram.

mtcars %>% frq(mpg)
## mpg <numeric> 
## # total N=32 valid N=32 mean=20.09 sd=6.03
## 
## Value | N | Raw % | Valid % | Cum. %
## ------------------------------------
## 10.40 | 2 |  6.25 |    6.25 |   6.25
## 13.30 | 1 |  3.12 |    3.12 |   9.38
## 14.30 | 1 |  3.12 |    3.12 |  12.50
## 14.70 | 1 |  3.12 |    3.12 |  15.62
## 15.00 | 1 |  3.12 |    3.12 |  18.75
## 15.20 | 2 |  6.25 |    6.25 |  25.00
## 15.50 | 1 |  3.12 |    3.12 |  28.12
## 15.80 | 1 |  3.12 |    3.12 |  31.25
## 16.40 | 1 |  3.12 |    3.12 |  34.38
## 17.30 | 1 |  3.12 |    3.12 |  37.50
## 17.80 | 1 |  3.12 |    3.12 |  40.62
## 18.10 | 1 |  3.12 |    3.12 |  43.75
## 18.70 | 1 |  3.12 |    3.12 |  46.88
## 19.20 | 2 |  6.25 |    6.25 |  53.12
## 19.70 | 1 |  3.12 |    3.12 |  56.25
## 21.00 | 2 |  6.25 |    6.25 |  62.50
## 21.40 | 2 |  6.25 |    6.25 |  68.75
## 21.50 | 1 |  3.12 |    3.12 |  71.88
## 22.80 | 2 |  6.25 |    6.25 |  78.12
## 24.40 | 1 |  3.12 |    3.12 |  81.25
## 26.00 | 1 |  3.12 |    3.12 |  84.38
## 27.30 | 1 |  3.12 |    3.12 |  87.50
## 30.40 | 2 |  6.25 |    6.25 |  93.75
## 32.40 | 1 |  3.12 |    3.12 |  96.88
## 33.90 | 1 |  3.12 |    3.12 | 100.00
##  <NA> | 0 |  0.00 |    <NA> |   <NA>
mtmpgplot <- mtcars %>% ggplot() +
  geom_histogram(aes(x = mpg, y =..density..), color="#1fa187", fill="#4ac16d", 
                 position="identity", breaks = seq(0, 45, by = 2)) +
  labs(x= "Sample distribution of mpg values in mtcars data") + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.1))
mtmpgplot

Now, let’s overlay a line representing the distribution of our null hypothesis, which here is a normal curve with a mean=22 and a standard deviation that we will estimate using the sample sd value (sd(mtcars$mpg)). We will also draw a vertical solid line at the mean of this null distribution (mean=22). Finally, we will add a second dashed vertical line at the mean value of the mtcars sample distribution for mpg.

These additions will hopefully help you better visualize the comparison of the mtcars sample estimate with our assumed population parameter value. Recall, the aim of our test is to determine how improbable it would be to get a “random” sample of this size and variation that generated a sample mean as far or further from mean=22 than the one we observed.

mtmpgplot <- mtcars %>% ggplot() +
  geom_histogram(aes(x = mpg, y =..density..), color="#1fa187", fill="#4ac16d", 
                 position="identity", breaks = seq(0, 45, by = 2)) +
  labs(x= "Sample distribution of mpg values in mtcars data") + 
  stat_function(fun = dnorm, alpha=.7,
                args = list(mean = 22, sd = sd(mtcars$mpg)), 
                                color='#d44842', size=1.1) + 
  geom_vline(xintercept = 22, color = "#d44842", size=1.2, alpha=0.7) + 
  geom_vline(xintercept = mean(mtcars$mpg), linetype="dashed", 
             color = "#365c8d", size=1.2, alpha=0.7) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.125))
mtmpgplot

Now, we conduct our t-test. You read about how to calculate z- and t-test statistics in Chapter 8. We will also spend more time on how this t-standardized test statistic is calculated in the next section.

t.test(mtcars$mpg, mu = 22, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = -1.7921, df = 31, p-value = 0.08288
## alternative hypothesis: true mean is not equal to 22
## 95 percent confidence interval:
##  17.91768 22.26357
## sample estimates:
## mean of x 
##  20.09062

Our t-standardized test statistic is t=-1.79. With this t-value and 31 degrees of freedom (df=n-1), the associated p-value is 0.083. Assuming data, measurement, and modeling assumptions are valid (e.g., assuming the mtcars data were actually a random sample), we can interpret this p-value as the probability of observing a sample mean at least as far away from mean=22 as the observed sample mean (20.09) if the null hypothesis that the 1974 population mpg mean=22 were true.

Since the p-value is greater than our pre-specified alpha level (p>0.05), the appropriate inference would be to fail to reject the null hypothesis. In this case, there is insufficient evidence to reject the possibility that the true 1974 population mean mpg was equal to 22. In essence, if the true 1974 population mean mpg really was equal to 22, then it would not be all that improbable to draw a random sample of n=32 cars with a mean mpg of 20.09 (or more extreme) simply due to sampling variability.

To further illustrate this point, let’s return to our simulated population data to examine just how rare it would be to draw samples of n=32 cars with means that are at least this extreme. We will draw N=10,000 random samples of n=32 “cars” each from our simulated population of 240k mpg values, which we know were simulated from a normal distribution with a mean of 22 and a standard deviation of 6. (Note: The standard deviation of the mpg values in the mtcars data equals 6.03).

# If we randomly select N=10,000 samples of n=32 cars each from our simulated population, what proportion of sample means would be at least as extreme as the mtcars data?
set.seed(1138) 
funmeans(32,10000) %>%
  filter(mean_mpg<=20.09062 | mean_mpg>=23.90938) %>%
  summarise(n_extreme = n()) %>%
  mutate(
    prop_extreme = n_extreme / 10000, 
    pct_extreme = (n_extreme / 10000)*100) %>% 
  gt() %>%
    tab_header(title = md("Percent of sample means as/more extreme than mtcars data"),
      subtitle = md("(N=10,000 samples of size n=32 from simulated cars population)")
    )
Percent of sample means as/more extreme than mtcars data
(N=10,000 samples of size n=32 from simulated cars population)
n_extreme prop_extreme pct_extreme
747 0.0747 7.47
#alt method with count
# set.seed(1138) 
# funmeans(32,10000) %>% 
#   count(mean_mpg<=20.09062 | mean_mpg>=23.90938) 

So, after drawing N=10,000 random samples from our simulated population that we know has a mean=22 and sd=6, we found about 7.5% of those samples had a sample mean mpg that was at least as extreme as the mean mpg value observed in the mtcars data. This is not exactly the same as the p-value we observed (0.083), but it is pretty close!

Now, let’s move on to one-sample hypothesis test examples that are more similar to the SPSS exercises found at the end of Chapter 8 in your book.


Part 1 (Assignment 10.1)

Goal: Read in Youth data and manually conduct a one-sample hypothesis test

For this assignment, you will be working through Question #3 from Chapter 8 SPSS Exercises in the book. That question asks you to conduct a one-sample hypothesis test using the variable ‘moral’ from the Youth data. Below, we will demonstrate how to complete the assignment steps using an alternative variable - parnt2 - from the Youth data. This ensures that you get the experience of calculating the answers and working with R yourself using the same data. After working through the process with us using the parnt2 variable, you will then simply repeat the same steps with the ‘moral’ variable afterwards. Be sure to repeat these steps with the correct variable from the assignment to ensure you receive full credit.

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2023_03_21_Ducate_CRIM5305_Assign10)

  1. Go to your CRIM5305_L folder, which should contain the R Markdown file you created for Assignment 9. Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our CRIM5305_L folder as the top-level directory.
  2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
  3. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 10.1).” Then, create a third-level header titled: “Read in Youth data and manually conduct one-sample z-test”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 10. It is my work and only my work.
  4. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, gt, patchwork, and infer.
      • This is our first time using the infer package, which means you’ll need to download it first using install.packages() or the “Install Packages” button under the “Tools” tab. Then, you can use library() to load it in. While we will briefly introduce you to the popular and easy to use t.test() function from the base R stats package, we will also show you how to conduct t-tests using the infer package. This package is a tidy-style, pipe-friendly alternative for conducting and visualizing common inferential tests.
      • Recall, you only need to install packages one time. However, you must load them each time you start a new R session.
  5. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.

    2. In the new R code chunk, read in the “Youth_0.sav” SPSS datafile and assign it to an R object named YouthData.

      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: YouthData.

      • This will call the object and provide a brief view of the data. Recall that you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.)
      • Note: We used the head() function (piped to gt()) to print the first six rows so you can compare your output.
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
YouthData %>% head() %>% gt()
Gender v2 v21 v22 v63 v77 v79 v109 v119 parnt2 fropinon frbehave certain moral delinquency d1 hoursstudy filter_$ heavytvwatcher studyhard supervision drinkingnotbad Lowcertain_bin
1 15 36 15 3 1 3 5 1 5 18 9 9 19 8 1 15 1 1 1 0 0 1
0 15 3 5 4 1 2 4 1 8 11 6 11 19 10 1 5 0 0 0 1 0 0
1 15 20 6 4 1 1 4 2 8 12 5 11 20 1 0 6 0 1 0 1 0 0
0 15 2 4 2 1 2 5 3 4 9 5 13 19 1 0 4 1 0 0 0 0 0
0 14 12 2 4 3 5 5 3 7 27 9 4 18 104 1 2 1 0 0 1 1 1
1 15 1 3 4 1 1 5 3 7 8 9 10 20 0 0 3 0 0 0 1 0 0
  1. Create a third-level header titled: “Manually conduct z-Test for ‘parnt’”.
    1. Insert an R chunk.
    2. First, we will create a frequency table, so that we can examine the mean, standard deviation, and distribution of the variable. Type YouthData %>% frq(parnt2).
YouthData %>%
  frq(parnt2)
## parental supervision scale (parnt2) <numeric> 
## # total N=1272 valid N=1272 mean=6.23 sd=1.38
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     2 |   6 |  0.47 |    0.47 |   0.47
##     3 |  16 |  1.26 |    1.26 |   1.73
##     4 | 163 | 12.81 |   12.81 |  14.54
##     5 | 139 | 10.93 |   10.93 |  25.47
##     6 | 440 | 34.59 |   34.59 |  60.06
##     7 | 189 | 14.86 |   14.86 |  74.92
##     8 | 319 | 25.08 |   25.08 | 100.00
##  <NA> |   0 |  0.00 |    <NA> |   <NA>
  • Note that the sample size is greater than 120 (> 120) so we can use either a z- or t-test.
    • Recall, the reason we can use either test is that t distributions converge with the z distribution with large samples, meaning we would select the same critical (z) value either way with a large sample.
    • For instance, the z-critical value for an alpha of .05 is 1.96, whereas the t-critical value for an alpha of .05 and degrees of freedom of 1,271 (i.e., df = n-1) is also 1.96.
    • We will demonstrate how to perform both tests with two different alpha levels - first with an alpha level of 0.05 and then with an alpha level of 0.10.
  • We will start by calculating a z statistic “by hand” using the book’s Equation 8.3 for calculating z-obtained or z(obt). Then, we will show that the t(obt) value is identical using the t_test function from the infer package.
    • In our example, we will illustrate how to do this using a comparison mean of \(\mu\) = 6.0 for our null hypothesis using the parnt2 variable.
    • In other words, we will assume that the true population mean for the “parental supervision scale” is equal to 6.0, and we will test the hypothesis that the sample mean for the parnt2 variable in our data is different from 6.0 (i.e., a two-sided hypothesis test).
  1. In our first example, we will set an alpha level of 0.05. This converts to a z-critical value or z(crit) of 1.96 (see Appendix A). For our sample size, this also equates to a t-critical value or t(crit) of 1.96 (df = 1272 - 1 = 1271).
    1. Insert an R chunk. Now, we already know that n = 1,272. However, if we did not know, we can ask R to tell us. Simply type n <- length(YouthData$parnt2). Then, hit enter, type n, and your R studio should give you the value 1,272.
      • The length() function tells R to output the length of the dataset, which is equivalent to the number of observations or participants in this particular dataset.
    2. Now, we will calculate the z-statistic using the formula from our book, which is: \[ z_{obt} = \frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}} \]
    3. Recall that the numerator, or the difference between our sample mean and the assumed population mean, is divided by the denominator \(\frac{s}{\sqrt{n}}\), which is the standard deviation of the sampling distribution of the sample mean, or otherwise known as the standard error of the mean.
      • Remember those sampling distributions of sample means we plotted above? Due to the sampling variability associated with small samples, we need to adjust test statistics more (i.e., make them smaller and less likely to surpass a critical value) when we are estimating population parameters with sample statistics from small samples.
      • Thus, this denominator essentially generates a standardized adjustment to our test value that varies by sample size. Smaller sample sizes result in larger denominator values and, hence, smaller test statistics, whereas larger sample sizes result in smaller denominator values and, hence, larger test statistics.
      • Note: We could have used this same formula to calculate the standard error of the mean in the simulated examples above, then compared the estimates to the observed standard error of means from, say, N=10,000 samples drawn from our population. Feel free to go back and try it yourself!
    4. Hit enter in your R chunk and type z_stat <- (mean(YouthData$parnt2) - 6) / (sd(YouthData$parnt2) / sqrt(n)). Then, hit enter and type z_stat to view your z-statistic.
      • Note: This code is our book’s formula typed out. sd is the standard deviation; sqrt(n) is the square root of n; and our (null) hypothesized population mean \(\mu\) = 6.
#alpha = .05
#z-crit = 1.96
#t-crit = 1.96 (df > 120)

#get n 
n <- length(YouthData$parnt2)
n
## [1] 1272
# calculate the z-statistic
z_stat <- (mean(YouthData$parnt2) - 6) / (sd(YouthData$parnt2) / sqrt(n))
z_stat
## [1] 5.880774
#> [1] -290.7536
  1. Our z-obtained or z test statistic is 5.88. How does this compare to the z(crit) of 1.96?
    1. Our test statistic z-obtained is greater than the z-critical value (5.88 > 1.96), which means it falls in the pre-specified critical region or “rejection region” of the z-standardized (or t-standardized) distribution. In essence, this value indicates that our observed sample mean is about 5.88 (n-adjusted) standardized units away from the null hypothesized mean of \(\mu\)=6.
    2. Thus, we would interpret this as sufficient evidence to reject the null hypothesis that the mean population value for the parental supervision scale is \(\mu\)=6.
    3. Put differently, if the true population mean of the parental supervision scale were actually equal to ‘6’ (and assuming our modeling assumptions are valid), then it would be highly improbable to draw a random sample with a mean value as extreme as the one we observed (6.23) due to sampling variability alone. Of course, with such a large sample, the standard error of the mean is quite small, so most samples would be expected to generate means that are very close to a value of ‘6’ if the null hypothesis were true.
  2. We know that a t-test with an alpha of .05 would yield virtually the same result, since the critical values are the same for large sample sizes. But what would happen if we change the alpha level of our two-sided test from 0.05 to 0.10, or equivalent to a 90% confidence interval? How would that change our z-critical or t-critical value? Would it change our z-obtained or t-obtained test statistic? Would it change our conclusions? Let’s find out, but this time we will use R functions to conduct the test rather than calculating the test statistic manually with a formula.

Part 2 (Assignment 10.2)

Goal: Conduct a one-sample hypothesis test with infer::t_test()

In the last part of the assignment, we will illustrate how to conduct a one-sample t-test using the t_test() function from the infer package. Recall, we previously illustrated the base R t.test() function in our mtcars example earlier. For another example of conducting a one-sample t-test by hand and with the t.test() function in base R, see here. In comparison, the infer alternative allows us to run basic statistical tests such as a one-sample t-test within a tidyverse framework that permits sequencing functions together using tidyverse-style pipes. We will soon illustrate why such piping can be quite helpful.

  1. First, create a second-level header titled: “Part 2 (Assignment 10.2)”. Then, create a third-level header titled: “Conduct t-test with alpha of 0.10 for parnt2 variable”.
    1. Insert an R chunk.
    2. To conduct a basic one-sample, two-tailed t-test with an alpha level of 0.10 using infer::t_test(), type YouthData %>% t_test(parnt2 ~ 1, mu = 6, conf_level = 0.90) and then click run to generate results of your t-test.
#infer
t_stat <- YouthData %>% infer::t_test(response = parnt2, mu = 6, conf_level = 0.90)
t_stat
## # A tibble: 1 × 7
##   statistic  t_df       p_value alternative estimate lower_ci upper_ci
##       <dbl> <dbl>         <dbl> <chr>          <dbl>    <dbl>    <dbl>
## 1      5.88  1271 0.00000000521 two.sided       6.23     6.16     6.29
#NOTE: if you load both rstatix & infer, there is conflict over t_test function call
# alternative: calculating t-score using rstatix
#https://www.datanovia.com/en/lessons/t-test-in-r/
# YouthData %>% rstatix::t_test(parnt2 ~ 1, mu = 6)
  1. Note that the test statistic (5.88) is the same value that we calculated manually above.
    1. With an alpha level of 0.10 for a two-sided test, the z and t (large sample) critical value would change to 1.64.
      • Remember, you can find the z or t critical value in the back of your book in Appendix tables.
      • Instead of using these tables, in a later assignment we will learn how to find these critical values with R.
    2. Since our test statistic does not change and it is still greater than the critical value, our test statistic again falls in the pre-specified critical region or “rejection region” of the z- or t-standardized distributions. As such, there is still sufficient evidence to reject the null hypothesis that the true population \(\mu\)=6.
    3. Note: In a later assignment, rather than comparing a standardized test statistic like z(obt) or t(obt) directly to a critical value, we will take the additional step of calculating a p-value. A p-value is interpreted as the estimated probability of observing a test statistic that is at least as far as or farther from the mean (in either the positive or negative direction for a two-directional test), assuming the null hypothesis is true (and given all other modeling assumptions).
      • As you can see above, the infer::t_test function also outputs this p-value automatically (p = 5.2137228^{-9}). In this case, the p-value is much smaller than our pre-specified alpha threshold of 0.10, so again we would reject the null hypothesis of \(\mu\)=6.
  2. In addition to conducting a t-test, the infer package also employs simulation methods similar to those we introduced in the beginning of this assignment to make it relatively quick and easy to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis. The tidy-style piping comes in handy for this part, which involves executing a three-step process:
    1. Calculate the observed sample statistic and assign it to an object
      • In this case, our observed statistic is the sample mean of the parnt2 variable in YouthData. Here, the code below specifies our YouthData data object, specifies parnt2 as the response variable, then calculates the mean and assigns it to the object we named observed_statistic.
    2. Simulate the null hypothesis population distribution and assign it to an object
      • Here, we specify the response variable (parnt2), hypothesize the (null) point estimate for the population mean as mu = 6, then generate 10,000 simulated samples (using bootstrapping methods to estimate sampling variation) and calculate the mean of each sample before assigning these simulated sample means to an object we named null_dist_1_sample.
    3. Visualize the null distribution (histogram) and the test statistic (red line)
      • Once we have our observed sample mean (observed_statistic) and the simulated mean values from the null distribution (null_dist_1_sample), we can use infer::visualize() to plot these together.
      • We also specified a two-sided hypothesis and shading the p-value region where applicable. (In this case, our observed sample statistic fell well outside the estimated distribution for the standard error of the hypothesized population mean, so there were no simulated values exceeding our observed statistic to shade.)
      • Note: The default color of the test statistic line is red. However, you can add the argument color = to the shade_p_value function and specify your new color just as you did for the boxplots in a previous assignment.
# a. calculate the observed statistic
observed_statistic <- YouthData %>%
  specify(response = parnt2) %>%
  calculate(stat = "mean")

# b. generate the null distribution
null_dist_1_sample <- YouthData %>%
  specify(response = parnt2) %>%
  hypothesize(null = "point", mu = 6) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "mean")

# c. visualize the null distribution and test statistic!
null_dist_1_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided",
                color = "turquoise")

  • As you can see, the estimated (simulated) null hypothesis distribution is clustered around \(\mu\)=6 with nearly all values falling within the range of 5.9 to 6.1, while the red line shows the observed sample mean of 6.23 for the parnt2 variable falls well outside this range.
    • Considering the variable’s range, the observed sample mean of 6.23 is not far in absolute units from our hypothesized population \(\mu\)=6. However, because we have such a large sample (n=1,272 participants), we can expect relatively little sampling variability, meaning the sampling distribution of the sample mean would be characterized by sample means that are tightly clustered around the hypothesized population mean of 6.

  • Hence, this plot confirms what we concluded earlier: Assuming the null hypothesis is true that the population mean of the parent supervision scale is indeed \(\mu\)=6 (and given our modeling assumptions are appropriate), it would be extremely improbable to draw a random sample of this size (n=1,272) and then calculate a sample mean statistic that is this extreme relative to the null. But what if our null hypothesis had been \(\mu\)=6.2 instead of \(\mu\)=6 like we previously assumed? Let’s generate that plot and conduct the t-test using the infer package!
# conduct t-test 
YouthData %>% infer::t_test(response = parnt2, mu = 6.2, conf_level = 0.90)
## # A tibble: 1 × 7
##   statistic  t_df p_value alternative estimate lower_ci upper_ci
##       <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
## 1     0.722  1271   0.470 two.sided       6.23     6.16     6.29
# a. calculate the observed statistic
observed_statistic <- YouthData %>%
  specify(response = parnt2) %>%
  calculate(stat = "mean")

# b. generate the null distribution
null_dist_1_sample <- YouthData %>%
  specify(response = parnt2) %>%
  hypothesize(null = "point", mu = 6.2) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "mean")

# c. visualize the null distribution and test statistic!
null_dist_1_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

  • Now our test statistic is smaller than the z- or t-critical value, even at alpha = 0.10, meaning the sample mean does not fall within the rejection region. Hence, if we had instead hypothesized that the true population mean is \(\mu\)=6.2, then we would fail to reject the null hypothesis. There is insufficient evidence to reject the null hypothesis that this sample was drawn from a population with a true mean value of \(\mu\)=6.2.
  1. Congratulations! You should now have everything that you need to complete the questions in Assignment 10!
  2. Follow the same steps as illustrated above to complete Questions 12-17 in the Assignment 10 Blackboard assignment, using the moral variable in the YouthData object.
    • As usual, complete the remainder of the questions in your RMD file. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    • Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    • Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_CRIM5305_Assign10. Submit via Blackboard in the relevant section for Assignment 10.

Assignment 10 Objective Checks

After completing Assignment 10…

  • do you know how to draw random samples from data in R?
    • do you know how to draw one random sample with dplyr::slice_sample()?
    • do you know how to draw multiple (“replicate”) random samples with infer::rep_slice_sample()?
  • do you recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects?
  • do you recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list?
  • are you able to use the base R tail() function to quickly view the last few rows of data?
  • do you know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users?
  • are you able to conduct a one-sample z or t hypothesis test in R and interpret results?
    • are you able to conduct a one-sample test using the base R t.test() function?
    • are you able to manually calculate a z or t test statistic by typing formula in R?
    • are you able to conduct a one-sample test using infer::t_test()?
    • do you know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis?