Assignment 10 Objectives

The purpose of this tenth assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 10 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

In the last assignment, you learned how to make statistical inferences from a contingency table by conducting and appropriately interpreting chi-squared tests of independence. You also learned how to estimate some measures of the magnitude of an association in a crosstab, such as Cramer’s V or the phi-coefficient. In this assignment, you will learn how to find the difference between two independent sample means in R and conduct an independent sample t-test of equality in population means.

By the end of assignment #10, you should…

  • recognize that you can add inline R code to RMD text with `r `, which can improve reproducibility and accuracy of reporting results by helping avoid typos or copy-and-paste errors and by auto-updating results
  • recognize that you can type LaTeX-style mathematical equations in your R Markdown text by using a single $ for inline text equations or two $$ to offset equations on a new line.
  • be able to filter data using datawizard::data_filter()
  • be able to change a numeric variable to a factor variable with mutate(newvar = as.factor(oldvar))
  • be able to conduct an independent samples pooled or separate variance t-test using t.test() and interpret the results
    • know how to conduct a Levene’s test for equality of variances using car::leveneTest()
  • be able to generate a half-violin/half-dotplot with geom_violindot() to visualize group distributions and sample sizes simultaneously
  • know how to use aes(fill=groupvar) to fill plots with different colors for each category of a grouping variable
  • know how to manually change the fill colors in a ggplot object using scale_fill_manual()
  • know how to add a horizontal or vertical line to a ggplot object with geom_hline() or geom_vline()

Assumptions & Ground Rules

We are building on objectives from Assignments 1-9. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors
  • recognize that round() can be used to specify number of decimals on numeric values

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages
  • improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()
  • know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users

Data viewing & wrangling

  • use the base R head() function to quickly view the first few rows of data
  • use the base R tail() function to quickly view the last few rows of data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()
  • select random sample from data without or with replacement using dplyr::sample_n()
  • select data with conditions using dplyr::filter() and %in% operator
  • simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()
  • draw random samples from data in R
    • draw one random sample with dplyr::slice_sample()
    • draw multiple (“replicate”) random samples with infer::rep_slice_sample()
  • recognize that one can manually build a simple tibble row-by-row using tidyverse’s tibble::tribble()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar) or with crosstable(depvar, by=indepvar)

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
    • recognize that a gt() table can be modified to add or remove the decimals in specific columns or rows with fmt_number()
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
    • recognize that one can write a custom function to repeatedly generate similar plots before combining them with patchwork
    • recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list
  • recognize that one can plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize that one can add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()
  • generate and interpret confidence intervals to quantify uncertainty caused by sampling variability
    • identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • estimate the standard error of a sample mean or proportion in R
    • estimate a two-tailed confidence interval around a sample mean or proportion in R
    • properly interpret and avoid common misinterpretations of confidence intervals
  • be able to conduct a one-sample z or t hypothesis test of the difference between a sample and assumed population mean in R and interpret results
    • be able to conduct a one-sample test using the base R t.test() function
    • be able to manually calculate a z or t test statistic by typing formula in R
    • be able to conduct a one-sample test using infer::t_test()
    • know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis
  • conduct a chi-squared test of independence using sjPlot::sjtab() or chisq.test() and interpret results
    • specify different measures of association with statistics= using sjPlot::sjtab() and interpret them appropriately.
    • generate observed and expected frequencies by assigning results of chisq.test() to an object (e.g., chisq) and then calling elements from object (e.g., chisq$observed or chisq$expected)

If you do not recall how to do these things, review Assignments 1-9.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • independent random samples
  • matched or dependent samples
  • sampling distribution of sample mean differences
  • how to calculate test statistic for a pooled variance t-test
  • how to calculate test statistic & degrees of freedom for a separate variance t-test
  • how to calculate a dependent-samples t-test
  • how to calculate a difference between proportions z-test

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Understanding Sampling Distribution of Sample Mean Differences

Goal: Visualize sampling distribution of sample mean differences, sampling variability, and standard errors; understand statistical power of a two-sample t-test

This week’s book chapter focused on null hypothesis tests to make inferences about the equality of two group means or proportions using sample data. The chapter discussed in detail tests involving differences between independent and dependent, or matched, samples as well as the appropriate time to use pooled or separate variance estimates.

In this assignment, you will learn how to make an inference about the (in)equality of two population group means by conducting an independent sample t-test and then determining whether to reject or fail to reject the null hypothesis of no difference in population group means. We will also briefly introduce Levene’s test as a method for determining whether one should assume equal or unequal group variances in the population, and we will introduce the half-violin/half-dotplot as a way to visualize variable distributions and potential outliers for two (or more) groups.

Before jumping into conducting independent samples t-tests and plotting in R, we want to be sure you understand the mechanisms that allow us to apply these statistical tests to samples and then make valid inferences about population group mean differences. So, we will first use simulation and plotting to help you visualize the sampling distribution of sample mean differences. While your book discusses this in detail, we are reviewing the topic here as well because it is essential to understanding the inferential logic behind the independent samples t-test of equality in means.

Simulating the sampling distribution of sample mean differences

Let’s start by using rnorm() to draw two random samples from independent “population groups”, each represented by a distinct normal probability distribution. We will imagine that the first “population group” of scores is normally distributed with a mean=20 and sd=5. In comparison, we will imagine the second “population group” of scores is also normally distributed, with a different mean=24 but equal variance (sd=5).

It may help to think back to our “car miles per gallon (mpg)” example from Assignment 8. In that example, we first simulated a population of cars with normally distributed mpg values (mean=22, sd=6), then repeatedly drew random samples from that simulated population, plotted the sample mpg distributions, then illustrated the sampling distribution of the sample mean (for different sample sizes) by plotting all the sample means. We then conducted a one-sample t-test that compared the sample mean mpg from the mtcars data with our (null) hypothesized population mean mpg (\(\mu=22\)).

Likewise, in our current example, it may be helpful to imagine that our two simulated populations are independent groups of car mpg values - say, two population groups of cars manufactured in different years. Since we are simulating the populations, we know that the two population group means differ by an average of 4 units (\(\mu_2-\mu_1 = 24-20 = 4\)), or by an average of 4mpg.

However, when we draw random samples from independent population groups, we rarely know the true population parameters, such as the population group means (\(\mu_1\) or \(\mu_2\)) or the population group mean difference (\(\mu_2-\mu_1\)). Hence, we may use descriptive statistics from our samples as estimates of population parameters, then rely on inferential statistics to draw conclusions about the populations from which the samples were drawn. Here, we will illustrate this process by drawing samples from two independent (known) population groups and then conducting independent samples t-tests to make inferences about the populations from which our samples were drawn.

We will begin with relatively small samples of n=100 random draws from each simulated population group. Below we plot histograms and overlay normal curves for our two independent random samples of n=100 values each.

# https://stackoverflow.com/questions/21008884/simlulating-a-t-test-in-r
# https://www.guru99.com/r-apply-sapply-tapply.html
# https://stackoverflow.com/questions/6967664/ggplot2-histogram-with-normal-curve

#Illustrate sample mean differences for two independent population groups 
nvals <- 100
set.seed(1138) 

#sim two columns (groups) of rnorm values with "nval" size 
grpA <- rnorm(n=nvals, mean=20, sd=5) %>% as_tibble_col(column_name = "A")
grpB <- rnorm(n=nvals, mean=24, sd=5) %>% as_tibble_col(column_name = "B")

#column bind & transform to long form for graphing
data <- cbind(grpA, grpB) %>% 
  pivot_longer(
    cols = c(`A`, `B`), 
    names_to = "Group", 
    values_to = "Simval"
  )

#Create dnorm x/sd values in data to draw w/geom_point() 
  # adds gganimate transitions on curves, which do not work w/stat_function
xA <- as_tibble_col(seq(min(grpA$A), max(grpA$A), length.out = nvals), column_name = "A")
xB <- as_tibble_col(seq(min(grpB$B), max(grpB$B), length.out = nvals), column_name = "B")
dnormx <- cbind(xA, xB) %>% 
  pivot_longer(
    cols = c(`A`, `B`), 
    names_to = "GroupX", 
    values_to = "xval" 
  ) 

sdA <- as_tibble_col(dnorm(seq(min(grpA$A), max(grpA$A), length.out = nvals), 
                           mean(grpA$A), sd(grpA$A)), column_name = "A")
sdB <- as_tibble_col(dnorm(seq(min(grpB$B), max(grpB$B), length.out = nvals), 
                           mean(grpB$B), sd(grpB$B)), column_name = "B")
dnormsd <- cbind(sdA, sdB) %>% 
  pivot_longer(
    cols = c(`A`, `B`), 
    names_to = "GroupSD", 
    values_to = "sdval" 
  ) 

#merge (column bind) data & drop redundant Group cols
data <- cbind(data, dnormx, dnormsd) %>% select(c(-GroupX, -GroupSD))

#plot base figures
twosampfig <- data %>% ggplot() +
  geom_histogram(aes(x = Simval, y =..density.., fill=Group), alpha=0.3, 
                 binwidth=2, position="identity", breaks = seq(0, 45, by = 2)) +
  scale_fill_manual(values=c('#365c8d','#4ac16d')) +
  geom_point(aes(x = xval, y = sdval, color=Group), size = 2, alpha = 0.5) +
  scale_color_manual(values=c('#365c8d','#4ac16d')) + 
  labs(x= "Two Samples (n=100 each) from Independent Population Groups") + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE)
# NOTE: this code will not work properly with groundhog - need to raise issue w/devs
# To create and save the animated gif file rather than simply load image (next chunk):
# 1. load gganimate & transformer library in library chunk above 
  # (remove comment # *before* groundhog code) 
# 2. change chunk options here from eval=FALSE to eval=TRUE

#add plot animations
twosampgif <- twosampfig + 
  transition_states(Group,
                    transition_length = 2,
                    state_length = 1) +
  ease_aes() + 
  enter_fade() + 
  exit_fade()  

anim_save(here("Images","twosampgif.gif"), twosampgif)
Two independent samples, n=100 each

Two independent samples, n=100 each

The figure below highlights the observed sample difference in group means, which is represented by the red double-sided arrow in the space between the two vertical lines drawn at each of the two sample group mean values. The sample difference in group means from these two independent random samples is 3.63.

Note: It may not be obvious to you from our knitted RMD assignment file, but we are actually using inline R code to call desired output values in the text, such as the sample difference in means of 3.63 that we reported above. To do this, we initiate an inline R code chunk by typing `r ` followed by our code in the text to call the result that we want. For instance, to report the sample mean difference above, after initiating the R code, we typed round(mean(grpB$B) - mean(grpA$A), digits=2).

Using inline R code to report results is a preferred practice over simply typing or copy-and-pasting output values for reproducibility purposes. First, it helps us avoid all-to-common typos or copy-and-paste errors. Second, if we change our code at all in a way that affects our results (e.g., different data; variable selections; variable recoding; modeling decisions), then we may forget to retype or repaste the new results, but our inline R code will automatically update with the new values!


twosampfig +
  geom_vline(xintercept = mean(grpA$A), 
                color = "#365c8d", size=1.2) + 
  geom_vline(xintercept = mean(grpB$B), 
                color = "#4ac16d", size=1.2) + 
  geom_segment(x = mean(grpA$A), y = .13, xend = mean(grpB$B), yend = .13, 
               color="#bc3754", size=1.5,
               arrow = arrow(length = unit(0.025, "npc"), ends = "both")) + 
    annotate("text", x=34, y=.125, label="Difference in group means = 3.63\n(sample #1)", 
             angle=0, color="#bc3754") 

Remember, social scientists often are less interested in observed sample group differences and more interested in making inferences about the underlying population group differences that generated those sample differences we observe. In this case, we know that the true population difference in group means is 24 - 20 = 4. We also know that our observed sample group difference of 3.63 is slightly off the mark, specifically by an error (observed - expected) of 3.63 - 4 = -0.37. Since we randomly selected data from each population and we did not add any other source of bias in our simulation, we also know that this error of -0.37 between the observed difference in sample means and the expected or true difference in population means is due to sampling variability - random “chance” variations in statistical estimates that vary from sample to sample.

By now, we also know that we can reduce sampling variability by drawing larger random samples from each population. We will return to this idea later. Moreover, based on our understanding of the central limit theorem, we expect that repeated sample draws from our two populations will result in an array of observed differences in sample means that are normally distributed and that cluster around the true difference in population means (=4). We illustrate this in the figure below.

#Illustrate sampling distribution of differences in sample means, two indep pop grps 
nsamps <- 1000
nvals <- 100
set.seed(1138) 

#sim 1000 columns (samples) of rnorm values with "nval" size for each group 
grpC <- replicate(nsamps, rnorm(n=nvals, mean=20, sd=5), simplify=TRUE) 
grpD <- replicate(nsamps, rnorm(n=nvals, mean=24, sd=5), simplify=TRUE) 

#calculate difference in sample means for all 1000 samples
mudiffDC <- as_tibble(colMeans(grpD)-colMeans(grpC))

#plot difference in sample means for all 1000 samples
sampdiffs <- mudiffDC %>% ggplot(aes(x=value)) +
  geom_dotplot(method="histodot", color="#21918c", fill="#5ec962", binwidth=.105, 
               dotsize=0.4) +
  labs(x= "Observed Sample Differences in Group Means\nN=1,000 Samples (Group n=100)") + 
  stat_function(fun = dnorm, args = list(mean = mean(mudiffDC$value), sd = sd(mudiffDC$value)), 
                color='#443983', size=1.1, alpha=0.7) + 
  geom_point(aes(x=3.57, y=0.005), color="#bc3754", fill="#bc3754", size=2.5) +
  geom_segment(x = 2.8, y = .40, xend = 3.53, yend = .025, 
               color="#bc3754", size=1.5,
               arrow = arrow(length = unit(0.025, "npc"))) + 
  annotate("text", x=2.5, y=.45, label="Difference in group means\nobserved in our first sample\n(sample #1 diff. = 3.63)",
           angle=0, color="#bc3754") + 
  geom_vline(xintercept = mean(mudiffDC$value), linetype="dashed", color = "#443983", 
             size=1.2, alpha=0.7) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, ylim=c(0,.6)) + 
  scale_y_continuous(NULL, breaks = NULL) #y-axis values not meaningful so hide

sampdiffs

To create this plot, we drew N=1,000 random samples like the Group A sample above, each with n=100 values, as well as N=1,000 samples like the Group B sample above, each with n=100 values as well. Next, for each pair of Group A and Group B samples of n=100 each, we calculated the difference in group means (mean[GroupB] - mean[GroupA]). Finally, we plotted these N=1,000 differences in observed sample means between Groups A and B using ggplot’s geom_dotplot(). The plot, then, helps us visualize the sampling distribution of sample mean differences between Groups A and B (see pp.305-7 in your book).

As you can see, the differences in independent sample means are approximately normally distributed and cluster around the true population mean (with rounding: average difference=4.01).

This is an important point to understand about sampling distributions - even with relatively small samples (n=100 each), the mean of the sampling distribution will cluster around the true population mean. However, the sampling distribution will be more variable (i.e., more spread out) with smaller samples, and it will be less variable (i.e., narrower) with larger samples. One implication is that, when we draw small samples (e.g., n=100), there is a relatively high chance of estimating a group mean difference that is quite a ways off (e.g., <3 or >5) from the true population difference (=4). Put differently, our average error in estimating the true population difference will be larger when we draw small samples, and it will be smaller when we draw large samples.

Instead of plotting observed differences in sample means, since we know the true difference in population means, we could illustrate this point by instead plotting our error in estimation due to sampling variability. To do this, we would simply subtract the true population mean difference (=4) from each sample difference in means that we calculated.

  • Samples with estimated group differences >4 will generate positive error values (e.g., 5 - 4 = 1). These samples resulted in overestimation errors of the true difference in population group means.
  • Samples with estimated group differences <4 will generate negative error values (e.g., 3 - 4 = -1). These samples resulted in underestimation errors of the true difference in population group means. Recall, the first estimated sample difference in means that we observed generated a negative error value (3.63 - 4 = -0.37), meaning our first sample resulted in underestimation of the true difference in population group means (by 0.37 units).
  • Samples with estimated group differences =4 will generate a value of “0” (i.e., 4 - 4 = 0). These samples generated accurate estimates of the true difference in population group means.

Like the sampling distribution of sample mean differences, we expect the distribution of sampling errors (like the one we observed: 3.63 - 4 = -0.37) will also be normally distributed. However, this distribution will be comprised of positive sampling errors (overestimates) and negative sampling errors (underestimates) that all cluster around a mean error=0.

#Illustrate sampling distribution of errors in estimating diffs in sample means 

set.seed(1138) 

#calculate difference in sample means for all 1000 samples
mudiffDC <- mudiffDC %>% mutate(
  error = value - 4 # group diff in population mean (4) - observed diff in samp means (value) 
)

#plot difference in sample means for all 1000 samples
samperror <- mudiffDC %>% ggplot(aes(x=error)) +
  geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.105, 
               dotsize=0.4) +
  labs(x= "Population - Sample Group Mean Diff. (Sampling Errors)\nN=1,000 Samples (Group n=100)") + 
  stat_function(fun = dnorm, args = list(mean = mean(mudiffDC$error), sd = sd(mudiffDC$error)), 
                color='#57106e', size=1.1, alpha=0.7) + 
  geom_point(aes(x=-0.42, y=0.005), color="#bc3754", fill="#bc3754", size=2.5) +
  geom_segment(x = -1.5, y = .3, xend = -0.5, yend = .025, 
               color="#bc3754", size=1.5,
               arrow = arrow(length = unit(0.025, "npc"))) + 
  annotate("text", x=-1.7, y=.36, label="Sampling error from\nour first sample\n(sample #1 error. = -0.37)",
           angle=0, color="#bc3754") + 
  geom_vline(xintercept = mean(mudiffDC$error), linetype="dashed", color = "#57106e", 
             size=1.2, alpha=0.7) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(-2.6,2.6), ylim=c(0,.6)) + 
  scale_y_continuous(NULL, breaks = NULL)

samperror

As expected, the distribution of sample estimation errors, or the sampling error distribution is normally distributed with a mean =“0”. However, also note that a lot of our samples generated estimates of the population group mean difference that were substantially off the true difference. In fact, 15% (N=150) of our 1,000 samples overestimated or underestimated the true population group mean difference by more than 1 unit (i.e., estimated diff. <3 or >5).

Given this, what do you think would happen to the sampling error distribution if we increase our initial sample group sizes from n=100 per group to n=1,000 per group? If we raise the group sample size to n=1,000 in each group, how many of our N=1000 simulated sample group pairs will generate estimated group mean differences with errors of more than 1 unit (i.e., overestimated or underestimated mean differences <3 or >5)?

#Illustrate sampling distribution of differences in sample means, two indep pop grps 
nvals <-1000 
nsamps <- 1000
set.seed(1138) 

#sim 1000 columns (samples) of rnorm values with "nval" size for each group 
grpE <- replicate(nsamps, rnorm(n=nvals, mean=20, sd=5), simplify=TRUE) 
grpF <- replicate(nsamps, rnorm(n=nvals, mean=24, sd=5), simplify=TRUE) 

#calculate difference in sample means for all 1000 samples
mudiffFE <- as_tibble(colMeans(grpF)-colMeans(grpE))

#calculate difference in sample means for all 1000 samples
mudiffFE <- mudiffFE %>% mutate(
  error = value - 4 # group diff in population mean (4) - observed diff in samp means (value) 
)

#plot difference in sample means for all 1000 samples
samperrorn1000 <- mudiffFE %>% ggplot(aes(x=error)) +
  geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.085, 
               dotsize=0.4) +
  labs(x= "Sampling Errors\nN=1,000 Samples\nGroup n=1000") + 
  stat_function(fun = dnorm, args = list(mean = mean(mudiffFE$error), sd = sd(mudiffFE$error)), 
                color='#57106e', size=1.1, alpha=0.7) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(-2.5,2.5), ylim=c(0,2)) + 
  scale_y_continuous(NULL, breaks = NULL)

samperrorn100 <- mudiffDC %>% ggplot(aes(x=error)) +
  geom_dotplot(method="histodot", color="#ed6925", fill="#fbb61a", binwidth=.085, 
               dotsize=0.4) +
  labs(x= "Sampling Errors\nN=1,000 Samples\nGroup n=100") + 
  stat_function(fun = dnorm, args = list(mean = mean(mudiffDC$error), sd = sd(mudiffDC$error)), 
                color='#57106e', size=1.1, alpha=0.7) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(-2.5,2.5), ylim=c(0,2)) +
  scale_y_continuous(NULL, breaks = NULL)

samperrorn100 + samperrorn1000

# mudiffDC %>% count(error >=1)
# mudiffDC %>% count(error <=1)
# https://rpruim.github.io/s341/S19/from-class/MathinRmd.html

The figure on the left (above) is the same one we showed before - it displays the sampling error distribution for our N=1,000 simulated samples with group sizes of n=100 per group. The figure on the right (above) displays the sampling error distribution for our N=1,000 simulated samples with group sizes of n=1,000 per group. As you can see, the distribution of sample estimation errors, or the sampling error distribution, is normally distributed with a mean =“0” in both figures.

However, note that the spread of the sampling distribution is much wider in the figure on the left (small groups of n=100 each) and much narrower in the figure on the right (large groups of n=1,000 each). Moreover, while 15% of samples on the left underestimated or overestimated the true group difference by at least 1 unit (i.e., <3 or >5; error <-1 or >1), none of the samples on the right generated group mean difference estimates that were in error by more than +/-1 unit.

This illustrates another essential point about sampling distributions: the spread - e.g., the variance or standard deviation - of the sampling distribution of a statistical estimate decreases as sample sizes increase! In fact, as you learned in this chapter (p.306), the formula for the standard deviation of the sampling distribution of the difference in means explicitly adjusts for the sample group sizes (n_1 & n_2):

\[ \sigma_{\overline{X}_1-\overline{X}_2} = \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \]

Note: You can add LaTeX style mathematical equations to your R Markdown text just like we did above! To offset the equation on its own line, type two dollar signs ($$), input the equation, then close the equation editor by typing two more dollar signs. To add equations as inline text, type one dollar sign ($), input the equation, then close the equation editor with a second dollar sign.

For instance, to display the equation above, we typed the following:

$$

\sigma_{\overline{X}_1-\overline{X}_2} =

\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}

$$


Recall, we know the population standard deviation for both groups because we set them each to equal \(\sigma=5\) in our simulations. Therefore, we do not need to use the formula for calculating the pooled variance estimate of the standard error of the difference (see p.309) to estimate the standard deviation of the sampling distribution using the observed sd of our first sample. Instead, we can simply plug in the population values into the formula above to calculate the standard error of the difference (aka, the standard deviation of the sampling distribution of sample means) for the small (n=100) and large (n=1,000) sample groups.


Population standard error of the difference, group n=100

popsediffn100 <- sqrt((5^2/100) + (5^2/100))
popsediffn100
## [1] 0.7071068


Population standard error of the difference, group n=1,000

popsediffn1000 <- sqrt((5^2/1000) + (5^2/1000))
popsediffn1000
## [1] 0.2236068

By now, it should not be surprising to see that the standard error of the difference is smaller for larger groups (n=1,000) and larger for the smaller groups (n=100). We can also check the standard deviation of our N=1,000 simulated sample differences to see if they are close to these known population values.

Sample estimated standard error of difference, group n=100

simsediffn100 <- sd(mudiffDC$value)
simsediffn100
## [1] 0.7414356


Sample estimated standard error of difference, group n=1,000”

simsediffn1000 <- sd(mudiffFE$value)
simsediffn1000
## [1] 0.2224673

Note that both estimates from our N=1,000 simulated samples of small and large groups are close to the true population standard errors of the difference values we calculated above - so our simulation of the sampling distribution worked as expected. However, the small group simulated estimate (0.74) was further away from our population value (0.71). This too should be unsurprising, as you know by now to expect more sampling variability in statistical estimates with smaller samples.

Bottom line: If researchers collect small samples, they take a substantial risk of making estimation and inference errors!

Simulated sample t-tests of mean differences & statistical power

Now, let’s return to our first sample of n=100 in Group A and n=100 in Group B. We will use estimates of the population group means from these two independent samples and a two-sample t-test with equal variance assumed to test the null hypothesis of no difference in population group means. Let’s assume a two-tailed test with a conventional alpha threshold of 0.05.

#conduct t-test
t_testAB <- t.test(grpA, grpB, paired=FALSE, var.equal=TRUE)
t_testAB
## 
##  Two Sample t-test
## 
## data:  grpA and grpB
## t = -5.5416, df = 198, p-value = 9.479e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.926117 -2.340310
## sample estimates:
## mean of x mean of y 
##  20.33748  23.97070

We conducted a t-test! Now, what does it mean?

Well, as your book explains, we start with the observed sample mean difference of 20.34 (\(\overline{X}_A\)) minus 23.97 (\(\overline{X}_B\)), which equals -3.63 (\(\overline{X}_A - \overline{X}_B\)). We then convert this sample mean difference into a t-standardized score by dividing the mean difference by the standard error of the difference using the pooled variance t-test formula (see p.309).

Although we know the actual population standard error of the difference, here we are strictly estimating from our samples to illustrate the sample inference process. Thus, R will divide the mean difference by the sample pooled variance estimate of the standard error as described on p.309 of your book. We specified a pooled variance estimate, which assumes equal variance between groups, by including var.equal=TRUE.

  • Note: In this case, we know the true population variance is equal across groups because we set it to be \(\sigma=5\) for both groups, so this assumption is appropriate. However, since we rarely know the true population parameters, it is typically good practice not to make an assumption of equal variance. Rather, you can conduct a Levene’s test (described later), or you can make the conservative assumption that the group variances are not equal. If the sample variances are equal, then the t-test results will converge.

We can check the model’s standard error estimate by clicking on the “t_testAB” object in our environment and navigating to the “stderr” value, or we can call it in code with t_testAB$stderr; that estimate is 0.6556248.

Likewise, we can use these values to see how the t-standardized mean difference was calculated in R:

Mean difference, Group A & B (Sample #1)

diffAB <- mean(grpA$A) - mean(grpB$B)
diffAB
## [1] -3.633213


Standard error of difference, Mean(A) - Mean(B)

stderrAB <- t_testAB$stderr
stderrAB
## [1] 0.6556248


t-standardized mean difference, Group A & B

tvalAB <- diffAB / stderrAB
tvalAB
## [1] -5.541605

Note the t value here is the same as the one found in our t-test output. That output also includes a 95% confidence interval for the estimated t-standardized mean difference. Additionally, the p-value is near zero (9.4793139^{-8}), which is well below our pre-specified alpha level of 0.05. Given this, with our first simulated sample, we would reject the null hypothesis of no difference in means; it would be highly improbable to observe a sample group mean difference as large or larger than the one we observed (i.e., observed diff >= +/-3.63) if the true population mean difference between groups A and B were equal to “0”.

Of course, we know this is a valid inference because we set the true population difference to be equal to “4”, so we would hope such a test would reject the null hypothesis of diff=0. Yet, do you know how many of our N=1,000 simulated samples with group size n=100 would have reach the same valid inference? There are various ways to approach this question, which essentially asks about the (inverse of the) statistical power of our test, or the probability of making a Type II (false negative) error.

One way we could approach answering this question is by using replicate() to perform t-tests on all N=1,000 samples and then counting the number of tests that failed to reject the null hypothesis. Let’s try it by generating 1,000 new samples of group size n=100 each and then counting the number of Type II (false negative) errors.

How many of our N=1,000 samples (group n=100) led to false negative test? (i.e., ‘TRUE’)

set.seed(1138) 
simtrep <- replicate(1000,t.test(rnorm(n=100, mean=20, sd=5),rnorm(n=100, mean=24, sd=5)), simplify=FALSE)
pvals <- as_tibble(sapply(simtrep, '[[', 'p.value'))
pvals %>% count(value>=0.05)
## # A tibble: 1 × 2
##   `value >= 0.05`     n
##   <lgl>           <int>
## 1 FALSE            1000

You might find it interesting to see that none of our N=1,000 simulated samples with relatively small group sizes of n=100 each would have generated a false negative inference from the t-test results. That is, each sample test generated a p-value that was less than our alpha threshold of 0.05 (i.e., n[TRUE]=0; n[FALSE]=1000). This is despite the fact that many of these samples would have led to non-negligible overestimates or underestimates of the true population mean difference!

This helps us highlight one of the key pitfalls of interpreting results of null hypothesis significance testing: A significant p-value does not mean we have a good sample estimate of the population. Rather, it merely means that we had sufficient statistical power to detect a difference of the observed magnitude (or larger). A statistically significant estimate could be way off the magnitude or even the direction of the true population difference - especially if we are estimating from relatively small samples with a high degree of sampling variability.

  • Note: Recall in our last assignment (#9) when our chi-squared test resulted in us rejecting the null hypothesis of independence yet our measure of association indicated a very weak relationship between variables? That earlier example illustrated that rejection of the null hypothesis does not necessarily indicate the relationship or difference we are estimating is strong or even of a meaningful magnitude. Our current example adds to this insight by illustrating that a valid inference from a null hypothesis test does not necessarily indicate that your sample descriptive statistic is an accurate estimate of the true population parameter!

Additionally, though it was not an issue above, a statistically non-significant estimate could be a false positive - especially with small samples and/or when the true population magnitude is small (relative to our sample size). In the example above, we selected a magnitude for the population mean difference (24-20=4) that was large enough to reliability detect with group sizes of n=100. In other words, our test was designed to have sufficient statistical power to detect an effect of that magnitude given a null hypothesis of no mean difference, as about 99.99% of tests (like we observed) would be expected to detect a true population difference. But what if we had set a different null hypothesis of, say, a magnitude of diff=2 rather than diff=0? Or, what if the true population difference was a smaller magnitude of, say, 2 units instead? Or, what if the two populations were more heterogeneous and, hence, had higher standard deviations?

Let’s explore what might happen by running the simulation again, but this time we will specify a true population mean of \(\overline{X}_A=22\) for the second group (instead of “24”) and a population standard deviation of \(\sigma=6\) for both groups (instead of “5”).

  • Note: This means we will be attempting to make inferences about the null hypothesis of no difference with samples that are drawn from population groups that are more similar to each other (i.e., smaller mean difference) and that have more within-group variation (i.e., larger standard deviations). If we do not change our research design (e.g., sample sizes or measurement), then this situation makes it more difficult - we have less statistical power - to detect true mean differences via rejection of the null hypothesis.

If population diff=2 & sigma=6, how many samples (group n=100) would lead to false negative test? (i.e., ‘TRUE’)

set.seed(1138) 
simtrep <- replicate(1000,t.test(rnorm(n=100, mean=20, sd=6),rnorm(n=100, mean=22, sd=6)), simplify=FALSE)
pvals <- as_tibble(sapply(simtrep, '[[', 'p.value'))
pvals %>% count(value>=0.05)
## # A tibble: 2 × 2
##   `value >= 0.05`     n
##   <lgl>           <int>
## 1 FALSE             658
## 2 TRUE              342

In this example, with a true mean difference of “2” units (22 - 20) and a true population standard deviation of “6” for both groups, 65.8%% of our samples generated a p-value that is less than our alpha level of 0.05. In other words, this test had an approximate power of around 0.66 (or 0.65), meaning that if the true population characteristics match what we specified in our simulation and if our sample sizes were the same each time n=100 per group), then we can a priori expect around 66% of sample t-test replications to accurately detect the true population difference.

In contrast, 34% of our simulated samples generated a p-value that is as large or larger than our alpha level of 0.05, meaning that about one-third of these samples would have generated false negative t-test results (Type II inference errors) in this situation.

So, what lessons can we take from this example? All else equal, when the true population difference is smaller and/or the population groups are more heterogeneous (i.e., more variable), then random samples drawn from the population also will be more variable, and sample inferential tests will become less reliable. Once again, one way to improve the reliability (power) of those inferential tests - and improve estimation - is increasing our sample size. Let’s show results of one final p-value simulation example, keeping our population mean difference at “2” units and our population standard deviation at “6” as in the last example, but increasing our group sample size to n=300 per group.

If diff=2 & sigma=6, how many samples (group n=1000) would lead to false negative test? (i.e., ‘TRUE’)

set.seed(1138) 
simtrep <- replicate(1000,t.test(rnorm(n=300, mean=20, sd=6),rnorm(n=300, mean=22, sd=6)), simplify=FALSE)
pvals <- as_tibble(sapply(simtrep, '[[', 'p.value'))
pvals %>% count(value>=0.05)
## # A tibble: 2 × 2
##   `value >= 0.05`     n
##   <lgl>           <int>
## 1 FALSE             980
## 2 TRUE               20

By increasing our group size to n=300 each (n=600 total), we drastically increase the statistical power of our test (to about 0.98), while reducing the number of expected false negative results (Type II inference errors) to about 2% of trials!

Okay, that is enough about sampling variability, sampling error, and statistical power for now. Let’s move on to learning how to conduct your own two-sample t-tests so you can complete the SPSS Exercises for Chapter 10.

Part 1 (Assignment 10.1)

Goal: Read in Youth Dataset and Find Mean Differences

In this assignment, you will work through Question #2 from the SPSS Exercises at the end of Chapter 10. This question ask you to investigate whether students who spend lots of time studying also view delinquency as more wrong compared to those who spend less time studying. Essentially, you will be testing the null hypothesis of no difference in means on the ‘moral’ variable between two independent groups of youth - those who study 8 or fewer hours (<9) per week (‘studyhard’=0) and those who study more than 8 hours (9+) per week (‘studyhard’=1).

As we have done in the past couple assignments, we will first illustrate the steps you will take to complete the assignment using a different research question and different variables. Specifically, we will instead examine whether those students who spend lots of time studying are less likely to engage in delinquency compared to those who study less frequently. As such, ‘studyhard’ will still be our focal independent variable that defines our two independent groups. However, we will substitute ‘delinquency’ for our dependent variable (instead of ‘moral’).

Once again, we want to ensure you learn how to conduct an independent sample t-test of mean differences in R using real data and that you can apply these steps on your own to answer the assignment questions. Simply replace ‘delinquency’ with ‘moral’ throughout the assignment to obtain the correct information. Be sure to repeat these steps with the correct variables listed in the assignment to ensure you receive full credit.

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_07_06_Fordham_K300Assign10)

  1. Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 9 (named YEAR_MO_DY_LastName_K300Assign9). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter K300 Assignment 10.
    5. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
    6. Under Default Output Format box, select “HTML document” (HTML is usually the default selection)

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 10.1).” Then, create a third-level header titled: “Read in Youth Dataset and Find Mean Differences”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 10. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, see, datawizard, car, and sjPlot.
      • This is our first time using the see, car, and datawizard packages. see is part of the easystats suite of packages. easystats is a collection of R packages which aims to provide a unifying, consistent, and easy to use framework for analyzing and visualizing data in R. The package we will use from this suite, see, is a plotting companion that allows us to make aesthetically pleasing visualizations of our data relatively easily. To learn more about see and other easystats packages, visit GitHub. We will briefly discuss the datawizard and car packages later in the assignment.
      • Recall, you only need to install packages one time. However, you must load them each time you start a new R session.

  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read the “Youth_0.sav” SPSS datafile into R and assign it to an object named YouthData.
      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: YouthData.
      • This will call the object and provide a brief view of the data. Remember, you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.
      • Here is a glimpse of what the first few rows of your data should look like:
YouthData <- read_spss(here("Datasets", "Youth_0.sav"))
head(YouthData) %>% gt()
Gender v2 v21 v22 v63 v77 v79 v109 v119 parnt2 fropinon frbehave certain moral delinquency d1 hoursstudy filter_$ heavytvwatcher studyhard supervision drinkingnotbad Lowcertain_bin
1 15 36 15 3 1 3 5 1 5 18 9 9 19 8 1 15 1 1 1 0 0 1
0 15 3 5 4 1 2 4 1 8 11 6 11 19 10 1 5 0 0 0 1 0 0
1 15 20 6 4 1 1 4 2 8 12 5 11 20 1 0 6 0 1 0 1 0 0
0 15 2 4 2 1 2 5 3 4 9 5 13 19 1 0 4 1 0 0 0 0 0
0 14 12 2 4 3 5 5 3 7 27 9 4 18 104 1 2 1 0 0 1 1 1
1 15 1 3 4 1 1 5 3 7 8 9 10 20 0 0 3 0 0 0 1 0 0


  1. Next, we will examine frequency distributions and central tendencies for our ‘delinquency’ and ‘studyhard’ variables. Remember, the SPSS Exercises for Chapter 10 in your book includes a description of the data and the variables we will be using. You can also use sjPlot::view_df() to get a snapshot of your data and variable information.
    1. Create a third-level header titled: “Create Frequency Tables”
    2. Insert a R chunk.
    3. Type YouthData %>% frq(studyhard). Hit enter and type YouthData %>% frq(delinquency).
      • After running, your R studio should show two frequency tables, one for the variable ‘studyhard’ (with a mean of .27) and one for the variable ‘delinquency’ (with a mean of 26.4).
      • Note: When you repeat these steps later to complete the assignment, remember to replace the ‘delinquency’ variable with the ‘moral’ variable!

        # show distribution & central tendency of variables 

        YouthData %>%
          frq(studyhard)
## studies 9+ hours a week (studyhard) <numeric> 
## # total N=1272 valid N=1214 mean=0.27 sd=0.45
## 
## Value |            Label |   N | Raw % | Valid % | Cum. %
## ---------------------------------------------------------
##     0 | 8 or fewer hours | 881 | 69.26 |   72.57 |  72.57
##     1 |         9+ hours | 333 | 26.18 |   27.43 | 100.00
##  <NA> |             <NA> |  58 |  4.56 |    <NA> |   <NA>
        YouthData %>%
          frq(delinquency)
## time 1 delinquency scale (delinquency) <numeric> 
## # total N=1272 valid N=1272 mean=26.40 sd=39.27
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 288 | 22.64 |   22.64 |  22.64
##     1 |  91 |  7.15 |    7.15 |  29.80
##     2 |  71 |  5.58 |    5.58 |  35.38
##     3 |  56 |  4.40 |    4.40 |  39.78
##     4 |  37 |  2.91 |    2.91 |  42.69
##     5 |  62 |  4.87 |    4.87 |  47.56
##     6 |  38 |  2.99 |    2.99 |  50.55
##     7 |  29 |  2.28 |    2.28 |  52.83
##     8 |  22 |  1.73 |    1.73 |  54.56
##     9 |  15 |  1.18 |    1.18 |  55.74
##    10 |  30 |  2.36 |    2.36 |  58.10
##    11 |  21 |  1.65 |    1.65 |  59.75
##    12 |  23 |  1.81 |    1.81 |  61.56
##    13 |  15 |  1.18 |    1.18 |  62.74
##    14 |  13 |  1.02 |    1.02 |  63.76
##    15 |  22 |  1.73 |    1.73 |  65.49
##    16 |  11 |  0.86 |    0.86 |  66.35
##    17 |   9 |  0.71 |    0.71 |  67.06
##    18 |   8 |  0.63 |    0.63 |  67.69
##    19 |   3 |  0.24 |    0.24 |  67.92
##    20 |  20 |  1.57 |    1.57 |  69.50
##    21 |   7 |  0.55 |    0.55 |  70.05
##    22 |   3 |  0.24 |    0.24 |  70.28
##    23 |   5 |  0.39 |    0.39 |  70.68
##    24 |   5 |  0.39 |    0.39 |  71.07
##    25 |  13 |  1.02 |    1.02 |  72.09
##    26 |   6 |  0.47 |    0.47 |  72.56
##    27 |   6 |  0.47 |    0.47 |  73.03
##    28 |   8 |  0.63 |    0.63 |  73.66
##    29 |   2 |  0.16 |    0.16 |  73.82
##    30 |   5 |  0.39 |    0.39 |  74.21
##    31 |  11 |  0.86 |    0.86 |  75.08
##    32 |   6 |  0.47 |    0.47 |  75.55
##    33 |   6 |  0.47 |    0.47 |  76.02
##    34 |   2 |  0.16 |    0.16 |  76.18
##    35 |  11 |  0.86 |    0.86 |  77.04
##    36 |   6 |  0.47 |    0.47 |  77.52
##    37 |   2 |  0.16 |    0.16 |  77.67
##    38 |   3 |  0.24 |    0.24 |  77.91
##    39 |   3 |  0.24 |    0.24 |  78.14
##    40 |   4 |  0.31 |    0.31 |  78.46
##    41 |   2 |  0.16 |    0.16 |  78.62
##    42 |   2 |  0.16 |    0.16 |  78.77
##    43 |   4 |  0.31 |    0.31 |  79.09
##    44 |   2 |  0.16 |    0.16 |  79.25
##    45 |   6 |  0.47 |    0.47 |  79.72
##    46 |   4 |  0.31 |    0.31 |  80.03
##    48 |   2 |  0.16 |    0.16 |  80.19
##    50 |   8 |  0.63 |    0.63 |  80.82
##    51 |   3 |  0.24 |    0.24 |  81.05
##    52 |   8 |  0.63 |    0.63 |  81.68
##    53 |   5 |  0.39 |    0.39 |  82.08
##    55 |   1 |  0.08 |    0.08 |  82.15
##    56 |   1 |  0.08 |    0.08 |  82.23
##    58 |   1 |  0.08 |    0.08 |  82.31
##    59 |   1 |  0.08 |    0.08 |  82.39
##    60 |   2 |  0.16 |    0.16 |  82.55
##    61 |   1 |  0.08 |    0.08 |  82.63
##    62 |   2 |  0.16 |    0.16 |  82.78
##    63 |   2 |  0.16 |    0.16 |  82.94
##    64 |   2 |  0.16 |    0.16 |  83.10
##    65 |   2 |  0.16 |    0.16 |  83.25
##    66 |   1 |  0.08 |    0.08 |  83.33
##    68 |   1 |  0.08 |    0.08 |  83.41
##    69 |   1 |  0.08 |    0.08 |  83.49
##    70 |   3 |  0.24 |    0.24 |  83.73
##    71 |   1 |  0.08 |    0.08 |  83.81
##    72 |   2 |  0.16 |    0.16 |  83.96
##    74 |   1 |  0.08 |    0.08 |  84.04
##    75 |   2 |  0.16 |    0.16 |  84.20
##    77 |   2 |  0.16 |    0.16 |  84.36
##    78 |   2 |  0.16 |    0.16 |  84.51
##    80 |   2 |  0.16 |    0.16 |  84.67
##    81 |   4 |  0.31 |    0.31 |  84.98
##    82 |   2 |  0.16 |    0.16 |  85.14
##    84 |   1 |  0.08 |    0.08 |  85.22
##    85 |   4 |  0.31 |    0.31 |  85.53
##    86 |   3 |  0.24 |    0.24 |  85.77
##    90 |   1 |  0.08 |    0.08 |  85.85
##    91 |   1 |  0.08 |    0.08 |  85.93
##    92 |   1 |  0.08 |    0.08 |  86.01
##    93 |   1 |  0.08 |    0.08 |  86.08
##    94 |   3 |  0.24 |    0.24 |  86.32
##    95 |   1 |  0.08 |    0.08 |  86.40
##    97 |   2 |  0.16 |    0.16 |  86.56
##    99 |   2 |  0.16 |    0.16 |  86.71
##   100 |  10 |  0.79 |    0.79 |  87.50
##   101 |   5 |  0.39 |    0.39 |  87.89
##   102 |   4 |  0.31 |    0.31 |  88.21
##   103 |   5 |  0.39 |    0.39 |  88.60
##   104 |   2 |  0.16 |    0.16 |  88.76
##   105 |   1 |  0.08 |    0.08 |  88.84
##   106 |   1 |  0.08 |    0.08 |  88.92
##   107 |   2 |  0.16 |    0.16 |  89.07
##   109 |   1 |  0.08 |    0.08 |  89.15
##   110 |   5 |  0.39 |    0.39 |  89.54
##   111 |   1 |  0.08 |    0.08 |  89.62
##   114 |   2 |  0.16 |    0.16 |  89.78
##   115 |   1 |  0.08 |    0.08 |  89.86
##   116 |   1 |  0.08 |    0.08 |  89.94
##   118 | 128 | 10.06 |   10.06 | 100.00
##  <NA> |   0 |  0.00 |    <NA> |   <NA>
  1. Now we will use the datawizard package, which is included with the see package, to filter and manipulate our data. I know – packages within packages? As complex as that may seem, datawizard is one of several packages that automatically are installed with see. This package allows us to tidily filter our data into objects. Visit this website to learn more about all that the see package can do!
    1. Create a third-level header titled: “Finding Mean Differences”
    2. Insert an R chunk.
    3. Type studyhi <- data_filter(YouthData, studyhard == 1). Then, hit enter and type mean(studyhi$delinquency) to see the ‘delinquency’ mean in our new filtered data object.
      • The ‘studyhard’ variable is a binary variable with categories of “0”, indicating the youth reports studying 8 or fewer hours per week, and “1”, indicating the youth reports 9+ study hours per week. So, we are using the datawizard package to filter our data to keep only those cases where the youth studied 9 or more hours a week (studyhard == 1), then assign the filtered data to a new data object named studyhi.
      • As you may recall, you already learned how to filter data like this using dplyr::filter(). We wanted to introduce you to the “easystats” suite of packages and, in the process, to show you that you can use its selection of self-contained and inter-compatible packages to accomplish many of the data wrangling, analysis, and visualization tasks you have learned about thus far.
    4. Use the same process to filter the data for youth who reported studying 8 hours or less per week (studyhard == 0), assign to a new data object named studylow, and check the ‘delinquency’ mean in this group.
#Use datawizard package for tidy filtering & data manip 
#https://easystats.github.io/datawizard/

#mean high studying group
studyhi <- data_filter(YouthData, studyhard == 1) 
mean(studyhi$delinquency)
## [1] 19.81381
#mean low/no studying group
studylow <- data_filter(YouthData, studyhard == 0) 
mean(studylow$delinquency)
## [1] 28.98751
  1. Now that we have the means for our two groups, we can calculate the difference in means for ‘delinquency’ between the ‘studyhi’ and ‘studylow’ groups.
    1. Recall, our example research question asks whether students who spend lots of time studying are less likely to engage in delinquency compared to those who study less frequently. So, we will subtract the ‘delinquency’ mean for the low studying group from the mean for the high studying group (mean[studyhi] - mean[studylow]).
      • A positive mean difference would indicate that the high studying group engages in more delinquency on average compared to the low studying group.
      • A mean difference of “0” would indicate that the high studying group has the same mean level of delinquency as the low studying group.
      • A negative mean difference would indicate that the high studying group engages in less delinquency on average compared to the low studying group (as we hypothesized).
    2. Create a new R chunk and type meandiff <- mean(studyhi$delinquency) - mean(studylow$delinquency). Then, hit enter and type meandiff to view the resulting value.
#calc mean diff btw groups 
meandiff <- mean(studyhi$delinquency) - mean(studylow$delinquency)
meandiff
## [1] -9.1737

As we anticipated, the negative mean difference in delinquency values between these two independent samples indicates that the high studying group engages in less delinquency on average compared to the low studying group - about 9 fewer acts on average in the past year (-9.17).

This descriptive statistic informs us about the observed mean difference in delinquency between two independent groups in a sample of youth. What if we wanted to make an inference about group mean differences in the population from which this sample was drawn?

To do this, we could conduct an independent samples t-test. Most commonly, this is done as a null hypothesis test in which we assume the two group means are equal in the population (i.e., a null hypothesis of equality in means) and then estimate the probability of obtaining a mean difference at least as large as that observed due to sampling variability alone under the null hypothesis of no difference in means.

As we illustrated in the simulations above, this process involves using the t distribution to transform our raw observed mean difference into a t-standardized mean difference score (\(t_{obt}\)). From there, we can directly compare our t-obtained test statistic to a t critical value that is determined by our pre-specified alpha level and the degrees of freedom for the test. We can also transform our t-standardized mean difference score (i.e., our t-test statistic) into a p-value that represents the estimated probability of observing a mean difference at least as large as that which we observed under the null hypothesis (i.e., if the means were truly equal in the population). There are various ways to conduct independent samples t-tests in R. For some additional examples, you might wish to check out here, here, and here. We will describe how to do so using the base R t.test() function.

Part 2 (Assignment 10.2)

Goal: Conduct independent samples t-test

  1. Create a second-level header titled: “Part 2 (Assignment 10.2)” Then, create a third-level header titled: “Conduct Independent Samples t-test”

  2. Insert an R chunk. Type myttest_eq <- t.test (delinquency ~ studyhard, var.equal=TRUE, data = YouthData). Then, hit enter and type print(myttest_eq) to view our t statistic, p-value, and confidence intervals (at 95% confidence level).

    1. The t.test() function conducts an independent sample t-test.
    2. delinquency ~ studyhard is our statistical model, which specifies that ‘delinquency’ is distributed as a random normal variable with different mean values across each of the two categories of the ‘studyhard’ variable.
    3. var.equal = TRUE specifies that the data have met the assumption of homogeneity (i.e., equality) of variances. In other words, this tells R to conduct a pooled variance t-test.
      • Recall from your book that we often should not make this assumption, since we rarely know whether the true group variances are equal in the population. Therefore, we will also conduct a separate variances t-test, which alternatively specifies that the data have not met the assumption of homogeneity of variances.

  3. Insert another R chunk and retype the code, but replace myttest_eq with myttest_uneq (for unequal variances) and var.equal = TRUE with var.equal = FALSE. Run your code and print the results.

Independent samples t-test, pooled variance

#t-test will convert raw mean diff to stdz t value, then calc p-value of that stdz diff
#indep samp t-test w/equal var assumed; 
myttest_eq <- t.test(delinquency ~ studyhard, var.equal=TRUE, data = YouthData) # assumes that the data has met the assumption of homogeneity of variances
print(myttest_eq)
## 
##  Two Sample t-test
## 
## data:  delinquency by studyhard
## t = 3.6444, df = 1212, p-value = 0.0002794
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   4.235134 14.112266
## sample estimates:
## mean in group 0 mean in group 1 
##        28.98751        19.81381


Independent samples t-test, separate variance

#indep samp t-test w/unequal var assumed; 
myttest_uneq <- t.test (delinquency ~ studyhard, var.equal=FALSE, data = YouthData) # does not  assume that the data has met the assumption of homogeneity of variances
print(myttest_uneq)
## 
##  Welch Two Sample t-test
## 
## data:  delinquency by studyhard
## t = 3.9005, df = 690.37, p-value = 0.0001054
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   4.555967 13.791434
## sample estimates:
## mean in group 0 mean in group 1 
##        28.98751        19.81381
  • As noted above, the first t-test assumes equal variance while the second does not. Notice that the t-value, p-value, and confidence intervals are all somewhat different depending on whether we assume equal or unequal variance.

  • For a two-sided test with an alpha level of 0.05, the t critical value would be +/-1.96. In both tests, the t-test statistic falls in the critical region. Likewise, the p-value associated with each test is less than either the alpha levels of 0.05 or 0.01 mentioned in Question #2 of the SPSS exercise for Chapter 10.

  • Therefore, we would reject the null hypothesis of no difference in means. It would be highly unlikely to obtain a sample with a mean difference in delinquency acts between high and low study groups that was at least as large as the difference we observed (-9.17 acts) due to sampling variability alone if the two group means were actually equal in the population.

While we demonstrated how to conduct both a pooled and separate variance independent samples t-test above, which of those two tests should we have conducted? According to your book, it is typically safest to assume unequal variances since we are making inferences from a sample and do not know whether the variances indeed are equal in the population. However, Question #2.e in the SPSS Exercises introduces the Levene’s test for equality of variances as another method for making this decision. This test uses our sample group variances as estimates of the population group variances and then conducts an F test to help us make an inference about the homogeneity of variances in the population by either rejecting or failing to reject the null hypothesis of equal variances. We will show you how to conduct this test and answer the associated assignment questions using R.

  1. Create a third-level header titled: “Levene’s Test for Equality of Variances” and then insert an R chunk.

  2. First, we need to recode the ‘studyhard’ variable into a factor rather than numeric variable. Then, we will assign our new dataset containing the modified ‘studyhard’ factor variable into a new data object named YouthData2.

    1. Sometimes, statistical procedures in R will require us to change the format of a variable in order for the procedure to run correctly. In this case, our ‘studyhard’ variable is measured as numeric values of “0” (nominally meaning “8 or fewer study hours”) and “1” (nominally meaning “9+ study hours”). Our Levene’s test requires us to recode ‘studyhard’ as a factor variable and our ‘delinquency’ variable as a numeric variable. We can calculate mean values for numeric variables, and we can compare means on a numeric variable across different groups or levels of a “factor” variable.
    2. The “factor” format essentially tells R that this is a categorical or grouping variable and not a numeric one; that is, the “0” and “1” are not really numeric categories but instead categorical labels indicating membership in one group (e.g., low study hours) or the other group (high study hours).
    3. Also, we create a new data object instead of overwriting our original data object because this makes it easier to go back and correct any mistakes we might make when modifying our data. If we mess up, we can always go back to our original data object!

  3. Type YouthData2 <- YouthData %>% mutate(studyhard = as.factor(studyhard)) From here on, be sure to use our new data object, YouthData2, and not the original!

    1. The mutate() function is one you have used before. Recall that it allows us to recode and manipulate our data.
    2. studyhard = as.factor(studyhard) mutates the variable ‘studyhard’ from numeric into a factor variable.
    3. We then assign our new dataset containing the new (mutated) factor variable into a new data object named YouthData2.
    #First, need to recode studyhard as factor variable; we will save into new dataset YouthData2 just to be safe
    YouthData2 <- YouthData %>% mutate(studyhard = as.factor(studyhard)) 


  4. Now, we are ready to conduct Levene’s test for homogeneity of variance. As noted above, Question #2.e in the SPSS Exercise for Chapter 10 in your book mentions this test but does not provide a detailed explanation. As we explained, this procedure uses observed sample group variances to test the null hypothesis that the variances (standard deviations) are equal across both groups in the population. If the sample distributions differ substantially, we should reject the null hypothesis of equal variances and choose to conduct a separate (unequal) variance t-test of equality of means across groups. (Note: To be safe, it is fine to choose the unequal variance option even when Levene’s test results in a failure to reject the null hypothesis.)

    1. To conduct the Levene’s test, we will use the car package, which is an acronym for “companion to applied regression.” For examples of conducting this test in R, see here, here, and here.
    2. To conduct the test, use the following code format: leveneTest(yvar ~ factorvar, data). Of course, you will need to input your own y-variable, factor or group variable, and dataset!
#Levene's test found in car package (companion to applied regression)
leveneTest(delinquency ~ studyhard, YouthData2)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1   11.94 0.0005685 ***
##       1212                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Recall, the null hypothesis for Levene’s test for equality of variances is that the two group variances are equal in the population. The test converted the observed differences in standard deviations between our two independent sample groups into a standardized F statistic, which was then converted into a p-value to aid interpretation.

  • Based on these results, we would reject the null hypothesis of equal variances in delinquency between the two population study groups. As such, this test is further evidence that we should choose to conduct and interpret the results of the separate variance t-test!

Part 3 (Assignment 10.3)

Goal: Visualize sample means, distributions, and outlier observations

For the last part of the assignment, we will visualize our variables using the see package and geom_violindot(). Recall that the see package allows us to visualize our data from spread to outliers. We have used the ggplot() and aes() functions before to plot data. Now, we will use these functions with the geom_violindot() function to graph our variables side-by-side for comparison.

  1. Create a second-level header titled: “Part 3 (Assignment 10.3)”. Then, create a third-level header titled: “Visualize Distributions and Observations”
    1. Insert an R chunk.
    2. To complete this next part, we must start by dropping any missing (NA) observations from our data. For example, if a participant did not give a response on their time studying, we will need to drop that case from our data. One way to do this is to type: data <- droplevels(data_filter(YouthData2, studyhard != "NA"))
      • Here, we use the data_filter() function again to tidily filter out any NA observations on our ‘studyhard’ variable from our new YouthData2 data object.
      • Earlier, we filtered data to keep those cases that were equal to (==) a particular value (e.g., studyhard == 1). Here, we use the != operator to filter, which tells R that we do not want cases equal to this value on the variable in our data. So, studyhard != "NA" tells R not to include cases where there is no information on hours spent studying.
      • This filter function is then wrapped into a droplevels() command, which removes the now empty NA grouping level for the ‘studyhard’ variable altogether so it will not show up in tables or figures.
      • All of this is saved into the new, generically named data object called data.
        # Filter out NA values on studyhard & drop NA factor level
        data <- droplevels(data_filter(YouthData2, studyhard != "NA"))
  1. Next, we can create our violin dot plot. You may have noticed from the frequency tables that the delinquency variable is not normally distributed in either group. The violin dot plot will also help you visualize this distributional non-normality.
    1. To do this, we start with ggplot() to let R know we are creating a graph. Insert an R chunk and type ggplot(data, aes(x = studyhard, y = delinquency, fill = studyhard)) +
      • Remember that aes() allows us to determine our axes. Here, we want our independent variable, ‘studyhard’, to be on the x-axis and our dependent variable, ‘delinquency’, to be on the y-axis.
      • fill = studyhard tells ggplot to fill the violin plots by ‘studyhard’ group. In other words, it will fill the delinquency distributions with different colors for each study group.
      • Also, do not forget to include the + sign on the same line as ggplot() to add layers (i.e., at the end of each line rather than the beginning of a new line) so R knows you want more than a blank graph.
    2. Hit enter and type geom_violindot(color_dots = "#d44842", size_dots = 4) +.
      • geom_violindot() generates a half-violin and half-dotplot, which is useful for visualizing the distribution and the group sample sizes simultaneously. Violin plots (right side of each image) are mirrored density plots that are displayed in a similar way to boxplots, while the dotplot (left side of each image) allows more precise visualization of each observation in the group. To learn more about geom_violindot, visit this website.
      • The color_dots = and size_dots = functions adjusts the color and size of the our frequency dotplot (left side). We can use any color we’d like and make the dots as large or small as we prefer. Here, we chose a color from the inferno palette.
    3. Hit enter and type scale_fill_manual(values=c("#4ac16d","#e45a31")) +
      • This line manually changes the fill colors in the violin plots for each of the two groups to whatever colors we list. Here, we selected two colors from the viridis palette.
    4. Lastly, hit enter and type theme_modern(). This is a default function that tell R what theme to use for our plot.
# Visualize distributions and observations
myplot <- ggplot(data, aes(x = studyhard, y = delinquency, fill = studyhard)) +
  geom_violindot(color_dots =  "#d44842", size_dots = 4) +
  scale_fill_manual(values=c("#365c8d", "#4ac16d")) +
  theme_modern()
myplot

  1. BONUS: We can also add horizontal lines representing the mean values for each group, which will help us better visualize the mean difference between these groups (i.e., the space between the mean lines). Let’s try it!
    1. In the chunk above, we saved our plot above as a ggplot object. That way, in the next plot, we can simply add (+) two horizontal lines at the delinquency mean values for each group.
    2. You can use geom_hline() to add a horizontal line to any ggplot object. If you want to add a vertical line, use geom_vline()
    3. yintercept = specifies where each horizontal line should cross the y-axis. We added a line at the delinquency mean value for the ‘studylow’ group (mean(studylow$delinquency)) and another at the ‘studyhi’ delinquency mean (mean(studyhi$delinquency)).
    4. We use color = to specify the same group colors that we used to fill our violin plot in the ggplot code chunk above.
myplot + 
  geom_hline(yintercept = mean(studylow$delinquency), color = "#365c8d") +
  geom_hline(yintercept = mean(studyhi$delinquency), color = "#4ac16d") 

  • Do you learn anything important from these plots?

  • Based on the delinquency distributions for each group, do you think that the independent samples t-test assumption of normality is appropriate?

  • Are there any outlier delinquency values (e.g., 120 delinquency acts)? How do you think those values affect the group mean estimates?
  1. Congratulations! You should now have everything that you need to complete the questions in Assignment 10 that parallel those from B&P’s SPSS Exercises for Chapter 10! Complete the remainder of the questions in Assignment 10 in your RMD file. Be sure to follow all steps above using the assigned variables (‘studyhard’ and ‘moral’), including the addition of a violin dotplot!
    1. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign10. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 10.

Assignment 10 Objective Checks

After completing assignment #10…

  • do you recognize that you can add inline R code to RMD text with `r `, which can improve reproducibility and accuracy of reporting results by helping avoid typos or copy-and-paste errors?
  • do you recognize that you can type LaTeX-style mathematical equations in your R Markdown text by using a single $ for inline text equations or two $$ to offset equations on a new line.?
  • are you able to filter data using datawizard::data_filter()?
  • are you able to change a numeric variable to a factor variable with mutate(newvar = as.factor(oldvar))?
  • are you able to conduct an independent samples pooled or separate variance t-test using t.test() and interpret the results?
    • do you know how to conduct a Levene’s test for equality of variances using car::leveneTest()?
  • are you able to generate a half-violin/half-dotplot with geom_violindot() to visualize group distributions and sample sizes simultaneously?
  • do you know how to use aes(fill=groupvar) to fill plots with different colors for each category of a grouping variable?
  • do you know how to manually change the fill colors in a ggplot object using scale_fill_manual()?
  • do you know how to add a horizontal or vertical line to a ggplot object with geom_hline() or geom_vline()?