Assignment 11 Objectives

The purpose of this eleventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 12 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

In the last assignment, you learned how to use R to calculate the difference between two independent sample means in R and then conduct an independent sample t-test of equality in population means. You also learned how to conduct a Levene’s test for equal or unequal variances and how to generate a half-violin/half-dotplot to visualize group distributions and sample sizes simultaneously. In this assignment, you will learn how to estimate the strength of a linear association between two numeric variables by calculating Pearson’s correlation coefficient (r) and predict expected values of a dependent variable (Y) from the values of a linearly correlated predictor variable (X) using a linear regression model. Additionally, you will learn to visualize the association with a scatterplot and be introduced to some tools for checking model assumptions and assessing the fit of your regression model to the data.

By the end of assignment #11, you should…

  • be able to calculate Pearson’s correlation coefficient (r) with correlation::cor_test() and plot the correlation with see::plot()
  • be able to fit a linear regression model with lm() and summarize results using summary()
  • be able to generate an interactive bivariate plot and regression line using ggPredict() function from ggiraph and ggiraphExtra packages
  • know how to use performance::check_model() to generate some checks for influential outliers, residual variance, and other regression assumptions
  • understand the importance of considering whether a linear correlation or linear regression model are appropriate for the bivariate relationship you are examining
  • understand how to read results of a linear regression model fit in R, including identifying a linear regression equation and predicting values from that equation
  • recognize how the correlation coefficient essentially expresses a bivariate relationship in terms of z scores, or standard deviation units

Assumptions & Ground Rules

We are building on objectives from Assignments 1-0. By the start of this

assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize that use lapply() to create a list of objects, which can help you avoid cluttering the R Environment with objects
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors
  • recognize that round() can be used to specify number of decimals on numeric values
  • recognize that you can type LaTeX-style mathematical equations in your R Markdown text by using a single $ for inline text equations or two $$ to offset equations on a new line.

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages
  • improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()
  • know that you can share examples or troubleshoot code in a reproducible way by using built-in datasets like mtcars that are universally available to R users
  • recognize that you can add inline R code to RMD text with `r `, which can improve reproducibility and accuracy of reporting results by helping avoid typos or copy-and-paste errors and by auto-updating results

Data viewing & wrangling

  • use the base R head() function to quickly view the first few rows of data
  • use the base R tail() function to quickly view the last few rows of data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • alternatively change a numeric variable to a factor variable with mutate(newvar = as.factor(oldvar))
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()
  • select random sample from data without or with replacement using dplyr::sample_n()
  • select data with conditions using dplyr::filter() and %in% operator
  • alternatively filter data using datawizard::data_filter()
  • simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()
  • draw random samples from data in R
    • draw one random sample with dplyr::slice_sample()
    • draw multiple (“replicate”) random samples with infer::rep_slice_sample()
  • recognize that one can manually build a simple tibble row-by-row using tidyverse’s tibble::tribble()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar) or with crosstable(depvar, by=indepvar)

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
    • recognize that a gt() table can be modified to add or remove the decimals in specific columns or rows with fmt_number()
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
    • generate a half-violin/half-dotplot with geom_violindot() to visualize group distributions and sample sizes simultaneously
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • use aes(fill=groupvar) to fill plots with different colors for each category of a grouping variable
    • manually change the fill colors in a ggplot object using scale_fill_manual()
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))
  • combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
    • recognize that one can write a custom function to repeatedly generate similar plots before combining them with patchwork
    • recognize you can use patchwork::wrap_plots() to quickly combine ggplot objects contained in a list
  • recognize that one can plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize that one can add elements like horizontal or vertical lines (with + geom_hline() or + geom_vline()), arrows (+ geom_segment()), or text (+ annotate()) to a ggplot() object

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()
  • generate and interpret confidence intervals to quantify uncertainty caused by sampling variability
    • identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • estimate the standard error of a sample mean or proportion in R
    • estimate a two-tailed confidence interval around a sample mean or proportion in R
    • properly interpret and avoid common misinterpretations of confidence intervals
  • be able to conduct a one-sample z or t hypothesis test of the difference between a sample and assumed population mean in R and interpret results
    • be able to conduct a one-sample test using the base R t.test() function
    • be able to manually calculate a z or t test statistic by typing formula in R
    • be able to conduct a one-sample test using infer::t_test()
    • know how to use infer::visualize() to visualize where your sample statistic would fall in the sampling distribution associated with your null hypothesis
  • conduct a chi-squared test of independence using sjPlot::sjtab() or chisq.test() and interpret results
    • specify different measures of association with statistics= using sjPlot::sjtab() and interpret them appropriately.
    • generate observed and expected frequencies by assigning results of chisq.test() to an object (e.g., chisq) and then calling elements from object (e.g., chisq$observed or chisq$expected)
  • conduct an independent samples pooled or separate variance t-test using t.test() and interpret the results
    • know how to conduct a Levene’s test for equality of variances using car::leveneTest()

If you do not recall how to do these things, review Assignments 1-10.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • the difference between a positive and negative relationship between two continuous numeric (interval/ratio level) variables
  • what a scatterplot is and how to interpret it as a visualization of the association between two continuous numeric variables
  • how to calculate Pearson’s r and coefficient of determination (\(r^2\))
  • how the ordinary least-squares (OLS) regression equation is different from the correlation coefficent and why both are useful
  • the formula for an ordinary least-squares linear regression line for a sample or population, including how to interpret the intercept and slope coefficient
  • how to calculate the linear regression slope coefficient (beta or b) and how to use the OLS regression equation to predict values of Y
  • how to calculate t-test statistic for null hypothesis about b and r and interpret results to make inferences about the null hypothesis

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Understanding Bivariate Correlation and Ordinary Least-squares Regression

Goal: Visualize a bivariate relationship and OLS regression line

This week’s book chapter focused on describing the association between two continuous numeric variables using a scatterplot, Pearson’s correlation coefficient, and an ordinary least-squared (OLS) linear regression model. Additionally, the chapter introduced t-value statistics that can be used to test the null hypothesis of no association (r=0 and b=0) in the population from which the sample was randomly drawn.

Later, you will practice these skills yourself with a criminology example using the States data. First, though, we want to try to help build upon your intuitions about what is happening under the hood when we estimate a correlation and fit a regression model. Remember the “mtcars” dataset that comes built into R we used in Assignment 8? Let’s use those again for our illustration.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat   wt qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1

Remember, you can learn more about the “mtcars” data by typing ?mtcars in your console (or a code chunk) and hit enter (or run). Doing so will bring up a help page, where you will learn that the data were “extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).”

Recall, in assignment 8, we pretended the mtcars data were randomly selected from the population of cars in 1974 so we could illustrate the logic of a one-sample null hypothesis test. Here, we will not attempt to test a null hypothesis using these data. Recall that we cannot make valid statistical inferences (e.g., via a null hypothesis test) about the population of cars from these data because mtcars is a convenience sample and not a probability sample. That is, the mtcars sample was not drawn from a larger, well-defined population of cars in 1974 in such a way that all cars in that population had an equal (as in simple random sampling or SRS) or a known and nonzero probability (as in more complex forms of probability sampling) of selection into the mtcars sample. So, here, we will simply focus on describing a bivariate association in this nonrandom sample for now. You will practice testing and interpreting results of a null hypothesis later in the assignment.

For our illustration, let’s explore the association between miles per gallon (mpg) and horsepower (hp) among cars in these data. Our a priori expectation is that cars with greater horsepower will, on average, travel fewer miles per gallon - that is, they will have lower mpg values. The assumed mechanism driving this expected association is that cars with higher horsepower often (but not always) have larger engines (e.g., greater engine displacement in cubic inches or disp values), and larger engines on average burn more fuel than smaller engines in comparable driving conditions. So, put simply, we expect a strong, negative correlation between hp and mpg in the mtcars data.

Correlation coefficient and scatterplot of mpg & hp relationship

Let’s start by visualizing the univariate distributions of both variables using histograms, as well as the joint bivariate distribution using a scatterplot.

hist1 <- mtcars %>% ggplot() + 
  geom_histogram(aes(x=mpg, y=..density..), color="#ed6925", fill="#ed6925",
                 position="identity", breaks = seq(0, 45, by = 2)) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.12))

hist2 <- mtcars %>% ggplot() + 
  geom_histogram(aes(x=hp, y=..density..), color="#fbb61a", fill="#fbb61a",
                 position="identity", breaks = seq(0, 350, by = 20)) + 
  theme_minimal() + 
  coord_cartesian(expand = FALSE, xlim=c(0,300), ylim=c(0,.01))

scatter1 <- mtcars %>% ggplot(aes(x=hp, y=mpg)) + 
  geom_point(shape=21, size=4, 
             color="#ed6925", fill="#fbb61a") + 
  theme_minimal() 

(hist1 + hist2)/scatter1

As we expected, mpg and hp appear to be negatively correlated: cars with higher horsepower scores tend to have lower mpg scores and vice versa.

There are many ways to calculate a Pearson correlation coefficient to describe the strength of the linear association between these two variables. Later, we will introduce you to the cor_test() function from the correlation package. For now, let’s just use the base R cor() function:


Pearson’s correlation coefficient (r) for linear association between mpg & hp

cor(mtcars$mpg, mtcars$hp)
## [1] -0.776

Indeed, the Pearson’s correlation coefficient (r = -0.78) suggests there is a strong negative linear correlation between mpg and hp in the mtcars data.

Does r=0 mean that two variables are unrelated?

Before we move onto fitting an ordinary least-squares linear regression model to these data, let’s take a detour to understand why we emphasized the word linear in the text above.

Recall, Pearson’s correlation coefficient ranges from ‘r=-1’ (perfect negative linear correlation) to ‘r=+1’ (perfect positive linear correlation), with the midpoint ‘r=0’ indicating no linear correlation between two variables. Of course, as this week’s book chapter explained, two numeric variables may not be related in a linear way just because we chose to calculate a linear correlation or fit a linear regression model to them. For instance, the chapter discussed various problems that may arise when we fit a linear model to data with limited variation, data with nonlinear relationships, or data with influential outliers.

We will demonstrate one such issue by illustrating some situations in which the calculation of a linear correlation of ‘r=0’ might result in an incorrect inference that two variables are unrelated. Specifically, we will show some examples in which nonlinear relationships between two variables exist in data, yet the relationship essentially averages out to a linear correlation of ‘r=0’ across the observed XY values.

To visualize such a situation, let’s consider the following four graphs displaying bivariate relationships between X and Y variables below. In each figure, the Pearson’s correlation coefficient is approximately r=0:

set.seed(1138)
n <- 5000
x <- rnorm(n)
y1 <- 0 + 0*x
y2 <- 0 + 0*x + rnorm(n)
y3 <- 0 + sin(5*x)
y4 <- 0 + x^2

fig1 <- ggplot() + 
  geom_point(aes(x=x, y=y1), shape=21, size=2, 
             color="#1fa187", fill="#4ac16d") + 
  theme_minimal() 
fig2 <- ggplot() + 
  geom_point(aes(x=x, y=y2), shape=21, size=2, 
             color="#1fa187", fill="#4ac16d") + 
  theme_minimal() 
fig3 <- ggplot() + 
  geom_point(aes(x=x, y=y3), shape=21, size=2, 
             color="#1fa187", fill="#4ac16d") + 
  theme_minimal() 
fig4 <- ggplot() + 
  geom_point(aes(x=x, y=y4), shape=21, size=2, 
             color="#1fa187", fill="#4ac16d") + 
  theme_minimal() 


fig1 + fig2 + fig3 + fig4

In other words, there is essentially no linear bivariate correlation between any of the XY variable pairs in the four figures above. Yet, in two of the figures, there really is no correlation between X and Y; that is, the data were simulated so that r=0. Meanwhile, in the other two figures, there are strong - and near perfect - nonlinear relationships between X and Y.

Below, we plotted these relationships again in a GIF and included the Pearson’s r values (at the top) as well as a fitted least-squared regression line to help you better visualize the lack of linear correlations in these XY pairs. In each, you can see that Pearson’s r is approximately equal to zero. Also, in each, the linear regression line is approximately flat, meaning that knowing the value of X and using our linear regression equation will not help us predict the value of Y even if X and Y are strongly related to each other in a nonlinear way.

# to create animated version below:

# exampledata <- tibble(x,y1,y2,y3,y4)
# corr1 <- cor_test(exampledata, y="y1", x="x")
# corr2 <- cor_test(exampledata, y="y2", x="x")
# corr3 <- cor_test(exampledata, y="y3", x="x")
# corr4 <- cor_test(exampledata, y="y4", x="x")
# fig1 <- plot(corr1)
# fig2 <- plot(corr2)
# fig3 <- plot(corr3)
# fig4 <- plot(corr4)
# 
# figs <- list(fig1,fig2,fig3,fig4)
# saveGIF({
#   for (i in figs){
#     print(i)}
# }, interval = 2, movie.name=here("Images", "nocorr.gif"))
Pearson's *r*=0, but no correlation?

Pearson’s r=0, but no correlation?

Still not convinced that we need to visualize our data and think carefully about the assumptions underlying our statistical descriptions and tests? Check out the animation below, which we borrowed from Matejka and Fitzmaurice’s paper entitled “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing.” Their animations are an extension upon the Datasaurus
data created by Alberto Cairo, which were extensions of the famous Anscombe’s quartet.

For each of the figures shown below, the distinct underlying datasets used to generate each graph all have the same descriptive statistics - that is, the same mean values for X, mean values for Y, standard deviations for X, standard deviations for Y, and the same Pearson’s r correlation coefficient. Moreover, as with our example figures above, the Pearson’s correlation coefficient is approximately zero (r=-0.06) for each of the XY pairs displayed in the figures in the GIF below. [Note: You can install the datasauRus package and plot these figures in R yourself!]

Plots from Same Stats, Different Graphs

Plots from Same Stats, Different Graphs

Is a linear correlation and regression model always appropriate?

Ok, that’s enough of a detour; let’s get back on track by revisiting our mtcars example. Below, we add a linear regression line to our scatterplot of the relationship between mpg and hp.

linfig <- scatter1 + geom_smooth(method = lm)
linfig

Now that you know issues such as nonlinear relationships might affect our inferences from linear models, do you think this is an issue with these data? That is, do you think the relationship between mpg and horsepower really is best described as linear?

It appears from the scatterplot that a linear regression line might systematically under-predict Y values (i.e., the line is below the dots or observations) at low and high values of X, and it might systematically over-predict (i.e., the line is above the observations) at mid-range values of X. This type of systematic under- or over-predicting can be symptomatic of a poor-fitting linear function, and it may indicate that the relationship between X and Y is actually nonlinear.

Theoretically, a nonlinear functional relationship between mpg and hp also might make a lot more sense than a linear one. After all, the linear regression line that we fit to these data predicts that cars with really high horsepower - say, 500hp - will have negative mpg values! Yet, despite this model prediction, we might logically assume that there is a threshold at which mpg values cannot really get any lower. For instance, it is probably safe to assume that a functioning car on the consumer market should not get less than 0 mpg! Moreover, increasingly higher mpg gains might be expected among cars with low hp (or smaller engine sizes), but there might also be a threshold at which mpg gains level off at low hp values as well. Given these expectations, let’s fit a few different nonlinear functions to the data and visually compare them - say, a quadratic curve, a simple logarithmic decay curve, and a more complex logarithmic decay curve that allows for theoretically possible upper and lower mpg thresholds.

# quadratic - note we would not expect mpg to increase at high X (overfitting)
quadfig <- scatter1 + geom_smooth(method = lm, formula=y ~ x + I(x^2))

# logarithmic decay function 
# examples: https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html
# thanks to chatGPT for helping with the log decay code! 

#simple logarithmic decay
log1fig <- scatter1 + geom_smooth(method = lm, formula = y ~ log(x))
  
# Complex logarithmic decay: This is the type of function I want to fit to the data:
# Define function
log_decay <- function(x, a, b, c, d) {
  a + b/(1 + exp(c*(x - d)))
}

# Here is a plot of the function:
# # Generate x values
# x_vals <- seq(0, 400, length.out = 1000)
# 
# # Plot function
# ggplot(data.frame(x = x_vals), aes(x)) +
#   stat_function(fun = log_decay, args = list(a = 35, b = -30, c = -0.02, d = 100), geom = "line") +
#   ylim(0, 30) +
#   xlim(0, 400) +
#   labs(x = "hp", y = "mpg")

# Fit log_decay function to mtcars data & plot
log2fig <- scatter1 +  geom_smooth(method = "nls", formula = y ~ log_decay(x, a, b, c, d),
              method.args = list(start = list(a = 35, b = -23, c = -0.02, d = 100)),
              se = FALSE, color = "blue")
  


linfig + quadfig + log1fig + log2fig

What do you think? does the simple linear regression line fit better than the nonlinear models?

Let’s first consider the quadratic model (top right). Social science researchers frequently fit quadratic models to account for nonlinearity. Indeed, the quadratic line appears to fit the data better than the linear model. However, this model also appears to “overfit” the data by predicting an increase in mpg at the highest hp values. A couple cars with really low mpg scores around 200hp seem to pull the critical point of the quadratic curve down, while the one car with the highest hp and a relatively higher mpg score seems to pull the quadratic line back up to create the predicted parabolic increase in mpg at high hp scores.

Hence, as this week’s chapter describes, outlier observations can be quite influential when fitting a regression line. Why? Because the regression models uses all observed values to identify the best-fitting line. Remember weeks ago when you learned about the statistical properties of the mean and, specifically, about how the mean is the point of minimized variation? Recall you also learned how to calculate variation (variance) as the sum of squared deviations about the mean?

Remember the mean?

Remember the mean?

Well, essentially the same thing is happening here! In fitting a (OLS linear) regression model, we are identifying a best-fitting line that minimizes variation in Y at every value of X. That is, we are calculating the conditional mean of Y at every observed value of X, then drawing a line through those conditional means.

Regression as conditional means

Regression as conditional means

Regression as conditional means

Regression as conditional means

So, what about those logarithmic decay curves? Given the data and assumed theoretical mechanisms, these might be more appropriate for describing the relationship between hp and mpg. With that said, assessing which of these regression lines is more theoretically plausible and which provides a better fit to the data is far beyond the scope of this assignment. However, it is important to point out that this is exactly the type of thing a careful researcher can and should be investigating when they build and fit (linear) regression models!

And that brings us to an important lesson to take away from all this line-fitting: data and statistical models are dumb - that is, they will do what you tell them to do, even if it is inappropriate to do so. This is why it is important to understand what is going on “under the hood” when working with and analyzing data, or when reading the statistical results from others’ data analysis research.

How do we use a (linear) regression model to make predictions?

Now, let’s get back to our simple OLS linear regression model anyway. We can easily generate a comparable scatterplot with a linear regression line, as well as the Pearson’s r correlation coefficient AND a null hypothesis t-test statistic by simply using the correlation and see packages from the easystats suite.

#example with correlation & see packages (easystats suite)

# https://easystats.github.io/see/articles/correlation.html

#correlation & scatterplot
carscorr <- cor_test(mtcars, y="mpg", x="hp")
plot(carscorr)

Recall, using a t-test to make statistical inferences about a population of cars is inappropriate for these data since the mtcars data were not collected as a probability sample. Yet, that fact does not stop our packages from generating such test statistics anyway. Hence, yet again, data and statistical models are dumb: they do what you tell them to do, even if you tell them to do something that they should not do.

What if we wanted to use our linear regression model to make a prediction about expected mpg scores given a specific car horsepower? Let’s use lm() to fit the linear regression model and then summarize results.

carslinreg = lm(mpg ~ hp, data=mtcars)
summary(carslinreg)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.712 -2.112 -0.885  1.582  8.236 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  30.0989     1.6339   18.42 < 0.0000000000000002 ***
## hp           -0.0682     0.0101   -6.74           0.00000018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.86 on 30 degrees of freedom
## Multiple R-squared:  0.602,  Adjusted R-squared:  0.589 
## F-statistic: 45.5 on 1 and 30 DF,  p-value: 0.000000179


The intercept is 30.10, which means the predicted conditional average mpg for a car with hp=0 (an implausible car) is about 30mpg. In other words, the model predicts cars with lowest hp (and presumably smallest engines) should top out around 30mpg. Of course, we already know that our linear model was under-predicting mpg at these low values, which is why we considered possible nonlinear relationships. Still, let’s stick with the linear model results for now anyway.

The regression slope (beta) for hp is b=-0.068, which means that the conditional average mpg score is expected to decrease by 0.068 for every one unit increase in horsepower. The linear regression equation, then is: \(\hat{y}=30.10 + -0.068x\)

From this equation, we can calculate specific model predictions for mpg. For example, the model predicts that a 100hp car (i.e., an increase from 0hp to 100hp) is expected to get \(30.10 + 100*(-0.068) = 23.3mpg\), or about 23 miles per gallon. A 250hp car is expected to get \(30.10 + 250*(-0.068) = 13.2mpg\), or again around 13mpg. A 400hp car is expected to get 30.10 + 400*(-0.068) = 2.9mpg. Yikes! What about a 500hp car? Here, the model predictions become especially problematic, as this powerful gas guzzler is predicted to get \(30.10 + 100*(-0.068) = -3.9mpg\). Uh oh.

What if we used the simple logarithmic decay function (bottom left plot above) instead to predict mpg? That model predicts a 100hp car is expected to get \(72.64 + log(100)*(-10.76) = 23.1mpg\), or again to get about 23 miles per gallon. A 250hp car is expected to get \(72.64 + log(250)*(-10.76) = 13.1mpg\), or once again to get about 13mpg. Those two predictions really did not change much from the linear to the simple logarithmic decay model. In contrast, a 400hp car is expected to get \(72.64 + log(400)*(-10.76) = 8.2mpg\), which is substantially higher than the 2.9mpg predicted by the linear model. Meanwhile, a 500hp car is predicted to get \(72.64 + log(400)*(-10.76) = 5.77mpg\), or about 6mpg. Recall, these data are from a convenience sample of cars in 1974, so these predictions might be plausible; they appear to be consistent with other reports of mpg ranges (6.8 to 29.1) for cars at the time.

Remember z scores? They’re back!

Now, before moving on to practice these skills yourself, let’s dig a little deeper under the correlation and regression hood.

Remember z scores, which you create by subtracting a variable’s mean from each observation and then dividing by the variable’s standard deviation (\(z=\frac{x-\overline{x}}{s}\))? It turns out that, if you transform your X and Y variables into z scores, then it is quite easy to calculate a correlation: the correlation is the average product of all pairs of z scores in your data! Moreover, if X and Y are perfectly positively correlated (r=1), then \(z_x\) and \(z_y\) values will be identical for every observation. In contrast, if X and Y are perfectly negatively correlated (r=-1), then \(z_x\) and \(z_y\) values will be identical magnitude but opposite sign for every observation.

Jaccard and Becker’s book, Statistics for the Behavioral Sciences, illustrates this point well:

“When a linear relationship is positive, the z scores on variable X will tend to be similar to the z scores on variable Y, and they will also tend to be alike in sign. In fact, when a positive linear relationship is perfect, the z scores on X and Y will be identical” (Jaccard & Becker, p.141). “When the scores for a given individual are multiplied by each other, the product (\(z_xz_y\)) will be positive,” unless both variables equal 0, in which case their product will equal zero” (p.142).
The sum of the products will be positive.

“When a linear relationship is negative, the z scores on variable X will also tend to be similar to the z scores on variable Y, but they will generally be opposite in sign. “When the scores for a given individual are multiplied by each other, the product (\(z_xz_y\)) will be negative,” unless both variables equal 0, in which case their product will equal zero” (p.142). The sum of the products will be negative.

“Finally, when there is no linear relationship, the z scores on variable X will bear no consistent relationship to the z scores on variable Y, in either size or sign” (p.144). The sum of the products will be 0.

“We can therefore use the sum of the products of z scores as an index of the relationship between two variables. There is, however, one complication: When the correlation between two variables is nonzero, the value of the sum of z score products is influenced not only by the size of the correlation but also by the sample size (N).”

So, we adjust for N and, with z scores, we can calculate the correlation coefficient as: \(r=\frac{\Sigma z_xz_y}{N}\)

Jaccard & Becker, p.143

Jaccard & Becker, p.143


Let’s try this with our mtcars data and see for ourselves by recoding mpg and hp into z scores. Given the strong, negative correlation between mpg and hp, we should expect that their z scores will tend to have similar magnitudes yet opposite signs.

mtcars2 <- mtcars %>% 
  mutate(
    zy = (mpg - mean(mpg))/sd(mpg), 
    zx = (hp - mean(hp))/sd(hp)
    ) %>% 
  dplyr::select(mpg,hp,zy,zx)
head(mtcars2)
##                    mpg  hp     zy     zx
## Mazda RX4         21.0 110  0.151 -0.535
## Mazda RX4 Wag     21.0 110  0.151 -0.535
## Datsun 710        22.8  93  0.450 -0.783
## Hornet 4 Drive    21.4 110  0.217 -0.535
## Hornet Sportabout 18.7 175 -0.231  0.413
## Valiant           18.1 105 -0.330 -0.608

Now, let’s fit a linear model with lm() and then generate an interactive plot using ggPredict() from the ggiraphExtra package. This interactive plot will allow us to compare the specific \(z_x\) (hp) and \(z_y\) (mpg) scores for each observation or car in our scatterplot by simply hovering your mouse cursor over each point in the plot.

#linear model 
carslinreg = lm(zy ~ zx, data=mtcars2)
summary(carslinreg)
## 
## Call:
## lm(formula = zy ~ zx, data = mtcars2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.948 -0.350 -0.147  0.263  1.367 
## 
## Coefficients:
##                           Estimate             Std. Error t value   Pr(>|t|)
## (Intercept) -0.0000000000000000315  0.1133047256289521632    0.00          1
## zx          -0.7761683718265860454  0.1151177163567893152   -6.74 0.00000018
##                
## (Intercept)    
## zx          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.641 on 30 degrees of freedom
## Multiple R-squared:  0.602,  Adjusted R-squared:  0.589 
## F-statistic: 45.5 on 1 and 30 DF,  p-value: 0.000000179
ggPredict(carslinreg,se=TRUE,interactive=TRUE)

As expected, when we hover over each observation, we notice that as \(z_x\) scores increase (i.e., larger positive hp z scores), \(z_y\) scores tend to decrease (larger negative mpg z scores).

Note, too, that hovering over the regression line tells us the regression equation that we fit with our model: \(\hat{y}=0 + -0.78x\). Note the intercept, or the average predicted value of Y when X=0, is equal to “0” in this
equation. This is because we are using standardized z score variables; with a standardized z score variable for hp, cars with the average (mean) horsepower are recoded as \(z_x=0\). Likewise, cars with average (mean) mpg are recoded as \(z_y=0\). Moreover, since a regression line must always pass through the mean of X and Y, our linear regression line with standardized z scores must also pass through XY coordinates of (0,0), as this is the location of the means of X and Y.

Additionally, note that the regression slope or beta coefficient is b=-0.78. That number should look familiar - it is the same as our Pearson’s r correlation coefficient! Again, this is because we are using standardized scores. With only one predictor in an OLS linear regression model, the standardized regression coefficient is equal to the correlation coefficient. We can interpret this coefficient as follows: A one-standard deviation unit increase in horsepower (hp) is predicted to decrease miles per gallon (mpg) by 0.78 standard deviation units.

There is so much more we could cover, but you must be getting antsy to try some of this yourself. So, without further ado, let’s load the States data and learn how to calculate bivariate correlations, fit a bivariate linear regression model, and plot a bivariate scatterplot!

Part 1 (Assignment 11.1)

Goal: Read in States Dataset, Create Scatterplot, and Calculate Correlation

In the rest of this assignment, you will work through most of the tasks described in the SPSS Exercises at the end of Chapter 12. Take some time to read through those exercises to help you understand what we are doing. Put briefly, this exercise asks you to investigate whether there is a relationship between unhealthy behaviors and crime rates in the States dataset, with unhealthy behaviors measured using states’ tobacco death rates and crime rates measured using states’ robbery rates. As in those exercises, you will be graphing the association, estimating the strength of the association, and fitting a regression model to the data.

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_07_06_Fordham_K300Assign11)

  1. Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 10 (named YEAR_MO_DY_LastName_K300Assign10). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter K300 Assignment 11.
    5. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
    6. Under Default Output Format box, select “HTML document” (HTML is usually the default selection)

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 11.1).” Then, create a third-level header titled: “Read in States Dataset, Create Scatterplot, and Calculate Correlation”
      • Note: In this assignment, you are required to use the ‘tobaccodeathrt’ and ‘BurglarytRt’ variables. However, I will be demonstrating all steps using the ‘RobberyRt’ and ‘tobaccodeathrt’ variables. This process is done to ensure you have the opportunity to practice with R and do not fall into a habit of copying my code. That is, For your assignment, you will examine how tobacco death rates are correlated with burglary rates in the States data. I will demonstrate how tobacco death rates are correlated with robbery rates in America. So, be sure to replace ‘RobberyRt’ with ‘BurglaryRt’ throughout the assignment to complete the assignment correctly.
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 11. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following three packages: tidyverse, haven, here, sjmisc, see, ggplot2, correlation, performance, ggiraphExtra, ggiraph,and sjPlot. Recall, you only need to install packages one time. However, you must load them each time you start a new R session.
    3. I know what you’re thinking – that’s a lot of new packages! Let’s go through what each of these new packages (i.e., ggplot2, correlation, performance, ggiraphExtra, and ggiraph) will do for us.
      • ggplot2 allows us to create graphs or charts to visually represent our data. Once we provide R with the data, it tells ggplot2 how to map variables to aesthetics, what graphical primitives to use, and, in general, takes care of the details of our graph. You can learn more about ggplot2 here
      • correlation is a package within the Easystats suite, which includes our see package as well. correlation is focused on correlation analysis and is easy to use. It allows for the computation of many different kinds of correlations, such as partial and multilevel correlations. To learn more about the correlation package, visit here.
      • As noted in your book, it is crucial to ensure your model is a good fit for you data. Though several functions to compute fit measures do exist, they are spread between R’s base and stats packages with no unique approach toward checking model fitness. performance attempts to fill this gap by allowing users to check the fitness of their model to the data with measures like r-squared.For more information on the performance package, you can visit here.
      • ggiraph is an interactive and dynamic plotting package. It allows us to impose JavaScript functions onto our graphs, thus rendering them interactive. This means we can hover over and choose specific data points on the graph when exploring our data. ggiraphExtra is an enhancer package to ggiraph and ggplot2 that provides a function for exploratory graphs. To learn more about ggiraph and ggiraphExtra, visit here and here.

  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read and save the “2012StatesData.sav” SPSS datafile into R. Put the “2012StatesData.sav” datafile into an object named StatesData.
      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/save object command, type the name of your new R data object: StatesData. This will call the object and provide a brief view of the data. (Note: You can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.)

StatesData <- read_spss(here("Datasets", "2012StatesData.sav"))
  1. Next, we will create a scatterplot using our independent variable, ‘tobaccodeathrt’, and our dependent variable, ‘RobberyRt’. Remember to replace ‘RobberyRt’ with ‘BurglaryRt’!
    1. Create a third-level header titled: “Run Correlation”
    2. Insert another R chunk and type mycorr <- cor_test(StatesData, y = "RobberyRt", x = "tobaccodeathrt")
      • Here, we are using the cor_test function within the correlation package to calculate Pearson’s r, or the correlation, between the RobberyRt and tobaccodeathrt variables. We then place the calculated correlation into a data object called mycorr. Note that the independent variable is second and the dependent variable is first. This order is important to ensure the correlation is run correctly (i.e., we want to know how tobacco death rates affect robbery rates, not how robbery rates affect tobacco death rates. Remember to replace RobberyRt with BurglaryRt!
    3. Hit enter and type plot(mycorr). The plot() function is within the see package and simply tells R to plot the correlation as a scatterplot. This function also allows us to see the confidence intervals and Pearson’s r coefficient. Your R studio should look like this:
      Scatterplot with Correlation of 'RobberyRt' and 'tobaccodeathrt'

      Scatterplot with Correlation of ‘RobberyRt’ and ‘tobaccodeathrt’

    4. Here, our Pearson’s Correlation Coefficient (r) is equal to .28 and our p-value is .053. Graph the scatterplot with a correlation between your assignment variables and answer questions 26-28 of Quiz 11 on Canvas.
#example with correlation & see packages (easystats suite)

# https://easystats.github.io/see/articles/correlation.html

#correlation & scatterplot
mycorr <- cor_test(StatesData, y="RobberyRt", x="tobaccodeathrt")
plot(mycorr)

Part 2 (Assignment 11.2)

Goal: Calculate Regression Model

For the second part of the assignment, you will run a linear regression model on our variables and calculate the intercepts, adjusted and unadjusted r-squareds, and the t- and p-values. The linear model allows us to determine the slope of the regression line and the r-squared value.

  1. Create a second-level header titled: “Part 2 (Assignment 11.2)”
    1. Create a third-level header titled: “Calculate Regression Model” and insert an R chunk.
    2. Type mylinreg = lm(RobberyRt ~ tobaccodeathrt, data=StatesData)
      • Here, we are creating a new data object called mylinreg. The lm() function tells R to run a regression model between the ‘RobberyRt’ and ‘tobaccodeathrt’ variables.
    3. Hit enter and type summary(mylinreg). The summary() function simply shows us a summary of our data object, including the r-squared and slope of the regression line. Your R Studio should look like this:
      Regression Model of 'RobberyRt' and 'tobaccodeathrt'

      Regression Model of ‘RobberyRt’ and ‘tobaccodeathrt’

#linear model 
mylinreg = lm(RobberyRt ~ tobaccodeathrt, data=StatesData)

summary(mylinreg)
## 
## Call:
## lm(formula = RobberyRt ~ tobaccodeathrt, data = StatesData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -95.04 -34.96   4.93  28.25 158.79 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       4.218     51.101    0.08    0.935  
## tobaccodeathrt    0.371      0.187    1.99    0.053 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.1 on 48 degrees of freedom
## Multiple R-squared:  0.0761, Adjusted R-squared:  0.0568 
## F-statistic: 3.95 on 1 and 48 DF,  p-value: 0.0525
  1. Don’t forget to use the correct assignment variables! For the last part of the assignment, I am going to show you how to create an interactive scatterplot using the function ggPredict() from ggiraph and ggiraphExtra. We will also use the check_model() function from the performance package to check the fit of our model.
    1. Create a third-level header titled: “Interactive Scatterplot”
    2. Insert an R chunk.
    3. Type ggPredict(mylinreg,se=TRUE,interactive=TRUE). Because our linear model is already calculated, all we need to do is use ggPredict() to make the plot interactive. The function interactive = TRUE does just this, while the function se = TRUE function tells R to include the ‘standard error’. To learn more about the ggPredict() function, visit here. Your R Studio should look like this:
      Interactive Model of 'RobberyRt' and 'tobaccodeathrt'

      Interactive Model of ‘RobberyRt’ and ‘tobaccodeathrt’

#create interactive plot to easily identify specific data points (need: ggiraph, ggiraphExtra packages)  
ggPredict(mylinreg,se=TRUE,interactive=TRUE)
  1. Notice how the top middle data point with the robbery rate at 260.1 and the tobacco death rate at 263.1 is highlighted in the screenshot. This is the function of an interactive scatterplot. We can now hover over these data points and see their exact values.

  2. Lastly, we will check the fit of our model using the performance package and the check_model() function. This allows us to ensure our model is the right fit for our data.
    1. Create a third-level header titled: “Checking Fit of Model”
    2. Insert an R chunk.
    3. Type check_model(mylinreg). This function allows us to check for influential outliers, residual variance, and other assumptions. It also utilizes Bayesian estimation to conduct some of these tests, although that is beyond the scope of this class. To learn more about the check_model() function, visit here and here. Your R Studio should look like this:
      Checking Model Fit of Linear Regression Model

      Checking Model Fit of Linear Regression Model

3. Congratulations! You should now have everything that you need to complete the questions in Assignment 11 that parallel those from B&P’s SPSS Exercises for Chapter 12! Complete the remainder of the questions in Assignment 11 in your RMD file. a. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions. b. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments! c. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign11_11. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 11.11. d. This is your last assignment! You made it! Hopefully, you feel comfortable performing basic statistical processes in R. R is not an easy program to use and it takes practice, but you have already learned more about it than even many researchers know. You should be proud!


Assignment 11 Objective Checks

After completing assignment #11…

  • are you able to calculate Pearson’s correlation coefficient (r) with correlation::cor_test() and plot the correlation with see::plot()?
  • are you able to fit a linear regression model with lm() and summarize results using summary()?
  • are you able to generate an interactive bivariate plot and regression line using ggPredict() function from ggiraph and ggiraphExtra packages?
  • do you know how to use performance::check_model() to generate some checks for influential outliers, residual variance, and other regression assumptions?
  • do you understand the importance of considering whether a linear correlation or linear regression model are appropriate for the bivariate relationship you are examining?
  • do you understand how to read results of a linear regression model fit in R, including identifying a linear regression equation and predicting values from that equation?
  • do you recognize how the correlation coefficient essentially expresses a bivariate relationship in terms of z scores, or standard deviation units?