The purpose of this eleventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 12 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
In the last assignment, you learned how to use R to calculate the difference between two independent sample means in R and then conduct an independent sample t-test of equality in population means. You also learned how to conduct a Levene’s test for equal or unequal variances and how to generate a half-violin/half-dotplot to visualize group distributions and sample sizes simultaneously. In this assignment, you will learn how to estimate the strength of a linear association between two numeric variables by calculating Pearson’s correlation coefficient (r) and predict expected values of a dependent variable (Y) from the values of a linearly correlated predictor variable (X) using a linear regression model. Additionally, you will learn to visualize the association with a scatterplot and be introduced to some tools for checking model assumptions and assessing the fit of your regression model to the data.
correlation::cor_test()
and plot the correlation with
see::plot()
lm()
and
summarize results using summary()
ggPredict()
function from ggiraph
and ggiraphExtra
packagesperformance::check_model()
to generate
some checks for influential outliers, residual variance, and other
regression assumptionsWe are building on objectives from Assignments 1-0. By the start of this
assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
listname <- c(item1, item2)
lapply()
to create a list of
objects, which can help you avoid cluttering the R Environment with
objectsfunxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorsround()
can be used to specify number of
decimals on numeric values$
for inline text
equations or two $$
to offset equations on a new line.here()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packagesset.seed()
mtcars
that are universally available to R users`r `
, which can improve reproducibility and accuracy of
reporting results by helping avoid typos or copy-and-paste errors and by
auto-updating resultshead()
function to quickly view the
first few rows of datatail()
function to quickly view the last
few rows of dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
haven::as_factor()
mutate(newvar = as.factor(oldvar))
data %>% droplevels(data$variable)
select()
,
mutate()
, and if_else()
functionsmutate()
dplyr::sample_n()
dplyr::filter()
and
%in%
operatordatawizard::data_filter()
rnorm()
,
truncnorm::rtruncnorm()
, or runif()
dplyr::slice_sample()
infer::rep_slice_sample()
tibble::tribble()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
or with
crosstable(depvar, by=indepvar)
gt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsgt()
table can be modified to add or
remove the decimals in specific columns or rows with
fmt_number()
ggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_violindot()
to visualize group distributions and
sample sizes simultaneouslygeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)aes(fill=groupvar)
to fill plots with different
colors for each category of a grouping variablescale_fill_manual()
+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)ggplot()
plots into a single figure
using “patchwork” package
patchwork::wrap_plots()
to
quickly combine ggplot objects contained in a listggplot() + geom_point() + geom_errorbars
+ geom_hline()
or + geom_vline()
),
arrows (+ geom_segment()
), or text
(+ annotate()
) to a ggplot()
objectrstatix::binom_test()
qt()
or
qnorm()
t.test()
functioninfer::t_test()
infer::visualize()
to visualize where
your sample statistic would fall in the sampling distribution associated
with your null hypothesissjPlot::sjtab()
or chisq.test()
and interpret
results
statistics=
using sjPlot::sjtab()
and
interpret them appropriately.chisq.test()
to an object (e.g., chisq
) and
then calling elements from object (e.g., chisq$observed
or
chisq$expected
)t.test()
and interpret the results
car::leveneTest()
If you do not recall how to do these things, review Assignments 1-10.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Visualize a bivariate relationship and OLS regression line
This week’s book chapter focused on describing the association between two continuous numeric variables using a scatterplot, Pearson’s correlation coefficient, and an ordinary least-squared (OLS) linear regression model. Additionally, the chapter introduced t-value statistics that can be used to test the null hypothesis of no association (r=0 and b=0) in the population from which the sample was randomly drawn.
Later, you will practice these skills yourself with a criminology example using the States data. First, though, we want to try to help build upon your intuitions about what is happening under the hood when we estimate a correlation and fit a regression model. Remember the “mtcars” dataset that comes built into R we used in Assignment 8? Let’s use those again for our illustration.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
Remember, you can learn more about the “mtcars” data by typing
?mtcars
in your console (or a code chunk) and hit enter (or
run). Doing so will bring up a help page, where you will learn that the
data were “extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).”
Recall, in assignment 8, we pretended the mtcars
data
were randomly selected from the population of cars in 1974 so we could
illustrate the logic of a one-sample null hypothesis test. Here, we will
not attempt to test a null hypothesis using these data. Recall that we
cannot make valid statistical inferences (e.g., via a null hypothesis
test) about the population of cars from these data because
mtcars
is a convenience sample and not a probability
sample. That is, the mtcars
sample was not drawn from a
larger, well-defined population of cars in 1974 in such a way that all
cars in that population had an equal (as in simple random sampling or
SRS) or a known and nonzero probability (as in more complex forms of
probability sampling) of selection into the mtcars
sample.
So, here, we will simply focus on describing a bivariate
association in this nonrandom sample for now. You will practice testing
and interpreting results of a null hypothesis later in the
assignment.
For our illustration, let’s explore the association between miles per
gallon (mpg
) and horsepower (hp
) among cars in
these data. Our a priori expectation is that cars with greater
horsepower will, on average, travel fewer miles per gallon - that is,
they will have lower mpg values. The assumed mechanism driving this
expected association is that cars with higher
horsepower often (but not always) have larger engines (e.g., greater
engine displacement in cubic inches or disp
values), and larger engines on average burn more fuel
than smaller engines in comparable driving conditions. So, put simply,
we expect a strong, negative correlation between hp
and
mpg
in the mtcars
data.
Let’s start by visualizing the univariate distributions of both variables using histograms, as well as the joint bivariate distribution using a scatterplot.
hist1 <- mtcars %>% ggplot() +
geom_histogram(aes(x=mpg, y=..density..), color="#ed6925", fill="#ed6925",
position="identity", breaks = seq(0, 45, by = 2)) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(0,45), ylim=c(0,.12))
hist2 <- mtcars %>% ggplot() +
geom_histogram(aes(x=hp, y=..density..), color="#fbb61a", fill="#fbb61a",
position="identity", breaks = seq(0, 350, by = 20)) +
theme_minimal() +
coord_cartesian(expand = FALSE, xlim=c(0,300), ylim=c(0,.01))
scatter1 <- mtcars %>% ggplot(aes(x=hp, y=mpg)) +
geom_point(shape=21, size=4,
color="#ed6925", fill="#fbb61a") +
theme_minimal()
(hist1 + hist2)/scatter1
As we expected, mpg and hp appear to be negatively correlated: cars with higher horsepower scores tend to have lower mpg scores and vice versa.
There are many ways to calculate a Pearson correlation coefficient to
describe the strength of the linear association between these two
variables. Later, we will introduce you to the cor_test()
function from the correlation
package. For now, let’s just
use the base R cor()
function:
Pearson’s correlation coefficient (r) for linear association
between mpg & hp
cor(mtcars$mpg, mtcars$hp)
## [1] -0.776
Indeed, the Pearson’s correlation coefficient (r = -0.78)
suggests there is a strong negative linear correlation between
mpg and hp in the mtcars
data.
Ok, that’s enough of a detour; let’s get back on track by revisiting
our mtcars
example. Below, we add a linear regression line
to our scatterplot of the relationship between mpg and hp.
linfig <- scatter1 + geom_smooth(method = lm)
linfig
Now that you know issues such as nonlinear relationships might affect our inferences from linear models, do you think this is an issue with these data? That is, do you think the relationship between mpg and horsepower really is best described as linear?
It appears from the scatterplot that a linear regression line might systematically under-predict Y values (i.e., the line is below the dots or observations) at low and high values of X, and it might systematically over-predict (i.e., the line is above the observations) at mid-range values of X. This type of systematic under- or over-predicting can be symptomatic of a poor-fitting linear function, and it may indicate that the relationship between X and Y is actually nonlinear.
Theoretically, a nonlinear functional relationship between mpg and hp also might make a lot more sense than a linear one. After all, the linear regression line that we fit to these data predicts that cars with really high horsepower - say, 500hp - will have negative mpg values! Yet, despite this model prediction, we might logically assume that there is a threshold at which mpg values cannot really get any lower. For instance, it is probably safe to assume that a functioning car on the consumer market should not get less than 0 mpg! Moreover, increasingly higher mpg gains might be expected among cars with low hp (or smaller engine sizes), but there might also be a threshold at which mpg gains level off at low hp values as well. Given these expectations, let’s fit a few different nonlinear functions to the data and visually compare them - say, a quadratic curve, a simple logarithmic decay curve, and a more complex logarithmic decay curve that allows for theoretically possible upper and lower mpg thresholds.
# quadratic - note we would not expect mpg to increase at high X (overfitting)
quadfig <- scatter1 + geom_smooth(method = lm, formula=y ~ x + I(x^2))
# logarithmic decay function
# examples: https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html
# thanks to chatGPT for helping with the log decay code!
#simple logarithmic decay
log1fig <- scatter1 + geom_smooth(method = lm, formula = y ~ log(x))
# Complex logarithmic decay: This is the type of function I want to fit to the data:
# Define function
log_decay <- function(x, a, b, c, d) {
a + b/(1 + exp(c*(x - d)))
}
# Here is a plot of the function:
# # Generate x values
# x_vals <- seq(0, 400, length.out = 1000)
#
# # Plot function
# ggplot(data.frame(x = x_vals), aes(x)) +
# stat_function(fun = log_decay, args = list(a = 35, b = -30, c = -0.02, d = 100), geom = "line") +
# ylim(0, 30) +
# xlim(0, 400) +
# labs(x = "hp", y = "mpg")
# Fit log_decay function to mtcars data & plot
log2fig <- scatter1 + geom_smooth(method = "nls", formula = y ~ log_decay(x, a, b, c, d),
method.args = list(start = list(a = 35, b = -23, c = -0.02, d = 100)),
se = FALSE, color = "blue")
linfig + quadfig + log1fig + log2fig
What do you think? does the simple linear regression line fit better than the nonlinear models?
Let’s first consider the quadratic model (top right). Social science researchers frequently fit quadratic models to account for nonlinearity. Indeed, the quadratic line appears to fit the data better than the linear model. However, this model also appears to “overfit” the data by predicting an increase in mpg at the highest hp values. A couple cars with really low mpg scores around 200hp seem to pull the critical point of the quadratic curve down, while the one car with the highest hp and a relatively higher mpg score seems to pull the quadratic line back up to create the predicted parabolic increase in mpg at high hp scores.
Hence, as this week’s chapter describes, outlier observations can be quite influential when fitting a regression line. Why? Because the regression models uses all observed values to identify the best-fitting line. Remember weeks ago when you learned about the statistical properties of the mean and, specifically, about how the mean is the point of minimized variation? Recall you also learned how to calculate variation (variance) as the sum of squared deviations about the mean?
Well, essentially the same thing is happening here! In fitting a (OLS linear) regression model, we are identifying a best-fitting line that minimizes variation in Y at every value of X. That is, we are calculating the conditional mean of Y at every observed value of X, then drawing a line through those conditional means.
So, what about those logarithmic decay curves? Given the data and assumed theoretical mechanisms, these might be more appropriate for describing the relationship between hp and mpg. With that said, assessing which of these regression lines is more theoretically plausible and which provides a better fit to the data is far beyond the scope of this assignment. However, it is important to point out that this is exactly the type of thing a careful researcher can and should be investigating when they build and fit (linear) regression models!
And that brings us to an important lesson to take away from all this line-fitting: data and statistical models are dumb - that is, they will do what you tell them to do, even if it is inappropriate to do so. This is why it is important to understand what is going on “under the hood” when working with and analyzing data, or when reading the statistical results from others’ data analysis research.
Now, let’s get back to our simple OLS linear regression model anyway.
We can easily generate a comparable scatterplot with a linear regression
line, as well as the Pearson’s r correlation coefficient AND a
null hypothesis t-test statistic by simply using the
correlation
and see
packages from the
easystats suite.
#example with correlation & see packages (easystats suite)
# https://easystats.github.io/see/articles/correlation.html
#correlation & scatterplot
carscorr <- cor_test(mtcars, y="mpg", x="hp")
plot(carscorr)
Recall, using a t-test to make statistical inferences about
a population of cars is inappropriate for these data since the
mtcars
data were not collected as a probability sample.
Yet, that fact does not stop our packages from generating such test
statistics anyway. Hence, yet again, data and statistical models are
dumb: they do what you tell them to do, even if you tell them to do
something that they should not do.
What if we wanted to use our linear regression model to make a
prediction about expected mpg scores given a specific car horsepower?
Let’s use lm()
to fit the linear regression model and then
summarize results.
carslinreg = lm(mpg ~ hp, data=mtcars)
summary(carslinreg)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.712 -2.112 -0.885 1.582 8.236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.0989 1.6339 18.42 < 0.0000000000000002 ***
## hp -0.0682 0.0101 -6.74 0.00000018 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.86 on 30 degrees of freedom
## Multiple R-squared: 0.602, Adjusted R-squared: 0.589
## F-statistic: 45.5 on 1 and 30 DF, p-value: 0.000000179
The intercept is 30.10, which means the predicted conditional
average mpg for a car with hp=0 (an implausible car) is about 30mpg. In
other words, the model predicts cars with lowest hp (and presumably
smallest engines) should top out around 30mpg. Of course, we already
know that our linear model was under-predicting mpg at these low values,
which is why we considered possible nonlinear relationships. Still,
let’s stick with the linear model results for now anyway.
The regression slope (beta) for hp is b=-0.068, which means that the conditional average mpg score is expected to decrease by 0.068 for every one unit increase in horsepower. The linear regression equation, then is: \(\hat{y}=30.10 + -0.068x\)
From this equation, we can calculate specific model predictions for mpg. For example, the model predicts that a 100hp car (i.e., an increase from 0hp to 100hp) is expected to get \(30.10 + 100*(-0.068) = 23.3mpg\), or about 23 miles per gallon. A 250hp car is expected to get \(30.10 + 250*(-0.068) = 13.2mpg\), or again around 13mpg. A 400hp car is expected to get 30.10 + 400*(-0.068) = 2.9mpg. Yikes! What about a 500hp car? Here, the model predictions become especially problematic, as this powerful gas guzzler is predicted to get \(30.10 + 100*(-0.068) = -3.9mpg\). Uh oh.
What if we used the simple logarithmic decay function (bottom left plot above) instead to predict mpg? That model predicts a 100hp car is expected to get \(72.64 + log(100)*(-10.76) = 23.1mpg\), or again to get about 23 miles per gallon. A 250hp car is expected to get \(72.64 + log(250)*(-10.76) = 13.1mpg\), or once again to get about 13mpg. Those two predictions really did not change much from the linear to the simple logarithmic decay model. In contrast, a 400hp car is expected to get \(72.64 + log(400)*(-10.76) = 8.2mpg\), which is substantially higher than the 2.9mpg predicted by the linear model. Meanwhile, a 500hp car is predicted to get \(72.64 + log(400)*(-10.76) = 5.77mpg\), or about 6mpg. Recall, these data are from a convenience sample of cars in 1974, so these predictions might be plausible; they appear to be consistent with other reports of mpg ranges (6.8 to 29.1) for cars at the time.
Now, before moving on to practice these skills yourself, let’s dig a little deeper under the correlation and regression hood.
Remember z scores, which you create by subtracting a variable’s mean from each observation and then dividing by the variable’s standard deviation (\(z=\frac{x-\overline{x}}{s}\))? It turns out that, if you transform your X and Y variables into z scores, then it is quite easy to calculate a correlation: the correlation is the average product of all pairs of z scores in your data! Moreover, if X and Y are perfectly positively correlated (r=1), then \(z_x\) and \(z_y\) values will be identical for every observation. In contrast, if X and Y are perfectly negatively correlated (r=-1), then \(z_x\) and \(z_y\) values will be identical magnitude but opposite sign for every observation.
Jaccard and Becker’s book, Statistics for the Behavioral Sciences, illustrates this point well:
“When a linear relationship is positive, the z scores on
variable X will tend to be similar to the z scores on variable
Y, and they will also tend to be alike in sign. In fact, when a positive
linear relationship is perfect, the z scores on X and Y will be
identical” (Jaccard & Becker, p.141). “When the scores for a given
individual are multiplied by each other, the product (\(z_xz_y\)) will be positive,” unless both
variables equal 0, in which case their product will equal zero”
(p.142).
The sum of the products will be positive.
“When a linear relationship is negative, the z scores on variable X will also tend to be similar to the z scores on variable Y, but they will generally be opposite in sign. “When the scores for a given individual are multiplied by each other, the product (\(z_xz_y\)) will be negative,” unless both variables equal 0, in which case their product will equal zero” (p.142). The sum of the products will be negative.
“Finally, when there is no linear relationship, the z scores on variable X will bear no consistent relationship to the z scores on variable Y, in either size or sign” (p.144). The sum of the products will be 0.
“We can therefore use the sum of the products of z scores as an index of the relationship between two variables. There is, however, one complication: When the correlation between two variables is nonzero, the value of the sum of z score products is influenced not only by the size of the correlation but also by the sample size (N).”
So, we adjust for N and, with z scores, we can calculate the correlation coefficient as: \(r=\frac{\Sigma z_xz_y}{N}\)
Let’s try this with our mtcars
data and see for
ourselves by recoding mpg and hp into z scores. Given the
strong, negative correlation between mpg and hp, we should expect that
their z scores will tend to have similar magnitudes yet
opposite signs.
mtcars2 <- mtcars %>%
mutate(
zy = (mpg - mean(mpg))/sd(mpg),
zx = (hp - mean(hp))/sd(hp)
) %>%
dplyr::select(mpg,hp,zy,zx)
head(mtcars2)
## mpg hp zy zx
## Mazda RX4 21.0 110 0.151 -0.535
## Mazda RX4 Wag 21.0 110 0.151 -0.535
## Datsun 710 22.8 93 0.450 -0.783
## Hornet 4 Drive 21.4 110 0.217 -0.535
## Hornet Sportabout 18.7 175 -0.231 0.413
## Valiant 18.1 105 -0.330 -0.608
Now, let’s fit a linear model with lm()
and then
generate an interactive plot using ggPredict()
from the
ggiraphExtra
package. This interactive plot will allow us
to compare the specific \(z_x\) (hp)
and \(z_y\) (mpg) scores for each
observation or car in our scatterplot by simply hovering your mouse
cursor over each point in the plot.
#linear model
carslinreg = lm(zy ~ zx, data=mtcars2)
summary(carslinreg)
##
## Call:
## lm(formula = zy ~ zx, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.948 -0.350 -0.147 0.263 1.367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0000000000000000315 0.1133047256289521632 0.00 1
## zx -0.7761683718265860454 0.1151177163567893152 -6.74 0.00000018
##
## (Intercept)
## zx ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.641 on 30 degrees of freedom
## Multiple R-squared: 0.602, Adjusted R-squared: 0.589
## F-statistic: 45.5 on 1 and 30 DF, p-value: 0.000000179
ggPredict(carslinreg,se=TRUE,interactive=TRUE)
As expected, when we hover over each observation, we notice that as \(z_x\) scores increase (i.e., larger positive hp z scores), \(z_y\) scores tend to decrease (larger negative mpg z scores).
Note, too, that hovering over the regression line tells us the
regression equation that we fit with our model: \(\hat{y}=0 + -0.78x\). Note the intercept,
or the average predicted value of Y when X=0, is equal to “0” in
this
equation. This is because we are using standardized z score
variables; with a standardized z score variable for hp, cars
with the average (mean) horsepower are recoded as \(z_x=0\). Likewise, cars with average (mean)
mpg are recoded as \(z_y=0\). Moreover,
since a regression line must always pass through the mean of X and Y,
our linear regression line with standardized z scores must also
pass through XY coordinates of (0,0), as this is the location of the
means of X and Y.
Additionally, note that the regression slope or beta coefficient is b=-0.78. That number should look familiar - it is the same as our Pearson’s r correlation coefficient! Again, this is because we are using standardized scores. With only one predictor in an OLS linear regression model, the standardized regression coefficient is equal to the correlation coefficient. We can interpret this coefficient as follows: A one-standard deviation unit increase in horsepower (hp) is predicted to decrease miles per gallon (mpg) by 0.78 standard deviation units.
There is so much more we could cover, but you must be getting antsy to try some of this yourself. So, without further ado, let’s load the States data and learn how to calculate bivariate correlations, fit a bivariate linear regression model, and plot a bivariate scatterplot!
Goal: Read in States Dataset, Create Scatterplot, and Calculate Correlation
In the rest of this assignment, you will work through most of the tasks described in the SPSS Exercises at the end of Chapter 12. Take some time to read through those exercises to help you understand what we are doing. Put briefly, this exercise asks you to investigate whether there is a relationship between unhealthy behaviors and crime rates in the States dataset, with unhealthy behaviors measured using states’ tobacco death rates and crime rates measured using states’ robbery rates. As in those exercises, you will be graphing the association, estimating the strength of the association, and fitting a regression model to the data.
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_07_06_Fordham_K300Assign11)
here
package will automatically set our K300_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, see
, ggplot2
,
correlation
, performance
,
ggiraphExtra
, ggiraph
,and sjPlot
.
Recall, you only need to install packages one time. However, you must
load them each time you start a new R session.ggplot2
,
correlation
, performance
,
ggiraphExtra
, and ggiraph
) will do for us.
ggplot2
allows us to create graphs or charts to
visually represent our data. Once we provide R with the data, it tells
ggplot2
how to map variables to aesthetics, what graphical
primitives to use, and, in general, takes care of the details of our
graph. You can learn more about ggplot2
herecorrelation
is a package within the
Easystats
suite, which includes our see
package as well. correlation
is focused on correlation
analysis and is easy to use. It allows for the computation of many
different kinds of correlations, such as partial and multilevel
correlations. To learn more about the correlation
package,
visit here.base
and
stats
packages with no unique approach toward checking
model fitness. performance
attempts to fill this gap by
allowing users to check the fitness of their model to the data with
measures like r-squared.For more information on the
performance
package, you can visit here.ggiraph
is an interactive and dynamic plotting package.
It allows us to impose JavaScript functions onto our graphs, thus
rendering them interactive. This means we can hover over and choose
specific data points on the graph when exploring our data.
ggiraphExtra
is an enhancer package to ggiraph
and ggplot2
that provides a function for exploratory
graphs. To learn more about ggiraph
and
ggiraphExtra
, visit here
and here.StatesData
.
StatesData
. This will call the object and provide a brief
view of the data. (Note: You can
get a similar but more visually appealing view by simply clicking on the
object in the “Environment” window. Also, be sure to put these functions
on separate lines or they will not run properly.)StatesData <- read_spss(here("Datasets", "2012StatesData.sav"))
mycorr <- cor_test(StatesData, y = "RobberyRt", x = "tobaccodeathrt")
cor_test
function within the
correlation
package to calculate Pearson’s r, or the
correlation, between the RobberyRt
and
tobaccodeathrt
variables. We then place the calculated
correlation into a data object called mycorr
. Note that the
independent variable is second and the dependent variable is first. This
order is important to ensure the correlation is run correctly (i.e., we
want to know how tobacco death rates affect robbery rates, not how
robbery rates affect tobacco death rates. Remember to replace
RobberyRt
with BurglaryRt
!plot(mycorr)
. The
plot()
function is within the see
package and
simply tells R to plot the correlation as a scatterplot. This function
also allows us to see the confidence intervals and Pearson’s r
coefficient. Your R studio should look like this:#example with correlation & see packages (easystats suite)
# https://easystats.github.io/see/articles/correlation.html
#correlation & scatterplot
mycorr <- cor_test(StatesData, y="RobberyRt", x="tobaccodeathrt")
plot(mycorr)
For the second part of the assignment, you will run a linear regression model on our variables and calculate the intercepts, adjusted and unadjusted r-squareds, and the t- and p-values. The linear model allows us to determine the slope of the regression line and the r-squared value.
mylinreg = lm(RobberyRt ~ tobaccodeathrt, data=StatesData)
mylinreg
. The lm()
function tells R to run a
regression model between the ‘RobberyRt’ and ‘tobaccodeathrt’
variables.summary(mylinreg)
. The
summary()
function simply shows us a summary of our data
object, including the r-squared and slope of the regression line. Your R
Studio should look like this:
#linear model
mylinreg = lm(RobberyRt ~ tobaccodeathrt, data=StatesData)
summary(mylinreg)
##
## Call:
## lm(formula = RobberyRt ~ tobaccodeathrt, data = StatesData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.04 -34.96 4.93 28.25 158.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.218 51.101 0.08 0.935
## tobaccodeathrt 0.371 0.187 1.99 0.053 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.1 on 48 degrees of freedom
## Multiple R-squared: 0.0761, Adjusted R-squared: 0.0568
## F-statistic: 3.95 on 1 and 48 DF, p-value: 0.0525
ggPredict()
from
ggiraph
and ggiraphExtra
. We will also use the
check_model()
function from the performance
package to check the fit of our model.
ggPredict(mylinreg,se=TRUE,interactive=TRUE)
.
Because our linear model is already calculated, all we need to do is use
ggPredict()
to make the plot interactive. The function
interactive = TRUE
does just this, while the function
se = TRUE
function tells R to include the
‘standard error’. To learn more about
the ggPredict()
function, visit here.
Your R Studio should look like this:#create interactive plot to easily identify specific data points (need: ggiraph, ggiraphExtra packages)
ggPredict(mylinreg,se=TRUE,interactive=TRUE)
performance
package and the check_model()
function. This allows us to ensure our model is the right fit for our
data.
check_model(mylinreg)
. This function allows us to
check for influential outliers, residual variance, and other
assumptions. It also utilizes Bayesian estimation to conduct some of
these tests, although that is beyond the scope of this class. To learn
more about the check_model()
function, visit here and here.
Your R Studio should look like this:
3. Congratulations! You should now have everything that you need to complete the questions in Assignment 11 that parallel those from B&P’s SPSS Exercises for Chapter 12! Complete the remainder of the questions in Assignment 11 in your RMD file. a. Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions. b. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments! c. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign11_11. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 11.11. d. This is your last assignment! You made it! Hopefully, you feel comfortable performing basic statistical processes in R. R is not an easy program to use and it takes practice, but you have already learned more about it than even many researchers know. You should be proud!
correlation::cor_test()
and plot the
correlation with see::plot()
?lm()
and summarize results using summary()
?ggPredict()
function from
ggiraph
and ggiraphExtra
packages?performance::check_model()
to
generate some checks for influential outliers, residual variance, and
other regression assumptions?