The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 7 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.
In the last assignment, you were introduced to frequentist empirical probabilities and probability distributions, including the standard normal (z) distribution. You also learned how these are essential to null hypothesis significance testing, which is the process by which most social scientists make inferences from sample descriptive statistics to population parameters. Finally, you practiced various steps in this process, such as calculating frequentist probabilities from crosstabs, calculating z-scores from raw observation values, and even conducting a basic binomial hypothesis test.
In Assignment 7, we will dig deeper into the process of making statistical inferences about population parameters from sample statistics. For instance, you will learn to think about sample descriptive statistics (e.g., a sample mean or correlation coefficient) as point estimates of population parameters. Moreover, while point estimates may represent our best or most likely guess about the value of a population parameter, a point estimate usually is just one among many potential estimates of a population parameter that are plausible under our data and modeling assumptions. In fact, we can use our knowledge of probability to calculate an interval estimate, such as a frequentist 95% confidence interval, around a point estimate.
Pragmatically, a confidence interval is a range of plausible estimates that quantifies the degree of uncertainty we have in our estimate of a population parameter. Learning to think about sample-derived estimates of population parameters as (uncertain) interval estimates rather than as (fixed and certain) point estimates is essential to truly understanding the process of statistical inference. After all, we rarely know the true population parameters and, thus, the seductive tendency to view sample statistics as “real” and known calculations of population values is misguided and prone to inference errors.
Unfortunately, frequentist confidence intervals (like their p-value cousins) are notoriously misunderstood and misinterpreted. For instance, it is quite common to read or hear people explain a 95% confidence interval like this: “We can be 95% certain that the true population parameter we seek falls within the 95% confidence interval we constructed around our sample point estimate.” Sadly, this is an inaccurate interpretation. When we calculate a frequentist 95% confidence interval, we are not 95% confident in the calculated interval itself. Rather, we are confident in the method that we used to generate confidence intervals.
What does this mean? Well, do you recall that long run or “sampling” notion of probability we discussed in the last assignment? We need to keep that in mind when interpreting a frequentist confidence interval around a point estimate. Now, imagine we calculated a sample statistic and used it as a point estimate to infer a population parameter value, and we also calculated a 95% confidence interval to quantify uncertainty around our estimate. Next, let’s hypothetically imagine we were to repeatedly follow the same sampling procedures a large number of times, calculating a point estimate and 95% confidence interval the same way each time. In what are we 95% confident? Assuming our data and modeling assumptions are appropriate, we are confident that the interval estimates we calculate in a large number of repeated trials would capture the true population parameter 95% of the time.
Of course, this means that we also expect our 95% confidence intervals would fail to capture the population parameter 5% of the time on repeated trials. Moreover, when we only calculate a confidence interval from a single sample, we cannot know whether we have one of the 95% of parameter-catching interval nets that effectively captured the true population value or, instead, whether we happened to get one of the 5% of atypical interval nets that failed to capture the true population parameter.
Finally, want to catch more parameters? Like capturing Pokemon, you just need to get a bigger or better net! For instance, we can improve our expected parameter-capturing rate (e.g., to 99%) by casting a wider interval net (i.e., by calculating wider confidence intervals). By widening our interval net, we exchange greater imprecision/uncertainty for the prospect of making fewer inference errors. We can avoid this trade-off by getting a better net - that is, by improving features of our research design (e.g., more precise measurement; larger sample sizes).
In the current assignment, you will learn how to calculate confidence intervals around a point estimate in R and to interpret them appropriately. Additionally, you will learn how to simulate data from a probability distribution, which should help you better understand sampling variability and the need for interval estimates. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.
dplyr::sample_n()
set.seed()
dplyr::filter()
and %in%
operatorlistname <- c(item1, item2)
rnorm()
,
truncnorm::rtruncnorm()
, or runif()
ggplot()
plots into a
single figure using “patchwork” package
qt()
or
qnorm()
ggplot() + geom_point() + geom_errorbars
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
objectWe are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
funxtoz()
function) - and that doing so is recommended for
duplicate tasks to avoid copy-and-paste errorshere()
for a simple and reproducible
self-referential file directory methodgroundhog.library()
as an optional but recommended
reproducible alternative to library()
for loading
packageshead()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your datasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
haven::as_factor()
data %>% droplevels(data$variable)
select()
,
mutate()
, and if_else()
functionsmutate()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
and median()
(e.g.,
mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
or with
crosstable(depvar, by=indepvar)
rstatix::binom_test()
gt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)+ theme_minimal()
)
to a ggplot object to conveniently modify certain plot elements (e.g.,
white background color)viridisLite::viridis()
) and specify them for the outline
and fill colors in a ggplot geometric object (e.g.,
geom_boxplot()
)labs()
function (e.g.,
+ labs(title = "My Title")
)If you do not recall how to do these things, review Assignments 1-6.
Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Read in 2012 States data & calculate population mean for ‘RobberyRt’
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_23_Fordham_K300Assign7)
In previous assignments, you learned how to calculate descriptive statistics like the sample mean of a data column in R. In the current assignment, you will use sample statistics (e.g., sample mean or proportion) as point estimates to infer population parameter values. Likewise, you will learn to calculate the standard error of the estimate and a confidence interval around the estimate using R, as well as how to interpret these interval estimates appropriately.
here
package will automatically set our K300_L folder as the top-level
directory.tidyverse
, haven
, here
,
sjmisc
, sjPlot
, gt
, and
truncnorm
.
truncnorm
package,
which means you’ll need to download it first using
install.packages()
or the “Install Packages” button under
the “Tools” tab. Then, you can use library()
to load it in.
We will use this package to simulate data from a truncated normal
distribution.groundhog.library()
to improve the reproducibility of your
script.StatesData
.
StatesData
.
head()
function (piped to
gt()
) to print the first five rows so you can compare your
output.StatesData <- read_spss(here("Datasets", "2012StatesData.sav"))
StatesData %>% head() %>% gt()
State | Number1824 | NumberRural | State in South | Four regional categories | Percent in different house in 2008 | Cigarette Tax in pennies | Smoke Free workplace laws Implemented | Tobacco related death rate per 100K | Percent of Adults who smoke | total population in Thousands | Divorces per 1K population | Percent of Families below poverty | Percent Individuals below poverty | Median Income for Households | Percent of Population without health insurance | Murder Rate per 100K | Robbery Rate per 100K | Assault Rate per 100K | Burglary Rate per 100K | Motor Vehicle Rate per 100K | Infant Mortality Rat per 1,000 Live Births | Cardiovascular Death Rate per 100K | Cancer Death Rate Per 100K | Percent of Pop With Bachelor's degree or more | PercentRural | Percent18to24 | ID | Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k | Murder rate categorical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama | 335 | 1981 | 1 | 1 | 15.1 | 0.425 | 0 | 317.5 | 22 | 4780 | 4.4 | 13.4 | 17.5 | 40489 | 16.9 | 7.1 | 142.5 | 277.5 | 1058.9 | 244.8 | 9.9 | 235.5 | 197.3 | 22.0 | 41.443515 | 7.008368 | 1 | 1 | 2 |
Alaska | 54 | 216 | 0 | 2 | 21.5 | 2.000 | 0 | 270.4 | 22 | 710 | 4.4 | 6.2 | 9.0 | 66953 | 17.7 | 3.2 | 94.0 | 462.0 | 514.2 | 241.5 | 6.5 | 147.9 | 179.9 | 26.6 | 30.422535 | 7.605634 | 2 | 1 | 1 |
Arizona | 443 | 607 | 0 | 2 | 19.7 | 2.000 | 1 | 247.4 | 16 | 6392 | 3.5 | 11.6 | 16.5 | 48745 | 19.6 | 5.5 | 123.9 | 261.1 | 817.3 | 397.1 | 6.8 | 152.5 | 152.8 | 25.6 | 9.496245 | 6.930538 | 3 | 1 | 1 |
Arkansas | 200 | 1269 | 1 | 1 | 17.6 | 1.150 | 1 | 323.7 | 22 | 2916 | 5.7 | 14.8 | 18.8 | 37823 | 19.2 | 6.3 | 93.5 | 381.8 | 1224.1 | 215.6 | 7.7 | 221.8 | 200.4 | 18.9 | 43.518519 | 6.858711 | 4 | 1 | 2 |
California | 2766 | 1882 | 0 | 2 | 15.6 | 0.870 | 0 | 235.0 | 14 | 37254 | 0.0 | 10.6 | 14.2 | 58931 | 20.0 | 5.4 | 173.7 | 270.8 | 622.1 | 443.6 | 5.2 | 177.9 | 161.7 | 29.9 | 5.051807 | 7.424706 | 5 | 1 | 1 |
Colorado | 349 | 668 | 0 | 2 | 18.2 | 0.840 | 1 | 237.6 | 18 | 5029 | 4.2 | 8.9 | 12.9 | 55430 | 15.3 | 3.2 | 67.9 | 224.5 | 532.5 | 250.6 | 6.1 | 145.3 | 153.7 | 35.9 | 13.282959 | 6.939749 | 6 | 1 | 1 |
sjmisc::frq()
by typing
StatesData %>% frq(RobberyRt)
to create frequency table
that also provides the mean and standard deviation.mean(StatesData$RobberyRt)
and
sd(StatesData$RobberyRt)
.StatesData %>%
frq(RobberyRt)
## Robbery Rate per 100K (RobberyRt) <numeric>
## # total N=50 valid N=50 mean=104.58 sd=57.74
##
## Value | N | Raw % | Valid % | Cum. %
## -------------------------------------
## 14.30 | 1 | 2 | 2 | 2.00
## 14.90 | 1 | 2 | 2 | 4.00
## 16.50 | 1 | 2 | 2 | 6.00
## 17.20 | 1 | 2 | 2 | 8.00
## 18.00 | 1 | 2 | 2 | 10.00
## 22.90 | 1 | 2 | 2 | 12.00
## 30.30 | 1 | 2 | 2 | 14.00
## 37.20 | 1 | 2 | 2 | 16.00
## 42.20 | 1 | 2 | 2 | 18.00
## 47.30 | 1 | 2 | 2 | 20.00
## 56.20 | 1 | 2 | 2 | 22.00
## 65.30 | 1 | 2 | 2 | 24.00
## 66.70 | 1 | 2 | 2 | 26.00
## 67.90 | 1 | 2 | 2 | 28.00
## 70.40 | 1 | 2 | 2 | 30.00
## 74.50 | 1 | 2 | 2 | 32.00
## 74.70 | 1 | 2 | 2 | 34.00
## 79.50 | 1 | 2 | 2 | 36.00
## 80.20 | 1 | 2 | 2 | 38.00
## 86.80 | 1 | 2 | 2 | 40.00
## 87.70 | 1 | 2 | 2 | 42.00
## 92.90 | 1 | 2 | 2 | 44.00
## 93.50 | 1 | 2 | 2 | 46.00
## 94.00 | 1 | 2 | 2 | 48.00
## 98.70 | 1 | 2 | 2 | 50.00
## 103.40 | 1 | 2 | 2 | 52.00
## 113.60 | 1 | 2 | 2 | 54.00
## 114.10 | 1 | 2 | 2 | 56.00
## 117.30 | 1 | 2 | 2 | 58.00
## 123.90 | 1 | 2 | 2 | 60.00
## 126.00 | 1 | 2 | 2 | 62.00
## 126.50 | 1 | 2 | 2 | 64.00
## 127.10 | 1 | 2 | 2 | 66.00
## 129.40 | 1 | 2 | 2 | 68.00
## 131.60 | 1 | 2 | 2 | 70.00
## 133.70 | 1 | 2 | 2 | 72.00
## 142.30 | 1 | 2 | 2 | 74.00
## 142.40 | 1 | 2 | 2 | 76.00
## 142.50 | 1 | 2 | 2 | 78.00
## 144.50 | 1 | 2 | 2 | 80.00
## 153.30 | 1 | 2 | 2 | 82.00
## 153.60 | 1 | 2 | 2 | 84.00
## 157.00 | 1 | 2 | 2 | 86.00
## 166.80 | 1 | 2 | 2 | 88.00
## 167.60 | 1 | 2 | 2 | 90.00
## 173.70 | 1 | 2 | 2 | 92.00
## 189.70 | 1 | 2 | 2 | 94.00
## 210.70 | 1 | 2 | 2 | 96.00
## 228.00 | 1 | 2 | 2 | 98.00
## 260.70 | 1 | 2 | 2 | 100.00
## <NA> | 0 | 0 | <NA> | <NA>
Goal: Randomly Select 25 States from Dataset
As noted in the previous section, this is one of the relatively rare instances where we have data from an entire population - in this case, from all 50 states. Since the population parameters (e.g., mean and standard deviation) are known to us, we can illustrate the logic and process of statistical inference by randomly sampling from the population, generating sample point and interval estimates, and then comparing these sample estimates to the known population parameters. (Note: Typically, we cannot do this last step as we rarely know the population parameters - that is why we use samples to estimate them!)
So, you will start randomly select rows of data from 25 of the 50
states (rows) in the 2012 States dataset. In doing so, you will learn
about the set.seed()
function, which is essential for
reproducibility as it will allow all of us to select the same
random sample rather than generating a unique random sample every time
our code is run.
After drawing a random sample of 25 states, we will take a detour to introduce nonrandom selection of data subsets followed by data simulation. We will then revisit the random sample you generate in this section to complete the last part of the assignment.
set.seed(1234)
and hit enter. Then, type
StatesDataRdm25 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed()
allows us to make our “random”
sample easier to reproduce. Without it, R will generate a new and unique
random sample each time you run your sampling code, which can cause
major reproducibility problems.
set.seed()
function specifies a fixed
instead of changing starting point for the random number generator. This
way, any time we or a collaborator runs our code, we will get the exact
same “random” sample. Here, we are using a starting seed of
1234
, which will begin the random number generator at the
1,234th place. However, we can use any set of numbers when we set our
seed. Unique seeds will keep our work as random as possible yet still
reproducible. You could set your seed to 6789
– any value
will work, even 1
! I often use numbers like
8675309
or 1138
.dplyr::sample_n()
function allows us to take a
sample of size n
from our dataset. The first value, 25,
specifies that we want 25 rows, or states. The function
replace = FALSE
simply tells R not to replace states after
choosing them. For example, if Missouri is chosen, R will not place it
back into the selection pool. This method ensures no state is chosen
more than once. If you wanted to sample with replacement,
change FALSE
to TRUE
.
sample_n()
with sample_frac()
and 25
with 0.5
. Essentially, this approach
will take a random sample of a fraction (in our case, 50%) of the rows
in our data object.StatesDataRdm25 <- StatesData %>%
portion of the code
assigns the result of our piped actions on the StatesData - that is, our
random sample of 25 rows - into a new data object called
“StatesDataRdm25”.set.seed(1234)
StatesDataRdm25 <- StatesData %>% sample_n(25, replace = FALSE)
StatesDataRdm25 %>% head() %>% gt()
State | Number1824 | NumberRural | State in South | Four regional categories | Percent in different house in 2008 | Cigarette Tax in pennies | Smoke Free workplace laws Implemented | Tobacco related death rate per 100K | Percent of Adults who smoke | total population in Thousands | Divorces per 1K population | Percent of Families below poverty | Percent Individuals below poverty | Median Income for Households | Percent of Population without health insurance | Murder Rate per 100K | Robbery Rate per 100K | Assault Rate per 100K | Burglary Rate per 100K | Motor Vehicle Rate per 100K | Infant Mortality Rat per 1,000 Live Births | Cardiovascular Death Rate per 100K | Cancer Death Rate Per 100K | Percent of Pop With Bachelor's degree or more | PercentRural | Percent18to24 | ID | Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k | Murder rate categorical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nevada | 178 | 170 | 0 | 2 | 20.6 | 0.800 | 1 | 343.7 | 22 | 2701 | 6.7 | 9.0 | 12.4 | 53341 | 20.8 | 5.9 | 228.0 | 432.1 | 835.7 | 468.6 | 6.4 | 200.0 | 180.2 | 21.8 | 6.293965 | 6.590152 | 29 | 1 | 1 |
Kansas | 204 | 768 | 0 | 3 | 17.6 | 0.790 | 0 | 262.7 | 18 | 2853 | 3.7 | 9.0 | 13.4 | 47817 | 13.3 | 4.7 | 66.7 | 297.9 | 690.0 | 218.2 | 7.9 | 178.7 | 180.0 | 29.5 | 26.919033 | 7.150368 | 17 | 1 | 1 |
Michigan | 669 | 2519 | 0 | 3 | 14.4 | 2.000 | 0 | 281.9 | 20 | 9884 | 3.3 | 11.6 | 16.2 | 45255 | 13.8 | 6.3 | 126.5 | 326.5 | 768.1 | 297.7 | 7.9 | 221.5 | 187.3 | 24.6 | 25.485633 | 6.768515 | 23 | 1 | 2 |
Oregon | 253 | 727 | 0 | 2 | 17.3 | 1.180 | 1 | 263.3 | 16 | 3831 | 3.9 | 9.8 | 14.3 | 48457 | 17.7 | 2.3 | 65.3 | 162.3 | 513.0 | 261.5 | 5.8 | 156.9 | 179.3 | 29.2 | 18.976768 | 6.604020 | 38 | 0 | 0 |
Utah | 227 | 263 | 0 | 2 | 16.4 | 0.695 | 1 | 138.3 | 9 | 2764 | 3.6 | 7.8 | 11.5 | 55117 | 14.8 | 1.4 | 47.3 | 133.8 | 548.7 | 251.1 | 5.1 | 152.1 | 128.8 | 28.5 | 9.515195 | 8.212735 | 45 | 0 | 0 |
Florida | 1229 | 1712 | 1 | 1 | 15.9 | 1.339 | 1 | 258.8 | 17 | 18801 | 4.2 | 10.7 | 14.9 | 44736 | 22.4 | 5.5 | 166.8 | 410.6 | 981.2 | 271.2 | 7.1 | 162.4 | 166.6 | 25.3 | 9.105899 | 6.536886 | 10 | 1 | 1 |
#Alternative way to sample half the rows (n=25)
# StatesDataRdm25 <- StatesData %>% sample_frac(0.5, replace = FALSE)
StatesDataRdm25
data object in your Environment. It should contain 25 rows (states) with
30 columns (variables). The first state should be Nevada and the last
should be Delaware.head()
function (piped to
gt()
) again to print the first five rows so you can compare
your output.StatesData
data object should also still
be in the R Environment and should still contain all 50 states.Goal: Nonrandomly Select Rows using Conditions
While a random (or probability) sample is essential for drawing unbiased statistical inferences from a sample to a population, there may be times that you want to non-randomly select certain rows with specific characteristics from a data object rather than selecting rows randomly. So, we we will also show you how to select a specific subset of rows (states) in the data, such as the first 25 rows or all rows from states starting with a particular letter.
StatesData
object. One way you could do
this is to combine dplyr::filter()
with the
%in%
operator. Let’s try it.
StateswA <- StatesData %>%
and hit enter.
Then, type
dplyr::filter(State %in% c("Alabama", "Alaska", "Arizona", "Arkansas"))
and hit run. A data object called StateswA
should appear in
your Environment.
StateswA
using our
StatesData
dataset. Note: Do not use the
StatesDataRdm25
dataset, because this only includes 50% of
the states!dplyr::filter()
function generates a subset of a
dataframe by retaining only the rows that match a condition you specify
(e.g., matching column values). You might recall that we used
dplyr::filter()
in the previous assignment to select only
rows that contained valid non-missing response values on a variable
(with filter(!is.na(var))
).%in%
operator tells R that we want to keep only
those rows in the StatesData
dataset where the response
values are “in” (or are equal to) the values we specified - in this
case, “Alabama”, “Alaska”, “Arizona”, and “Arkansas”. If we had
specified only one state, we could have used a double equal sign instead
(e.g., State == "Alabama"
).c()
. This
is a very handy and widely used base R function that combines multiple
elements into a value or a list.newdata <- olddata %>% filter(var %in% c(value1, value2))
informs R to filter the old data object by the “var” column, keeping
only those rows where “var” is equal to “value1” or “value2”, then
assigning this subset to a new data object.#preferred way to choose specific rows, or states
StateswA <- StatesData %>%
dplyr::filter(State %in% c("Alabama", "Alaska", "Arizona", "Arkansas")
)
StateswA %>% gt()
State | Number1824 | NumberRural | State in South | Four regional categories | Percent in different house in 2008 | Cigarette Tax in pennies | Smoke Free workplace laws Implemented | Tobacco related death rate per 100K | Percent of Adults who smoke | total population in Thousands | Divorces per 1K population | Percent of Families below poverty | Percent Individuals below poverty | Median Income for Households | Percent of Population without health insurance | Murder Rate per 100K | Robbery Rate per 100K | Assault Rate per 100K | Burglary Rate per 100K | Motor Vehicle Rate per 100K | Infant Mortality Rat per 1,000 Live Births | Cardiovascular Death Rate per 100K | Cancer Death Rate Per 100K | Percent of Pop With Bachelor's degree or more | PercentRural | Percent18to24 | ID | Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k | Murder rate categorical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama | 335 | 1981 | 1 | 1 | 15.1 | 0.425 | 0 | 317.5 | 22 | 4780 | 4.4 | 13.4 | 17.5 | 40489 | 16.9 | 7.1 | 142.5 | 277.5 | 1058.9 | 244.8 | 9.9 | 235.5 | 197.3 | 22.0 | 41.443515 | 7.008368 | 1 | 1 | 2 |
Alaska | 54 | 216 | 0 | 2 | 21.5 | 2.000 | 0 | 270.4 | 22 | 710 | 4.4 | 6.2 | 9.0 | 66953 | 17.7 | 3.2 | 94.0 | 462.0 | 514.2 | 241.5 | 6.5 | 147.9 | 179.9 | 26.6 | 30.422535 | 7.605634 | 2 | 1 | 1 |
Arizona | 443 | 607 | 0 | 2 | 19.7 | 2.000 | 1 | 247.4 | 16 | 6392 | 3.5 | 11.6 | 16.5 | 48745 | 19.6 | 5.5 | 123.9 | 261.1 | 817.3 | 397.1 | 6.8 | 152.5 | 152.8 | 25.6 | 9.496245 | 6.930538 | 3 | 1 | 1 |
Arkansas | 200 | 1269 | 1 | 1 | 17.6 | 1.150 | 1 | 323.7 | 22 | 2916 | 5.7 | 14.8 | 18.8 | 37823 | 19.2 | 6.3 | 93.5 | 381.8 | 1224.1 | 215.6 | 7.7 | 221.8 | 200.4 | 18.9 | 43.518519 | 6.858711 | 4 | 1 | 2 |
%in%
operator and the filter()
and
c()
functions. We will start by assigning the column values
(state names) to an object or, more precisely, to a vector of values.
Cstates <- c("California", "Colorado", "Connecticut")
.
Cstates
. This basically creates a list or vector of values
(not a data object) that will be assigned to an object. You can then
find this object in your Environment tab under “Values” (below “Data”
objects), and you can call this object for use in later code.Cstates
and click “Run”
to view the object, R studio should output the list you assigned:#assign list/vector of specific row values, or states, as object called "Cstates"
Cstates <- c("California", "Colorado", "Connecticut")
Cstates
## [1] "California" "Colorado" "Connecticut"
filter()
code from above.
StateswC <- StatesData %>% filter(State %in% Cstates)
StateswC
.#Filter data using object containing list/vector of row values (states)
StateswC <- StatesData %>%
filter(State %in% Cstates)
StateswC %>% gt()
State | Number1824 | NumberRural | State in South | Four regional categories | Percent in different house in 2008 | Cigarette Tax in pennies | Smoke Free workplace laws Implemented | Tobacco related death rate per 100K | Percent of Adults who smoke | total population in Thousands | Divorces per 1K population | Percent of Families below poverty | Percent Individuals below poverty | Median Income for Households | Percent of Population without health insurance | Murder Rate per 100K | Robbery Rate per 100K | Assault Rate per 100K | Burglary Rate per 100K | Motor Vehicle Rate per 100K | Infant Mortality Rat per 1,000 Live Births | Cardiovascular Death Rate per 100K | Cancer Death Rate Per 100K | Percent of Pop With Bachelor's degree or more | PercentRural | Percent18to24 | ID | Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k | Murder rate categorical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
California | 2766 | 1882 | 0 | 2 | 15.6 | 0.87 | 0 | 235.0 | 14 | 37254 | 0.0 | 10.6 | 14.2 | 58931 | 20.0 | 5.4 | 173.7 | 270.8 | 622.1 | 443.6 | 5.2 | 177.9 | 161.7 | 29.9 | 5.051807 | 7.424706 | 5 | 1 | 1 |
Colorado | 349 | 668 | 0 | 2 | 18.2 | 0.84 | 1 | 237.6 | 18 | 5029 | 4.2 | 8.9 | 12.9 | 55430 | 15.3 | 3.2 | 67.9 | 224.5 | 532.5 | 250.6 | 6.1 | 145.3 | 153.7 | 35.9 | 13.282959 | 6.939749 | 6 | 1 | 1 |
Connecticut | 228 | 418 | 0 | 4 | 11.0 | 3.00 | 0 | 238.3 | 16 | 3574 | 3.1 | 6.7 | 9.4 | 67034 | 12.0 | 3.0 | 113.6 | 165.2 | 431.1 | 212.0 | 6.6 | 171.0 | 170.7 | 35.6 | 11.695579 | 6.379407 | 7 | 0 | 0 |
Goal: Simulate Data
In addition to selecting subsets from real-world data, there may be times that you want to simulate fake data. Social scientists simulate data for many reasons, such as to check model assumptions, mimic real-world scenarios, or even generate more accurate interval estimates around sample statistics from real data (e.g., bootstrapped confidence intervals or Bayesian credible intervals). In fact, some even claim that to truly understand statistical modeling and inference, one must first know how to effectively simulate data.
One of the most challenging aspects of data simulation is choosing an appropriate probability distribution(s) from which to randomly sample data. Here, we will introduce you to simple methods for generating simulated datasets from some common probability distributions, including standard normal, normal, truncated normal, and uniform distributions. We recommend checking out this site for additional reading and other examples of simulating from normal, uniform, binomial, and Poisson distributions in R.
1234
rnorm(n = 50, mean=0, sd=1)
rnorm()
draws “n” random numbers from a normal
distribution with a specified mean and standard deviation.rnorm()
function to draw 50 random numbers
from a (standard) normal distribution that has a mean=0 and sd=1.# simulation of n=25 RobberyRt values from random normal dist
set.seed(1234)
rnorm(n = 50,
mean = 0,
sd = 1)
## [1] -1.20706575 0.27742924 1.08444118 -2.34569770 0.42912469 0.50605589
## [7] -0.57473996 -0.54663186 -0.56445200 -0.89003783 -0.47719270 -0.99838644
## [13] -0.77625389 0.06445882 0.95949406 -0.11028549 -0.51100951 -0.91119542
## [19] -0.83717168 2.41583518 0.13408822 -0.49068590 -0.44054787 0.45958944
## [25] -0.69372025 -1.44820491 0.57475572 -1.02365572 -0.01513830 -0.93594860
## [31] 1.10229755 -0.47559308 -0.70944004 -0.50125806 -1.62909347 -1.16761926
## [37] -2.18003965 -1.34099319 -0.29429386 -0.46589754 1.44949627 -1.06864272
## [43] -0.85536463 -0.28062300 -0.99434008 -0.96851432 -1.10731819 -1.25198589
## [49] -0.52382812 -0.49684996
# simulation of n=25 RobberyRt values from random normal dist
set.seed(1234)
simznorm <- rnorm(n = 50,
mean = 0,
sd = 1) %>% as_tibble()
Mean of z-normal simulated sample, n=50”
mean(simznorm$value)
## [1] -0.453053
SD of z-normal simulated sample, n=50”
sd(simznorm$value)
## [1] 0.8850435
simznorm %>% ggplot(aes(x=value)) + geom_histogram(fill="#21918c", bins=5)
Notice the data series are approximately normally distributed with a mean that is somewhat close to “0” (-0.45) and SD close to “1” (0.89). However, as a random draw from the standard normal distribution, they are not expected to exactly match these parameter values or perfect bell-curved shape. This is especially true when sampling a small number of values from a heterogeneous population.
We illustrate this sampling variability in the figure below, which was generated by drawing nine samples of n=50 values each from the standard normal distribution (with the same starting seed) and then compiling their histograms into a single figure using the “patchwork” package. For now, do not worry about the code - we will walk you through much of this process soon enough. Instead, just notice how different each of these histograms appear to be despite the fact that they each represent a random sample drawn from the same population (i.e., a standard normal probability distribution).
n=50
in the rnorm()
function to a larger
sample size - say, to n=5000
instead.
Mean of z-normal simulated sample V1, n=5,000”
## [1] -0.006354696
SD of z-normal simulated sample V1, n=5,000”
sd(simznorm$V1)
## [1] 0.991345
Sample mean for RobberyRt (n=25 states)
## [1] 99.424
Sample sd for RobberyRt (n=25 states)
#sample sd of RobberyRt
sd(StatesDataRdm25$RobberyRt)
## [1] 57.07451
funhist(StatesDataRdm25, RobberyRt, 5, 20, "Robbery Rate (Random Sample of States Data, n=25)")
1234
rnorm(n = 50,
then hit
enter.mean=mean(StatesDataRdm25$RobberyRt),
then hit enter.sd=sd(StatesDataRdm25$RobberyRt))
and then run your
code.# simulation of n=25 RobberyRt values from random normal dist
set.seed(1234)
rnorm(n = 50,
mean = mean(StatesDataRdm25$RobberyRt),
sd = sd(StatesDataRdm25$RobberyRt))
## [1] 30.531317 115.258137 161.317946 -34.455541 123.916080 128.306891
## [7] 66.621000 68.225256 67.208180 48.625529 72.188462 42.441585
## [13] 55.119691 103.102955 154.186651 93.129510 70.258384 47.417970
## [19] 51.642839 237.306603 107.077019 71.418344 74.279947 125.654841
## [25] 59.830259 16.768418 132.227900 40.999354 98.559989 46.005195
## [31] 162.337090 72.279759 58.933059 70.814943 6.444293 32.782706
## [37] -25.000689 22.887474 82.627323 72.833127 182.153285 38.431743
## [43] 50.604485 83.407580 42.672530 44.146522 36.224360 27.967522
## [49] 69.526768 71.066533
rnorm()
and the mean/sd of our random
sample (printed output above and plotted below) reveals a set of values
that are much more comparable to the RobberyRt values observed in our
StatesData, with one glaring exception: Our simulated data contain some
negative values, whereas the minimum value in our RobberyRt sample (and
population) data equals “0”. Of course, this is because a state cannot
validly report a negative robbery rate - that is, a state cannot have
less than 0 robberies in a year!Ideally, we want to simulate data from a probability distribution
that closely matches the data-generating processes underlying RobberyRt
values. So, at the very least, we want to draw from a probability
distribution that does not include any negative values. Since the normal
distribution extends from negative infinity to infinity, there is always
a chance that we will draw one or more negative - and hence invalid (as
RobberyRt values cannot be negative) - simulated values using
rnorm()
. As such, this distribution is not a good fit to
the data generating process that we are trying to simulate.
Instead, we might consider randomly sampling from a truncated distribution, which is a probability distribution that is limited to or excludes certain values. For instance, we could draw from a truncated normal distribution that excludes (rejects) any values below a minimum of “0”.
1234
rnorm(n = 50,
then hit
enter.a = 0,
then hit enter.
rnorm()
code. With
truncnorm()
we can specify lower (minumum; a=
)
and upper (maximum; b=
) sampling limits to define the
accept/reject sampling region. In our case, we typed a=0
to
specify a minimum value of “0”, which means that any negative values
(<0) will be rejected (skipped) when drawing random numbers from the
normal distribution.mean=mean(StatesDataRdm25$RobberyRt),
then hit enter.sd=sd(StatesDataRdm25$RobberyRt))
and then run your
code.set.seed(1234)
rtruncnorm(n = 50,
a = 0,
mean = mean(StatesDataRdm25$RobberyRt),
sd = sd(StatesDataRdm25$RobberyRt))
## [1] 30.531317 115.258137 161.317946 123.916080 128.306891 66.621000
## [7] 68.225256 67.208180 48.625529 72.188462 42.441585 55.119691
## [13] 103.102955 154.186651 93.129510 70.258384 47.417970 51.642839
## [19] 237.306603 107.077019 71.418344 74.279947 125.654841 59.830259
## [25] 16.768418 132.227900 40.999354 98.559989 46.005195 162.337090
## [31] 72.279759 58.933059 70.814943 6.444293 32.782706 22.887474
## [37] 82.627323 72.833127 182.153285 38.431743 50.604485 83.407580
## [43] 42.672530 44.146522 36.224360 27.967522 69.526768 71.066533
## [49] 66.202303 36.134671
If you look carefully, these values are nearly identical to those
generated using rnorm()
. This is because in both instances
we set the same random seed (set.seed(1234)
) before drawing
(pseudo)random values from a normal probability distribution with the
same mean and standard deviation. The key difference is that the 4th
value and 38th value were both negative (-34.45554 & -25.000689,
respectively) in our simulated sample generated earlier using
rnorm()
. Each of these negative values were rejected by the
random sampler and skipped in generating our new simulated sample using
truncnorm()
. Likewise, the very last two values (66.202303
& 36.134671) in the truncnorm()
simulated sample are
new, as these values essentially were next in line (from our seed using
the random number generator) to replace the two rejected negative
values.
Below, we present the plot of our real random sample data alongside the plot of our simulated sample data from a truncated normal distribution. Compared to our previous attempts, it is more plausible that these simulated values were drawn randomly from the same underlying generative distribution as our sample “RobberyRt” values.
Before you learn to replicate this simulation process multiple times and create your own patchwork plots, let’s consider one alternative probability distribution - the uniform or “flat” distribution. In a uniform distribution, the probability of selecting any one value between a specified minimum and maximum limit is equal to the probability of selecting any other value. Put differently, every value within the set limits of a uniform distribution is equally likely to be chosen at random.
Due to this equal probability feature, a uniform distribution generates random samples that, over the long run (i.e., with large samples), resemble uniform or “flat” distributions. Contrast this shape with the peaked (rather than flat) standard normal distribution, where the probability of selecting “0” (i.e., the mean) is higher than that of any other value in the distribution and where the probability of selecting any given value near “0” is much higher than the probability of selecting any given value that is multiple standard deviations away from the mean of “0”.
To illustrate these differences, below we have plotted large simulated samples (n=50,000 each) drawn from the standard or z-normal (mean=0,sd=1), truncated normal (mean=1, sd=1), and uniform (min=-5,max=5) distributions.
At first glance, the flat shape of a uniform distribution does not look like a good fit for sampling state-level robbery rates. However, as with sampling from the normal distribution, small samples (e.g., n=50) drawn from a uniform distribution will be subject to greater sampling variability, so their plots are likely to be “noisier” and not as flat as the one shown above.
Moreover, rather than stacking the mass of our generative
probability distribution around the observed mean from a small, noisy
sample like we do with truncnorm()
, we might instead think
that it is reasonable to expect a state to have any robbery
rate values between “0” and some plausible upper bound - say 300
robberies per 100k people; we might also think that states are equally
likely to have any value between “0” and our specified maximum of “300”
robberies per 100k. If so, then the uniform probability distribution
might be an appropriate alternative starting point for simulating
state-level robbery rate data.
runif()
.
1234
runif
(n = 50,` then hit
enter.min = 0,
then hit enter.max = 300
then hit enter.
rnorm()
and
truncnorm()
code above. Here, we specify the sampling
region for our uniform distribution between limits of min=0
(0 robberies per 100k) and max=300
(300 robberies per
100k). In a uniform probability distribution, all values within those
limits have an equal probability of selection.set.seed(1234)
runif(n = 50,
min = 0,
max = 300)
## [1] 34.111023 186.689821 182.782420 187.013833 258.274615 192.093182
## [7] 2.848727 69.765152 199.825127 154.275342 208.077388 163.492451
## [13] 84.820075 277.030045 87.694752 251.188688 85.866985 80.046234
## [19] 56.016837 69.667773 94.983736 90.808011 47.713801 11.998775
## [25] 65.639862 243.179566 157.709264 274.397450 249.403514 13.731079
## [31] 136.827445 79.556002 91.401661 152.192061 54.328862 227.901191
## [37] 60.374411 77.642946 297.645125 242.205702 166.000077 193.921828
## [43] 93.547292 186.545759 98.931053 150.599242 203.128358 145.497372
## [49] 73.178648 229.637936
simrunif <-
before runif(n = 50,
to assign your results to an object called “simrunif”.%>%
immediately after max = 300)
and hit enter.as_tibble()
.simrunif %>% gt()
gt()
table. (Note: We also piped to
head()
; this keeps the output brief while allowing you to
compare your output to our first five rows.)simrunif <- runif(n = 50,
min = 0,
max = 300) %>%
as_tibble()
simrunif %>% head() %>% gt()
value |
---|
22.13396 |
92.90598 |
215.18152 |
151.36377 |
45.89969 |
151.18005 |
simrunifplot <- simrunif %>%
then hit enter.
ggplot(aes(x=value)) +
then
hit enter.
+
at
the end, as we need to add a geometric object to our blank ggplot
coordinate canvas.coord_cartesian(expand = FALSE, ylim=c(0,30), xlim=c(0,300))
coord_cartesian()
function allows us to modify the
XY coordinates in our ggplot objects, such as zooming in or out of the
plot, without changing the data.geom_histogram(fill="#21918c", bins=5) +
.
bins=30
, would
divide the values from “0” to “300” into 30 bins, into which our small
sample of n=50 values would then be sorted and plotted. Selecting a
large number of bins displays more fine-grained frequency distinctions
(akin to a frequency table at the extreme), whereas selecting a smaller
number of bins can show clustering in grouped frequency regions. Try
re-running the plot a few times with different bin numbers to see what
we mean.xlab("Simulated data (runif[0,300], n=50)")
simrunifplot
and click run to
generate and then view your plot object.simrunifplot <- simrunif %>%
ggplot(aes(x=value)) +
coord_cartesian(expand = FALSE, ylim=c(0,30), xlim=c(0,310)) +
geom_histogram(fill="#21918c", bins=5) +
xlab("Simulated data (runif[0,300], n=50)")
simrunifplot
simrunifplot
to sampleplot
.simrunif
to
StatesDataRdm25
.x=value
to
x=RobberyRt
ylim=c(0,30)
to
ylim=c(0,15)
.
ylim
) in half (from 30 to 15).geom_histogram(
) can remain
unchanged."Simulated data (runif[0,300], n=50)"
to
"RobberyRt (random sample, n=25)"
sampleplot <- StatesDataRdm25 %>%
ggplot(aes(x=RobberyRt)) +
coord_cartesian(expand = FALSE, ylim=c(0,15), xlim=c(0,310)) +
geom_histogram(fill="#21918c", bins=5) +
xlab("Sample RobberyRt (n=25)")
sampleplot
simrunifplot + sampleplot
and hit run!simrunifplot + sampleplot
+
) a
plot_spacer()
, specifying the plot layout by adding a
plot_layout()
, or including a title by adding
plot_annotation(title="My Title")
. We can also use
/
to indicate which plot(s) we want on top and which we
want on bottom. For example, click the code box below to see two
different ways to generate the same combined plot using patchwork:# Illustrate patchwork with plot spacers, formatting, and adding a title
(simrunifplot + plot_spacer()) / (plot_spacer() + sampleplot) +
plot_annotation(
title = 'Histograms of RobberyRt (random sample, n=25) & Simulated Data (uniform(0,300), n=50)')
# Alternative way to generate same plot
# simrunifplot + plot_spacer() + plot_spacer() + sampleplot +
# plot_layout(ncol = 2) +
# plot_annotation(
# title = 'Histograms of RobberyRt (random sample, n=25) & Simulated Data (uniform(0,300), n=50)')
Finally, we could use the replicate()
function to
generate multiple simulated samples, plot them all, and combine all the
plots with patchwork. This is essentially what we did earlier to
generate the large figures with numerous plots. However, since we were
creating many similar plots with minor differences between them, we also
wrote a simple function (see code for funhist()
function
below) to generate custom histograms with
ggplot()
.
Below, we illustrate how to:
replicate()
in conjunction with your earlier code
to generate twelve simulated data sets from the uniform(0,300)
distributionfunhist()
) to plot each simulated
dataset in addition to plotting the real “RobberyRt” distributions from
the population and our random sampleNote: You do not need to do this step, but feel free to follow along if you want!
# Replicate 12 simulated samples from uniform(0,300) distribution
set.seed(1234)
simrunif12robrt <- replicate(n = 12,
expr = runif(n = 50,
min = 0,
max = 300),
simplify = TRUE) %>%
as_tibble()
# Alt code to simulate n=50 RobberyRt values from gamma(2,60) distribution
# set.seed(1234)
# simrgam12robrt <- replicate(n = 12,
# expr = rgamma(n = 50,
# shape = 2,
# scale = 60),
# simplify = TRUE) %>%
# as_tibble()
#create function to quickly plot custom histograms
#function inputs: data; variable to plot; number of bins for geom_histogram; y-axis max; x-axis label
funhist = function(dat,var,numbins,ymax,x_label){
dat %>%
ggplot(aes(x={{var}})) +
geom_histogram(fill="#21918c", bins=numbins) +
coord_cartesian(expand = FALSE, ylim=c(0,ymax), xlim=c(0,300)) +
xlab(x_label)
}
histpop <- funhist(StatesData, RobberyRt, 5, 30, "Pop RobRt (n=50)")
histsamp <- funhist(StatesDataRdm25, RobberyRt, 5, 15, "Samp RobRt (n=25)")
histv1runif <- funhist(simrunif12robrt, V1, 5, 30, "Sim1 (n=50)")
histv2runif <- funhist(simrunif12robrt, V2, 5, 30, "Sim2 (n=50)")
histv3runif <- funhist(simrunif12robrt, V3, 5, 30, "Sim3 (n=50)")
histv4runif <- funhist(simrunif12robrt, V4, 5, 30, "Sim4 (n=50)")
histv5runif <- funhist(simrunif12robrt, V5, 5, 30, "Sim5 (n=50)")
histv6runif <- funhist(simrunif12robrt, V6, 5, 30, "Sim6 (n=50)")
histv7runif <- funhist(simrunif12robrt, V7, 5, 30, "Sim7 (n=50)")
histv8runif <- funhist(simrunif12robrt, V8, 5, 30, "Sim8 (n=50)")
histv9runif <- funhist(simrunif12robrt, V9, 5, 30, "Sim9 (n=50)")
histv10runif <- funhist(simrunif12robrt, V10, 5, 30, "Sim10 (n=50)")
histv11runif <- funhist(simrunif12robrt, V11, 5, 30, "Sim11 (n=50)")
histv12runif <- funhist(simrunif12robrt, V12, 5, 30, "Sim12 (n=50)")
plot_spacer() + histpop + histsamp + plot_spacer() +
histv1runif + histv2runif + histv3runif +
histv4runif + histv5runif + histv6runif +
histv7runif + histv8runif + histv9runif +
histv10runif + histv11runif + histv12runif +
plot_layout(ncol = 4) +
plot_annotation(
title = 'Histograms of Robbery Rate: Population, Sample, & 12 Simulated Sets (Unif(0,300))' )
Goal: Estimate Confidence Intervals around ‘RobberyRt’ Sample Mean
After our detour into data simulation from probability distributions, we hope that “sampling variability” makes a little more sense to you now. Understanding sampling variability is essential to truly comprehending confidence intervals. Interval estimates are one way that we attempt to quantify and communicate the uncertainty caused by sampling variability in our statistical estimates of population parameters. After all, if our random samples from a population do not vary at all (imagine sampling from a population of identical units), then we would not have any uncertainty in our estimates and would not need to generate intervals to try to capture the true population parameters!
For the last two sections of the assignment, you will answer questions 3 and 5 from the SPSS exercises at the end of Chapter 7. In this section, we will show you how to estimate a 95% CI around the mean of “RobberyRt” in our randomly selected sample of 25 states. You will then follow this process on your own to estimate a 95% CI around the mean of “AssaultRt” in our random sample. Then, in the last section, we will show you how to estimate a 90% CI around a sample proportion calculated from the “AssaultRt” variable in our random sample. You will then follow these same procedures on your own to estimate 90% and 99% CIs around the same “AssaultRt” proportion.
mean(data$var)
and sd(data$var)
and assign these sample values for
“RobberyRt” to objects. Remember, you will do this again for “AssaultRt”
at the end of this section. We will also assign the sample size
(n <- 25
) to the object n
; since our sample
stays the same, we can use this object for this section and the next
section.StatesDataRdm25
data object contains only the 25 randomly selected states. We are
practicing interval estimation of population parameters from a random
sample, so do not use the full dataset (StatesData
)!
Sample Mean of RobberyRt (n=25 states)
sampmeanRobRt <- mean(StatesDataRdm25$RobberyRt)
sampmeanRobRt
## [1] 99.424
Sample SD of RobberyRt (n=25 states)
sampsdRobRt <- sd(StatesDataRdm25$RobberyRt)
sampsdRobRt
## [1] 57.07451
Sample Size
n <- 25
n
## [1] 25
qt()
function by
typing:
tcrit <- qt(p = .05/2, df = n-1, lower.tail = FALSE)
.
Then, type tcrit
to output the value and check to make sure
it is the same (t=2.064)! (Note: We also rounded the output to
show the book’s value and the R value is the same.)
qt()
generates specific “quantiles” from the t
distribution. With a two-tailed 95% CI, we want to specify 0.05 (i.e.,
5%) of samples to fall outside in the tails or “critical region” of the
t distribution.2
to get
that same t critical value from the book? We know this can be
confusing. Try to remember that, for a two-tailed confidence interval,
you need to divide the alpha level by two.
t critical value for two-tailed test, alpha=.05,
df=24
# Determine t critical value for alpha = .05 (to estimate 95% CI)
# For two-tailed test, divide alpha level by two (for both tails)!
tcrit <- qt(p = .05/2, df = n-1, lower.tail = FALSE)
tcrit
## [1] 2.063899
t critical value, rounded to 3 digits to compare with
book
round(tcrit, 3)
## [1] 2.064
se = sd/sqrt(n)
.
sampseRobRt <- (sampsdRobRt/sqrt(n))
. Recall that you
can view this value after assigning it to an object by retyping it
(sampseRobRt
) afterwards on its own line of code.
n
is equal to the number of observations,
which is 25 (i.e., 25 randomly selected states).
sampsdRobRt
is the standard deviation of the “RobberyRt”
variable in our random sample.marg95RobRt <- tcrit*sampseRobRt
tcrit
is our t critical value and
sampseRobRt
is the estimated standard error of the mean of
“RobberyRt” from our sample.qt()
function.
0.975
quantile,
which represents the proportion of samples that we want to fall within
our estimated interval. You might notice that this quantile value is
equivalent to 1 - alpha/2.
Sample Std Error of Mean of RobberyRt & 95% margin of
error
sampseRobRt <- (sampsdRobRt/sqrt(n))
sampseRobRt
## [1] 11.4149
marg95RobRt <- tcrit*sampseRobRt
marg95RobRt
## [1] 23.5592
#Alternative method for calculating 95% margin of error in R using qt()
# qt(0.975,df=n-1)*sampsdRobRt/sqrt(n)
RobRt95CIlow <- sampmeanRobRt - marg95RobRt
.
Hit enter and type
RobRt95CIhi <- sampmeanRobRt + marg95RobRt
. Type each of
the object names on subsequent lines to print their output values and
run your code.
Lower bound of 95% confidence interval around sample mean
RobRt95CIlow <- sampmeanRobRt - sampseRobRt
RobRt95CIlow
## [1] 88.0091
Upper bound of 95% confidence interval around sample
mean
RobRt95CIhi <- sampmeanRobRt + sampseRobRt
RobRt95CIhi
## [1] 110.8389
95% CI Estimate around Sample Mean of RobberyRt
cat("[",RobRt95CIlow,",",RobRt95CIhi,"]")
## [ 88.0091 , 110.8389 ]
StatesData
, then compare it to our 95%
CI estimate around the sample mean.
Population Mean of RobberyRt
mean(StatesData$RobberyRt)
## [1] 104.584
Goal: Understand how to appropriately interpret a frequentist confidence interval
Now that you have estimated a 95% confidence interval around the sample mean of “RobberyRt”, we want to discuss how to appropriately interpret these intervals before we estimating any more of them.
You may recall us mentioning earlier that frequentist confidence intervals are routinely misinterpreted. So, how do we appropriately interpret them? Well, we are glad you asked! Here is an appropriate interpretation of the 95% CI that we just estimated:
CORRECT INTERPRETATION: We are 95% confident that the true population mean robbery rate is between 88.01 and 110.84. By 95% confident, we mean that over the long run (and assuming our research design and statistical model are sound), the same procedures would result in estimates and confidence intervals that capture the true population mean in 95% of trials.
In addition to knowing the correct interpretation, it is worth recognizing a few common misinterpretations. Though the distinctions at times appear to be quite subtle, each of the following is technically a misinterpretation:
We illustrate sampling variability - and CI variability - in the figure below, in which we plotted the mean and 95% CI from our sample (sample #1 at the bottom) along with means and 95% CIs from eight other samples of n=25 states each randomly drawn from “StatesData” using a different random starting seed. As you can see, all nine (ours + 8 new samples) generated different point estimates (dots = sample means) and interval estimates (error bars = 95% CIs around sample means). In this particular case, all nine of the estimated 95% confidence intervals managed to capture the true population parameter (vertical line = population mean of “RobberyRt”).
#generate three more random samples of n=25 states from StatesData
set.seed(1)
samp2 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(21)
samp3 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(31)
samp4 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(41)
samp5 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(51)
samp6 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(61)
samp7 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(71)
samp8 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(81)
samp9 <- StatesData %>% sample_n(25, replace = FALSE)
#create function to calculate stnd error of mean for 95% CI
fun95se = function(dat){
(qt(0.975,df=24)*sd({{dat}}$RobberyRt)/sqrt(25))
}
#create tibble with sample, mean(RobberyRt), and se(mean)
statesamps <- tribble(
~sample, ~mean, ~se,
"1", mean(StatesDataRdm25$RobberyRt), fun95se(StatesDataRdm25),
"2", mean(samp2$RobberyRt), fun95se(samp2),
"3", mean(samp3$RobberyRt), fun95se(samp3),
"4", mean(samp4$RobberyRt), fun95se(samp4),
"5", mean(samp5$RobberyRt), fun95se(samp5),
"6", mean(samp6$RobberyRt), fun95se(samp6),
"7", mean(samp7$RobberyRt), fun95se(samp7),
"8", mean(samp8$RobberyRt), fun95se(samp8),
"9", mean(samp9$RobberyRt), fun95se(samp9)
)
statesamps <- statesamps %>%
mutate(
CI95low = mean - se,
CI95hi = mean + se
)
ggplot(statesamps, aes(mean, sample)) +
geom_point() +
geom_errorbar(aes(xmin = CI95low, xmax = CI95hi, color=sample)) +
geom_vline(xintercept = mean(StatesData$RobberyRt, linetype="dashed"),
color = "#d44842", size=1) +
geom_segment(aes(x = 130, y = 1.3, xend = 124.5, yend = .9),
arrow = arrow(length = unit(0.5, "cm")), color="#365c8d") +
annotate("text", x=130, y=2.2, label="95% CI\nfor our\nfirst sample", angle=0, color="#365c8d") +
geom_segment(aes(x = 113, y = 9.5, xend = 106, yend = 9.4),
arrow = arrow(length = unit(0.5, "cm")), color="#d44842") +
annotate("text", x=120, y=9.5, label="Population mean", angle=0, color="#d44842") +
# scale_color_viridis_d() +
scale_colour_manual(values=c('#365c8d','#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d')) +
theme_classic() +
theme(legend.position = "none")
#https://statisticsglobe.com/draw-plot-with-confidence-intervals-in-r
replicate()
to
generate and assign numerous samples to a single combined data object or
to a list of separate objects, then calculate our estimates and generate
a plot from that single data object or list. (We will illustrate some
more efficient approaches in the next assignment). Instead, we generated
eight more samples and assigned them to separate data objects. We then
created a single tibble containing the sample mean and standard error
from each of our nine samples, calculated 95% upper and lower CI limits
for each sample in our new tibble, and then plotted our means and 95%
CIs from this tibble. Our hope is that you can more easily follow our
logic with this explicit approach.ggplot() + geom_point() + geom_errorbars
,
as well as illustrating how you can add other elements like a vertical
line (+ geom_vline()
), an arrow
(+ geom_segment
), or text (+ annotate()
)
elements to a ggplot()
object. While we do not expect you
to reproduce this plot, please do not let us thwart your ambitions!Goal: Estimate Confidence Intervals around ‘AssaultRt’ Sample Proportion
In the last part of this assignment, you will use what you have learned to answer SPSS exercise question 5, which asks you to estimate 95% and 99% CIs around the proportion of states with an assault rate (“AssaultRt”) higher than 200 assaults in 100k in our random sample of 25 states. We will illustrate the process by estimating a 90% CI around this same proportion.
Remember, we previously used the mean and sd to first estimate the standard error of the mean, which we then used to calculate the lower and upper limits of the 90% CI. We follow a similar process for a proportion, though the formula and inputs are different.
On that topic, you may have noticed that your book only provides a formula for calculating a confidence interval around a sample proportion with large samples, and it does not have an equivalent formula for small samples. What gives? Well, it turns out that these types of interval estimates are pretty error-prone with small samples, and neither z nor t distributions are really very useful for generating valid intervals around proportions with small samples. Yet, the book asks you to do this anyway, so we will plug our nose and move forward using the formula for large samples, which relies on the z distribution.
n
count()
function to answer this question directly in R. We show you how to do
this in the code below.pAstRtgt200
, which stands for “proportion
AssaultRt greater
than 200”.
Sample Proportion of States w/Assault Rate >200”
StatesDataRdm25 %>% count(AssaultRt > 200)
## # A tibble: 2 × 2
## `AssaultRt > 200` n
## <lgl> <int>
## 1 FALSE 9
## 2 TRUE 16
pAstRtgt200 <- 0.64
pAstRtgt200
## [1] 0.64
qnorm()
instead of qt()
, which is
a quantile function that we can use to identify critical values from the
standard normal (z) rather than t distribution.qnorm()
to find the 90% CI
(two-tailed), though you will be doing this again on your own to
estimate 95% and 99% confidence intervals.
0.95
in
our code instead.qnorm()
to identify the appropriate
z critical value for a 90% CI and assign it to an object named
zcrit
. That value is 1.64.
Critical z value for two-tailed 90% CI
zcrit <- qnorm(0.95)
zcrit
## [1] 1.644854
5. We have what we need to estimate a confidence interval around
our sample proportion. a. Recall, the formula for the standard error of
the proportion (aka, the standard error of the sampling distribution of
the proportion) is: se = sqrt(p(1-p)/n) and we multiply this by the
critical z value to get the margin of error. b. In an R chunk,
input our assigned objects into the formula by typing
marg90AstRtprop <- zcrit*sqrt(pAstRtgt200*(1-pAstRtgt200)/n)
.
- This code estimates the margin of error for our 90% CI by multiplying
our critical value by the square root of the proportion (.64) multiplied
by its complement (or 1 - p) and divided by n. - The margin of error
estimate is then assigned to a new value object called
marg90AstRtprop
(named differently so we do not confuse it
with sample standard error for the mean of ‘RobberyRt’; the name refers
to the sample margin of error for 90%
CI around Assault
Rate proportion).
Standard error of Assault Rate proportion
marg90AstRtprop <- zcrit*sqrt(pAstRtgt200*(1-pAstRtgt200)/n)
marg90AstRtprop
## [1] 0.1579059
6. With the margin of error estimated and assigned to an object,
we can now calculate the 90% confidence interval around our sample
proportion. a. As we did in the previous section, we do this by simply
subtracting from and adding to the sample proportion to calculate the
lower and upper limits, respectively, for our 90% confidence interval
around the “AssaultRt” sample proportion. b. We can do this by simply
typing AstProp90CIlow <- pAstRtgt200 - marg90AstRtprop
and AstProp90CIhi <- pAstRtgt200 + marg90AstRtprop
to
determine the 90% CI of the assault rate for our 25 states.
90% CI lower limit for AssaultRt proportion
AstProp90CIlow <- pAstRtgt200 - marg90AstRtprop
AstProp90CIlow
## [1] 0.4820941
90% CI upper limit for AssaultRt proportion
AstProp90CIhi <- pAstRtgt200 + marg90AstRtprop
AstProp90CIhi
## [1] 0.7979059
90% CI Estimate around AssaultRt Sample Proportion
cat("[",AstProp90CIlow,",",AstProp90CIhi,"]")
## [ 0.4820941 , 0.7979059 ]
- Congratulations! You just calculated a 90% confidence interval
estimate for the proportion of states with an assault rate greater than
200. - Our lower bound estimate is 0.48 and our upper bound estimate is
0.8. That means our 90% CI around the sample proportion is [0.48, 0.8].
- We can interpret this interval as follows: We are 90% confident that
the true population proportion of states with assault rates greater than
200 is between 0.48 and 0.8. - By 90% confident, we mean that over the
long run the same procedures would result in estimates and confidence
intervals that capture the true population proportion in 90% of trials,
assuming our research design and statistical model (including the large
sample formula) are appropriate.
Population Proportion of States w/Assault Rate
>200
StatesData %>% count(AssaultRt > 200)
## # A tibble: 2 × 2
## `AssaultRt > 200` n
## <lgl> <int>
## 1 FALSE 21
## 2 TRUE 29
popAstRtprop <- 29/50
popAstRtprop
## [1] 0.58
- The population proportion is 0.58, which falls within the 90%
confidence interval we estimated around our sample proportion (p=0.64,
90% CI=[0.48, 0.8]). It seems our random sample was one of the 90% of
samples we might have drawn that would be expected to generate 90%
confidence intervals that capture the true population proportion. Again,
we only know this because we have the full population data - in most
research situations, we do not have this luxury.
marg90AstRtprop
value), and your lower and upper confidence limits (i.e.,
AstProp90CIlow
and AstProp90CIhi
).dplyr::sample_n()
?set.seed()
?dplyr::filter()
and %in%
operator?listname <- c(item1, item2)
?rnorm()
,
truncnorm::rtruncnorm()
, or runif()
?ggplot()
plots into a
single figure using “patchwork” package?
qt()
or
qnorm()
?ggplot() + geom_point() + geom_errorbars
?+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
object?