In the last assignment, you were introduced to frequentist empirical probabilities and probability distributions, including the standard normal (z) distribution. You also learned how these are essential to null hypothesis significance testing, which is the process by which most social scientists make inferences from sample descriptive statistics to population parameters. Finally, you practiced various steps in this process, such as calculating frequentist probabilities from crosstabs, calculating z-scores from raw observation values, and even conducting a basic binomial hypothesis test.
In Assignment 9, we will dig deeper into the process of making statistical inferences about population parameters from sample statistics. For instance, you will learn to think about sample descriptive statistics (e.g., a sample mean or correlation coefficient) as point estimates of population parameters. Moreover, while point estimates may represent our best or most likely guess about the value of a population parameter, a point estimate usually is just one among many potential estimates of a population parameter that are plausible under our data and modeling assumptions. In fact, we can use our knowledge of probability to calculate an interval estimate, such as a frequentist 95% confidence interval, around a point estimate.
Pragmatically, a confidence interval is a range of plausible estimates that quantifies the degree of uncertainty we have in our estimate of a population parameter. Learning to think about sample-derived estimates of population parameters as (uncertain) interval estimates rather than as (fixed and certain) point estimates is essential to truly understanding the process of statistical inference. After all, we rarely know the true population parameters and, thus, the seductive tendency to view sample statistics as “real” and known calculations of population values is misguided and prone to inference errors.
Unfortunately, frequentist confidence intervals (like their p-value cousins) are notoriously misunderstood and misinterpreted. For instance, it is quite common to read or hear people explain a 95% confidence interval like this: “We can be 95% certain that the true population parameter we seek falls within the 95% confidence interval we constructed around our sample point estimate.” Sadly, this is an inaccurate interpretation. When we calculate a frequentist 95% confidence interval, we are not 95% confident in the calculated interval itself. Rather, we are confident in the method that we used to generate confidence intervals.
What does this mean? Well, do you recall that long run or “sampling” notion of probability we discussed in the last assignment? We need to keep that in mind when interpreting a frequentist confidence interval around a point estimate. Now, imagine we calculated a sample statistic and used it as a point estimate to infer a population parameter value, and we also calculated a 95% confidence interval to quantify uncertainty around our estimate. Next, let’s hypothetically imagine we were to repeatedly follow the same sampling procedures a large number of times, calculating a point estimate and 95% confidence interval the same way each time. In what are we 95% confident? Assuming our data and modeling assumptions are appropriate, we are confident that the interval estimates we calculate in a large number of repeated trials would capture the true population parameter 95% of the time.
Of course, this means that we also expect our 95% confidence intervals would fail to capture the population parameter 5% of the time on repeated trials. Moreover, when we only calculate a confidence interval from a single sample, we cannot know whether we have one of the 95% of parameter-catching interval nets that effectively captured the true population value or, instead, whether we happened to get one of the 5% of atypical interval nets that failed to capture the true population parameter.
Finally, want to catch more parameters? Like capturing Pokemon, you just need to get a bigger or better net! For instance, we can improve our expected parameter-capturing rate (e.g., to 99%) by casting a wider interval net (i.e., by calculating wider confidence intervals). By widening our interval net, we exchange greater imprecision/uncertainty for the prospect of making fewer inference errors. We can minimize this trade-off by getting a better net - that is, by improving features of our research design (e.g., more precise measurement; larger sample sizes).
In the current assignment, you will learn how to calculate confidence intervals around a point estimate in R and to interpret them appropriately. Additionally, you will learn how to simulate data from a probability distribution, which should help you better understand sampling variability and the need for interval estimates. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.
dplyr::sample_n()
set.seed()
dplyr::filter()
and %in%
operatorlistname <- c(item1, item2)
rnorm()
,
truncnorm::rtruncnorm()
, or runif()
ggplot()
plots into a
single figure using “patchwork” package
qt()
or
qnorm()
ggplot() + geom_point() + geom_errorbars
+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
objectWe are building on objectives from Assignments 1-7. By the start of this assignment, you should already know how to:
package::function()
formathaven::read_spss()
and assign it to an R object using an
assignment (<-
) operator$
symbol to call a specific element (e.g., a
variable, row, or column) within an object (e.g., dataframe or tibble),
such as with the format dataobject$varname
%>%
pipe operator to perform a
sequence of actions!=
as “not equal to”options(scipen=999, digits = 3)
here()
for a simple and reproducible
self-referential file directory methodlibrary()
for loading packageshead()
function to quickly view a
snapshot of your dataglimpse()
function to quickly view all columns
(variables) in your d atasjPlot::view_df()
to quickly browse variables in a
data fileattr()
to identify variable and attribute value
labelsNA
for
variables in your data filefilter(!is.na(var))
data %>% droplevels(data$variable)
select()
,
mutate()
, and if_else()
functionsmutate()
summarytools::dfsummary()
to quickly describe one
or more variables in a data filesjmisc:frq()
and
summarytools::freq()
functionssummarytools::freq()
mean()
, median()
, and mode()
(e.g., mean(data$variable
))summarytools::descr()
and psych::describe()
functionssjmisc:frq()
or
summarytools::descr()
)dplyr::select()
&
sjPlot::sjtab(depvar, indepvar)
rstatix::binom_test()
gt()
(e.g., head(data) %>% gt()
)
gt()
table, such as adding
titles/subtitles with Markdown-formatted (e.g., **bold**
or
*italicized*
) fontsggplot()
function
boxplot()
and
ggplot()
to visualize dispersion in a data
distributiongeom_boxplot()
) by adding fill=
and
color=
followed by specific color names (e.g., “orange”) or
hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for
cream)labs()
function (e.g.,
+ labs(title = "My Title")
)If you do not recall how to do these things, review Assignments 1-6.
Additionally, you should have read the assigned book chapter, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:
As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
Goal: Read in 2012 States data & calculate population mean for
RobberyRt
(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR-MO-DY for the actual date. E.g., 2023-01-23_Ducate_CRIM5305_Assign05)
In previous assignments, you learned how to calculate descriptive statistics like the sample mean of a data column in R. In the current assignment, you will use sample statistics (e.g., sample mean or proportion) as point estimates to infer population parameter values. Likewise, you will learn to calculate the standard error of the estimate and a confidence interval around the estimate using R, as well as how to interpret these interval estimates appropriately.
here
package will automatically set our CRIM5305_L folder as the top-level
directory.Then, create a third-level header titled: Read in 2012 States Data
This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 9. It is my work and only my work.
tidyverse
, haven
, here
,
sjmisc
, sjPlot
, gt
, and
truncnorm
.
truncnorm
and
patchwork
packages, which means you’ll need to download
them first using install.packages()
or the “Install
Packages” button under the “Tools” tab. Then, use library()
to load it in.Insert another R code chunk.
In the new R code chunk, read the 2012StatesData.sav
SPSS datafile into R and assign it to an R object named
StatesData
.
In the same code chunk, on a new line below your read data/assign
object command, type the name of your new R data object:
StatesData
.
head()
function (piped to
gt()
) to print the first five rows so you can compare your
output.StatesData <- read_spss(here("Datasets", "2012StatesData.sav"))
StatesData %>% head() %>% gt()
RobberyRt
, then use it to determine the population mean and
standard deviation of RobberyRt
for all 50 states.
RobberyRt
sjmisc::frq()
by typing
StatesData %>% frq(RobberyRt)
to create frequency table
that also provides the mean and standard deviation.mean(StatesData$RobberyRt)
and
sd(StatesData$RobberyRt)
.StatesData %>%
frq(RobberyRt)
## Robbery Rate per 100K (RobberyRt) <numeric>
## # total N=50 valid N=50 mean=104.58 sd=57.74
##
## Value | N | Raw % | Valid % | Cum. %
## -------------------------------------
## 14.30 | 1 | 2 | 2 | 2.00
## 14.90 | 1 | 2 | 2 | 4.00
## 16.50 | 1 | 2 | 2 | 6.00
## 17.20 | 1 | 2 | 2 | 8.00
## 18.00 | 1 | 2 | 2 | 10.00
## 22.90 | 1 | 2 | 2 | 12.00
## 30.30 | 1 | 2 | 2 | 14.00
## 37.20 | 1 | 2 | 2 | 16.00
## 42.20 | 1 | 2 | 2 | 18.00
## 47.30 | 1 | 2 | 2 | 20.00
## 56.20 | 1 | 2 | 2 | 22.00
## 65.30 | 1 | 2 | 2 | 24.00
## 66.70 | 1 | 2 | 2 | 26.00
## 67.90 | 1 | 2 | 2 | 28.00
## 70.40 | 1 | 2 | 2 | 30.00
## 74.50 | 1 | 2 | 2 | 32.00
## 74.70 | 1 | 2 | 2 | 34.00
## 79.50 | 1 | 2 | 2 | 36.00
## 80.20 | 1 | 2 | 2 | 38.00
## 86.80 | 1 | 2 | 2 | 40.00
## 87.70 | 1 | 2 | 2 | 42.00
## 92.90 | 1 | 2 | 2 | 44.00
## 93.50 | 1 | 2 | 2 | 46.00
## 94.00 | 1 | 2 | 2 | 48.00
## 98.70 | 1 | 2 | 2 | 50.00
## 103.40 | 1 | 2 | 2 | 52.00
## 113.60 | 1 | 2 | 2 | 54.00
## 114.10 | 1 | 2 | 2 | 56.00
## 117.30 | 1 | 2 | 2 | 58.00
## 123.90 | 1 | 2 | 2 | 60.00
## 126.00 | 1 | 2 | 2 | 62.00
## 126.50 | 1 | 2 | 2 | 64.00
## 127.10 | 1 | 2 | 2 | 66.00
## 129.40 | 1 | 2 | 2 | 68.00
## 131.60 | 1 | 2 | 2 | 70.00
## 133.70 | 1 | 2 | 2 | 72.00
## 142.30 | 1 | 2 | 2 | 74.00
## 142.40 | 1 | 2 | 2 | 76.00
## 142.50 | 1 | 2 | 2 | 78.00
## 144.50 | 1 | 2 | 2 | 80.00
## 153.30 | 1 | 2 | 2 | 82.00
## 153.60 | 1 | 2 | 2 | 84.00
## 157.00 | 1 | 2 | 2 | 86.00
## 166.80 | 1 | 2 | 2 | 88.00
## 167.60 | 1 | 2 | 2 | 90.00
## 173.70 | 1 | 2 | 2 | 92.00
## 189.70 | 1 | 2 | 2 | 94.00
## 210.70 | 1 | 2 | 2 | 96.00
## 228.00 | 1 | 2 | 2 | 98.00
## 260.70 | 1 | 2 | 2 | 100.00
## <NA> | 0 | 0 | <NA> | <NA>
Goal: Randomly Select 25 States from Dataset
As noted in the previous section, this is one of the relatively rare instances where we have data from an entire population - in this case, from all 50 states. Since the population parameters (e.g., mean and standard deviation) are known to us, we can illustrate the logic and process of statistical inference by randomly sampling from the population, generating sample point and interval estimates, and then comparing these sample estimates to the known population parameters. (Note: Typically, we cannot do this last step as we rarely know the population parameters - that is why we use samples to estimate them!)
So, you will start randomly select rows of data from 25 of the 50
states (rows) in the 2012 States dataset. In doing so, you will learn
about the set.seed()
function, which is essential for
reproducibility as it will allow all of us to select the same
random sample rather than generating a unique random sample every time
our code is run.
After drawing a random sample of 25 states, we will take a detour to introduce nonrandom selection of data subsets followed by data simulation. We will then revisit the random sample you generate in this section to complete the last part of the assignment.
First, you will randomly select 25 states from the dataset. Later in the assignment, we will use this random sample to calculate confidence intervals around sample statistics.
set.seed(1234)
and hit enter. Then, type
StatesDataRdm25 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed()
allows us to make our “random”
sample easier to reproduce. It tells R WHERE in its list of random
numbers it should begin drawing numbers. Without it, R will generate a
new and unique random sample each time you run your sampling code, which
can cause major reproducibility problems.
set.seed()
function specifies a fixed
instead of changing starting point for the random number generator. This
way, any time we or a collaborator runs our code, we will get the exact
same “random” sample. Here, we are using a starting seed of
1234
, which will begin the random number generator at the
1,234th place. However, we can use any set of numbers when we set our
seed. Unique seeds will keep our work as random as possible yet still
reproducible. You could set your seed to 6789
– any value
will work, even 1
! I often use numbers like
8675309
or 1138
.dplyr::sample_n()
function allows us to take a
sample of size n
from our dataset. The first value, 25,
specifies that we want 25 rows, or states. The function
replace = FALSE
simply tells R not to replace states after
choosing them. For example, if Missouri is chosen, R will not place it
back into the selection pool. This method ensures no state is chosen
more than once. If you wanted to sample with replacement,
change FALSE
to TRUE
.
sample_n()
with sample_frac()
and 25
with 0.5
. Essentially, this approach
will take a random sample of a fraction (in our case, 50%) of the rows
in our data object.
sample_n()
, you will need to re-run
set.seed(1234)
. This is because the seed will have moved
from generating the random sample the first time. Try it both ways to
prove it to yourself!StatesDataRdm25 <- StatesData %>%
portion of the code
assigns the result of our piped actions on the StatesData
-
that is, our random sample of 25 rows - into a new data object called
StatesDataRdm25
.set.seed(1234)
StatesDataRdm25 <- StatesData %>% sample_n(25, replace = FALSE)
StatesDataRdm25 %>% head() %>% gt()
#Alternative way to sample half the rows (n=25)
#set.seed(1234)
#StatesDataRdm25 <- StatesData %>% sample_frac(0.5, replace = FALSE)
StatesDataRdm25
data object in your Environment. It should contain 25 rows (states) with
30 columns (variables). The first state should be Nevada and the last
should be Delaware.head()
function (piped to
gt()
) again to print the first five rows so you can compare
your output.StatesData
data object should also still
be in the R Environment and should still contain all 50 states.Goal: Nonrandomly Select Rows using Conditions
While a random (or probability) sample is essential for drawing unbiased statistical inferences from a sample to a population, there may be times that you want to non-randomly select certain rows with specific characteristics from a data object rather than selecting rows randomly. So, we we will also show you how to select a specific subset of rows (states) in the data, such as the first 25 rows or all rows from states starting with a particular letter.
Imagine you wanted to select only a specific row or rows (i.e.,
states) from the StatesData
object. One way you could do
this is to combine dplyr::filter()
with the
%in%
operator. Let’s try it.
StateswA <- StatesData %>%
and hit enter.
Then, type
dplyr::filter(State %in% c("Alabama", "Alaska", "Arizona", "Arkansas"))
and hit run. A data object called StateswA
should appear in
your Environment.
StateswA
using our
StatesData
dataset.
StatesDataRdm25
dataset
to do this step because this only includes 50% of the states!dplyr::filter()
function generates a subset of a
dataframe by retaining only the rows that match a condition you specify
(e.g., matching column values). You might recall that we used
dplyr::filter()
in the previous assignment to select only
rows that contained valid non-missing response values on a variable
(with filter(!is.na(var))
).%in%
operator tells R that we want to keep only
those rows in StatesData
where the response values are “in”
(or are equal to) the values we specified - in this case, “Alabama”,
“Alaska”, “Arizona”, and “Arkansas”. If we had specified only one state,
we could have used a double equal sign instead (e.g.,
State == "Alabama"
).c()
. This
is a very handy and widely used base R function that combines multiple
elements into a vector or a list.newdata <- olddata %>% filter(var %in% c(value1, value2))
informs R to filter the old data object by the “var
”
column, keeping only those rows where “var
” is equal to
“value1
” or “value2
”, then assigning this
subset to a new data object.#preferred way to choose specific rows, or states
StateswA <- StatesData %>%
dplyr::filter(State %in% c("Alabama", "Alaska", "Arizona", "Arkansas")
)
StateswA %>% gt()
%in%
operator and the filter()
and
c()
functions. We will start by assigning the column values
(state names) to an object or, more precisely, to a vector of values.
Cstates <- c("California", "Colorado", "Connecticut")
.
Cstates
. This basically creates a list or vector of values
(not a data object) that will be assigned to an object. You can then
find this object in your Environment tab under “Values” (below “Data”
objects), and you can call this object for use in later code.Cstates
and click “Run”
to view the object, R studio should output the list you assigned:#assign list/vector of specific row values, or states, as object called "Cstates"
Cstates <- c("California", "Colorado", "Connecticut")
Cstates
## [1] "California" "Colorado" "Connecticut"
filter()
code from above.
StateswC <- StatesData %>% filter(State %in% Cstates)
StateswC
.#Filter data using object containing list/vector of row values (states)
StateswC <- StatesData %>%
filter(State %in% Cstates)
StateswC %>% gt()
Goal: Simulate Data
In addition to selecting subsets from real-world data, there may be times that you want to simulate fake data. Social scientists simulate data for many reasons, such as to check model assumptions, mimic real-world scenarios, or even generate more accurate interval estimates around sample statistics from real data (e.g., bootstrapped confidence intervals or Bayesian credible intervals). In fact, some even claim that to truly understand statistical modeling and inference, one must first know how to effectively simulate data.
One of the most challenging aspects of data simulation is choosing an appropriate probability distribution(s) from which to randomly sample data. Here, we will introduce you to simple methods for generating simulated datasets from some common probability distributions, including standard normal, normal, truncated normal, and uniform distributions. We recommend checking out this site for additional reading and other examples of simulating from normal, uniform, binomial, and Poisson distributions in R.
Imagine that our end goal in this section is to generate a dataset containing several simulated columns that each contain n=50 values, where each row value represents a plausible (fake but realistic) RobberyRt value for one of 50 simulated states. Since simulation generally involves random draws from a probability distribution(s), we will start by first demonstrating how to simulate n=50 row values from a standard normal distribution.
1234
rnorm(n = 50, mean=0, sd=1)
rnorm()
draws “n” random numbers from a normal
distribution with a specified mean and standard deviation.rnorm()
function to draw 50 random numbers
from a (standard) normal distribution that has a mean=0 and sd=1.# simulation of n=50 RobberyRt values from random normal dist
set.seed(1234)
rnorm(n = 50,
mean = 0,
sd = 1)
## [1] -1.20706575 0.27742924 1.08444118 -2.34569770 0.42912469 0.50605589
## [7] -0.57473996 -0.54663186 -0.56445200 -0.89003783 -0.47719270 -0.99838644
## [13] -0.77625389 0.06445882 0.95949406 -0.11028549 -0.51100951 -0.91119542
## [19] -0.83717168 2.41583518 0.13408822 -0.49068590 -0.44054787 0.45958944
## [25] -0.69372025 -1.44820491 0.57475572 -1.02365572 -0.01513830 -0.93594860
## [31] 1.10229755 -0.47559308 -0.70944004 -0.50125806 -1.62909347 -1.16761926
## [37] -2.18003965 -1.34099319 -0.29429386 -0.46589754 1.44949627 -1.06864272
## [43] -0.85536463 -0.28062300 -0.99434008 -0.96851432 -1.10731819 -1.25198589
## [49] -0.52382812 -0.49684996
We will go a step further by assigning these numbers to a tibble, then printing the mean and sd of the simulated data series, and finally plotting them in a histogram. Though this step is optional, feel free to follow along in your own RMD file if you want. At minimum, read through each step and try to understand what is going on. We’ll walk through how to do this yourself later on.
# simulation of n=50 RobberyRt values from random normal dist
set.seed(1234)
simznorm <- rnorm(n = 50,
mean = 0,
sd = 1) %>%
as_tibble()
Mean of z-normal simulated sample, n=50”
mean(simznorm$value)
## [1] -0.453053
SD of z-normal simulated sample, n=50”
sd(simznorm$value)
## [1] 0.8850435
simznorm %>% ggplot(aes(x=value)) + geom_histogram(fill="#21918c", bins=5)
patchwork
package.n=50
in the rnorm()
function to a larger
sample size - say, to n=5000
instead.
Mean of z-normal simulated sample V1, n=5,000”
## [1] -0.006354696
SD of z-normal simulated sample V1, n=5,000”
sd(simznorm$V1)
## [1] 0.991345
Sample mean for RobberyRt (n=25 states)
## [1] 99.424
Sample sd for RobberyRt (n=25 states)
#sample sd of RobberyRt
sd(StatesDataRdm25$RobberyRt)
## [1] 57.07451
funhist(StatesDataRdm25, RobberyRt, 5, 20, "Robbery Rate (Random Sample of States Data, n=25)")
RobberyRt
distribution from our random sample seems
like it could have been drawn from a normal probability distribution
but, as we would expect, key summary statistics like mean, sd, min, and
max values are quite a bit different from the standard or z-normal
distribution. So, instead of sampling from the standard normal
distribution, let’s try sampling values from a normal probability
distribution that more closely matches the empirical distribution of
RobberyRt
values.mean
and sd
values estimated from our random sample instead of the standard normal
values.
1234
rnorm(n = 50,
then hit
enter.mean=mean(StatesDataRdm25$RobberyRt),
then hit enter.sd=sd(StatesDataRdm25$RobberyRt))
and then run your
code.# simulation of n=25 RobberyRt values from random normal dist
set.seed(1234)
rnorm(n = 50,
mean = mean(StatesDataRdm25$RobberyRt),
sd = sd(StatesDataRdm25$RobberyRt))
## [1] 30.531317 115.258137 161.317946 -34.455541 123.916080 128.306891
## [7] 66.621000 68.225256 67.208180 48.625529 72.188462 42.441585
## [13] 55.119691 103.102955 154.186651 93.129510 70.258384 47.417970
## [19] 51.642839 237.306603 107.077019 71.418344 74.279947 125.654841
## [25] 59.830259 16.768418 132.227900 40.999354 98.559989 46.005195
## [31] 162.337090 72.279759 58.933059 70.814943 6.444293 32.782706
## [37] -25.000689 22.887474 82.627323 72.833127 182.153285 38.431743
## [43] 50.604485 83.407580 42.672530 44.146522 36.224360 27.967522
## [49] 69.526768 71.066533
rnorm()
and the mean/sd of our random
sample (printed output above and plotted below) reveals a set of values
that are much more comparable to the RobberyRt
values
observed in our StatesData
, with one glaring exception: Our
simulated data contain some negative values, whereas the minimum value
in our RobberyRt
sample (and population) data equals “0”.
Of course, this is because a state cannot validly report a negative
robbery rate - that is, a state cannot have less than 0 robberies in a
year!Ideally, we want to simulate data from a probability distribution
that closely matches the data-generating processes underlying RobberyRt
values. So, at the very least, we want to draw from a probability
distribution that does not include any negative values. Since the normal
distribution extends from negative infinity to infinity, there is always
a chance that we will draw one or more negative - and hence invalid (as
RobberyRt values cannot be negative) - simulated values using
rnorm()
. As such, this distribution is not a good fit to
the data generating process that we are trying to simulate.
Instead, we might consider randomly sampling from a truncated distribution, which is a probability distribution that is limited to or excludes certain values. For instance, we could draw from a truncated normal distribution that excludes (rejects) any values below a minimum of “0”.
1234
rnorm(n = 50,
then hit
enter.a = 0,
then hit enter.
rnorm()
code. With
truncnorm()
we can specify lower (minumum; a=
)
and upper (maximum; b=
) sampling limits to define the
accept/reject sampling region. In our case, we typed a=0
to
specify a minimum value of “0”, which means that any negative values
(<0) will be rejected (skipped) when drawing random numbers from the
normal distribution.mean=mean(StatesDataRdm25$RobberyRt),
then hit enter.sd=sd(StatesDataRdm25$RobberyRt))
and then run your
code.set.seed(1234)
rtruncnorm(n = 50,
a = 0,
mean = mean(StatesDataRdm25$RobberyRt),
sd = sd(StatesDataRdm25$RobberyRt))
## [1] 30.531317 115.258137 161.317946 123.916080 128.306891 66.621000
## [7] 68.225256 67.208180 48.625529 72.188462 42.441585 55.119691
## [13] 103.102955 154.186651 93.129510 70.258384 47.417970 51.642839
## [19] 237.306603 107.077019 71.418344 74.279947 125.654841 59.830259
## [25] 16.768418 132.227900 40.999354 98.559989 46.005195 162.337090
## [31] 72.279759 58.933059 70.814943 6.444293 32.782706 22.887474
## [37] 82.627323 72.833127 182.153285 38.431743 50.604485 83.407580
## [43] 42.672530 44.146522 36.224360 27.967522 69.526768 71.066533
## [49] 66.202303 36.134671
If you look carefully, these values are nearly identical to those
generated using rnorm()
. This is because in both instances
we set the same random seed (set.seed(1234)
) before drawing
(pseudo)random values from a normal probability distribution with the
same mean and standard deviation. The key difference is that the 4th
value and 38th value were both negative (-34.45554 & -25.000689,
respectively) in our simulated sample generated earlier using
rnorm()
. Each of these negative values were rejected by the
random sampler and skipped in generating our new simulated sample using
truncnorm()
. Likewise, the very last two values (66.202303
& 36.134671) in the truncnorm()
simulated sample are
new, as these values essentially were next in line (from our seed using
the random number generator) to replace the two rejected negative
values.
Below, we present the plot of our real random sample data
alongside the plot of our simulated sample data from a truncated normal
distribution. Compared to our previous attempts, it is more plausible
that these simulated values were drawn randomly from the same underlying
generative distribution as our sample RobberyRt
values.
Before you learn to replicate this simulation process multiple
times and create your own patchwork
plots, let’s consider
one alternative probability distribution - the uniform or “flat”
distribution. In a uniform
distribution, the probability of selecting any one value between a
specified minimum and maximum limit is equal to the probability of
selecting any other value. Put differently, every value within the set
limits of a uniform distribution is equally likely to be chosen at
random.
Due to this equal probability feature, a uniform distribution generates random samples that, over the long run (i.e., with large samples), resemble uniform or “flat” distributions. Contrast this shape with the peaked (rather than flat) standard normal distribution, where the probability of selecting “0” (i.e., the mean) is higher than that of any other value in the distribution and where the probability of selecting any given value near “0” is much higher than the probability of selecting any given value that is multiple standard deviations away from the mean of “0”.
To illustrate these differences, below we have plotted large simulated samples (n=50,000 each) drawn from the standard or z-normal (mean=0,sd=1), truncated normal (mean=1, sd=1), and uniform (min=-5,max=5) distributions.
At first glance, the flat shape of a uniform distribution does not look like a good fit for sampling state-level robbery rates. However, as with sampling from the normal distribution, small samples (e.g., n=50) drawn from a uniform distribution will be subject to greater sampling variability, so their plots are likely to be “noisier” and not as flat as the one shown above.
Moreover, rather than stacking the mass of our generative
probability distribution around the observed mean from a small, noisy
sample like we do with truncnorm()
, we might instead think
that it is reasonable to expect a state to have any robbery
rate values between “0” and some plausible upper bound - say 300
robberies per 100k people; we might also think that states are equally
likely to have any value between “0” and our specified maximum of “300”
robberies per 100k. If so, then the uniform probability distribution
might be an appropriate alternative starting point for simulating
state-level robbery rate data.
runif()
.
1234
runif(n = 50,
then hit
enter.min = 0,
then hit enter.max = 300
then hit enter.
rnorm()
and
truncnorm()
code above. Here, we specify the sampling
region for our uniform distribution between limits of min=0
(0 robberies per 100k) and max=300
(300 robberies per
100k). In a uniform probability distribution, all values within those
limits have an equal probability of selection.set.seed(1234)
runif(n = 50,
min = 0,
max = 300)
## [1] 34.111023 186.689821 182.782420 187.013833 258.274615 192.093182
## [7] 2.848727 69.765152 199.825127 154.275342 208.077388 163.492451
## [13] 84.820075 277.030045 87.694752 251.188688 85.866985 80.046234
## [19] 56.016837 69.667773 94.983736 90.808011 47.713801 11.998775
## [25] 65.639862 243.179566 157.709264 274.397450 249.403514 13.731079
## [31] 136.827445 79.556002 91.401661 152.192061 54.328862 227.901191
## [37] 60.374411 77.642946 297.645125 242.205702 166.000077 193.921828
## [43] 93.547292 186.545759 98.931053 150.599242 203.128358 145.497372
## [49] 73.178648 229.637936
simrunif
.
simrunif <-
before runif(n = 50,
to assign your results to an object called simrunif
.%>%
immediately after max = 300)
and hit enter.as_tibble()
.simrunif %>% gt()
gt()
table.
head()
; this keeps the
output brief while allowing you to compare your output to our first five
rows.simrunif <- runif(n = 50,
min = 0,
max = 300) %>%
as_tibble()
simrunif %>% head() %>% gt()
simrunif
and the column containing our 50 simulated values
is named value
, which you can see in the column header of
your tibble output above.
simrunifplot <- simrunif %>%
then hit enter.
simrunifplot
.ggplot(aes(x=value)) +
then
hit enter.
simrunif
data object, we are creating a
ggplot aesthetic with the column value
on its x-axis.
Remember the +
at the end, as we need to add a geometric
object to our blank ggplot coordinate canvas.coord_cartesian(expand = FALSE, ylim=c(0,30), xlim=c(0,300))
coord_cartesian()
function allows us to modify the
XY coordinates in our ggplot objects, such as zooming in or out of the
plot, without changing the data.geom_histogram(fill="#21918c", bins=5) +
.
ggplot
colors).bins=30
, would
divide the values from “0” to “300” into 30 bins, into which our small
sample of n=50 values would then be sorted and plotted. Selecting a
large number of bins displays more fine-grained frequency distinctions
(akin to a frequency table at the extreme), whereas selecting a smaller
number of bins can show clustering in grouped frequency regions. Try
re-running the plot a few times with different bin numbers to see what
we mean.xlab("Simulated data (runif[0,300], n=50)")
simrunifplot
and click run to
generate and then view your plot object.simrunifplot <- simrunif %>%
ggplot(aes(x=value)) +
coord_cartesian(expand = FALSE, ylim=c(0,30), xlim=c(0,310)) +
geom_histogram(fill="#21918c", bins=5) +
xlab("Simulated data (runif[0,300], n=50)")
simrunifplot
RobberyRt
distribution from our random sample of 25 states.
Recall, we assigned the random sample to the
StatesDataRdm25
object.
RobberyRt
from the random sample assigned to your
StatesDataRdm25
object. To do so, make the following
changes to your code from step 6 above:
simrunifplot
to sampleplot
.simrunif
to
StatesDataRdm25
.x=value
to
x=RobberyRt
ylim=c(0,30)
to
ylim=c(0,15)
.
ylim
) in half (from 30 to 15).geom_histogram(
) can remain
unchanged."Simulated data (runif[0,300], n=50)"
to
"RobberyRt (random sample, n=25)"
sampleplot <- StatesDataRdm25 %>%
ggplot(aes(x=RobberyRt)) +
coord_cartesian(expand = FALSE, ylim=c(0,15), xlim=c(0,310)) +
geom_histogram(fill="#21918c", bins=5) +
xlab("Sample RobberyRt (n=25)")
sampleplot
patchwork
package earlier, it
is trivially easy to combine these plots into a single plot - just add
the two objects together and patchwork will do its magic automatically!
simrunifplot + sampleplot
and hit run!simrunifplot + sampleplot
+
) a
plot_spacer()
, specifying the plot layout by adding a
plot_layout()
, or including a title by adding
plot_annotation(title="My Title")
. We can also use
/
to indicate which plot(s) we want on top and which we
want on bottom. For example, click the code box below to see two
different ways to generate the same combined plot using patchwork:# Illustrate patchwork with plot spacers, formatting, and adding a title
(simrunifplot + plot_spacer()) / (plot_spacer() + sampleplot) +
plot_annotation(
title = 'Histograms of RobberyRt (random sample, n=25) & Simulated Data (uniform(0,300), n=50)')
# Alternative way to generate same plot
# simrunifplot + plot_spacer() + plot_spacer() + sampleplot +
# plot_layout(ncol = 2) +
# plot_annotation(
# title = 'Histograms of RobberyRt (random sample, n=25) & Simulated Data (uniform(0,300), n=50)')
Finally, we could use the replicate()
function to
generate multiple simulated samples, plot them all, and combine all the
plots with patchwork. This is essentially what we did earlier to
generate the large figures with numerous plots. However, since we were
creating many similar plots with minor differences between them, we also
wrote a simple function (see code for funhist()
function
below) to generate custom histograms with
ggplot()
.
Below, we illustrate how to:
replicate()
in conjunction with your earlier code
to generate twelve simulated data sets from the uniform(0,300)
distributionfunhist()
) to plot each simulated
dataset in addition to plotting the real RobberyRt distributions from
the population and our random sampleNote: You do not need to do this step, but feel free to follow along if you want!
# Replicate 12 simulated samples from uniform(0,300) distribution
set.seed(1234)
simrunif12robrt <- replicate(n = 12,
expr = runif(n = 50,
min = 0,
max = 300),
simplify = TRUE) %>%
as_tibble()
# Alt code to simulate n=50 RobberyRt values from gamma(2,60) distribution
# set.seed(1234)
# simrgam12robrt <- replicate(n = 12,
# expr = rgamma(n = 50,
# shape = 2,
# scale = 60),
# simplify = TRUE) %>%
# as_tibble()
#create function to quickly plot custom histograms
#function inputs: data; variable to plot; number of bins for geom_histogram; y-axis max; x-axis label
funhist = function(dat,var,numbins,ymax,x_label){
dat %>%
ggplot(aes(x={{var}})) +
geom_histogram(fill="#21918c", bins=numbins) +
coord_cartesian(expand = FALSE, ylim=c(0,ymax), xlim=c(0,300)) +
xlab(x_label)
}
histpop <- funhist(StatesData, RobberyRt, 5, 30, "Pop RobRt (n=50)")
histsamp <- funhist(StatesDataRdm25, RobberyRt, 5, 15, "Samp RobRt (n=25)")
histv1runif <- funhist(simrunif12robrt, V1, 5, 30, "Sim1 (n=50)")
histv2runif <- funhist(simrunif12robrt, V2, 5, 30, "Sim2 (n=50)")
histv3runif <- funhist(simrunif12robrt, V3, 5, 30, "Sim3 (n=50)")
histv4runif <- funhist(simrunif12robrt, V4, 5, 30, "Sim4 (n=50)")
histv5runif <- funhist(simrunif12robrt, V5, 5, 30, "Sim5 (n=50)")
histv6runif <- funhist(simrunif12robrt, V6, 5, 30, "Sim6 (n=50)")
histv7runif <- funhist(simrunif12robrt, V7, 5, 30, "Sim7 (n=50)")
histv8runif <- funhist(simrunif12robrt, V8, 5, 30, "Sim8 (n=50)")
histv9runif <- funhist(simrunif12robrt, V9, 5, 30, "Sim9 (n=50)")
histv10runif <- funhist(simrunif12robrt, V10, 5, 30, "Sim10 (n=50)")
histv11runif <- funhist(simrunif12robrt, V11, 5, 30, "Sim11 (n=50)")
histv12runif <- funhist(simrunif12robrt, V12, 5, 30, "Sim12 (n=50)")
plot_spacer() + histpop + histsamp + plot_spacer() +
histv1runif + histv2runif + histv3runif +
histv4runif + histv5runif + histv6runif +
histv7runif + histv8runif + histv9runif +
histv10runif + histv11runif + histv12runif +
plot_layout(ncol = 4) +
plot_annotation(
title = 'Histograms of Robbery Rate: Population, Sample, & 12 Simulated Sets (Unif(0,300))' )
Sim4
, Sim9
, and
Sim12
plots resemble the population distribution (“Pop
RobRt” plot).Sim2
, Sim5
, and
Sim10
plots are quite similar to the sample distribution
(“Samp RobRt” plot).RobberyRt
for the population (n=all 50 states) and versus
our random sample (n=25 states). Were we to draw 12 random samples of
n=50 values from the population of 50 RobberyRt
values, we
would likely see quite a bit of variability across those samples as
well. (Feeling ambitious? Feel free to try it out if you want!)Goal: Estimate Confidence Intervals around
RobberyRt
Sample Mean
After our detour into data simulation from probability distributions, we hope that “sampling variability” makes a little more sense to you now. Understanding sampling variability is essential to truly comprehending confidence intervals. Interval estimates are one way that we attempt to quantify and communicate the uncertainty caused by sampling variability in our statistical estimates of population parameters. After all, if our random samples from a population do not vary at all (imagine sampling from a population of identical units), then we would not have any uncertainty in our estimates and would not need to generate intervals to try to capture the true population parameters!
For the last two sections of the assignment, you will calculate
confidence intervals using R. In this section, we will show you how to
estimate a 95% CI around the mean of RobberyRt
in our
randomly selected sample of 25 states. You will then follow this process
on your own to estimate a 95% CI around the mean of
AssaultRt
in our random sample. Then, in the last section,
we will show you how to estimate a 90% CI around a sample proportion
calculated from the AssaultRt
variable in our random
sample. You will then follow these same procedures on your own to
estimate 90% and 99% CIs around the same AssaultRt
proportion.
RobberyRt
Sample Meanmean(data$var)
and sd(data$var)
and assign these sample values for
RobberyRt
to objects. Remember, you will do this again for
AssaultRt
at the end of this section. We will also assign
the sample size (n <- 25
) to the object n
;
since our sample stays the same, we can use this object for this section
and the next section.StatesDataRdm25
data object contains only the 25 randomly selected states. We are
practicing interval estimation of population parameters from a random
sample, so do not use the full dataset (StatesData
)!
Sample Mean of RobberyRt (n=25 states)
sampmeanRobRt <- mean(StatesDataRdm25$RobberyRt)
sampmeanRobRt
## [1] 99.424
Sample SD of RobberyRt (n=25 states)
sampsdRobRt <- sd(StatesDataRdm25$RobberyRt)
sampsdRobRt
## [1] 57.07451
Sample Size
n <- 25
n
## [1] 25
Now that we have assigned our mean, standard deviation, and sample size as value objects, we need to identify the proper critical value before we can estimate a confidence interval around our sample mean.
qt()
function by
typing:
tcrit <- qt(p = .05/2, df = n-1, lower.tail = FALSE)
.
Then, type tcrit
to output the value and check to make sure
it is the same (t=2.064)! (Note: We also rounded the output to
show the book’s value and the R value is the same.)
qt()
generates specific “quantiles” from the t
distribution. With a two-tailed 95% CI, we want to specify 0.05 (i.e.,
5%) of samples to fall outside in the tails or “critical region” of the
t distribution.2
to get
that same t critical value from the book? We know this can be
confusing. Try to remember that, for a two-tailed confidence interval,
you need to divide the alpha level by two.
t critical value for two-tailed test, alpha=.05,
df=24
# Determine t critical value for alpha = .05 (to estimate 95% CI)
# For two-tailed test, divide alpha level by two (for both tails)!
tcrit <- qt(p = .05/2, df = n-1, lower.tail = FALSE)
tcrit
## [1] 2.063899
t critical value, rounded to 3 digits to compare with
book
round(tcrit, 3)
## [1] 2.064
We now have all the necessary inputs to plug into our formula for
estimating a 95% confidence interval around a sample mean with a small
sample. Though there are ways to automatically generate the standard
error of the mean, we will mimic hand calculation of a 95% CI for the
RobberyRt
variable by typing the formula in R and plugging
our assigned values into the formula. We hope this will help you better
understand how these interval estimates are calculated.
RobberyRt
using the formula in your book,
which is se = tcrit*(sd/sqrt(n))
, or the critical value
multiplied by the variable’s sd divided by the square root of n.
sampseRobRt <- tcrit*(sampsdRobRt/sqrt(n))
.
n
is equal to the number of observations,
which is 25 (i.e., 25 randomly selected states). tcrit
is
our t critical value. sampsdRobRt
is the standard
deviation of the RobberyRt
variable in our random
sample.qt()
function.
0.975
quantile,
which represents the proportion of samples that we want to fall within
our estimated interval. You might notice that this quantile value is
equivalent to 1 - alpha/2.
Sample Std Error of Mean of RobberyRt
sampseRobRt <- tcrit*(sampsdRobRt/sqrt(n))
sampseRobRt
## [1] 23.5592
#Alternative method for calculating standard error of mean in R using qt()
# qt(0.975,df=n-1)*sampsdRobRt/sqrt(n)
RobRt95CIlow <- sampmeanRobRt - sampseRobRt
.
Hit enter and type
RobRt95CIhi <- sampmeanRobRt + sampseRobRt
. Type each of
the object names on subsequent lines to print their output values and
run your code.
Lower bound of 95% confidence interval around sample mean
RobRt95CIlow <- sampmeanRobRt - sampseRobRt
RobRt95CIlow
## [1] 75.8648
Upper bound of 95% confidence interval around sample
mean
RobRt95CIhi <- sampmeanRobRt + sampseRobRt
RobRt95CIhi
## [1] 122.9832
95% CI Estimate around Sample Mean of RobberyRt
cat("[",RobRt95CIlow,",",RobRt95CIhi,"]")
## [ 75.8648 , 122.9832 ]
RobberyRt
from our 25 randomly selected
states!
RobberyRt
in the StatesData
, then compare it
to our 95% CI estimate around the sample mean.
Population Mean of RobberyRt
mean(StatesData$RobberyRt)
## [1] 104.584
AssaultRt
and answer questions 16-19.
Goal: Understand how to appropriately interpret a frequentist confidence interval
Now that you have estimated a 95% confidence interval around the
sample mean of RobberyRt
, we want to discuss how to
appropriately interpret these intervals before we estimating any more of
them.
You may recall us mentioning earlier that frequentist confidence intervals are routinely misinterpreted. So, how do we appropriately interpret them? Well, we are glad you asked! Here is an appropriate interpretation of the 95% CI that we just estimated:
CORRECT INTERPRETATION: We are 95% confident that the true population mean robbery rate is between 75.86 and 122.98. By 95% confident, we mean that over the long run (and assuming our research design and statistical model are sound), the same procedures would result in estimates and confidence intervals that capture the true population mean in 95% of trials.
In addition to knowing the correct interpretation, it is worth recognizing a few common misinterpretations. Though the distinctions at times appear to be quite subtle, each of the following is technically a misinterpretation:
RobberyRt
means, standard deviations, and 95% CIs!We illustrate sampling variability - and CI variability - in the
figure below, in which we plotted the mean and 95% CI from our sample
(sample #1 at the bottom) along with means and 95% CIs from eight other
samples of n=25 states each randomly drawn from StatesData
using a different random starting seed. As you can see, all nine (ours +
8 new samples) generated different point estimates (dots = sample means)
and interval estimates (error bars = 95% CIs around sample means). In
this particular case, all nine of the estimated 95% confidence intervals
managed to capture the true population parameter (vertical line =
population mean of RobberyRt
).
#generate three more random samples of n=25 states from StatesData
set.seed(1)
samp2 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(21)
samp3 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(31)
samp4 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(41)
samp5 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(51)
samp6 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(61)
samp7 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(71)
samp8 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(81)
samp9 <- StatesData %>% sample_n(25, replace = FALSE)
#create function to calculate stnd error of mean for 95% CI
fun95se = function(dat){
(qt(0.975,df=24)*sd({{dat}}$RobberyRt)/sqrt(25))
}
#create tibble with sample, mean(RobberyRt), and se(mean)
statesamps <- tribble(
~sample, ~mean, ~se,
"1", mean(StatesDataRdm25$RobberyRt), fun95se(StatesDataRdm25),
"2", mean(samp2$RobberyRt), fun95se(samp2),
"3", mean(samp3$RobberyRt), fun95se(samp3),
"4", mean(samp4$RobberyRt), fun95se(samp4),
"5", mean(samp5$RobberyRt), fun95se(samp5),
"6", mean(samp6$RobberyRt), fun95se(samp6),
"7", mean(samp7$RobberyRt), fun95se(samp7),
"8", mean(samp8$RobberyRt), fun95se(samp8),
"9", mean(samp9$RobberyRt), fun95se(samp9)
)
statesamps <- statesamps %>%
mutate(
CI95low = mean - se,
CI95hi = mean + se
)
ggplot(statesamps, aes(mean, sample)) +
geom_point() +
geom_errorbar(aes(xmin = CI95low, xmax = CI95hi, color=sample)) +
geom_vline(xintercept = mean(StatesData$RobberyRt, linetype="dashed"),
color = "#d44842", size=1) +
geom_segment(aes(x = 130, y = 1.3, xend = 124.5, yend = .9),
arrow = arrow(length = unit(0.5, "cm")), color="#365c8d") +
annotate("text", x=130, y=2.2, label="95% CI\nfor our\nfirst sample", angle=0, color="#365c8d") +
geom_segment(aes(x = 113, y = 9.5, xend = 106, yend = 9.4),
arrow = arrow(length = unit(0.5, "cm")), color="#d44842") +
annotate("text", x=120, y=9.5, label="Population mean", angle=0, color="#d44842") +
# scale_color_viridis_d() +
scale_colour_manual(values=c('#365c8d','#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d')) +
theme_classic() +
theme(legend.position = "none")
#https://statisticsglobe.com/draw-plot-with-confidence-intervals-in-r
replicate()
to
generate and assign numerous samples to a single combined data object or
to a list of separate objects, then calculate our estimates and generate
a plot from that single data object or list. (We will illustrate some
more efficient approaches in the next assignment). Instead, we generated
eight more samples and assigned them to separate data objects. We then
created a single tibble containing the sample mean and standard error
from each of our nine samples, calculated 95% upper and lower CI limits
for each sample in our new tibble, and then plotted our means and 95%
CIs from this tibble. Our hope is that you can more easily follow our
logic with this explicit approach.ggplot() + geom_point() + geom_errorbars
,
as well as illustrating how you can add other elements like a vertical
line (+ geom_vline()
), an arrow
(+ geom_segment
), or text (+ annotate()
)
elements to a ggplot()
object. While we do not expect you
to reproduce this plot, please do not let us thwart your ambitions!Goal: Estimate Confidence Intervals around
AssaultRt
Sample Proportion
In the last part of this assignment, you will use what you have
learned to answer questions 20-22 of Assignment 9, which ask you to
estimate 95% and 99% CIs around the proportion of states with an assault
rate (AssaultRt
) higher than 200 assaults in 100k in our
random sample of 25 states. We will illustrate the process by estimating
a 90% CI around this same proportion.
Remember, we previously used the mean and sd to first estimate the standard error of the mean, which we then used to calculate the lower and upper limits of the 90% CI. We follow a similar process for a proportion, though the formula and inputs are different.
On that topic, you may have noticed that your book only provides a formula for calculating a confidence interval around a sample proportion with large samples, and it does not have an equivalent formula for small samples. What gives? Well, it turns out that these types of interval estimates are pretty error-prone with small samples, and neither z nor t distributions are really very useful for generating valid intervals around proportions with small samples. Yet, the book asks you to do this anyway, so we will plug our nose and move forward using the formula for large samples, which relies on the z distribution.
AssaultRt
Sample Proportion”n
count()
function to answer this question directly in R. We show you how to do
this in the code below.pAstRtgt200
, which stands for “proportion
AssaultRt greater
than 200”.
Sample Proportion of States w/Assault Rate >200”
StatesDataRdm25 %>% count(AssaultRt > 200)
## # A tibble: 2 × 2
## `AssaultRt > 200` n
## <lgl> <int>
## 1 FALSE 9
## 2 TRUE 16
pAstRtgt200 <- 0.64
pAstRtgt200
## [1] 0.64
qnorm()
instead of qt()
, which is
a quantile function that we can use to identify critical values from the
standard normal (z) rather than t distribution.qnorm()
to find the 90% CI
(two-tailed), though you will be doing this again on your own to
estimate 95% and 99% confidence intervals.
0.95
in
our code instead.qnorm()
to identify the appropriate
z critical value for a 90% CI and assign it to an object named
zcrit
. That value is 1.64.
AssaultRt
sample proportion, remember to
identify the appropriate z critical values corresponding to the
specified confidence level!
Critical z value for two-tailed 90% CI
zcrit <- qnorm(0.95)
zcrit
## [1] 1.644854
5. We have what we need to estimate a confidence interval around
our sample proportion. a. Recall, the formula for the standard error of
the proportion (aka, the standard error of the sampling distribution of
the proportion) is: se = zcritsqrt(p(1-p)/n) b. In an R chunk,
input our assigned objects into the formula by typing
sampseAstRtprop <- zcrit*sqrt(pAstRtgt200*(1-pAstRtgt200)/n)
.
- This code estimates the standard error by multiplying our critical
value by the square root of the proportion (.64) multiplied by its
complement (or 1 - p) and divided by n. - The standard error estimate is
then assigned to a new value object called sampseAstRtprop
(named differently so we do not confuse it with sample standard error
for the mean of RobberyRt; the name refers to the
sample standard error
of Assault
Rate proportion).
Standard error of Assault Rate proportion
sampseAstRtprop <- zcrit*sqrt(pAstRtgt200*(1-pAstRtgt200)/n)
sampseAstRtprop
## [1] 0.1579059
6. With the standard error of the mean estimated and assigned to
an object, we can now calculate the 90% confidence interval around our
sample proportion. a. As we did in the previous section, we do this by
simply subtracting from and adding to the sample proportion to calculate
the lower and upper limits, respectively, for our 90% confidence
interval around the AssaultRt sample proportion. b. We can do this by
simply typing
AstProp90CIlow <- pAstRtgt200 - sampseAstRtprop
and
AstProp90CIhi <- pAstRtgt200 + sampseAstRtprop
to
determine the 90% CI of the assault rate for our 25 states.
90% CI lower limit for AssaultRt proportion
AstProp90CIlow <- pAstRtgt200 - sampseAstRtprop
AstProp90CIlow
## [1] 0.4820941
90% CI upper limit for AssaultRt proportion
AstProp90CIhi <- pAstRtgt200 + sampseAstRtprop
AstProp90CIhi
## [1] 0.7979059
90% CI Estimate around AssaultRt Sample Proportion
cat(paste0("[",round(AstProp90CIlow, 3),", ", round(AstProp90CIhi, 3),"]"))
## [0.482, 0.798]
- Congratulations! You just calculated a 90% confidence interval
estimate for the proportion of states with an assault rate greater than
200. - Our lower bound estimate is 0.48 and our upper bound estimate is
0.8. That means our 90% CI around the sample proportion is [0.48, 0.8].
- We can interpret this interval as follows: We are 90% confident that
the true population proportion of states with assault rates greater than
200 is between 0.48 and 0.8. - By 90% confident, we mean that over the
long run the same procedures would result in estimates and confidence
intervals that capture the true population proportion in 90% of trials,
assuming our research design and statistical model (including the large
sample formula) are appropriate.
Population Proportion of States w/Assault Rate
>200
StatesData %>% count(AssaultRt > 200)
## # A tibble: 2 × 2
## `AssaultRt > 200` n
## <lgl> <int>
## 1 FALSE 21
## 2 TRUE 29
popAstRtprop <- 29/50
popAstRtprop
## [1] 0.58
- The population proportion is 0.58, which falls within the 90%
confidence interval we estimated around our sample proportion (p=0.16,
90% CI=[0.48, 0.8]). It seems our random sample was one of the 90% of
samples we might have drawn that would be expected to generate 90%
confidence intervals that capture the true population proportion. Again,
we only know this because we have the full population data - in most
research situations, we do not have this luxury.
AssaultRt
sample proportion and
answer questions 20-22 of Assignment 9.
sampseAstRtprop
value) and
your lower and upper confidence limits (i.e.,
AstProp90CIlow
and AstProp90CIhi
).dplyr::sample_n()
?set.seed()
?dplyr::filter()
and %in%
operator?listname <- c(item1, item2)
?rnorm()
,
truncnorm::rtruncnorm()
, or runif()
?ggplot()
plots into a
single figure using “patchwork” package?
qt()
or
qnorm()
?ggplot() + geom_point() + geom_errorbars
?+ geom_vline()
), arrows (+ geom_segment
), or
text (+ annotate()
) elements to a ggplot()
object?