Assignment 7 Objectives

The purpose of this seventh assignment is to help you use R to complete some of the SPSS Exercises from the end of Chapter 7 in Bachman, Paternoster, & Wilson’s Statistics for Criminology & Criminal Justice, 5th Ed.

In the last assignment, you were introduced to frequentist empirical probabilities and probability distributions, including the standard normal (z) distribution. You also learned how these are essential to null hypothesis significance testing, which is the process by which most social scientists make inferences from sample descriptive statistics to population parameters. Finally, you practiced various steps in this process, such as calculating frequentist probabilities from crosstabs, calculating z-scores from raw observation values, and even conducting a basic binomial hypothesis test.

In Assignment 7, we will dig deeper into the process of making statistical inferences about population parameters from sample statistics. For instance, you will learn to think about sample descriptive statistics (e.g., a sample mean or correlation coefficient) as point estimates of population parameters. Moreover, while point estimates may represent our best or most likely guess about the value of a population parameter, a point estimate usually is just one among many potential estimates of a population parameter that are plausible under our data and modeling assumptions. In fact, we can use our knowledge of probability to calculate an interval estimate, such as a frequentist 95% confidence interval, around a point estimate.

Pragmatically, a confidence interval is a range of plausible estimates that quantifies the degree of uncertainty we have in our estimate of a population parameter. Learning to think about sample-derived estimates of population parameters as (uncertain) interval estimates rather than as (fixed and certain) point estimates is essential to truly understanding the process of statistical inference. After all, we rarely know the true population parameters and, thus, the seductive tendency to view sample statistics as “real” and known calculations of population values is misguided and prone to inference errors.

Unfortunately, frequentist confidence intervals (like their p-value cousins) are notoriously misunderstood and misinterpreted. For instance, it is quite common to read or hear people explain a 95% confidence interval like this: “We can be 95% certain that the true population parameter we seek falls within the 95% confidence interval we constructed around our sample point estimate.” Sadly, this is an inaccurate interpretation. When we calculate a frequentist 95% confidence interval, we are not 95% confident in the calculated interval itself. Rather, we are confident in the method that we used to generate confidence intervals.

What does this mean? Well, do you recall that long run or “sampling” notion of probability we discussed in the last assignment? We need to keep that in mind when interpreting a frequentist confidence interval around a point estimate. Now, imagine we calculated a sample statistic and used it as a point estimate to infer a population parameter value, and we also calculated a 95% confidence interval to quantify uncertainty around our estimate. Next, let’s hypothetically imagine we were to repeatedly follow the same sampling procedures a large number of times, calculating a point estimate and 95% confidence interval the same way each time. In what are we 95% confident? Assuming our data and modeling assumptions are appropriate, we are confident that the interval estimates we calculate in a large number of repeated trials would capture the true population parameter 95% of the time.

Of course, this means that we also expect our 95% confidence intervals would fail to capture the population parameter 5% of the time on repeated trials. Moreover, when we only calculate a confidence interval from a single sample, we cannot know whether we have one of the 95% of parameter-catching interval nets that effectively captured the true population value or, instead, whether we happened to get one of the 5% of atypical interval nets that failed to capture the true population parameter.

Finally, want to catch more parameters? Like capturing Pokemon, you just need to get a bigger or better net! For instance, we can improve our expected parameter-capturing rate (e.g., to 99%) by casting a wider interval net (i.e., by calculating wider confidence intervals). By widening our interval net, we exchange greater imprecision/uncertainty for the prospect of making fewer inference errors. We can avoid this trade-off by getting a better net - that is, by improving features of our research design (e.g., more precise measurement; larger sample sizes).

In the current assignment, you will learn how to calculate confidence intervals around a point estimate in R and to interpret them appropriately. Additionally, you will learn how to simulate data from a probability distribution, which should help you better understand sampling variability and the need for interval estimates. As with previous assignments, you will be using R Markdown (with R & RStudio) to complete and submit your work.

By the end of assignment #7, you should…

  • be able to select random sample from data without or with replacement using dplyr::sample_n()
  • know how to improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()
  • be able to select data with conditions using dplyr::filter() and %in% operator
  • know how to create a list or vector and assign to object, such as listname <- c(item1, item2)
  • recognize how to simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()
  • recognize how one might write a custom function to repeatedly generate similar plots
  • be able to combine multiple ggplot() plots into a single figure using “patchwork” package
    • know how to customize plot layout and add title to patchwork figure
  • understand how confidence intervals are used to quantify uncertainty caused by sampling variability
    • be able to estimate a two-tailed confidence interval around a sample mean or proportion in R and interpret it appropriately
    • be able to identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()
    • know how to estimate the standard error of a sample mean or proportion in R
    • understand how to properly interpret and avoid common misinterpretations of confidence intervals
  • recognize how one might plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars
  • recognize how one might add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object

Assumptions & Ground Rules

We are building on objectives from Assignments 1-6. By the start of this assignment, you should already know how to:

Basic R/RStudio skills

  • create an R Markdown (RMD) file and add/modify text, level headers, and R code chunks within it
  • knit your RMD document into an HTML file that you can then save and submit for course credit
  • install/load R packages and use hashtags (“#”) to comment out sections of R code so it does not run
  • recognize when a function is being called from a specific package using a double colon with the package::function() format
  • read in an SPSS data file in an R code chunk using haven::read_spss() and assign it to an R object using an assignment (<-) operator
  • use the $ symbol to call a specific element (e.g., a variable, row, or column) within an object (e.g., dataframe or tibble), such as with the format dataobject$varname
  • use a tidyverse %>% pipe operator to perform a sequence of actions
  • recognize the R operator != as “not equal to”
  • turn off or change scientific notation in R, such as options(scipen=999, digits = 3)
  • recognize that you can create your own R functions (e.g., our funxtoz() function) - and that doing so is recommended for duplicate tasks to avoid copy-and-paste errors

Reproducibility

  • use here() for a simple and reproducible self-referential file directory method
  • Use groundhog.library() as an optional but recommended reproducible alternative to library() for loading packages

Data viewing & wrangling

  • use the base R head() function to quickly view a snapshot of your data
  • use the glimpse() function to quickly view all columns (variables) in your data
  • use sjPlot::view_df() to quickly browse variables in a data file
  • use attr() to identify variable and attribute value labels
  • recognize when missing values are coded as NA for variables in your data file
  • remove missing observations from a variable in R when appropriate using filter(!is.na(var))
  • change a numeric variable to a factor (e.g., nominal or ordinal) variable with haven::as_factor()
  • drop an unused factor level (e.g., missing “Don’t know” label) on a variable using data %>% droplevels(data$variable)
  • select and recode variables using dplyr’s select(), mutate(), and if_else() functions
  • convert raw column (variable) values into standardized z-score values using mutate()

Descriptive data analysis

  • use summarytools::dfsummary() to quickly describe one or more variables in a data file
  • create frequency tables with sjmisc:frq() and summarytools::freq() functions
  • sort frequency distributions (lowest to highest/highest to lowest) with summarytools::freq()
  • calculate measures of central tendency for a frequency distribution
    • calculate central tendency using base R functions mean() and median() (e.g., mean(data$variable))
    • calculate central tendency and other basic descriptive statistics for specific variables in a dataset using summarytools::descr() and psych::describe() functions
  • calculate measures of dispersion for a variable distribution
    • calculate dispersion measures by hand from frequency tables you generate in R
    • calculate some measures of dispersion (e.g., standard deviation) directly in R (e.g., with sjmisc:frq() or summarytools::descr())
  • recognize and read the basic elements of a contingency table (aka crosstab)
    • place IV in columns and DV in rows of a crosstab
    • recognize column/row marginals and their overlap with univariate frequency distributions
    • calculate marginal, conditional, and joint (frequentist) probabilities
    • compare column percentages (when an IV is in columns of a crosstab)
  • generate and modify a contingency table (crosstab) in R with dplyr::select() & sjPlot::sjtab(depvar, indepvar) or with crosstable(depvar, by=indepvar)
  • conduct a bimomial hypothesis test in R with rstatix::binom_test()

Data visualization & aesthetics

  • improve some knitted tables by piping a function’s results to gt() (e.g., head(data) %>% gt())
    • modify elements of a gt() table, such as adding titles/subtitles with Markdown-formatted (e.g., **bold** or *italicized*) fonts
  • create basic graphs using ggplot2’s ggplot() function
    • generate simple bar charts and histograms to visualize shape and central tendency of a frequency distribution
    • generate boxplots using base R boxplot() and ggplot() to visualize dispersion in a data distribution
  • modify elements of a ggplot object
    • change outline and fill colors in a ggplot geometric object (e.g., geom_boxplot()) by adding fill= and color= followed by specific color names (e.g., “orange”) or hexidecimal codes (e.g., “#990000” for crimson; “#EDEBEB” for cream)
    • add or change a preset theme (e.g., + theme_minimal()) to a ggplot object to conveniently modify certain plot elements (e.g., white background color)
    • select colors from a colorblind accessible palette (e.g., using viridisLite::viridis()) and specify them for the outline and fill colors in a ggplot geometric object (e.g., geom_boxplot())
    • add a title (and subtitle or caption) to a ggplot object by adding a label with the labs() function (e.g., + labs(title = "My Title"))

Hypothesis testing & statistical inference

  • conduct and interpret a null hypothesis significance test
    • specify null (test) hypothesis & identify contrasting alternative hypothesis (or hypotheses)
    • set an alpha or significance level (e.g., as risk tolerance or false positive error control rate)
    • calculate a test statistic and corresponding p-value
    • compare a test p-value to alpha level and then determine whether the evidence is sufficient to reject the null hypothesis or should result in a failure to reject the null hypothesis


If you do not recall how to do these things, review Assignments 1-6.

Additionally, you should have read the assigned book chapter and reviewed the SPSS questions that correspond to this assignment, and you should have completed any other course materials (e.g., videos; readings) assigned for this week before attempting this R assignment. In particular, for this week, I assume you understand:

  • population parameters and sample statistics
  • point estimates and confidence intervals
    • confidence level (e.g., 95% or 99%)
    • confidence limits (e.g., lower & upper)
  • z and Student’s t distributions
  • degrees of freedom
  • standard error of the mean & standard error of the proportion
  • how to calculate a confidence interval by hand around a sample mean with large or small samples
  • how to calculate a confidence interval by hand around a sample proportion with large samples

As noted previously, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).


Part 1 (Assignment 7.1)

Goal: Read in 2012 States data & calculate population mean for ‘RobberyRt’

(Note: Remember that, when following instructions, always substitute “LastName” for your own last name and substitute YEAR_MO_DY for the actual date. E.g., 2022_06_23_Fordham_K300Assign7)

In previous assignments, you learned how to calculate descriptive statistics like the sample mean of a data column in R. In the current assignment, you will use sample statistics (e.g., sample mean or proportion) as point estimates to infer population parameter values. Likewise, you will learn to calculate the standard error of the estimate and a confidence interval around the estimate using R, as well as how to interpret these interval estimates appropriately.

  1. Go to your K300_L folder, which should contain the R Markdown file you created for Assignment 6 (named YEAR_MO_DY_LastName_K300Assign6). Click to open the R Markdown file.
    1. Remember, we open RStudio in this way so the here package will automatically set our K300_L folder as the top-level directory.
    2. In RStudio, open a new R Markdown document. If you do not recall how to do this, refer to Assignment 1.
    3. The dialogue box asks for a Title, an Author, and a Default Output Format for your new R Markdown file.
    4. In the Title box, enter K300 Assignment 7.
    5. In the Author box, enter your First and Last Name (e.g., Tyeisha Fordham).
    6. Under Default Output Format box, select “HTML document” (HTML is usually the default selection)

  2. Remember that the new R Markdown file contains a simple pre-populated template to show users how to do basic tasks like add settings, create text headings and text, insert R code chunks, and create plots. Be sure to delete this text before you begin working.
    1. Create a second-level header titled: “Part 1 (Assignment 7.1).” Then, create a third-level header titled: “Read in 2012 States Data”
    2. This assignment must be completed by the student and the student alone. To confirm that this is your work, please begin all assignments with this text: This R Markdown document contains my work for Assignment 7. It is my work and only my work.
    3. Now, you need to get data into RStudio. You already know how to do this, but please refer to Assignment 1 if you have questions.

  3. Create a third-level header in R Markdown (hereafter, “RMD”) file titled: “Load Libraries”
    1. Insert an R chunk.
    2. Inside the new R code chunk, load the following packages: tidyverse, haven, here, sjmisc, sjPlot, gt, and truncnorm.
      • This is our first time using the truncnorm package, which means you’ll need to download it first using install.packages() or the “Install Packages” button under the “Tools” tab. Then, you can use library() to load it in. We will use this package to simulate data from a truncated normal distribution.
      • Recall, you only need to install packages one time. However, you must load them each time you start a new R session. Also, remember that you can optionally use (and we recommend) groundhog.library() to improve the reproducibility of your script.

  4. After your first code chunk, create another third-level header in RMD titled: “Read Data into R”
    1. Insert another R code chunk.
    2. In the new R code chunk, read the “2012StatesData.sav” SPSS datafile into R and assign it to an R object named StatesData.
      • Forget how to do this? Refer to any of the previous assignments.
    3. In the same code chunk, on a new line below your read data/assign object command, type the name of your new R data object: StatesData.
      • This will call the object and provide a brief view of the data. (Recall that you can get a similar but more visually appealing view by simply clicking on the object in the “Environment” window. Also, be sure to put these functions on separate lines or they will not run properly.)
      • Note: We used the head() function (piped to gt()) to print the first five rows so you can compare your output.
StatesData <- read_spss(here("Datasets", "2012StatesData.sav"))
StatesData %>% head() %>% gt()
State Number1824 NumberRural State in South Four regional categories Percent in different house in 2008 Cigarette Tax in pennies Smoke Free workplace laws Implemented Tobacco related death rate per 100K Percent of Adults who smoke total population in Thousands Divorces per 1K population Percent of Families below poverty Percent Individuals below poverty Median Income for Households Percent of Population without health insurance Murder Rate per 100K Robbery Rate per 100K Assault Rate per 100K Burglary Rate per 100K Motor Vehicle Rate per 100K Infant Mortality Rat per 1,000 Live Births Cardiovascular Death Rate per 100K Cancer Death Rate Per 100K Percent of Pop With Bachelor's degree or more PercentRural Percent18to24 ID Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k Murder rate categorical
Alabama 335 1981 1 1 15.1 0.425 0 317.5 22 4780 4.4 13.4 17.5 40489 16.9 7.1 142.5 277.5 1058.9 244.8 9.9 235.5 197.3 22.0 41.443515 7.008368 1 1 2
Alaska 54 216 0 2 21.5 2.000 0 270.4 22 710 4.4 6.2 9.0 66953 17.7 3.2 94.0 462.0 514.2 241.5 6.5 147.9 179.9 26.6 30.422535 7.605634 2 1 1
Arizona 443 607 0 2 19.7 2.000 1 247.4 16 6392 3.5 11.6 16.5 48745 19.6 5.5 123.9 261.1 817.3 397.1 6.8 152.5 152.8 25.6 9.496245 6.930538 3 1 1
Arkansas 200 1269 1 1 17.6 1.150 1 323.7 22 2916 5.7 14.8 18.8 37823 19.2 6.3 93.5 381.8 1224.1 215.6 7.7 221.8 200.4 18.9 43.518519 6.858711 4 1 2
California 2766 1882 0 2 15.6 0.870 0 235.0 14 37254 0.0 10.6 14.2 58931 20.0 5.4 173.7 270.8 622.1 443.6 5.2 177.9 161.7 29.9 5.051807 7.424706 5 1 1
Colorado 349 668 0 2 18.2 0.840 1 237.6 18 5029 4.2 8.9 12.9 55430 15.3 3.2 67.9 224.5 532.5 250.6 6.1 145.3 153.7 35.9 13.282959 6.939749 6 1 1


  1. Now, generate a frequency table of the variable ‘RobberyRt’, then use it to determine the population mean and standard deviation of ‘RobberyRt’ for all 50 states.
    1. Create a third-level header titled: “Frequency Table of ‘RobberyRt’”
    2. Insert an R chunk
    3. As you know, there are various ways to generate a frequency table. For instance, you can use sjmisc::frq() by typing StatesData %>% frq(RobberyRt) to create frequency table that also provides the mean and standard deviation.
    4. By now, you should also know how to calculate the population mean by hand. Use the information you generated to answer question 14 on Quiz 7.
      • Remember, you can also quickly determine the mean and standard deviation of a variable using base R functions by typing mean(StatesData$RobberyRt) and sd(StatesData$RobberyRt).
      • Note: Since these data contain information from all 50 states, the descriptive statistics you generate actually represent population parameters and not sample statistics!
StatesData %>%
  frq(RobberyRt) 
## Robbery Rate per 100K (RobberyRt) <numeric> 
## # total N=50 valid N=50 mean=104.58 sd=57.74
## 
##  Value | N | Raw % | Valid % | Cum. %
## -------------------------------------
##  14.30 | 1 |     2 |       2 |   2.00
##  14.90 | 1 |     2 |       2 |   4.00
##  16.50 | 1 |     2 |       2 |   6.00
##  17.20 | 1 |     2 |       2 |   8.00
##  18.00 | 1 |     2 |       2 |  10.00
##  22.90 | 1 |     2 |       2 |  12.00
##  30.30 | 1 |     2 |       2 |  14.00
##  37.20 | 1 |     2 |       2 |  16.00
##  42.20 | 1 |     2 |       2 |  18.00
##  47.30 | 1 |     2 |       2 |  20.00
##  56.20 | 1 |     2 |       2 |  22.00
##  65.30 | 1 |     2 |       2 |  24.00
##  66.70 | 1 |     2 |       2 |  26.00
##  67.90 | 1 |     2 |       2 |  28.00
##  70.40 | 1 |     2 |       2 |  30.00
##  74.50 | 1 |     2 |       2 |  32.00
##  74.70 | 1 |     2 |       2 |  34.00
##  79.50 | 1 |     2 |       2 |  36.00
##  80.20 | 1 |     2 |       2 |  38.00
##  86.80 | 1 |     2 |       2 |  40.00
##  87.70 | 1 |     2 |       2 |  42.00
##  92.90 | 1 |     2 |       2 |  44.00
##  93.50 | 1 |     2 |       2 |  46.00
##  94.00 | 1 |     2 |       2 |  48.00
##  98.70 | 1 |     2 |       2 |  50.00
## 103.40 | 1 |     2 |       2 |  52.00
## 113.60 | 1 |     2 |       2 |  54.00
## 114.10 | 1 |     2 |       2 |  56.00
## 117.30 | 1 |     2 |       2 |  58.00
## 123.90 | 1 |     2 |       2 |  60.00
## 126.00 | 1 |     2 |       2 |  62.00
## 126.50 | 1 |     2 |       2 |  64.00
## 127.10 | 1 |     2 |       2 |  66.00
## 129.40 | 1 |     2 |       2 |  68.00
## 131.60 | 1 |     2 |       2 |  70.00
## 133.70 | 1 |     2 |       2 |  72.00
## 142.30 | 1 |     2 |       2 |  74.00
## 142.40 | 1 |     2 |       2 |  76.00
## 142.50 | 1 |     2 |       2 |  78.00
## 144.50 | 1 |     2 |       2 |  80.00
## 153.30 | 1 |     2 |       2 |  82.00
## 153.60 | 1 |     2 |       2 |  84.00
## 157.00 | 1 |     2 |       2 |  86.00
## 166.80 | 1 |     2 |       2 |  88.00
## 167.60 | 1 |     2 |       2 |  90.00
## 173.70 | 1 |     2 |       2 |  92.00
## 189.70 | 1 |     2 |       2 |  94.00
## 210.70 | 1 |     2 |       2 |  96.00
## 228.00 | 1 |     2 |       2 |  98.00
## 260.70 | 1 |     2 |       2 | 100.00
##   <NA> | 0 |     0 |    <NA> |   <NA>

Part 2 (Assignment 7.2)

Goal: Randomly Select 25 States from Dataset

As noted in the previous section, this is one of the relatively rare instances where we have data from an entire population - in this case, from all 50 states. Since the population parameters (e.g., mean and standard deviation) are known to us, we can illustrate the logic and process of statistical inference by randomly sampling from the population, generating sample point and interval estimates, and then comparing these sample estimates to the known population parameters. (Note: Typically, we cannot do this last step as we rarely know the population parameters - that is why we use samples to estimate them!)

So, you will start randomly select rows of data from 25 of the 50 states (rows) in the 2012 States dataset. In doing so, you will learn about the set.seed() function, which is essential for reproducibility as it will allow all of us to select the same random sample rather than generating a unique random sample every time our code is run.

After drawing a random sample of 25 states, we will take a detour to introduce nonrandom selection of data subsets followed by data simulation. We will then revisit the random sample you generate in this section to complete the last part of the assignment.

  1. First, you will randomly select 25 states from the dataset. Later in the assignment, we will use this random sample to calculate confidence intervals around sample statistics.
    1. Create a second-level header titled: “Part 2 (Assignment 7.2)” and then a third-level header titled: “Randomly Select 25 States from Dataset”
    2. Insert an R chunk.

  2. Type set.seed(1234) and hit enter. Then, type StatesDataRdm25 <- StatesData %>% sample_n(25, replace = FALSE)
    1. The function set.seed() allows us to make our “random” sample easier to reproduce. Without it, R will generate a new and unique random sample each time you run your sampling code, which can cause major reproducibility problems.
      • For instance, imagine a team of researchers are working together. The lead researcher sends an R document containing code to generate a random sample to a collaborator, followed by code to calculate sample statistics and text that interprets the results. However, when the collaborator runs the code, they appear to get a slightly different sample than the lead researcher - and their sample generates slightly different sample statistics that do not match the lead researcher’s interpretations! Moreover, the collaborator has no way of knowing exactly how to reproduce the same random sample as the lead researcher. What a mess!
      • This confusion can be avoided by setting the random seed before drawing a random sample. What is a random seed? Well, the R functions we use to select random samples (and for various other randomization processes) actually generate pseudorandomness rather than true randomness. Specifically, they start by using an input such as the R system clock, which is constantly changing, to specify a starting value or “seed” that is then fed into a deterministic (pseudo)random number generator. So, if you do not specify a seed, then R will begin randomizing your sample from a different place each time you click ‘run’ to produce (pseudo)random output. One by-product of the randomness is that it will be nearly impossible for any sample you generate on one run to be the same in any subsequent run, and this will result in different calculations each time as well.
      • Meanwhile, the set.seed() function specifies a fixed instead of changing starting point for the random number generator. This way, any time we or a collaborator runs our code, we will get the exact same “random” sample. Here, we are using a starting seed of 1234, which will begin the random number generator at the 1,234th place. However, we can use any set of numbers when we set our seed. Unique seeds will keep our work as random as possible yet still reproducible. You could set your seed to 6789 – any value will work, even 1! I often use numbers like 8675309 or 1138.
    2. The dplyr::sample_n() function allows us to take a sample of size n from our dataset. The first value, 25, specifies that we want 25 rows, or states. The function replace = FALSE simply tells R not to replace states after choosing them. For example, if Missouri is chosen, R will not place it back into the selection pool. This method ensures no state is chosen more than once. If you wanted to sample with replacement, change FALSE to TRUE.
      • Alternatively, we could randomly choose half our states (n=25 out of 50) using proportions instead of exact counts. To do this, we could simply replace sample_n() with sample_frac() and 25 with 0.5. Essentially, this approach will take a random sample of a fraction (in our case, 50%) of the rows in our data object.
    3. As you know, the StatesDataRdm25 <- StatesData %>% portion of the code assigns the result of our piped actions on the StatesData - that is, our random sample of 25 rows - into a new data object called “StatesDataRdm25”.
set.seed(1234)
StatesDataRdm25 <- StatesData %>% sample_n(25, replace = FALSE)
StatesDataRdm25 %>% head() %>% gt()
State Number1824 NumberRural State in South Four regional categories Percent in different house in 2008 Cigarette Tax in pennies Smoke Free workplace laws Implemented Tobacco related death rate per 100K Percent of Adults who smoke total population in Thousands Divorces per 1K population Percent of Families below poverty Percent Individuals below poverty Median Income for Households Percent of Population without health insurance Murder Rate per 100K Robbery Rate per 100K Assault Rate per 100K Burglary Rate per 100K Motor Vehicle Rate per 100K Infant Mortality Rat per 1,000 Live Births Cardiovascular Death Rate per 100K Cancer Death Rate Per 100K Percent of Pop With Bachelor's degree or more PercentRural Percent18to24 ID Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k Murder rate categorical
Nevada 178 170 0 2 20.6 0.800 1 343.7 22 2701 6.7 9.0 12.4 53341 20.8 5.9 228.0 432.1 835.7 468.6 6.4 200.0 180.2 21.8 6.293965 6.590152 29 1 1
Kansas 204 768 0 3 17.6 0.790 0 262.7 18 2853 3.7 9.0 13.4 47817 13.3 4.7 66.7 297.9 690.0 218.2 7.9 178.7 180.0 29.5 26.919033 7.150368 17 1 1
Michigan 669 2519 0 3 14.4 2.000 0 281.9 20 9884 3.3 11.6 16.2 45255 13.8 6.3 126.5 326.5 768.1 297.7 7.9 221.5 187.3 24.6 25.485633 6.768515 23 1 2
Oregon 253 727 0 2 17.3 1.180 1 263.3 16 3831 3.9 9.8 14.3 48457 17.7 2.3 65.3 162.3 513.0 261.5 5.8 156.9 179.3 29.2 18.976768 6.604020 38 0 0
Utah 227 263 0 2 16.4 0.695 1 138.3 9 2764 3.6 7.8 11.5 55117 14.8 1.4 47.3 133.8 548.7 251.1 5.1 152.1 128.8 28.5 9.515195 8.212735 45 0 0
Florida 1229 1712 1 1 15.9 1.339 1 258.8 17 18801 4.2 10.7 14.9 44736 22.4 5.5 166.8 410.6 981.2 271.2 7.1 162.4 166.6 25.3 9.105899 6.536886 10 1 1
#Alternative way to sample half the rows (n=25)
# StatesDataRdm25 <- StatesData %>% sample_frac(0.5, replace = FALSE)
  1. You just randomly selected 25 states from your dataset!
    1. You should now be able to click the StatesDataRdm25 data object in your Environment. It should contain 25 rows (states) with 30 columns (variables). The first state should be Nevada and the last should be Delaware.
    2. Note: We used the head() function (piped to gt()) again to print the first five rows so you can compare your output.
    3. Your original StatesData data object should also still be in the R Environment and should still contain all 50 states.

  2. We will revisit this dataset later. For now, let’s explore a couple other ways of selecting rows from data and introduce some simple data simulation methods.

Part 3 (Assignment 7.3)

Goal: Nonrandomly Select Rows using Conditions

While a random (or probability) sample is essential for drawing unbiased statistical inferences from a sample to a population, there may be times that you want to non-randomly select certain rows with specific characteristics from a data object rather than selecting rows randomly. So, we we will also show you how to select a specific subset of rows (states) in the data, such as the first 25 rows or all rows from states starting with a particular letter.

  1. Imagine you wanted to select only a specific row or rows (i.e., states) from the StatesData object. One way you could do this is to combine dplyr::filter() with the %in% operator. Let’s try it.
    1. Insert an R chunk.
    2. Type StateswA <- StatesData %>% and hit enter. Then, type dplyr::filter(State %in% c("Alabama", "Alaska", "Arizona", "Arkansas")) and hit run. A data object called StateswA should appear in your Environment.
      • In this example, we want only states that begin with the letter ‘A’. We create a data object called StateswA using our StatesData dataset. Note: Do not use the StatesDataRdm25 dataset, because this only includes 50% of the states!
      • The dplyr::filter() function generates a subset of a dataframe by retaining only the rows that match a condition you specify (e.g., matching column values). You might recall that we used dplyr::filter() in the previous assignment to select only rows that contained valid non-missing response values on a variable (with filter(!is.na(var))).
      • The %in% operator tells R that we want to keep only those rows in the StatesData dataset where the response values are “in” (or are equal to) the values we specified - in this case, “Alabama”, “Alaska”, “Arizona”, and “Arkansas”. If we had specified only one state, we could have used a double equal sign instead (e.g., State == "Alabama").
      • Also, note that we listed the states inside a c(). This is a very handy and widely used base R function that combines multiple elements into a value or a list.
      • Overall, the code newdata <- olddata %>% filter(var %in% c(value1, value2)) informs R to filter the old data object by the “var” column, keeping only those rows where “var” is equal to “value1” or “value2”, then assigning this subset to a new data object.
#preferred way to choose specific rows, or states
StateswA <- StatesData %>% 
  dplyr::filter(State %in% c("Alabama", "Alaska", "Arizona", "Arkansas")
  )
StateswA %>%  gt()
State Number1824 NumberRural State in South Four regional categories Percent in different house in 2008 Cigarette Tax in pennies Smoke Free workplace laws Implemented Tobacco related death rate per 100K Percent of Adults who smoke total population in Thousands Divorces per 1K population Percent of Families below poverty Percent Individuals below poverty Median Income for Households Percent of Population without health insurance Murder Rate per 100K Robbery Rate per 100K Assault Rate per 100K Burglary Rate per 100K Motor Vehicle Rate per 100K Infant Mortality Rat per 1,000 Live Births Cardiovascular Death Rate per 100K Cancer Death Rate Per 100K Percent of Pop With Bachelor's degree or more PercentRural Percent18to24 ID Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k Murder rate categorical
Alabama 335 1981 1 1 15.1 0.425 0 317.5 22 4780 4.4 13.4 17.5 40489 16.9 7.1 142.5 277.5 1058.9 244.8 9.9 235.5 197.3 22.0 41.443515 7.008368 1 1 2
Alaska 54 216 0 2 21.5 2.000 0 270.4 22 710 4.4 6.2 9.0 66953 17.7 3.2 94.0 462.0 514.2 241.5 6.5 147.9 179.9 26.6 30.422535 7.605634 2 1 1
Arizona 443 607 0 2 19.7 2.000 1 247.4 16 6392 3.5 11.6 16.5 48745 19.6 5.5 123.9 261.1 817.3 397.1 6.8 152.5 152.8 25.6 9.496245 6.930538 3 1 1
Arkansas 200 1269 1 1 17.6 1.150 1 323.7 22 2916 5.7 14.8 18.8 37823 19.2 6.3 93.5 381.8 1224.1 215.6 7.7 221.8 200.4 18.9 43.518519 6.858711 4 1 2


  1. With this knowledge, there are other ways you could filter for specific rows, or states. Let’s try another method that also uses %in% operator and the filter() and c() functions. We will start by assigning the column values (state names) to an object or, more precisely, to a vector of values.
    1. Insert an R chunk.
    2. Type Cstates <- c("California", "Colorado", "Connecticut").
      • This puts all states that start with C into a vector called Cstates. This basically creates a list or vector of values (not a data object) that will be assigned to an object. You can then find this object in your Environment tab under “Values” (below “Data” objects), and you can call this object for use in later code.
      • So, if you hit enter, then type Cstates and click “Run” to view the object, R studio should output the list you assigned:
#assign list/vector of specific row values, or states, as object called "Cstates"
Cstates <- c("California", "Colorado", "Connecticut")
Cstates
## [1] "California"  "Colorado"    "Connecticut"
  1. Now, use the assigned vector (“Cstates”) in the filter() code from above.
    1. On a new line in the same code chunk, type StateswC <- StatesData %>% filter(State %in% Cstates)
    2. This filters the data object to keep states that begin with ‘C’ and assigns to a new data object called StateswC.
    3. When you click the data object, you should have a dataset that contains California, Colorado, and Connecticut.
#Filter data using object containing list/vector of row values (states)
StateswC <- StatesData %>% 
  filter(State %in% Cstates)
StateswC %>% gt()
State Number1824 NumberRural State in South Four regional categories Percent in different house in 2008 Cigarette Tax in pennies Smoke Free workplace laws Implemented Tobacco related death rate per 100K Percent of Adults who smoke total population in Thousands Divorces per 1K population Percent of Families below poverty Percent Individuals below poverty Median Income for Households Percent of Population without health insurance Murder Rate per 100K Robbery Rate per 100K Assault Rate per 100K Burglary Rate per 100K Motor Vehicle Rate per 100K Infant Mortality Rat per 1,000 Live Births Cardiovascular Death Rate per 100K Cancer Death Rate Per 100K Percent of Pop With Bachelor's degree or more PercentRural Percent18to24 ID Binary for assault rate where 0= less than 200 assaults per 100k and 1=more than 200 assaults per 100k Murder rate categorical
California 2766 1882 0 2 15.6 0.87 0 235.0 14 37254 0.0 10.6 14.2 58931 20.0 5.4 173.7 270.8 622.1 443.6 5.2 177.9 161.7 29.9 5.051807 7.424706 5 1 1
Colorado 349 668 0 2 18.2 0.84 1 237.6 18 5029 4.2 8.9 12.9 55430 15.3 3.2 67.9 224.5 532.5 250.6 6.1 145.3 153.7 35.9 13.282959 6.939749 6 1 1
Connecticut 228 418 0 4 11.0 3.00 0 238.3 16 3574 3.1 6.7 9.4 67034 12.0 3.0 113.6 165.2 431.1 212.0 6.6 171.0 170.7 35.6 11.695579 6.379407 7 0 0

Part 4 (Assignment 7.4)

Goal: Simulate Data

In addition to selecting subsets from real-world data, there may be times that you want to simulate fake data. Social scientists simulate data for many reasons, such as to check model assumptions, mimic real-world scenarios, or even generate more accurate interval estimates around sample statistics from real data (e.g., bootstrapped confidence intervals or Bayesian credible intervals). In fact, some even claim that to truly understand statistical modeling and inference, one must first know how to effectively simulate data.

One of the most challenging aspects of data simulation is choosing an appropriate probability distribution(s) from which to randomly sample data. Here, we will introduce you to simple methods for generating simulated datasets from some common probability distributions, including standard normal, normal, truncated normal, and uniform distributions. We recommend checking out this site for additional reading and other examples of simulating from normal, uniform, binomial, and Poisson distributions in R.

  1. Imagine that our end goal in this section is to generate a dataset containing several simulated columns that each contain n=50 values, where each row value represents a plausible (fake but realistic) “RobberyRt” value for one of 50 simulated states. Since simulation generally involves random draws from a probability distribution(s), we will start by first demonstrating how to simulate n=50 row values from a standard normal distribution.
    1. Insert an R chunk.
    2. Since we will be invoking R’s random number generator, start by setting the random seed to 1234
    3. After setting the seed, type rnorm(n = 50, mean=0, sd=1)
      • rnorm() draws “n” random numbers from a normal distribution with a specified mean and standard deviation.
      • Recall, the standard normal distribution has a mean=0 and sd=1, so we specified the rnorm() function to draw 50 random numbers from a (standard) normal distribution that has a mean=0 and sd=1.
    4. After running the code, you should see a list of negative and positive numbers.
# simulation of n=25 RobberyRt values from random normal dist
set.seed(1234)
rnorm(n = 50, 
      mean = 0, 
      sd = 1)
##  [1] -1.20706575  0.27742924  1.08444118 -2.34569770  0.42912469  0.50605589
##  [7] -0.57473996 -0.54663186 -0.56445200 -0.89003783 -0.47719270 -0.99838644
## [13] -0.77625389  0.06445882  0.95949406 -0.11028549 -0.51100951 -0.91119542
## [19] -0.83717168  2.41583518  0.13408822 -0.49068590 -0.44054787  0.45958944
## [25] -0.69372025 -1.44820491  0.57475572 -1.02365572 -0.01513830 -0.93594860
## [31]  1.10229755 -0.47559308 -0.70944004 -0.50125806 -1.62909347 -1.16761926
## [37] -2.18003965 -1.34099319 -0.29429386 -0.46589754  1.44949627 -1.06864272
## [43] -0.85536463 -0.28062300 -0.99434008 -0.96851432 -1.10731819 -1.25198589
## [49] -0.52382812 -0.49684996
  • We will go a step further by assigning these numbers to a tibble, then printing the mean and sd of the simulated data series, and finally plotting them in a histogram. Though this step is optional, feel free to follow along in your own RMD file if you want.
# simulation of n=25 RobberyRt values from random normal dist

set.seed(1234)
simznorm <- rnorm(n = 50, 
      mean = 0, 
      sd = 1) %>% as_tibble()


Mean of z-normal simulated sample, n=50”

mean(simznorm$value)
## [1] -0.453053


SD of z-normal simulated sample, n=50”

sd(simznorm$value)
## [1] 0.8850435
simznorm %>% ggplot(aes(x=value)) + geom_histogram(fill="#21918c", bins=5)

  • Notice the data series are approximately normally distributed with a mean that is somewhat close to “0” (-0.45) and SD close to “1” (0.89). However, as a random draw from the standard normal distribution, they are not expected to exactly match these parameter values or perfect bell-curved shape. This is especially true when sampling a small number of values from a heterogeneous population.

  • We illustrate this sampling variability in the figure below, which was generated by drawing nine samples of n=50 values each from the standard normal distribution (with the same starting seed) and then compiling their histograms into a single figure using the “patchwork” package. For now, do not worry about the code - we will walk you through much of this process soon enough. Instead, just notice how different each of these histograms appear to be despite the fact that they each represent a random sample drawn from the same population (i.e., a standard normal probability distribution).

  • You may have learned that drawing larger random samples and/or selecting from more homogeneous populations will reduce sampling error and minimize bias in statistical estimation. Again, we illustrate this below, this time by repeating the above procedures but simply changing the n=50 in the rnorm() function to a larger sample size - say, to n=5000 instead.

  • Notice these samples still vary randomly, but each sample much more closely resembles the standard normal distribution with its iconic bell-shaped curve. In fact, the mean and sd of the first larger (n=5,000) simulated sample also is much closer to standard normal values (mean=0 and sd=1):


Mean of z-normal simulated sample V1, n=5,000”

## [1] -0.006354696


SD of z-normal simulated sample V1, n=5,000”

sd(simznorm$V1)
## [1] 0.991345


  • Now, let’s compare the plots above from our small (n=50) simulated samples drawn from the standard normal distribution to the plot below of the observed “RobberyRt” distribution from our random sample of states (n=25):


Sample mean for RobberyRt (n=25 states)

## [1] 99.424


Sample sd for RobberyRt (n=25 states)

#sample sd of RobberyRt
sd(StatesDataRdm25$RobberyRt)
## [1] 57.07451
funhist(StatesDataRdm25, RobberyRt, 5, 20, "Robbery Rate (Random Sample of States Data, n=25)")

  • The “RobberyRt” distribution from our random sample seems like it could have been drawn from a normal probability distribution but, as we would expect, key summary statistics like mean, sd, min, and max values are quite a bit different from the standard or z-normal distribution. So, instead of sampling from the standard normal distribution, let’s try sampling values from a normal probability distribution that more closely matches the empirical distribution of “RobberyRt” values.
  1. One way we could start would be to simulate a from a normal distribution, but with specified “mean” and “sd” values estimated from our random sample instead of the standard normal values.
    1. Insert an R chunk,
    2. Since we will be invoking R’s random number generator again, remember to start by setting the random seed to 1234
    3. After setting the seed, type: rnorm(n = 50, then hit enter.
    4. On the next line, type mean=mean(StatesDataRdm25$RobberyRt), then hit enter.
    5. On the next line, type sd=sd(StatesDataRdm25$RobberyRt)) and then run your code.
# simulation of n=25 RobberyRt values from random normal dist
set.seed(1234)
rnorm(n = 50, 
      mean = mean(StatesDataRdm25$RobberyRt), 
      sd = sd(StatesDataRdm25$RobberyRt))
##  [1]  30.531317 115.258137 161.317946 -34.455541 123.916080 128.306891
##  [7]  66.621000  68.225256  67.208180  48.625529  72.188462  42.441585
## [13]  55.119691 103.102955 154.186651  93.129510  70.258384  47.417970
## [19]  51.642839 237.306603 107.077019  71.418344  74.279947 125.654841
## [25]  59.830259  16.768418 132.227900  40.999354  98.559989  46.005195
## [31] 162.337090  72.279759  58.933059  70.814943   6.444293  32.782706
## [37] -25.000689  22.887474  82.627323  72.833127 182.153285  38.431743
## [43]  50.604485  83.407580  42.672530  44.146522  36.224360  27.967522
## [49]  69.526768  71.066533
  • Recall, when we simulated from the standard normal distribution, our simulated datasets contained negative and positive values that centered around “0” (i.e., because we simulated from a z-normal distribution with a mean=0 and sd=1). Careful examination of our new simulated data generated using rnorm() and the mean/sd of our random sample (printed output above and plotted below) reveals a set of values that are much more comparable to the RobberyRt values observed in our StatesData, with one glaring exception: Our simulated data contain some negative values, whereas the minimum value in our RobberyRt sample (and population) data equals “0”. Of course, this is because a state cannot validly report a negative robbery rate - that is, a state cannot have less than 0 robberies in a year!

  • Ideally, we want to simulate data from a probability distribution that closely matches the data-generating processes underlying RobberyRt values. So, at the very least, we want to draw from a probability distribution that does not include any negative values. Since the normal distribution extends from negative infinity to infinity, there is always a chance that we will draw one or more negative - and hence invalid (as RobberyRt values cannot be negative) - simulated values using rnorm(). As such, this distribution is not a good fit to the data generating process that we are trying to simulate.

  • Instead, we might consider randomly sampling from a truncated distribution, which is a probability distribution that is limited to or excludes certain values. For instance, we could draw from a truncated normal distribution that excludes (rejects) any values below a minimum of “0”.

  1. Let’s try simulating data from a truncated normal distribution using the “truncnorm” package that we installed earlier.
    1. Insert an R chunk.
    2. Once again, we will be invoking R’s random number generator again, so remember to set the random seed to 1234
    3. After setting the seed, type: rnorm(n = 50, then hit enter.
    4. On the next line, type a = 0, then hit enter.
      • This is different from the rnorm() code. With truncnorm() we can specify lower (minumum; a=) and upper (maximum; b=) sampling limits to define the accept/reject sampling region. In our case, we typed a=0 to specify a minimum value of “0”, which means that any negative values (<0) will be rejected (skipped) when drawing random numbers from the normal distribution.
    5. On the next line, type mean=mean(StatesDataRdm25$RobberyRt), then hit enter.
    6. On the next line, type sd=sd(StatesDataRdm25$RobberyRt)) and then run your code.
set.seed(1234)
rtruncnorm(n = 50, 
           a = 0,
           mean = mean(StatesDataRdm25$RobberyRt), 
           sd = sd(StatesDataRdm25$RobberyRt)) 
##  [1]  30.531317 115.258137 161.317946 123.916080 128.306891  66.621000
##  [7]  68.225256  67.208180  48.625529  72.188462  42.441585  55.119691
## [13] 103.102955 154.186651  93.129510  70.258384  47.417970  51.642839
## [19] 237.306603 107.077019  71.418344  74.279947 125.654841  59.830259
## [25]  16.768418 132.227900  40.999354  98.559989  46.005195 162.337090
## [31]  72.279759  58.933059  70.814943   6.444293  32.782706  22.887474
## [37]  82.627323  72.833127 182.153285  38.431743  50.604485  83.407580
## [43]  42.672530  44.146522  36.224360  27.967522  69.526768  71.066533
## [49]  66.202303  36.134671
  • If you look carefully, these values are nearly identical to those generated using rnorm(). This is because in both instances we set the same random seed (set.seed(1234)) before drawing (pseudo)random values from a normal probability distribution with the same mean and standard deviation. The key difference is that the 4th value and 38th value were both negative (-34.45554 & -25.000689, respectively) in our simulated sample generated earlier using rnorm(). Each of these negative values were rejected by the random sampler and skipped in generating our new simulated sample using truncnorm(). Likewise, the very last two values (66.202303 & 36.134671) in the truncnorm() simulated sample are new, as these values essentially were next in line (from our seed using the random number generator) to replace the two rejected negative values.

  • Below, we present the plot of our real random sample data alongside the plot of our simulated sample data from a truncated normal distribution. Compared to our previous attempts, it is more plausible that these simulated values were drawn randomly from the same underlying generative distribution as our sample “RobberyRt” values.

  • Before you learn to replicate this simulation process multiple times and create your own patchwork plots, let’s consider one alternative probability distribution - the uniform or “flat” distribution. In a uniform distribution, the probability of selecting any one value between a specified minimum and maximum limit is equal to the probability of selecting any other value. Put differently, every value within the set limits of a uniform distribution is equally likely to be chosen at random.

  • Due to this equal probability feature, a uniform distribution generates random samples that, over the long run (i.e., with large samples), resemble uniform or “flat” distributions. Contrast this shape with the peaked (rather than flat) standard normal distribution, where the probability of selecting “0” (i.e., the mean) is higher than that of any other value in the distribution and where the probability of selecting any given value near “0” is much higher than the probability of selecting any given value that is multiple standard deviations away from the mean of “0”.

To illustrate these differences, below we have plotted large simulated samples (n=50,000 each) drawn from the standard or z-normal (mean=0,sd=1), truncated normal (mean=1, sd=1), and uniform (min=-5,max=5) distributions.

  • At first glance, the flat shape of a uniform distribution does not look like a good fit for sampling state-level robbery rates. However, as with sampling from the normal distribution, small samples (e.g., n=50) drawn from a uniform distribution will be subject to greater sampling variability, so their plots are likely to be “noisier” and not as flat as the one shown above.

  • Moreover, rather than stacking the mass of our generative probability distribution around the observed mean from a small, noisy sample like we do with truncnorm(), we might instead think that it is reasonable to expect a state to have any robbery rate values between “0” and some plausible upper bound - say 300 robberies per 100k people; we might also think that states are equally likely to have any value between “0” and our specified maximum of “300” robberies per 100k. If so, then the uniform probability distribution might be an appropriate alternative starting point for simulating state-level robbery rate data.

    • (Note: In an actual simulation for research purposes, one would make additional assumptions to generate even more plausible state-level data. Still, this should serve the purpose of introducing the logic of data simulation while illustrating the idea of sampling variability along the way).
  1. Let’s try simulating data from a uniform distribution using runif().
    1. Insert an R chunk.
    2. Again, we will be invoking R’s random number generator again, so remember to set the random seed to 1234
    3. After setting the seed, type: runif(n = 50,` then hit enter.
    4. On the next line, type min = 0, then hit enter.
    5. On the next line, type max = 300 then hit enter.
      • Again, this is a little different from the rnorm() and truncnorm() code above. Here, we specify the sampling region for our uniform distribution between limits of min=0 (0 robberies per 100k) and max=300 (300 robberies per 100k). In a uniform probability distribution, all values within those limits have an equal probability of selection.
set.seed(1234)
runif(n = 50,
      min = 0,
      max = 300) 
##  [1]  34.111023 186.689821 182.782420 187.013833 258.274615 192.093182
##  [7]   2.848727  69.765152 199.825127 154.275342 208.077388 163.492451
## [13]  84.820075 277.030045  87.694752 251.188688  85.866985  80.046234
## [19]  56.016837  69.667773  94.983736  90.808011  47.713801  11.998775
## [25]  65.639862 243.179566 157.709264 274.397450 249.403514  13.731079
## [31] 136.827445  79.556002  91.401661 152.192061  54.328862 227.901191
## [37]  60.374411  77.642946 297.645125 242.205702 166.000077 193.921828
## [43]  93.547292 186.545759  98.931053 150.599242 203.128358 145.497372
## [49]  73.178648 229.637936
  1. Now, let’s transform these simulated values into a tibble and assign to a data object called “simrunif”.
    1. Insert an R chunk and repeat step 4 above in the new chunk to generate 50 simulated values from the uniform distribution. Remember to set your seed first!
    2. Type simrunif <- before runif(n = 50, to assign your results to an object called “simrunif”.
    3. Type %>% immediately after max = 300) and hit enter.
    4. On the next line, type as_tibble().
    5. On the next line, type simrunif %>% gt()
      • We just piped our results to a function that transforms the list of numbers into a tibble, or simple tidy data frame and assigned them to a data object. The last line should display your tibble in a gt() table. (Note: We also piped to head(); this keeps the output brief while allowing you to compare your output to our first five rows.)
simrunif <- runif(n = 50,
                  min = 0,
                  max = 300) %>%
  as_tibble()
simrunif %>% head() %>% gt()
value
22.13396
92.90598
215.18152
151.36377
45.89969
151.18005
  • Now let’s plot our simulated data from the uniform distribution alongside our sample data.
  1. We will start by plotting our simulated data and then assigning the plot to an object. Recall, our data object is named “simrunif” and the column containing our 50 simulated values is named “value”, which you can see in the column header of your tibble output above.
    1. Insert an R chunk.
    2. Type simrunifplot <- simrunif %>% then hit enter.
      • This calls our data object and assigns the subsequent sequenced actions (i.e., our histgram) into an object called “simrunifplot”.
    3. On the second line, type ggplot(aes(x=value)) + then hit enter.
      • Using the “simrunif” data object, we are creating a ggplot aesthetic with the column “value” on its x-axis. Remember the + at the end, as we need to add a geometric object to our blank ggplot coordinate canvas.
    4. On the third line, type coord_cartesian(expand = FALSE, ylim=c(0,30), xlim=c(0,300))
      • The coord_cartesian() function allows us to modify the XY coordinates in our ggplot objects, such as zooming in or out of the plot, without changing the data.
      • Here, we are specifying the y-axis to range exactly from 0 to 30 and the x-axis from 0 to 310, without expanding at all beyond those values.
      • Setting the x-axis and y-axis values is an important step when plotting results. For instance, it can improve readability of your plot and can ensure consistency and comparability when generating multiple similar plots.
    5. On the fourth line, type geom_histogram(fill="#21918c", bins=5) +.
      • Recall, this adds a geometric object (i.e., a histogram) to our ggplot.
      • We also specified a color from the colorblind-accessible viridis scale (see Assignment 5).
      • Additionally, we specified the number of bins in which to divide and count the units on the x-axis. The default, bins=30, would divide the values from “0” to “300” into 30 bins, into which our small sample of n=50 values would then be sorted and plotted. Selecting a large number of bins displays more fine-grained frequency distinctions (akin to a frequency table at the extreme), whereas selecting a smaller number of bins can show clustering in grouped frequency regions. Try re-running the plot a few times with different bin numbers to see what we mean.
    6. On the fifth line, type xlab("Simulated data (runif[0,300], n=50)")
      • This line changes the x-axis label on our ggplot object from “Value” (our column name) to a meaningful description
    7. On the last line, type simrunifplot and click run to generate and then view your plot object.
simrunifplot <- simrunif %>% 
  ggplot(aes(x=value)) + 
  coord_cartesian(expand = FALSE, ylim=c(0,30), xlim=c(0,310)) + 
  geom_histogram(fill="#21918c", bins=5) + 
  xlab("Simulated data (runif[0,300], n=50)")
  
simrunifplot 

  1. Now that we have a plot object containing our histogram of values simulated from the uniform distribution, let’s also plot the real “RobberyRt” distribution from our random sample of 25 states. Recall, we assigned the random sample to the “StatesDataRdm25” object.
    1. Insert an R chunk.
    2. Follow the instructions above from step 6 to plot a histogram for “RobberyRt” from the random sample assigned to your “StatesDataRdm25” object. To do so, make the following changes to your code from step 6 above:
    3. In the first and last lines, change your new assigned plot object name from simrunifplot to sampleplot.
    4. In the first line, change data call from simrunif to StatesDataRdm25.
    5. In the second line, change x=value to x=RobberyRt
    6. In the third line, change ylim=c(0,30) to ylim=c(0,15).
      • Our random sample has half the number of “RobberyRt” values (n=25) as our simulated data. So, to make these graphs more comparable, we are zooming our y-axis by dividing the y-axis upper limit (ylim) in half (from 30 to 15).
    7. The fourth line (geom_histogram() can remain unchanged.
    8. In the fifth line, change x-axis label from "Simulated data (runif[0,300], n=50)" to "RobberyRt (random sample, n=25)"
sampleplot <- StatesDataRdm25 %>% 
  ggplot(aes(x=RobberyRt)) + 
  coord_cartesian(expand = FALSE, ylim=c(0,15), xlim=c(0,310)) + 
  geom_histogram(fill="#21918c", bins=5) + 
  xlab("Sample RobberyRt (n=25)")
  
sampleplot 

  1. Assuming you loaded the “patchwork” package earlier, it is trivially easy to combine these plots into a single plot - just add the two objects together and patchwork will do its magic automatically!
    1. Insert an R chunk.
    2. Type simrunifplot + sampleplot and hit run!
simrunifplot + sampleplot

  • We can also customize a patchwork plot in various ways, such as placing an empty space between plot objects by adding (+) a plot_spacer(), specifying the plot layout by adding a plot_layout(), or including a title by adding plot_annotation(title="My Title"). We can also use / to indicate which plot(s) we want on top and which we want on bottom. For example, click the code box below to see two different ways to generate the same combined plot using patchwork:
# Illustrate patchwork with plot spacers, formatting, and adding a title
(simrunifplot + plot_spacer()) /  (plot_spacer() + sampleplot) + 
  plot_annotation(
    title = 'Histograms of RobberyRt (random sample, n=25) & Simulated Data (uniform(0,300), n=50)')

# Alternative way to generate same plot
# simrunifplot + plot_spacer() + plot_spacer() + sampleplot + 
#   plot_layout(ncol = 2) + 
#   plot_annotation(
#     title = 'Histograms of RobberyRt (random sample, n=25) & Simulated Data (uniform(0,300), n=50)')
  • Finally, we could use the replicate() function to generate multiple simulated samples, plot them all, and combine all the plots with patchwork. This is essentially what we did earlier to generate the large figures with numerous plots. However, since we were creating many similar plots with minor differences between them, we also wrote a simple function (see code for funhist() function below) to generate custom histograms with ggplot().

  • Below, we illustrate how to:

    • use replicate() in conjunction with your earlier code to generate twelve simulated data sets from the uniform(0,300) distribution
    • write alternate code that simulates from the gamma distribution, using a shape parameter of “3” and a scale parameter of “50”
    • use our own function (funhist()) to plot each simulated dataset in addition to plotting the real “RobberyRt” distributions from the population and our random sample
    • combine all plots into one figure with patchwork.

  • Note: You do not need to do this step, but feel free to follow along if you want!

# Replicate 12 simulated samples from uniform(0,300) distribution
set.seed(1234)
simrunif12robrt <- replicate(n = 12,
          expr = runif(n = 50,
                       min = 0,
                       max = 300),
          simplify = TRUE) %>%
  as_tibble()

# Alt code to simulate n=50 RobberyRt values from gamma(2,60) distribution
# set.seed(1234)
# simrgam12robrt <- replicate(n = 12, 
#           expr = rgamma(n = 50, 
#                        shape = 2, 
#                        scale = 60), 
#           simplify = TRUE) %>%
#   as_tibble()

#create function to quickly plot custom histograms
  #function inputs: data; variable to plot; number of bins for geom_histogram; y-axis max; x-axis label
funhist = function(dat,var,numbins,ymax,x_label){
  dat %>% 
    ggplot(aes(x={{var}})) + 
    geom_histogram(fill="#21918c", bins=numbins) +
    coord_cartesian(expand = FALSE, ylim=c(0,ymax), xlim=c(0,300)) +
    xlab(x_label)
  }

histpop <- funhist(StatesData, RobberyRt, 5, 30, "Pop RobRt (n=50)")
histsamp <- funhist(StatesDataRdm25, RobberyRt, 5, 15, "Samp RobRt (n=25)")
histv1runif <- funhist(simrunif12robrt, V1, 5, 30, "Sim1 (n=50)")
histv2runif <- funhist(simrunif12robrt, V2, 5, 30, "Sim2 (n=50)")
histv3runif <- funhist(simrunif12robrt, V3, 5, 30, "Sim3 (n=50)")
histv4runif <- funhist(simrunif12robrt, V4, 5, 30, "Sim4 (n=50)")
histv5runif <- funhist(simrunif12robrt, V5, 5, 30, "Sim5 (n=50)")
histv6runif <- funhist(simrunif12robrt, V6, 5, 30, "Sim6 (n=50)")
histv7runif <- funhist(simrunif12robrt, V7, 5, 30, "Sim7 (n=50)")
histv8runif <- funhist(simrunif12robrt, V8, 5, 30, "Sim8 (n=50)")
histv9runif <- funhist(simrunif12robrt, V9, 5, 30, "Sim9 (n=50)")
histv10runif <- funhist(simrunif12robrt, V10, 5, 30, "Sim10 (n=50)")
histv11runif <- funhist(simrunif12robrt, V11, 5, 30, "Sim11 (n=50)")
histv12runif <- funhist(simrunif12robrt, V12, 5, 30, "Sim12 (n=50)")

plot_spacer() + histpop + histsamp + plot_spacer() +
  histv1runif + histv2runif + histv3runif + 
  histv4runif + histv5runif + histv6runif + 
  histv7runif + histv8runif + histv9runif + 
  histv10runif + histv11runif + histv12runif + 
        plot_layout(ncol = 4) +
  plot_annotation(
    title = 'Histograms of Robbery Rate: Population, Sample, & 12 Simulated Sets (Unif(0,300))' )

  • The first thing you might be surprised to notice is that the population and sample distributions of our real data do not look that different than our 12 simulated distributions. If you did not know any better, you might have thought the real data and the simulated distributions were all generated via random selection from the same uniform probability distribution.
    • For instance, the Sim4, Sim9, and Sim12 plots resemble the population distribution (“Pop RobRt” plot).
    • Likewise, the Sim2, Sim5, and Sim10 plots are quite similar to the sample distribution (“Samp RobRt” plot).

  • Additionally, we hope you noticed the sampling variability across the plots in the figure above. As indicated earlier, we are working with relatively small samples, and robbery rates vary quite a bit across all 50 states. As such, we can expect a lot of variation in samples drawn from the population.
    • Indeed, look at the differences in distributional shapes between the “RobberyRt” for the population (n=all 50 states) and versus our random sample (n=25 states). Were we to draw 12 random samples of n=50 values from the population of 50 “RobberyRt” values, we would likely see quite a bit of variability across those samples as well. (Feeling ambitious? Feel free to try it out if you want!)

    • Also, note the variability across the 12 simulated samples. Remember, each of these were drawn from a uniform or “flat” distribution! Again, a key point here is that we can expect to see a great degree of sampling variability from an underlying generative distribution when: (a) we draw small samples from a generative probability distribution and (b) when there is a large degree of heterogeneity in potential values (e.g., when drawing 50 values at random from a wide range of continuous uniform values from 0 to 300).

Part 5 (Assignment 7.5)

Goal: Estimate Confidence Intervals around ‘RobberyRt’ Sample Mean

After our detour into data simulation from probability distributions, we hope that “sampling variability” makes a little more sense to you now. Understanding sampling variability is essential to truly comprehending confidence intervals. Interval estimates are one way that we attempt to quantify and communicate the uncertainty caused by sampling variability in our statistical estimates of population parameters. After all, if our random samples from a population do not vary at all (imagine sampling from a population of identical units), then we would not have any uncertainty in our estimates and would not need to generate intervals to try to capture the true population parameters!

For the last two sections of the assignment, you will answer questions 3 and 5 from the SPSS exercises at the end of Chapter 7. In this section, we will show you how to estimate a 95% CI around the mean of “RobberyRt” in our randomly selected sample of 25 states. You will then follow this process on your own to estimate a 95% CI around the mean of “AssaultRt” in our random sample. Then, in the last section, we will show you how to estimate a 90% CI around a sample proportion calculated from the “AssaultRt” variable in our random sample. You will then follow these same procedures on your own to estimate 90% and 99% CIs around the same “AssaultRt” proportion.

  1. Create second-level header titled: “Part 5 (Assignment 7.5)”. Then, create a third-level header titled: “95% Confidence Intervals for ‘RobberyRt’ Sample Mean

  2. Revisit the formulas for estimating confidence intervals presented in your book. In order to estimate a CI around a sample mean, we will need to know:
    1. the sample mean, of course
    2. the sample standard deviation
    3. the sample size
    4. the appropriate z (large sample) or t (small sample) critical value associated with our desired confidence level (e.g., 90%, 95%, 99%).

  3. At this point, you know various ways to determine the sample mean and sd in R. We will use base R functions mean(data$var) and sd(data$var) and assign these sample values for “RobberyRt” to objects. Remember, you will do this again for “AssaultRt” at the end of this section. We will also assign the sample size (n <- 25) to the object n; since our sample stays the same, we can use this object for this section and the next section.

  • Note: Remember, the StatesDataRdm25 data object contains only the 25 randomly selected states. We are practicing interval estimation of population parameters from a random sample, so do not use the full dataset (StatesData)!


Sample Mean of RobberyRt (n=25 states)

sampmeanRobRt <- mean(StatesDataRdm25$RobberyRt)
sampmeanRobRt
## [1] 99.424


Sample SD of RobberyRt (n=25 states)

sampsdRobRt <- sd(StatesDataRdm25$RobberyRt)
sampsdRobRt
## [1] 57.07451


Sample Size

n <- 25
n
## [1] 25


  1. Now that we have assigned our mean, standard deviation, and sample size as value objects, we need to identify the proper critical value before we can estimate a confidence interval around our sample mean.
    1. Recall, since we are estimating from a small sample, we use the formula that invokes the t distribution because it has fatter tails than the z distribution and, hence, generates somewhat more conservative estimates for small, noisy samples.
    2. You can find critical t values in Table B.3 at the back of your book. Or, we can use R to identify our critical t value. Either way, you will need to know the significance level (e.g., alpha = .05 or .01) and the degrees of freedom (for this, df = n-1 = 25-1 = 24).
      • Note: The book’s SPSS exercise 3.b asks you to calculate the standard error and 95% confidence interval for “AssaultRt” in our random sample, which you will do later. Since we are illustrating how to calculate a CI around the mean of “RobberyRt” at the same confidence level (95%) and with the same sample size (n=25), we will be using the same t critical value.
    3. For our interval estimate, we would need the t critical value for a two-tailed test with alpha=0.05 and df=24. According to the table in our book, that t value is 2.064.
    4. Instead of using the table in your book, you can generate this same t critical value in R using the qt() function by typing: tcrit <- qt(p = .05/2, df = n-1, lower.tail = FALSE). Then, type tcrit to output the value and check to make sure it is the same (t=2.064)! (Note: We also rounded the output to show the book’s value and the R value is the same.)
      • qt() generates specific “quantiles” from the t distribution. With a two-tailed 95% CI, we want to specify 0.05 (i.e., 5%) of samples to fall outside in the tails or “critical region” of the t distribution.
      • Since we are generating a two-tailed confidence interval, we split 0.05 in half - 0.025% is in the left tail and 0.025% is in the right tail.
    • Wait, why did we divide our alpha level by 2 to get that same t critical value from the book? We know this can be confusing. Try to remember that, for a two-tailed confidence interval, you need to divide the alpha level by two.
      • Recall, the alpha level essentially represents the proportion of the standard curve that is specified in the tails of the distribution as the critical region. Since we are generating a two-tailed 95% CI, we want the critical region to cover 5% in the tails of the distribution to ensure 95% CI coverage probability.
      • Essentially, for a 95% CI, we will add 2.5% margin of error below the sample mean and 2.5% margin of error above the sample mean, for 5% total margin of error. So, the 5% critical region is split across both tails of the t distribution, which translates to 2.5% or 0.025 (0.05/2 = 0.025) in each of the the left and right tails, respectively.


t critical value for two-tailed test, alpha=.05, df=24

# Determine t critical value for alpha = .05 (to estimate 95% CI)
  # For two-tailed test, divide alpha level by two (for both tails)!
tcrit <- qt(p = .05/2, df = n-1, lower.tail = FALSE)
tcrit
## [1] 2.063899


t critical value, rounded to 3 digits to compare with book

round(tcrit, 3)
## [1] 2.064


  1. We now have all the necessary inputs to plug into our formula for estimating a 95% confidence interval around a sample mean with a small sample. Though there are ways to automatically generate the standard error of the mean, we will mimic hand calculation of a 95% CI for the ‘RobberyRt’ variable by typing the formula in R and plugging our assigned values into the formula. We hope this will help you better understand how these interval estimates are calculated.
    1. First, in an R chunk, estimate the standard error of the mean for RobberyRt using the formula in your book, which isse = sd/sqrt(n).
      • Recall, the standard error of the mean is our estimate of the standard deviation of the sampling distribution of the mean - it is the estimate we use to quantify uncertainty caused by sampling variability in estimating the population mean from a sample.
    2. Plug in our assigned value objects into the equation by typing: sampseRobRt <- (sampsdRobRt/sqrt(n)). Recall that you can view this value after assigning it to an object by retyping it (sampseRobRt) afterwards on its own line of code.
      • In this code, n is equal to the number of observations, which is 25 (i.e., 25 randomly selected states). sampsdRobRt is the standard deviation of the “RobberyRt” variable in our random sample.
    3. Next, we multiply the standard error of the mean by the z or t critical value (t in this case) associated with the alpha level we selected and our degrees of freedom by typing: marg95RobRt <- tcrit*sampseRobRt
      • By multiplying the standard error of the mean by the t critical value, we are essentially identifying how many standard errors wide in t distribution units (with df=25) we need our confidence interval to be so that the area under the curve matches our desired confidence level, or our desired “margin of error.” In this code, tcrit is our t critical value and sampseRobRt is the estimated standard error of the mean of “RobberyRt” from our sample.
    • Note: In the code below, we also show an alternative way of generating these quantities a little more directly using qt() function.
      • In the alternative code, we specify the 0.975 quantile, which represents the proportion of samples that we want to fall within our estimated interval. You might notice that this quantile value is equivalent to 1 - alpha/2.
      • Again, since the function generates the critical value associated with a one-tailed test, we need to divide our alpha level by two to identify the appropriate quantile for a two-tailed confidence interval.
      • So, for a two-tailed 95% confidence interval, we identify the quantile by subtracting 0.025 from 1.00 (i.e., 1.00 - 0.05/2 = 1 - 0.025 = 0.975).


Sample Std Error of Mean of RobberyRt & 95% margin of error

sampseRobRt <- (sampsdRobRt/sqrt(n))
sampseRobRt
## [1] 11.4149
marg95RobRt <- tcrit*sampseRobRt
marg95RobRt
## [1] 23.5592
#Alternative method for calculating 95% margin of error in R using qt()
# qt(0.975,df=n-1)*sampsdRobRt/sqrt(n)


  1. With the 95% margin of error of the mean estimated and assigned to an object, we can simply subtract it from the sample mean to calculate the lower 95% CI limit, then add it to the sample mean to calculate the upper 95% CI limit. Together, these make up the 95% confidence interval estimate around the sample mean of “RobberyRt”.
    1. Insert an R chunk.
    2. Type RobRt95CIlow <- sampmeanRobRt - marg95RobRt. Hit enter and type RobRt95CIhi <- sampmeanRobRt + marg95RobRt. Type each of the object names on subsequent lines to print their output values and run your code.
    3. Note: To learn more about calculating CI using R, you might check out here and here.



Lower bound of 95% confidence interval around sample mean

RobRt95CIlow <- sampmeanRobRt - sampseRobRt
RobRt95CIlow
## [1] 88.0091


Upper bound of 95% confidence interval around sample mean

RobRt95CIhi <- sampmeanRobRt + sampseRobRt
RobRt95CIhi
## [1] 110.8389


95% CI Estimate around Sample Mean of RobberyRt

cat("[",RobRt95CIlow,",",RobRt95CIhi,"]")
## [ 88.0091 , 110.8389 ]


  • You just calculated 95% confidence interval estimates around the sample mean of RobberyRt from our 25 randomly selected states!
    • As you can see, our lower bound is 88.01 and our upper bound is 110.84.
    • This means our 95% CI around the sample mean of “RobberyRt” is [88.01, 110.84].
  1. Now, for the question you all have been waiting to ask: Did our sample-specific 95% parameter-catching interval estimate happen to catch the true population mean this time? Well, since we actually have the population data, we can check! Use R to check the population mean of “RobberyRt” in the StatesData, then compare it to our 95% CI estimate around the sample mean.


Population Mean of RobberyRt

mean(StatesData$RobberyRt)
## [1] 104.584


  • Yes, the population mean (104.58) was captured in our sample’s 95% confidence interval estimate [88.01, 110.84].
    • In other words, our random sample was one of the 95% of samples we might have drawn that would be expected to generate 95% confidence intervals that capture the true population mean.
    • Of course, we only know this because we have the population data. Remember, usually we would not know whether our interval estimates contain the true population parameter because we typically estimate population parameters from sample data without the luxury of having access to population data. In the typical case, we must settle for being confident that such interval estimates will capture the true population parameters most of the time when we follow appropriate measurement, sampling, and estimation methods.
    • Finally, it should not be surprising that our 95% CI captures the true population value, since it is intended to do just that in 95 out 100 repeated sampling trials. However, note that our 95% CI estimates are not very precise - they cover a pretty wide range of plausible state robbery rate values. With small samples and heterogeneous populations, the only way to ensure our CIs provide 95% frequency coverage probability is by generating wide, imprecise interval estimates!
  1. You should now know everything you need to generate a confidence interval around a sample mean on your own in R. Follow the procedures illustrated above to estimate a two-tailed 95% CI for the sample mean of “AssaultRt” and answer questions 3.b through 3.d from Chapter 7 SPSS Exercises.
    • Create new headings and subheadings as appropriate, and describe what you are doing throughout in the text portion of your RMD document.
    • Note: You might want to read through the next section of this assignment before attempting to interpret the confidence intervals that you estimate!

Interpreting Confidence Intervals

Goal: Understand how to appropriately interpret a frequentist confidence interval

Now that you have estimated a 95% confidence interval around the sample mean of “RobberyRt”, we want to discuss how to appropriately interpret these intervals before we estimating any more of them.

You may recall us mentioning earlier that frequentist confidence intervals are routinely misinterpreted. So, how do we appropriately interpret them? Well, we are glad you asked! Here is an appropriate interpretation of the 95% CI that we just estimated:


CORRECT INTERPRETATION: We are 95% confident that the true population mean robbery rate is between 88.01 and 110.84. By 95% confident, we mean that over the long run (and assuming our research design and statistical model are sound), the same procedures would result in estimates and confidence intervals that capture the true population mean in 95% of trials.


In addition to knowing the correct interpretation, it is worth recognizing a few common misinterpretations. Though the distinctions at times appear to be quite subtle, each of the following is technically a misinterpretation:



  • Inaccurate: We are 95% certain that the true population proportion falls between 88.01 and 110.84.
  • This sounds correct, so why is it considered a misinterpretation? Well, it is because “95% confidence” has a very specific meaning in frequentist statistics. Recall, we are not 95% confident in - or 95% certain of - the specific interval that we estimated. Rather, our level of confidence is in the method or procedures used to estimate the interval and not in the specific interval itself.

  • Inaccurate: There is a 95% chance that the true population proportion falls between 88.01 and 110.84.
  • Again, this sounds correct, and under a very specific (frequentist) interpretation of “chance” or “probability” it arguably might be considered correct. However, it is listed as a misinterpretation for the same reasons we explained in the example above.
  • In fact, the true population parameter either is or is not in the frequentist confidence interval we estimated, so it either has a 0% or 100% chance of falling in the specific interval we estimated. Of course, we usually do not know whether our interval caught the true population parameter because we rarely have data from the entire population. So, again, the 95% confidence level refers to our confidence in the 95% frequency coverage probability of interval estimates (hypothetically) generated over the long run across repeated trials and not to the chances that the specific interval we estimated captures the true population parameter.

  • Inaccurate: In 100 repeated trials, 95% of new samples will have a population mean that falls between 88.01 and 110.84.
  • This is just a slightly confused deviation from the truth. Here, the interpretation seems to get the long run frequency coverage notion of probability correct. However, this interpretation again is too focused on the specific interval (i.e., “between 75.86 and 122.98”) we estimated with our single sample rather than recognizing that different samples would generate different point and interval estimates.
  • Moreover, what does it mean to say that “95% of new samples will have a population mean…”? Remember, sample statistics like the mean, standard deviation, and confidence intervals will likely vary across samples due to sampling variability; however, the true population parameter (e.g., mean; sd) is the same regardless of the specific characteristics of any given sample.

  • Inaccurate: If we replicated our study, we would get the same results 95% of the time.
  • Remember that we are estimating intervals to quantify the uncertainty in estimation due to sampling variability and to minimize inference errors by estimating intervals that have a high probability of long run frequency coverage. Like we saw in our simulations above, random samples drawn from a population often differ from each other and from the population - if we understand that such sampling variability is typical, then we should also expect the point and interval estimates to vary across random samples as well.
  • Exceptions to this rule occur when we draw random samples from homogeneous populations (e.g., if all states had the same robbery rate) or when we select very large random samples - because we would expect little to no sampling variability in these situations. Still, even in these circumstances, we would not interpret CIs in this way.
  • So, in fact, most of the random samples of n=25 states that we could have drawn from the population of all 50 states would have generated different results - that is, different “RobberyRt” means, standard deviations, and 95% CIs!

We illustrate sampling variability - and CI variability - in the figure below, in which we plotted the mean and 95% CI from our sample (sample #1 at the bottom) along with means and 95% CIs from eight other samples of n=25 states each randomly drawn from “StatesData” using a different random starting seed. As you can see, all nine (ours + 8 new samples) generated different point estimates (dots = sample means) and interval estimates (error bars = 95% CIs around sample means). In this particular case, all nine of the estimated 95% confidence intervals managed to capture the true population parameter (vertical line = population mean of “RobberyRt”).

#generate three more random samples of n=25 states from StatesData
set.seed(1)
samp2 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(21)
samp3 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(31)
samp4 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(41)
samp5 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(51)
samp6 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(61)
samp7 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(71)
samp8 <- StatesData %>% sample_n(25, replace = FALSE)
set.seed(81)
samp9 <- StatesData %>% sample_n(25, replace = FALSE)

#create function to calculate stnd error of mean for 95% CI 
fun95se = function(dat){
  (qt(0.975,df=24)*sd({{dat}}$RobberyRt)/sqrt(25))
  }

#create tibble with sample, mean(RobberyRt), and se(mean)
statesamps <- tribble(
  ~sample, ~mean,  ~se,
  "1", mean(StatesDataRdm25$RobberyRt),  fun95se(StatesDataRdm25),
  "2", mean(samp2$RobberyRt),  fun95se(samp2),
  "3", mean(samp3$RobberyRt), fun95se(samp3),
  "4", mean(samp4$RobberyRt), fun95se(samp4),
  "5", mean(samp5$RobberyRt), fun95se(samp5),
  "6", mean(samp6$RobberyRt), fun95se(samp6),
  "7", mean(samp7$RobberyRt), fun95se(samp7),
  "8", mean(samp8$RobberyRt), fun95se(samp8),
  "9", mean(samp9$RobberyRt), fun95se(samp9)
)

statesamps <- statesamps %>% 
  mutate(
    CI95low = mean - se, 
    CI95hi = mean + se
  )

ggplot(statesamps, aes(mean, sample)) +      
  geom_point() +
  geom_errorbar(aes(xmin = CI95low, xmax = CI95hi, color=sample)) + 
  geom_vline(xintercept = mean(StatesData$RobberyRt, linetype="dashed"), 
                color = "#d44842", size=1) + 
  geom_segment(aes(x = 130, y = 1.3, xend = 124.5, yend = .9),
                  arrow = arrow(length = unit(0.5, "cm")), color="#365c8d") + 
  annotate("text", x=130, y=2.2, label="95% CI\nfor our\nfirst sample", angle=0, color="#365c8d") + 
  geom_segment(aes(x = 113, y = 9.5, xend = 106, yend = 9.4),
                  arrow = arrow(length = unit(0.5, "cm")), color="#d44842") + 
  annotate("text", x=120, y=9.5, label="Population mean", angle=0, color="#d44842") + 
  # scale_color_viridis_d() +
  scale_colour_manual(values=c('#365c8d','#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d', '#4ac16d')) + 
  theme_classic() + 
  theme(legend.position = "none") 

#https://statisticsglobe.com/draw-plot-with-confidence-intervals-in-r
  • Note: There are more efficient ways of generating a figure like this. For instance, we could have used replicate() to generate and assign numerous samples to a single combined data object or to a list of separate objects, then calculate our estimates and generate a plot from that single data object or list. (We will illustrate some more efficient approaches in the next assignment). Instead, we generated eight more samples and assigned them to separate data objects. We then created a single tibble containing the sample mean and standard error from each of our nine samples, calculated 95% upper and lower CI limits for each sample in our new tibble, and then plotted our means and 95% CIs from this tibble. Our hope is that you can more easily follow our logic with this explicit approach.

  • Also, we wanted to show you how to plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars, as well as illustrating how you can add other elements like a vertical line (+ geom_vline()), an arrow (+ geom_segment), or text (+ annotate()) elements to a ggplot() object. While we do not expect you to reproduce this plot, please do not let us thwart your ambitions!

Part 6 (Assignment 7.6)

Goal: Estimate Confidence Intervals around ‘AssaultRt’ Sample Proportion

In the last part of this assignment, you will use what you have learned to answer SPSS exercise question 5, which asks you to estimate 95% and 99% CIs around the proportion of states with an assault rate (“AssaultRt”) higher than 200 assaults in 100k in our random sample of 25 states. We will illustrate the process by estimating a 90% CI around this same proportion.

Remember, we previously used the mean and sd to first estimate the standard error of the mean, which we then used to calculate the lower and upper limits of the 90% CI. We follow a similar process for a proportion, though the formula and inputs are different.

On that topic, you may have noticed that your book only provides a formula for calculating a confidence interval around a sample proportion with large samples, and it does not have an equivalent formula for small samples. What gives? Well, it turns out that these types of interval estimates are pretty error-prone with small samples, and neither z nor t distributions are really very useful for generating valid intervals around proportions with small samples. Yet, the book asks you to do this anyway, so we will plug our nose and move forward using the formula for large samples, which relies on the z distribution.

  1. Create second-level header titled: “Part 6 (Assignment 7.6)”. Then, create a third-level header titled: “90% Confidence Interval for ‘AssaultRt’ Sample Proportion”

  2. Referring to your book, you will find the formula for a confidence interval around a sample proportion requires the following inputs:
    1. the proportion (p) itself, of course
    2. the complement of the proportion (1-p)
    3. the sample size, which we already assigned to the object n
    4. the critical z value associated with your specified confidence level

  3. Our first step is to determine what proportion (p) of states in our random sample have more than 200 assaults per 100k.
    1. You can use a frequency table and calculate or check the cumulative frequencies for this value. Or, you can use the count() function to answer this question directly in R. We show you how to do this in the code below.
    2. However you do it, you should get 16 out of 25 states, or p=0.64.
    3. Assign this proportion value to an R object.


Sample Proportion of States w/Assault Rate >200”

StatesDataRdm25 %>% count(AssaultRt > 200)
## # A tibble: 2 × 2
##   `AssaultRt > 200`     n
##   <lgl>             <int>
## 1 FALSE                 9
## 2 TRUE                 16
pAstRtgt200 <- 0.64
pAstRtgt200
## [1] 0.64


  1. Next, identify the critical z value associated with a 90% confidence level and assign to an object.
    1. Since we are using the large sample formula (I know, I know…), we will invoke qnorm() instead of qt(), which is a quantile function that we can use to identify critical values from the standard normal (z) rather than t distribution.
    2. For our example, we will use qnorm() to find the 90% CI (two-tailed), though you will be doing this again on your own to estimate 95% and 99% confidence intervals.
      • Though we want a 90% CI, notice that we specify 0.95 in our code instead.
      • Remember, these quantile functions are designed to provide one-sided (left-tailed or right-tailed) critical values. Since we want to estimate a two-tailed 90% CI, we need to assign 10% of the total t distribution as the critical region to ensure 90% CI coverage probability. This translates to a critical region containing 5% (10/2) in each of the left and right tails, respectively. Essentially, we will be adding 5% margin of error above and below the sample proportion. Subtracting 0.05 from 1.00 results in a 0.95 quantile specification (1.00 - 0.10/2 = 1.00 - 0.05 = 0.95), which will generate the appropriate z critical value for a two-tailed 90% confidence interval.
    3. Below, we use qnorm() to identify the appropriate z critical value for a 90% CI and assign it to an object named zcrit. That value is 1.64.
      • You can also find this value in the book’s Appendix Table B.1. That table is also set up to find one-tailed z critical values. So, for 90% CI coverage probability, our interval should cover 45% (0.45) of the z curve on each side, with 5% (0.05) of the curve specified as the critical region in each of the left and right tails. Therefore, we look in the body of the table for the z critical value associated with an area equaling 0.45 (0.4495) under the normal curve; this too will lead us to the value of 1.64.
      • Note: When you follow these procedures later to estimate two-tailed 95% and 99% confidence intervals around the “AssaultRt” sample proportion for question 5 of Chapter 7 SPSS Exercises, remember to identify the appropriate z critical values corresponding to the specified confidence level!


Critical z value for two-tailed 90% CI

zcrit <- qnorm(0.95)
zcrit
## [1] 1.644854


5. We have what we need to estimate a confidence interval around our sample proportion. a. Recall, the formula for the standard error of the proportion (aka, the standard error of the sampling distribution of the proportion) is: se = sqrt(p(1-p)/n) and we multiply this by the critical z value to get the margin of error. b. In an R chunk, input our assigned objects into the formula by typing marg90AstRtprop <- zcrit*sqrt(pAstRtgt200*(1-pAstRtgt200)/n). - This code estimates the margin of error for our 90% CI by multiplying our critical value by the square root of the proportion (.64) multiplied by its complement (or 1 - p) and divided by n.  - The margin of error estimate is then assigned to a new value object called marg90AstRtprop (named differently so we do not confuse it with sample standard error for the mean of ‘RobberyRt’; the name refers to the sample margin of error for 90% CI around Assault Rate proportion).


Standard error of Assault Rate proportion

marg90AstRtprop <- zcrit*sqrt(pAstRtgt200*(1-pAstRtgt200)/n)
marg90AstRtprop
## [1] 0.1579059


6. With the margin of error estimated and assigned to an object, we can now calculate the 90% confidence interval around our sample proportion. a. As we did in the previous section, we do this by simply subtracting from and adding to the sample proportion to calculate the lower and upper limits, respectively, for our 90% confidence interval around the “AssaultRt” sample proportion. b. We can do this by simply typing AstProp90CIlow <- pAstRtgt200 - marg90AstRtprop and AstProp90CIhi <- pAstRtgt200 + marg90AstRtprop to determine the 90% CI of the assault rate for our 25 states.


90% CI lower limit for AssaultRt proportion

AstProp90CIlow <- pAstRtgt200 - marg90AstRtprop
AstProp90CIlow
## [1] 0.4820941


90% CI upper limit for AssaultRt proportion

AstProp90CIhi <- pAstRtgt200 + marg90AstRtprop
AstProp90CIhi
## [1] 0.7979059


90% CI Estimate around AssaultRt Sample Proportion

cat("[",AstProp90CIlow,",",AstProp90CIhi,"]")
## [ 0.4820941 , 0.7979059 ]


- Congratulations! You just calculated a 90% confidence interval estimate for the proportion of states with an assault rate greater than 200. - Our lower bound estimate is 0.48 and our upper bound estimate is 0.8. That means our 90% CI around the sample proportion is [0.48, 0.8]. - We can interpret this interval as follows: We are 90% confident that the true population proportion of states with assault rates greater than 200 is between 0.48 and 0.8. - By 90% confident, we mean that over the long run the same procedures would result in estimates and confidence intervals that capture the true population proportion in 90% of trials, assuming our research design and statistical model (including the large sample formula) are appropriate.

  • Did our 90% parameter-catching interval estimate happen to catch the true population proportion this time? Once again, since we actually have the population data, we can check.


Population Proportion of States w/Assault Rate >200

StatesData %>% count(AssaultRt > 200)
## # A tibble: 2 × 2
##   `AssaultRt > 200`     n
##   <lgl>             <int>
## 1 FALSE                21
## 2 TRUE                 29
popAstRtprop <- 29/50
popAstRtprop
## [1] 0.58


- The population proportion is 0.58, which falls within the 90% confidence interval we estimated around our sample proportion (p=0.64, 90% CI=[0.48, 0.8]). It seems our random sample was one of the 90% of samples we might have drawn that would be expected to generate 90% confidence intervals that capture the true population proportion. Again, we only know this because we have the full population data - in most research situations, we do not have this luxury.

  1. Follow the procedures illustrated above to estimate two-tailed 95% and 99% CIs around the “AssaultRt” sample proportion and answer question 5 from Chapter 7 SPSS Exercises.
    1. Create new headings and subheadings as appropriate, and describe what you are doing throughout in the text portion of your RMD document.
    2. Remember, with a different confidence level, your z critical value should change, which will then change your sample standard error, your margin of error (i.e., marg90AstRtprop value), and your lower and upper confidence limits (i.e., AstProp90CIlow and AstProp90CIhi).
    3. Also, recall that at higher confidence levels, we want to reduce inference errors by increasing the probability that our interval will contain the true population parameter. To do this, we are going to widen the interval, which will make us more confident that our parameter-catching net” will capture the true population parameter. This means that, if calculated correctly, your 95% and 99% CIs should be wider than the 90% CI we estimated above!

  2. You should now have everything that you need to complete the questions in Canvas Assignment 7 that parallel those from the SPSS Exercises for Chapter 7. Complete the remainder of the questions in Assignment 7 in your RMD file and on Canvas.
    1. Keep the RMD file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment questions.
    2. Write plain text after headings and before or after code chunks to explain what you are doing - such text will serve as useful reminders to you when working on later assignments!
    3. Upon completing the assignment, “knit” your final RMD file again and save the final knitted Word document to your “Assignments” folder as: YEAR_MO_DY_LastName_K300Assign7. Submit via Canvas in the relevant section (i.e., the last question) for Assignment 7.

Assignment 7 Objective Checks

After completing assignment #7…

  • are you able to select random sample from data without or with replacement using dplyr::sample_n()?
  • do you know how to improve reproducibility of randomization tasks in R by setting the random number generator seed using set.seed()?
  • are you able to select data with conditions using dplyr::filter() and %in% operator?
  • do you know how to create a list or vector and assign to object, such as listname <- c(item1, item2)?
  • do you recognize how one can simulate data from normal, truncated normal, or uniform probability distributions using rnorm(), truncnorm::rtruncnorm(), or runif()?
  • recognize how one might write a custom function to repeatedly generate similar plots?
  • are you able to combine multiple ggplot() plots into a single figure using “patchwork” package?
    • do you know how to customize the plot layout and add a title to a patchwork figure?
  • do you understand how confidence intervals are used to quantify uncertainty caused by sampling variability?
    • are you able to estimate a two-tailed confidence interval around a sample mean or proportion in R and interpret it appropriately?
    • are you able to identify t or z critical values associated with a two-tailed confidence level using qt() or qnorm()?
    • do you know how to estimate the standard error of a sample mean or proportion in R?
    • do you understand how to properly interpret and avoid common misinterpretations of confidence intervals?
  • do you recognize how one might plot means and confidence intervals using ggplot() + geom_point() + geom_errorbars?
  • do you recognize how one might add elements like vertical lines (+ geom_vline()), arrows (+ geom_segment), or text (+ annotate()) elements to a ggplot() object?