Assumptions & Ground Rules

The purpose of this assignment is to learn how to download and provide basic descriptions of data and specific variables analyzed in a published study. Up to this point, we have used built-in R data (e.g., R Assignment 2), provided you with the data you were working with (e.g., subsets of the NYS data in R Assignment 3), or you have downloaded the data manually from ICPSR and placed it within your reproducible file structure (e.g., R Assignment 4). The approach we used in the last R Assignment would work fine when you are using your own data and/or data that you have permission to share. However, this is not generally the case with data on ICPSR. According to ICPSR’s bylaws, you are not technically allowed to share ICPSR data in your own online repository (e.g. OSF or GitHub). For this assignment, we will show you how to download SPSS data directly within R and begin looking at the data via basic descriptive statistics.

Specifically, for this assignment, we will:

  1. Create part of file structure within R
  2. Download NYS data directly from ICPSR into file structure
  3. Subset data and combine multiple data sets into one
  4. Identify, rename, and provide basic descriptions of specific variables/items used in the study.
  5. Provide basic introduction to the ifelse function and logic in R.

I assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

  • Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).

Packages:

library(tidyverse)
library(here)
library(haven)
library(icpsrdata)
library(gt)
library(sjmisc)

Part 1 (Assignment 5.1): Create file structure within R

In the last R Assignment (“Reproducible File structure”) we created a basic reproducible file structure and shared it using your computer’s operating system. Here, we are going to create most of the folders we need using R code. This is useful for our specific purposes–downloading data directly from ICPSR–because we do not have to rely on someone placing their data in the correct folder, we can simply share with them the code to create the folder in their own root directory.

1.1: Create root folder for R Assignment #5

Before we start creating folders and downloading data within R, we need to create a root folder, save our RMD file inside it, and close and open the assignment directly from that root folder (so the “here” package will start in the correct folder on our computer):

  1. Go to your “LastName_CRM495_work” folder on OneDrive
  2. Create a new root folder titled “LastName_CRM495_RAssignment5” inside it.
  3. Save your RMD file as LastName_CRM495_RAssign5_YEAR_MO_DY
  4. Close RStudio, go to root folder and open RMD file.

1.2: Create “NYS_data” folder within R

Create a subfolder within your “LastName_CRM495_RAssignment5” subfolder called “NYS_data.” Technically, you could do this yourself by navigating to the folder on your computer and creating a new “NYS_data” folder manually. But we can also do it in R with the following code. Again, doing it in the R environment helps ensure that anyone else (including our future selves) can easily reproduce our work with minimal effort.

# check if "NYS_data" folder exists (TRUE if it does) & create if it does not exist. 
ifelse(dir.exists(here("NYS_data")), TRUE, dir.create(here("NYS_data")))
## [1] TRUE

Let us try to explain the above code to you. The “ifelse” command is a logical function within base R. To get more details about it, type ?ifelse into the console window. Here is the description of that function:

ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

It takes the form of the following: ifelse(test, yes, no). This means, you give R a logical test (or a logical question) that can be answered yes or no and then it gives you a value or performs another function based on the solution of that test (i.e., based upon the answer to that question).

In the above code, we are asking if the “NYS_data” folder exists within our root folder (i.e., your “LastName_CRM495_RAssignment5” folder) with the dir.exists function. If the answer is yes, it simply returns the logical value I told it to - in this case TRUE. If the answer is no, you instruct R to create that “NYS_data” folder with the dir.create function. Again, type ?dir.exists or ?dir.create for more information.

If you want to have some fun, you can actually have R return a text string instead of the logical value. For example:

ifelse(dir.exists(here("NYS_data")), "You already created that folder, dummy!", dir.create(here("NYS_data")))
## [1] "You already created that folder, dummy!"

Generally, it is probably not a great idea to have R call the user (yourself in this case) a “dummy” with code you plan to eventually share publicly. Yet, it is also OK to have some fun when doing science. I (Jake) think that having a computer program call me a “dummy” is fun - perhaps you do not.

  • Note: there is probably a programming rationale for using the logical value rather than a string of which I am unaware.

  • Note: tidyverse syntax has a stricter if_else function. According to the documentation, what makes it more strict is that “It checks that true and false are the same type.” I’ll be honest, I’m not sure exactly when this strictness is useful (tidyverse says it can allow for more predictable use and is somewhat faster). For most of what we will be using it for, either the base ifelse and tidyverse if_else functions will likely work just fine. I am going to use the more general ifelse function from base R.

You should now have a file structure for your “LastName_CRM495_RAssignment5” folder that looks like this:

Part 2 (Assignment 5.2): Download ICPSR data directly from within R

Now that you have the basic file structure for this assignment and specifically the “NYS_data” folder, it’s time to download the first five waves of NYS data. Recall these are the waves of data that Warr (1993) used in his study on the delinquent peer influences and the age-distribution of crime.

As we mentioned previously, it is technically against ICPSR’s bylaws to share data housed on ICPSR “without the written agreement of ICPSR.” This means that if you included ICPSR data and/or documentation directly within a reproducible file structure that you shared with someone else, you would technically be violating the bylaws. Fortunately, there is a package called “icpsrdata” that allows you to download data housed on ICPSR directly from within R. This means you simply need to provide the code for downloading and wrangling the data and you are 1) not violating ICPSR’s bylaws and 2) adhearing to open and reproducible research practices. Let’s show you how to do that now.

2.1: Identify ICPSR numbers for data you want to download

We need to know the ICPSR numbers for the first five waves of the NYS. Go to the NYS Series page on ICPSR and make note of the ICPSR numbers for the first five waves of data.

Here is a table to remind us of the ICPSR numbers.

Warr (1993) NYS Items
Wave ICPSR
Wave 1 8375
Wave 2 8424
Wave 3 8506
Wave 4 8917
Wave 5 9112

\(~\)

2.2: Use the “icpsrdata” package to download NYS data

In order to use the “icpsrdata” package, you need to install it and load it into your current R environment.

  • Recall that it is good practice to install packages in the “Console” and only load them within your RMD file (see R Assignments #1 and #2 for details on how to install and load packages)
library(icpsrdata)

To actually download the data, we will use the icpsr_download function that is a part of the “icpsrdata” package. The core arguments of the function are specifying the file_id (i.e. ICPSR numbers) and download_dir (the file on your computer to where you want the data files to be downloaded). Let’s just show you the code and then explain it.

  • Note: In order to prevent R from continually trying to download data we have already downloaded and to prevent issues when you knit your Rmd file, in the code below we added the icpsr_download function to an ifelse function. It is the same logic as ifelse function above when we created the “NYS_data” folder. It first checks to see if the “ICPSR_09112” folder exists (wave 5–the last wave we are telling R to download) in the “NYS_data” folder. Then it returns the logical statement “TRUE” if it the folder does exist and, if it does not exist, runs the icpsr_download command to download the first five waves from ICPSR.
ifelse(dir.exists(here("NYS_data", "ICPSR_09112")), TRUE, 
icpsr_download(file_id=c(8375, 8424, 8506, 8917, 9112),
               download_dir = here("NYS_data")))
  • Note: when you first run this chunk during an R session you will be asked to enter your ICPSR account information into the R console. R should remember this once you enter it once. So you will likely need to do this once per R session before trying to knit your Rmarkdown document.
    • Again, to be clear: Check the R Console after trying to run the icpsr_download command to see and respond to the ICPSR username/password prompts. If you have not yet created a free account on ICPSR (you should have already for an earlier project assignment), then you will need to do this on the ICPSR website first. Then, after each prompt in the console, you would put your ICPSR username instead of “your_icpsr_username” (I added that as a placeholder) and your ICPSR password instead of “your_icpsr_password.”

2.3: Read the data into R.

You already did this with the wave 1 data that you downloaded directly from ICPSR in “R Assignment 4.” Now you just need to do it for each of the first five waves of NYS data you just downloaded by telling R where the specific data file is within your file structure. You want to make note of the specific files that were downloaded to the “NYS” folder.

Recall from earlier that the actual data are within a folder called “DS0001” within each of the ICPSR folders. You simply want to use the “here” package to tell the read_spss function from the “haven” package where to find the SPSS data for each wave of the data. Make sure you pay close attention to which study numbers are associated with each specific wave of data!

nys_w1 <- read_spss(here("NYS_data", "ICPSR_08375", "DS0001", "08375-0001-Data.sav"))
nys_w2 <- read_spss(here("NYS_data", "ICPSR_08424", "DS0001", "08424-0001-Data.sav"))
nys_w3 <- read_spss(here("NYS_data", "ICPSR_08506", "DS0001", "08506-0001-Data.sav"))
nys_w4 <- read_spss(here("NYS_data", "ICPSR_08917", "DS0001", "08917-0001-Data.sav"))
nys_w5 <- read_spss(here("NYS_data", "ICPSR_09112", "DS0001", "09112-0001-Data.sav"))

If you have done everything correctly up to this point, you should have five data sets in your RStudio Environment representing each of the waves 1 through 5 data that we downloaded with the icpsr_download function above (named “nys_w1,” “nys_w2,” “nys_w3,” “nys_w4,” and “nys_w5”).

Part 3 (Assignment 5.3): Trim, Rename, and Pool Data used by Warr (1993)

In “R Assignment 3” we reproduced Figure 1 from Warr’s (1993) article “Age, Peers, and Delinquency.” Feel free to go back to assignment 3 for a refresher on the article, including a description of the specific variables that Warr (1993) constructed and analyzed. For Figure 1, Warr (1993) plotted the age distribution for the percentage of respondents who reported having no friends who engaged in eight delinquent behaviors in the previous year. Over the next couple of R Assignments, we are going to focus on reproducing and extending Figures 2, 3, and 4 from Warr (1993):

These figures plot the age distribution of 1) Percentage of respondents reporting they average three or more nights per week socializing (i.e. “going on dates, to parties, or other social activities”); 2) Percentage of respondents reporting it was “Very important” or “Pretty Important” to socialize; and 3) Percentage of respondents who reported they “would lie to protect their friends if they got into trouble with the police.”

In order to reproduce these figures we need to:

  1. Identify specific items from each wave of the NYS from which these variables were constructed.
  2. Rename items so they have informative names.
  3. Trim each wave of data so that they only include variables needed to reproduce the figures.
  4. Produce basic descriptive statistics and frequency tables for our key variables.
  5. Recode the specific items to align with Warr’s (1993) coding decisions.
  6. Wrangle the data so that it is in a format we can plot.
  7. Reproduce the plots.

In what follows, we will walk through the first four of these steps and save the last three for R Assignment 6.

3.1: Identify survey items for key variables from Warr (1993)

In order to reproduce Figures 2, 3, and 4 from Warr (1993) we need to start by identifying the specific survey items that Warr used to construct those figures. Let’s start with his description of the items in the article. On page 24-25 he describes the survey questions that are supposed to measure these “other elements of peer relations” by listing the questions respondents were asked:

  1. “How many evenings in the average week, including weekends, have you gone on dates, to parties, or to other social activities?” (“Less than once a week” to 7).
  2. “How important has it been to you to have dates and go to parties and other social activities?” (1 = not important at all, 2 = not too important, 3 = somewhat important, 4 = pretty important, 5 = very important)
  3. “If your friends got into trouble with the police, would you be willing to lie to protect them?” (1 = no; 2 = maybe; 3 = yes).

Warr (1993) provided us with the specific question wording, which is helpful and not always the case in the published literature. Of course, if you open any of the data sets you downloaded above, you won’t see the specific wording of each survey question. Here is what the first six rows of the wave 1 data look like:

head(nys_w1) %>%
  gt()
CASEID V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206 V207 V208 V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220 V221 V222 V223 V224 V225 V226 V227 V228 V229 V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240 V241 V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 V254 V255 V256 V257 V258 V259 V260 V261 V262 V263 V264 V265 V266 V267 V268 V269 V270 V271 V272 V273 V274 V275 V276 V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 V289 V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304 V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318 V319 V320 V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336 V337 V338 V339 V340 V341 V342 V343 V344 V345 V346 V347 V348 V349 V350 V351 V352 V353 V354 V355 V356 V357 V358 V359 V360 V361 V362 V363 V364 V365 V366 V367 V368 V369 V370 V371 V372 V373 V374 V375 V376 V377 V378 V379 V380 V381 V382 V383 V384 V385 V386 V387 V388 V389 V390 V391 V392 V393 V394 V395 V396 V397 V398 V399 V400 V401 V402 V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421 V422 V423 V424 V425 V426 V427 V428 V429 V430 V431 V432 V433 V434 V435 V436 V437 V438 V439 V440 V441 V442 V443 V444 V445 V446 V447 V448 V449 V450 V451 V452 V453 V454 V455 V456 V457 V458 V459 V460 V461 V462 V463 V464 V465 V466 V467 V468 V469 V470 V471 V472 V473 V474 V475 V476 V477 V478 V479 V480 V481 V482 V483 V484 V485 V486 V487 V488 V489 V490 V491 V492 V493 V494 V495 V496 V497 V498 V499 V500 V501 V502 V503 V504 V505 V506 V507 V508 V509 V510 V511 V512 V513 V514 V515 V516 V517 V518 V519 V520 V521 V522
1 1 2 2 4 4 4 5 6 1 1 5 59 1 4 4 1 2 0 2 1 2 1 1 1 1 1 1 1 1 2 1 1 NA 1 NA 1 1 1 1 1 1 1 4 2 1 1 1 2 1 1 2 2 4 1 2 5 4 2 1 4 4442 2 3 5 1 5 3 3 1 3 3 3 3 4 4 1 4 4 2 3 4 4 4 2 2 1 NA NA NA NA NA NA NA NA NA NA 4 4 3 2 3 3 3 4 3 4 3 3 3 3 3 4 4 3 3 3 1 1 1 4 2 5 4 2 4 4 2 4 5 5 1 1 2 2 2 2 2 2 2 NA NA NA NA NA NA NA 1 NA NA NA NA 1 1 3 NA NA 1 NA NA NA 1 1 1 1 13 8 3 2 NA 1 2 3 3 5 3 3 2 3 2 1 5 2 NA 4 0 4 3 1 2 NA NA NA 5 1 3 NA NA NA NA 2 NA 1 1 3 2 3 33 3 4 2 2 1 3 3 5 5 3 3 5 3 5 3 5 5 5 3 5 5 5 3 5 5 5 5 5 5 5 3 5 5 5 5 5 3 3 2 1 1 1 1 1 3 5 4 1 1 3 1 1 5 1 1 4 1 3 3 1 3 3 1 5 3 3 4 3 2 4 1 1 3 4 5 5 2 1 1 4 4 3 1 2 3 2 1 2 1 4 1 4 2 1 2 4 5 1 2 1 2 5 1 1 5 5 3 5 1 5 5 4 1 5 2 5 5 2 5 5 3 4 3 3 5 3 3 3 3 4 4 4 4 3 3 3 3 4 4 3 2 2 2 1 4 3 4 3 2 2 3 4 4 3 3 1 2 3 3 2 1 1 1 1 3 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 2 0 1 0 1 0 1 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 3 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 5 3 2 2 4 3 0 1 2 0 1 0 1 1 0 1 1 3 2 1 NA 1 NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA 1 1 1 NA NA NA 1 1 NA NA NA NA 1 2 1 2 NA NA NA NA NA NA 1 1 1 3 12 1
2 1 2 2 4 4 4 5 6 1 1 7 73 1 3 3 1 2 NA 2 1 2 1 1 1 1 1 1 1 1 1 1 1 NA 1 NA 2 1 1 1 1 1 1 4 3 3 2 1 1 1 1 1 5 5 1 1 4 5 1 1 4 4469 2 5 5 5 3 3 3 5 3 3 1 5 4 4 2 4 4 2 4 4 1 4 2 2 2 4 4 4 1 5 1 5 5 1 2 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 3 2 4 4 2 4 4 2 4 5 1 4 4 4 1 1 2 2 1 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 1 15 10 4 2 NA 5 5 4 3 4 4 3 2 3 3 3 5 1 3 NA NA NA NA 1 3 NA NA NA 5 1 2 NA NA NA NA 2 NA 1 1 5 5 4 55 5 4 3 1 1 5 3 3 3 3 3 5 3 5 3 3 1 5 5 5 3 3 3 5 3 5 3 5 3 5 5 5 5 5 3 3 3 3 3 4 2 2 2 2 2 5 4 2 2 4 2 3 4 3 1 4 2 4 2 2 3 2 2 4 2 2 2 2 2 4 2 1 2 4 4 1 2 2 2 4 4 1 1 2 4 1 2 2 2 4 1 4 1 2 2 4 4 2 1 2 2 4 1 2 4 5 2 5 2 5 5 5 2 5 1 5 5 2 4 4 2 5 2 5 4 4 2 4 2 5 5 3 3 3 3 3 3 3 3 3 3 3 3 2 3 4 4 4 4 4 4 4 4 3 1 1 1 2 2 1 1 1 1 1 3 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 2 1 1 NA 1 NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA NA NA NA 1 NA NA 1 NA NA 1 2 1 3 NA NA NA NA NA NA 1 1 1 3 11 2
3 1 2 2 4 4 4 5 6 1 1 7 73 1 3 3 1 2 NA 2 1 2 1 1 1 1 1 1 1 1 1 1 1 NA 1 NA 2 1 1 1 1 1 1 4 3 3 2 1 1 1 1 1 5 5 1 1 4 5 1 1 4 4469 3 3 5 1 5 3 3 3 1 3 3 3 5 2 1 5 2 2 5 1 1 5 1 1 2 5 5 4 1 3 1 4 4 1 2 4 4 3 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 2 1 3 4 3 5 5 1 5 5 2 5 5 5 2 2 2 2 2 2 3 3 2 NA NA NA NA NA NA NA NA NA NA NA 1 2 1 3 NA NA NA NA NA NA 1 1 2 1 11 6 5 2 NA 4 0 4 4 5 1 5 2 4 3 2 5 1 3 NA NA NA NA 1 5 NA NA NA 4 1 5 NA NA NA NA 1 3 NA 1 3 5 5 54 4 3 4 2 1 5 5 3 3 1 NA 5 5 5 3 5 5 5 5 5 5 3 3 5 5 5 5 5 3 5 5 5 5 5 5 3 5 3 3 5 4 5 2 2 2 5 4 2 2 4 2 2 4 2 2 4 2 2 2 2 2 2 2 4 2 2 4 2 2 5 4 2 4 4 4 3 2 4 2 4 4 2 2 2 4 2 1 2 3 2 1 4 2 2 4 5 4 4 2 3 2 4 1 2 2 4 5 4 4 4 4 4 2 5 4 4 4 4 2 4 2 5 4 4 4 4 2 4 3 4 4 4 4 4 4 4 3 4 4 2 2 3 2 3 4 3 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 3 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 12 4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 0 0 0 0 0 0 0 0 1 NA 1 NA 1 NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 1 1 1 4 NA NA NA NA NA NA 1 1 1 3 11 2
4 3 2 2 4 4 5 6 6 1 1 6 66 5 NA 4 1 1 0 2 1 2 1 2 1 1 1 1 1 1 1 1 1 NA 1 NA 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 2 2 2 2 4 4 2 4 2 4415 2 3 5 3 3 5 3 5 1 5 5 3 4 2 2 4 3 2 4 2 2 4 2 2 1 NA NA NA NA NA NA NA NA NA NA 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 3 2 5 5 4 5 5 2 5 5 5 1 1 2 1 2 2 2 1 1 NA NA NA NA NA NA NA NA NA NA NA 1 2 1 2 1 1 NA NA NA NA 1 1 1 2 16 11 3 2 NA 5 4 5 3 5 2 3 2 3 2 2 3 2 NA 5 2 3 3 1 4 NA NA NA 3 1 3 NA NA NA NA 2 NA 1 1 3 2 3 44 4 2 2 1 1 3 3 3 3 3 3 3 3 3 3 5 3 5 5 3 5 3 1 3 1 3 3 5 5 3 3 5 3 5 3 3 3 2 3 2 2 2 1 1 1 5 5 4 2 4 2 4 4 1 1 4 3 4 2 1 2 2 2 5 4 4 4 4 3 4 3 2 3 3 4 2 3 3 2 3 4 2 3 2 4 3 4 2 2 3 4 4 4 3 2 3 4 3 4 2 4 3 3 2 4 4 2 5 3 4 5 4 2 4 2 5 5 2 3 3 2 3 2 2 3 3 3 2 2 2 3 3 2 4 3 1 4 2 2 3 4 3 4 2 2 3 1 2 2 1 3 4 3 4 3 4 3 3 4 2 2 2 3 1 3 2 0 1 2 2 2 2 0 1 1 2 1 2 0 1 0 1 0 1 2 2 0 1 0 1 0 1 0 1 1 2 0 1 2 2 3 2 4 3 1 2 0 1 0 1 0 1 3 2 2 2 0 1 0 1 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 6 3 1 2 0 1 1 2 2 2 3 2 0 1 1 2 0 1 0 1 0 0 2 2 0 1 0 0 0 5 3 7 3 1 NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA NA NA NA NA 1 NA NA NA NA 1 1 1 3 NA NA NA NA NA NA 1 1 1 3 12 1
5 1 2 3 4 3 3 6 5 4 1 6 66 NA NA 1 2 3 0 2 1 1 1 1 1 2 1 1 1 1 1 1 1 NA 2 NA 2 2 1 1 1 1 3 1 2 2 2 1 1 1 1 1 4 3 2 2 4 4 4 2 4 4409 2 5 5 1 3 5 1 1 1 1 3 1 4 2 2 4 2 2 4 2 2 4 2 2 1 NA NA NA NA NA NA NA NA NA NA 3 3 2 2 3 3 3 3 1 3 3 3 3 3 3 3 4 3 4 4 4 2 3 4 2 4 4 2 5 4 2 4 4 4 2 NA 1 NA 2 NA 2 NA 1 NA NA NA NA NA NA NA NA NA NA NA 1 1 1 3 NA NA 1 NA NA NA 1 2 2 1 14 9 4 1 4 NA NA NA NA NA 0 4 2 1 5 2 4 1 2 NA NA NA NA 2 4 1 0 1 3 1 2 NA NA NA NA 2 NA 1 1 0 3 4 43 3 2 2 2 1 3 3 1 NA 3 3 5 3 5 3 5 5 5 3 5 5 1 NA 5 3 3 3 5 5 5 3 5 3 5 3 1 1 3 2 2 2 2 2 2 2 4 4 2 2 4 2 NA 4 2 2 4 3 4 2 2 3 2 3 4 2 4 2 2 2 5 2 2 2 4 4 2 2 2 3 4 5 1 2 2 4 2 3 2 3 4 3 4 2 2 2 4 4 2 2 2 2 4 2 2 4 4 2 5 2 4 4 4 2 3 2 4 5 2 3 3 2 4 3 4 4 4 2 2 2 4 4 4 4 4 3 3 4 4 4 4 1 1 1 2 3 4 3 3 3 3 4 4 4 NA NA NA NA NA NA NA NA NA 9 2 NA NA 0 1 0 1 0 1 0 1 0 1 2 2 0 1 0 1 0 1 0 1 0 1 5 3 0 1 0 1 0 1 0 1 0 1 2 2 2 2 0 1 2 2 0 1 0 1 0 1 10 4 0 1 0 1 0 1 3 2 0 1 0 1 0 1 2 2 0 1 3 2 0 1 0 1 0 1 0 1 0 1 1 2 1 2 3 2 0 1 5 3 0 0 0 0 0 0 0 0 0 4 1 1 NA 1 NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA NA NA NA 1 NA NA NA NA NA 1 1 1 4 NA NA 1 NA NA NA 1 2 0 4 10 2
6 NA NA NA NA NA NA NA NA 9 NA NA 66 NA NA NA NA 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4409 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 1 11 6 3 2 NA 5 3 3 3 4 2 3 2 4 2 2 4 1 5 NA NA NA NA 1 4 NA NA NA 4 2 NA 5 0 5 5 1 3 NA 1 3 2 3 44 4 2 4 2 1 3 5 5 3 1 NA 5 3 5 3 5 1 5 3 1 NA 3 1 5 3 5 3 3 3 5 3 5 5 5 3 1 3 2 2 2 NA 2 2 1 NA 4 4 2 4 4 1 2 5 NA 2 4 4 5 2 2 4 2 2 4 1 2 4 2 2 4 2 1 2 4 4 2 2 2 2 4 4 2 1 2 4 2 2 1 2 4 2 4 2 2 2 4 5 2 2 1 2 2 1 2 4 1 2 5 1 5 5 4 2 5 2 4 1 2 4 2 2 4 2 4 5 4 2 4 2 5 4 3 3 2 3 3 2 2 3 2 3 4 2 2 4 3 4 2 3 4 3 4 4 4 2 2 3 1 1 2 1 1 2 1 3 3 2 2 0 1 2 2 0 1 0 1 0 1 0 1 2 2 0 1 0 1 0 1 1 2 0 1 0 1 0 1 2 2 0 1 2 2 0 1 0 1 2 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 2 2 2 0 1 0 1 2 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 2 1 2 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 2 3 1 NA 1 NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA 1 NA NA NA NA NA 1 1 NA NA 1 NA 1 2 1 4 NA NA NA NA NA NA 3 2 0 4 10 2

\(~\) Instead of descriptive variable names we get a bunch of columns with variable numbers in the format “V###” (note: if you open up the actual dataset in R you will also see short descriptive variable labels). In order to find the specific survey items Warr (1993) is referring to, we need to go to the codebook and identify the variable numbers.

  • Note: Because variable numbers are not the same across each wave, this requires going to each of the codebooks and looking them up.

  • Note: In the above code we use the gt() function to simply have the data print out in a nice html formatted table. We will show you how to harness the power of the “gt” package to create publishable-ready tables in a later R Assignment.

Fortunately, the NYS codebooks make this relatively easy as they include bookmarks to different sections of the survey. Here is how it looks in the Wave 1 codebook:

Notice on the left that wave 1 included both a “parent interview” and a “youth interview.” If you go to wave 1 and look through the bookmarks, you’ll notice multiple sections of peer measures, including a “exposure to delinquent peers” and “commitment to delinquent peers.” However, the particular questions for Warr’s (1993) figures 2-3 are in the “Social Integration” section while the dependent variable for Figure 4 is in the “commitment to delinquent peers” section.

  • Note: We actually found them relatively quickly by simply searching for the verbiage in each of the questions above (e.g. “How many evenings”, “have dates and go to parties”, “friends got into trouble with the police”). Figures 2-4 also include age on the x-axis which is in the “Respondent Characteristics” section of the youth interview.

For wave 1, the key variables we need are:

  • V169 - Age
  • V179 - Evenings in average week spent in social activities
  • V180 - Importance of engaging in social activities
  • V377 - Lie to protect friends from trouble with police

Since Warr (1993) used the first five waves, to figure out each specific variable used to construct Figures 2-4, you would simply need to go to each codebook and find each of these items. Fortunately for you, we already did this and made this handy table. You’re welcome!

Warr (1993) Figures 2-4 NYS Items
Item Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
ICPSR number1 8375 8424 8506 8917 9112
Age V169 V7 V10 V6 V6
Evenings spent socializing V179 V17 V81 V24 V37
Importance of socializing V180 V18 V82 V25 V38
Lie to police V377 V223 V321 V301 V328

1 Note: indicates the icpsr number for the data set and not a survey item

\(~\)

3.2: Trim data

When working on a specific analysis or set of analyses from a large data set, I generally think it’s good practice to create a more manageable data set with just the items you need. This ensures that your raw data is kept intact and that you do not unintentionally make changes to it. In this particular case, it will also allow me to look at the data and see if the changes I am making to it are working (this is not always possible with analyses that utilize many variables).

Let’s start selecting the specific variables we need to reproduce Warr’s (1993) Figures 2-4. We are going to use the select() function in the “dplyr” package (one of the core packages within the tidyverse suite) to select those specific items in each of the five waves of data. That means, for the wave 1 data, we need to select “V169,” “V179,” “V180,” and “V377.” (“V7,” “V17,” “V18,” and “V223” for wave 2, and so on for waves 3 through 5).

nys_w1_trim <- nys_w1 %>%
  dplyr::select(V169, V179, V180, V377)

In the code above, we are telling R to select “V169,” “V179,” “V180,” and “V377” from the “nys_w1” data object and create a new data set object called “nys_w1_trim”. Our new object has the same number of observations, but only 4 variables. You can check this by looking in your RStudio “Environment.”

  • NOTE: The code above also introduces you to a new way to call a command directly from a specific package. Recall that R is an open-source program within which anyone could conceivably create their own useful packages. While this is one of the program’s greatest strengths, it also poses some challenges. One such challenge is the lack of strict curation across its countless and ever-growing list of packages to ensure that programmers do not incorporate conflicting package commands. From our experience, select() is one of those popular commands that frequently poses conflicts when you have several packages loaded at once. So, in this code chunk, we ensured that the select() command was invoked using the “dplyr” package by appending the package name followed by two colons directly in front of the command (i.e., dplyr::select()).

Let’s take a look at the first six observations of the data with teh head() function and see what the trimmed data look like.

head(nys_w1_trim) %>%
  gt()
V169 V179 V180 V377
13 3 3 1
15 4 3 3
11 1 5 1
16 2 3 2
14 0 4 NA
11 2 3 3

There are a few problems with the wave 1 trimmed data in its current form:

  • First, the variable names are not informative. They are simply the names in the original data file. Like with naming your computer files, it is usually a good practice to give informative names to your variables (and other R objects; see part 1 of Navarro’s series on “dplyr”). Using meaningful and systematic naming conventions will also be useful when we combine the data sets, since we can assign the same name across each wave before merging or combining them.

  • Second, the trimmed data we created has no information about which wave these data come from (except in the object name) nor does it include the unique identifier for individuals. If we were just working with the wave 1 data, this would not be a huge problem; also, since Warr (1993) simply pooled all five waves of data, the individual identifiers are less important. Nonetheless, it is generally good practice to preserve such important information.

3.3. Rename existing variables and create a new variable indicating NYS wave.

In order to rename the five variables in which we are currently interested (i.e., age, evenings spent socializing, importance of socializing, and lie to police) we will use the rename() function. To create a new variable indicating the wave of the data, we will use the mutate() function. Both of these functions are part of the “dplyr” package. The rename() function does exactly what it says - it renames existing items (or columns) in a data set, whereas the mutate() function allows us to create new variables (or columns) and to manipulate existing items in the data.

Let’s do this with the wave 1 data again.

  • Note: Below, we are writing over the nys_w1_trim data that we created earlier. Usually, we avoid writing over objects, as doing so can cause confusion and errors as we try to keep track of what exactly is in the object. Nonetheless, as you will see, we are going to repeat the select() function that we used before, though, this time, that command will be followed by a pipe and additional commands using rename and mutate functions. One of the nice things about the “dplyr” package and the “pipe” (%>%) you learned about in a previous R Assignment is that you sequentially invoke various commands all at once within the same code chunk.
nys_w1_trim <- nys_w1 %>%
  dplyr::select(CASEID, V169, V179, V180, V377) %>%
  rename(age = V169, 
         evsoc = V179,
         socimp = V180,
         liepolice = V377) %>%
  mutate(wave = 1)

A couple things to note about the above code:

  • First, like before, we are telling R to select “V169,” “V179,” “V180,” and “V377” from “nys_w1.” However, one key difference this time is that we immediately followed this with a pipe to a sequential rename command, which tells R to subsequently rename specific items after selecting them. Within the rename command, we specifically tell R what to rename each variable by invoking a new name as equal to an old name (i.e. new name = old name).

  • Second, the rename command is then followed by another pipe to a sequential mutate command that tells R to create or modify a variable after completing the rename command. In this case, our mutate command tells R to create a new variable named “wave” that equals “1” to indicate the wave of the data we are working with (i.e. variable_name = value). Since this command is not conditional (e.g., there is no ifelse operator), all 1,725 rows or observations (i.e., “cases” or “respondents”) from NYS wave 1 will include a variable column named “wave” with a value that equals “1” in each row. And, of course, the first line of code tells R to assign all of these operations into a new object called “nys_w1_trim.”

    • Note: We also included the “CASEID” item in the select command above; ICPSR and the NYS made it easy on us by consistently naming the identifier “CASEID” in each wave of data.

    • Note: The mutate() function can do a lot more than just assign a value to a new variable. We’ll discuss this more in the next R Assignment.

Here is what the data look like:

head(nys_w1_trim) %>%
  gt()
CASEID age evsoc socimp liepolice wave
1 13 3 3 1 1
2 15 4 3 3 1
3 11 1 5 1 1
4 16 2 3 2 1
5 14 0 4 NA 1
6 11 2 3 3 1

\(~\) Now, let’s create the trimmed data for each of the first five waves of the NYS. Again, we will write over the nys_w1_trim data we created above, as we would usually do all of these commands in the same code chunk. Here is the table of the items as a reminder:

Warr (1993) Figures 2-4 NYS Items
Item Variable name Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
ICPSR number1 8375 8424 8506 8917 9112
Age age V169 V7 V10 V6 V6
Evenings spent socializing evsoc V179 V17 V81 V24 V37
Importance of socializing socimp V180 V18 V82 V25 V38
Lie to police liepolice V377 V223 V321 V301 V328

1 Note: indicates the icpsr number for the data set and not a survey item

\(~\)

Here is the code to create five trimmed data objects corresponding to each of the first five waves of NYs data.

#Wave 1:
nys_w1_trim <- nys_w1 %>%
  dplyr::select(CASEID, V169, V179, V180, V377) %>%
  rename(age = V169, 
         evsoc = V179,
         socimp = V180,
         liepolice = V377) %>%
  mutate(wave = 1)

head(nys_w1_trim) %>%
  gt()

#Wave 2:
nys_w2_trim <- nys_w2 %>%
  dplyr::select(CASEID, V7, V17, V18, V223) %>%
  rename(age = V7, 
         evsoc = V17,
         socimp = V18,
         liepolice = V223) %>%
  mutate(wave = 2)

head(nys_w2_trim) %>%
  gt()

#Wave 3:
nys_w3_trim <- nys_w3 %>%
  dplyr::select(CASEID, V10, V81, V82, V321) %>%
  rename(age = V10, 
         evsoc = V81,
         socimp = V82,
         liepolice = V321) %>%
  mutate(wave = 3)

head(nys_w3_trim) %>%
  gt()

#Wave 4:
nys_w4_trim <- nys_w4 %>%
  dplyr::select(CASEID, V6, V24, V25, V301) %>%
  rename(age = V6, 
         evsoc = V24,
         socimp = V25,
         liepolice = V301) %>%
  mutate(wave = 4)

head(nys_w4_trim) %>%
  gt()

#Wave 5:
nys_w5_trim <- nys_w5 %>%
  dplyr::select(CASEID, V6, V37, V38, V328) %>%
  rename(age = V6, 
         evsoc = V37,
         socimp = V38,
         liepolice = V328) %>%
  mutate(wave = 5)

head(nys_w5_trim) %>%
  gt()

In the code above, we simply kept the same basic code that we used for wave 1 earlier, but replaced the item information with the appropriate items for each wave that we identified previously for the other four waves of NYS data. Now, you should have five separate “trimmed” data objects each with the same seven variables in them (“CASEID,” “age,” “marijuana,” “alcohol,” “cheating,” “vandalism,” and “wave”).

  • Note: In the above code we repeat the same code structure but change certain details. For now, it may be efficient for you to copy and paste from your own code in these situations. However, it is important to recognize that copying and pasting repeatedly is generally considered an ill-advised and error-prone coding/programming practice (see R for Data Science for an introduction to functions). Eventually, you want to be able to use existing functions and write your own simple functions for accomplishing these repetitive tasks.
    • We will not introduce writing functions in this class and, to be honest, we are still working on consistently using functions in our own work. However, one of the key advantages of using R is that it is a functional programming language. So, unlocking the ability to use functions really does supercharge your R abilities (R for Data Science and Advanced R provide book introductions to functional programming in R and Danielle Navarro has a video series about functional programming in R).

3.4. Pool data from all five waves

Now that we have all five data sets with the same variables and variable names, pooling them together is relatively easy. But first, we want you to try to build some intuition regarding what Warr (1993) did here when he says that “…all five years of the NYS data were pooled, producing a composite sample of 8,625 persons aged 11-21 (pg. 20).”

  • Actually, this statement is somewhat misleading as, to the untrained reader, it seems to imply that their are 8,625 different persons in the pooled data. However, this is not the case. Remember, the NYS is a longitudinal panel study. This means the researchers set out to study the same people over an extended period of time–in this case once per year over the first five years of the study. So, the 8,625 persons are really 1,725 unique individuals, each of whom were surveyed (or, more accurately, whom the researchers attempted to survey) five times in the first five years of the study (1,725 persons x 5 waves = 8,625 person-waves).

Because we created five trimmed data sets with the same variables in section 3.3 above, “pooling” the data in this case really just means stacking the waves of data on top of each other. In other words, I want to put Wave 1 data on top, Wave 2 data next, then Wave 3 data, then Wave 4 data, until the Wave 5 observations are at the bottom of the data set. Fortunately, the “dplyr” package makes this relatively easy with the bind_rows() function. Essentially, when you use the bind_rows() function, you are simply telling R which data sets to stack on top of each other by the order in which you list them.

  • Note: there is a corollary function bind_cols() that tells R to place (columns of) data sets next to each other.
nys_fwtrim <- bind_rows(nys_w1_trim, nys_w2_trim, nys_w3_trim, nys_w4_trim, nys_w5_trim)

head(nys_fwtrim) %>%
  gt()

In the code above, we told R to stack waves 1 through 5 on top of each other in chronological order and assign it to the object “nys_fwtrim” (we used fw to indicate we were creating a data set that included all “five waves” of data).

Part 4 (Assignment 5.4): Describe items in pooled data

You should now have a pooled data set called “nys_fwtrim” that has 8,625 observations and six variables that have informative names. An important first step in analyzing data is looking at basic descriptives for your key variables. This is often the first step in identifying the basic distribution of variables, identifying outliers, and identifying potential problems in the data (e.g., missing data).

4.1: View frequency tables for key variables

If you look again at Figures 2-4, you will notice that Warr (1993) reports, by age, the “Percentage of Respondents…” who:

  1. “Averaged three or more nights per week” engaging in social activities (e.g., dates and parties).
  2. responded that it “was ‘very imiportant’ or ’pretty important” to engage in social activities.
  3. reported that “they would lie to protect their friends if they got in trouble with the police.”

Each of these variables were dichotomized from specific survey questions with more than two response categories. One reason for this may be because the data are fairly skewed, or concentrated in certain answer categories. For example, it is likely pretty rare for teenage respondents to report averaging seven nights a week socializing via things like dates and parties. Of course, we’ll be able to confirm this below.

We can check the frequency distributions and modal categories for each of the three peer items creating a frequency table for them. There are lots of ways to create frequency tables and calculate and produce tables of descriptive statistics (see here for review), including the base R command table(), the more tidyverse-oriented command, tabyl(), that is part of the “janitor” package, and the frq() and flat_table() commands in the “sjmisc package.”

  • Note: If you use the base R command, you’ll need to use the $ operator to tell R which variable to use from the data set—table(nys_fwtrim$marijuana). For now, we’ll use the “sjmisc” package.

  • Note: we also include the frequency table for age as well.

library(sjmisc)

nys_fwtrim %>%
frq(age, evsoc, socimp, liepolice)
## 
## age <numeric>
## # total N=8625  valid N=8625  mean=15.87  sd=2.40
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##    11 |  252 |  2.92 |    2.92 |   2.92
##    12 |  509 |  5.90 |    5.90 |   8.82
##    13 |  778 |  9.02 |    9.02 |  17.84
##    14 | 1036 | 12.01 |   12.01 |  29.86
##    15 | 1289 | 14.94 |   14.94 |  44.80
##    16 | 1276 | 14.79 |   14.79 |  59.59
##    17 | 1216 | 14.10 |   14.10 |  73.69
##    18 |  947 | 10.98 |   10.98 |  84.67
##    19 |  689 |  7.99 |    7.99 |  92.66
##    20 |  436 |  5.06 |    5.06 |  97.72
##    21 |  197 |  2.28 |    2.28 | 100.00
##  <NA> |    0 |  0.00 |    <NA> |   <NA>
## 
## 
## Y5-37:EVENINGS/WK SPENT DATING/SOCIAL (evsoc) <numeric>
## # total N=8625  valid N=8029  mean=2.08  sd=1.57
## 
## Value |               Label |    N | Raw % | Valid % | Cum. %
## -------------------------------------------------------------
##     0 | Less than once a wk | 1288 | 14.93 |   16.04 |  16.04
##     1 |                   1 | 1899 | 22.02 |   23.65 |  39.69
##     2 |                   2 | 2068 | 23.98 |   25.76 |  65.45
##     3 |                   3 | 1463 | 16.96 |   18.22 |  83.67
##     4 |                   4 |  677 |  7.85 |    8.43 |  92.10
##     5 |                   5 |  386 |  4.48 |    4.81 |  96.91
##     6 |                   6 |  111 |  1.29 |    1.38 |  98.29
##     7 |                   7 |  137 |  1.59 |    1.71 | 100.00
##  <NA> |                <NA> |  596 |  6.91 |    <NA> |   <NA>
## 
## 
## Y1-15: HOW IMPORTANT SOCIAL (socimp) <numeric>
## # total N=8625  valid N=8028  mean=3.31  sd=1.13
## 
## Value |              Label |    N | Raw % | Valid % | Cum. %
## ------------------------------------------------------------
##     1 |      Not important |  474 |  5.50 |    5.90 |   5.90
##     2 |  Not too important | 1460 | 16.93 |   18.19 |  24.09
##     3 | Somewhat important | 2543 | 29.48 |   31.68 |  55.77
##     4 |   Pretty important | 2176 | 25.23 |   27.11 |  82.87
##     5 |     Very important | 1375 | 15.94 |   17.13 | 100.00
##  <NA> |               <NA> |  597 |  6.92 |    <NA> |   <NA>
## 
## 
## Y1-191: WILLING TO LIE (liepolice) <numeric>
## # total N=8625  valid N=7638  mean=1.59  sd=0.77
## 
## Value |      Label |    N | Raw % | Valid % | Cum. %
## ----------------------------------------------------
##     1 |         No | 4469 | 51.81 |   58.51 |  58.51
##     2 | Don't know | 1830 | 21.22 |   23.96 |  82.47
##     3 |        Yes | 1338 | 15.51 |   17.52 |  99.99
##     4 |          4 |    1 |  0.01 |    0.01 | 100.00
##  <NA> |       <NA> |  987 | 11.44 |    <NA> |   <NA>

As you can see above, the frq() command in the “sjmisc” package prints out some pretty bare bones tables of the frequency distributions for our key variables. One thing we particularly like about this command is it allows us to include all of the variables for which we want frequency tables in the same command. Thus, it is a good command for getting a quick look at the key variables in your data.

We also like that the frq() command defaults to printing out the the frequency of missing observations for each variable (the “” entry). When you are first looking at a given set of data, you want to know for which variables missing observations are particularly pronounced. In the case of the NYS, these missing observations likely result from item-level non-response (e.g., respondents not answering specific questions) as well as attrition (e.g., respondents taking the first survey but not responding to subsequent surveys).

  • Note: One thing we do not like about the frq() command in the “sjmisc” package is that taking the outputted tables above and converting them to a tidy data frame (i.e., tibble) for using with tidyverse packages like “ggplot2” is not as easy as with other commands.

  • Assignment: Take some time to look over the distributions for our key variables above. What do you notice about each variable’s distribution (e.g., where do most respondents fall in the distribution, what is the most common answer across the five waves, which items have more missing data, etc.)? From the above tables, do you notice any potential problems with the items (e.g., values that don’t make sense based on description of variable in the codebook)? Create a sub-header in your RMD file and write out your responses to these questions.

4.2: View Cross-tabulations for Age by Peer Relations Variables

Warr (1993) was fundamentally interested in the age distribution of these “Other elements of peer relations.” Essentially, Figures 2 through 4 are simply presenting cross-tabulations of Age by dichotomized versions of the variables for which we just looked at frequency tables. Although we’ll save dichotomizing the variables for the next assignment, we can produce basic cross-tabulation for age and the non-dichotomized versions of the variables relatively easily using the flat_table() function in the “sjmisc” package.

nys_fwtrim %>%
  flat_table(evsoc, age, margin = "col") #note: margin = "col" tells it to give me column percentages
##                     age    11    12    13    14    15    16    17    18    19    20    21
## evsoc                                                                                    
## Less than once a wk     40.80 41.25 31.36 23.09 15.36 10.52  7.09  4.88  6.47  6.32 10.30
## 1                       28.80 31.79 30.70 29.72 25.10 22.45 17.80 15.81 19.24 19.74 25.45
## 2                       18.00 14.89 21.74 22.19 26.67 27.96 30.82 27.11 26.87 30.00 32.12
## 3                        8.40  7.24  8.83 14.36 18.66 19.20 22.76 25.68 23.05 24.47 21.21
## 4                        2.40  3.22  3.69  5.82  8.01 10.60 10.72 11.89 12.77 10.26  4.85
## 5                        0.80  1.41  2.50  3.31  3.22  6.43  6.38  8.44  6.30  5.79  3.64
## 6                        0.00  0.00  0.79  0.60  0.99  1.00  2.13  3.09  2.82  1.84  0.61
## 7                        0.80  0.20  0.40  0.90  1.98  1.84  2.30  3.09  2.49  1.58  1.82
nys_fwtrim %>%
  flat_table(socimp, age, margin = "col")
##                    age    11    12    13    14    15    16    17    18    19    20    21
## socimp                                                                                  
## Not important          12.40 13.28  9.59  7.43  4.71  4.59  2.92  3.69  3.98  5.00  6.67
## Not too important      24.00 26.76 24.05 19.98 17.84 17.46 15.32 12.75 15.75 13.68 20.00
## Somewhat important     25.60 25.55 30.22 31.73 32.62 31.41 28.52 34.56 37.48 36.32 35.76
## Pretty important       25.20 22.33 22.08 25.20 27.09 29.16 31.53 29.32 26.04 28.16 24.24
## Very important         12.80 12.07 14.06 15.66 17.75 17.38 21.70 19.67 16.75 16.84 13.33
nys_fwtrim %>%
  flat_table(liepolice, age, margin = "col") 
##            age    11    12    13    14    15    16    17    18    19    20    21
## liepolice                                                                       
## No             74.77 69.80 63.95 61.06 56.93 55.58 53.12 56.85 53.18 58.31 63.80
## Don't know     15.89 20.81 24.32 22.87 24.49 23.79 26.28 22.74 26.76 25.33 22.70
## Yes             9.35  9.40 11.59 16.06 18.58 20.63 20.60 20.42 20.07 16.36 13.50
## 4               0.00  0.00  0.14  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00

Above we produced three basic tables with our three “other elements of peer relations” items representing the rows and age representing the columns. We used the margins = "col" argument in the flat_table() function to get the percentage of each response category for each age group across all five waves of data. Again, this is fundamentally what Figures 2-4 in Warr’s (1993) paper are doing, only here we’re showing this for the raw/untransformed items rather than the dichotomized versions that Warr (1993) created.

  • You can get a sense of whether we are getting the same results as Warr from these tables by eyeballing the total percentage in each age category that fall into the particular category group that Warr (1993) is plotting. For example, in Figure 2, Warr (1993) plots the “Percentage of respondents reporting that they averaged three or more nights per week going ‘on dates, to parties, or to other social activities,’ by age.” This means he is plotting the total percentage of the bottom five rows of our first table above (rows representing 3 to 7 nights a week on average). So, for example, the total percentage of respondents aged 11 who report socializing 3 to 7 nights a week on average is 8.4 + 2.4 + 0.8 + 0 + 0.8 = r 8.4 + 2.4 + 0.8 + 0 + 0.8`. This appears to correspond to the value at age 11 in Warr’s (1993) Figure 2 plot.

Part 5 (Assignment 5.5): Draw the Owl

In order for you to demonstrate that you can apply the basic data wrangling and descriptive analysis skills that you learned above on your own, in the last part of the assignment, you will consider alternative operationalizations of one of the “other elements of peer relations” that Warr (1993) was examining in Figures 2-4. In doing so, you will provide a type of robustnes check to one of Warr’s (1993) methodological decisions.

Specifically, Warr (1993) used the question about being willing to “lie to protect their friends if they got in trouble with the police” as an indicator of respondents’ “commitment or loyalty to their own particular set of friends (pg. 19).” However, in the section on “Committment to Delinquent Peers” in the codebooks for the first five waves of NYS data, their are two other questions that are meant to measure “commitment” to peers who are engaging in delinquency:

  • “If you found that your group of friends was leading you into trouble, would you still run around with them?” (1 = No, 2 = Maybe, 3 = Yes)
  • “If you found that your group of friends was leading you into trouble, would you try to stop these activities?” ( 1 = No, 2 = Maybe, 3 = Yes)

These were in addition to the question Warr (1993) examined:

  • “If your friends got into trouble with the police, would you be willing to lie to protect them?” (1 = No, 2 = Maybe, 3 = Yes)

Here is a table, similar to what I provided above, that shows you where each item is located in each of the first five waves of NYS data:

Warr (1993) Figures 2-4 NYS Items
Item Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
ICPSR number1 8375 8424 8506 8917 9112
Age V169 V7 V10 V6 V6
Still run around with friends V375 V221 V319 V299 V326
Try to stop activities V376 V222 V320 V300 V327
Lie to police V377 V223 V321 V301 V328

1 Note: indicates the icpsr number for the data set and not a survey item

5.1: Extend Warr’s Analysis of Commitment to Delinquent Peers

In order to complete the assignment, here is what you need to do:

  1. Before looking at the data, write a brief statement or commentary about whether you think the other two “commitment to delinquent peers” items will have a similar age distribution to the “lie to police” item for which you already produced the descriptive table.

  2. Trim, rename, and pool waves 1-5 data so that you have all three “commitment to delinquent peers” items in the same pooled data set.

    • Note: You are welcome to modify the code you wrote in Part 3 to create a pooled data set.
  3. Produce frequency tables for each o the “commitment to delinquent peers” items as well as cross-tabulations for these items by age.

    • Note: see code in Part 4 for example.
  4. Write a brief statement or commentary about the similarities and differences between each of the “commitment to delinquent peers” items in terms of their raw frequency distribution and their age distribution.

  5. Write a “Conclusion” section where you write about what you learned in this assignment and any problems or issues you had in completing it.

Part 6 (Assignment 5.6)

Submit your assignment

  1. “knit” your final RMD file to html format and save it using an informative file name (e.g., “LastName_CRM495_RAssgin5_YEAR_MO_DY”) within a file structure you create for this assignment (e.g., “LastName_CRM495_RAssign5”)

  2. Submit your knitted html file on Canvas.

  3. Place a copy of your root folder your LastName_495_commit folder on OneDrive.

    • Note: The root folder should contain your reproducible file structure for this assignment. This means it should include anything necessary to reproduce your knitted html document with “one click.”