Assumptions & Ground Rules

The purpose of this assignment is to learn how to download and provide basic descriptions of data and specific variables analyzed in a published study. Up to this point, we have used built-in R data (e.g., R Assignment 2), provided you with the data you were working with (e.g., subsets of the NYS data in R Assignment 3), or you have downloaded the data manually from ICPSR and placed it within your reproducible file structure (e.g., R Assignment 4). The approach we used in the last R Assignment would work fine when you are using your own data and/or data that you have permission to share. However, this is not generally the case with data on ICPSR. According to ICPSR’s bylaws, you are not technically allowed to share ICPSR data in your own online repository (e.g. OSF or GitHub). For this assignment, we will show you how to download SPSS data directly within R and begin looking at the data via basic descriptive statistics.

Specifically, for this assignment, we will:

  1. Create part of file structure within R
  2. Download NYS data directly from ICPSR into file structure
  3. Subset data and combine multiple data sets into one
  4. Identify, rename, and provide basic descriptions of specific variables/items used in the study.
  5. Provide basic introduction to the ifelse function and logic in R.

I assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, I expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, I also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.

  • Note: For this assignment, have RMarkdown knit to an html file.

As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).



Part 1 (Assignment 5.1): Create file structure within R

In the last R Assignment (“Reproducible File structure”) we created a basic reproducible file structure and shared it using your computer’s operating system. Here, we are going to create most of the folders we need using R code. This is useful for our specific purposes–downloading data directly from ICPSR–because we do not have to rely on someone placing their data in the correct folder, we can simply share with them the code to create the folder in their own root directory.

1.1: Create root folder for R Assignment #5

Before we start creating folders and downloading data within R, we need to create a root folder, save our RMD file inside it, and close and open the assignment directly from that root folder (so the “here” package will start in the correct folder on our computer):

  1. Go to your “LastName_CRM495_work” folder on OneDrive
  2. Create a new root folder titled “LastName_CRM495_RAssignment5” inside it.
  3. Save your RMD file as LastName_CRM495_RAssign5_YEAR_MO_DY
  4. Close RStudio, go to root folder and open RMD file.

1.2: Create “NYS_data” folder within R

Create a subfolder within your “LastName_CRM495_RAssignment5” subfolder called “NYS_data.” Technically, you could do this yourself by navigating to the folder on your computer and creating a new “NYS_data” folder manually. But we can also do it in R with the following code. Again, doing it in the R environment helps ensure that anyone else (including our future selves) can easily reproduce our work with minimal effort.

# check if "NYS_data" folder exists (TRUE if it does) & create if it does not exist. 
ifelse(dir.exists(here("NYS_data")), TRUE, dir.create(here("NYS_data")))
## [1] TRUE

Let us try to explain the above code to you. The “ifelse” command is a logical function within base R. To get more details about it, type ?ifelse into the console window. Here is the description of that function:

ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

It takes the form of the following: ifelse(test, yes, no). This means, you give R a logical test (or a logical question) that can be answered yes or no and then it gives you a value or performs another function based on the solution of that test (i.e., based upon the answer to that question).

In the above code, we are asking if the “NYS_data” folder exists within our root folder (i.e., your “LastName_CRM495_RAssignment5” folder) with the dir.exists function. If the answer is yes, it simply returns the logical value I told it to - in this case TRUE. If the answer is no, you instruct R to create that “NYS_data” folder with the dir.create function. Again, type ?dir.exists or ?dir.create for more information.

If you want to have some fun, you can actually have R return a text string instead of the logical value. For example:

ifelse(dir.exists(here("NYS_data")), "You already created that folder, dummy!", dir.create(here("NYS_data")))
## [1] "You already created that folder, dummy!"

Generally, it is probably not a great idea to have R call the user (yourself in this case) a “dummy” with code you plan to eventually share publicly. Yet, it is also OK to have some fun when doing science. I (Jake) think that having a computer program call me a “dummy” is fun - perhaps you do not.

  • Note: there is probably a programming rationale for using the logical value rather than a string of which I am unaware.

  • Note: tidyverse syntax has a stricter if_else function. According to the documentation, what makes it more strict is that “It checks that true and false are the same type.” I’ll be honest, I’m not sure exactly when this strictness is useful (tidyverse says it can allow for more predictable use and is somewhat faster). For most of what we will be using it for, either the base ifelse and tidyverse if_else functions will likely work just fine. I am going to use the more general ifelse function from base R.

You should now have a file structure for your “LastName_CRM495_RAssignment5” folder that looks like this:

Part 2 (Assignment 5.2): Download ICPSR data directly from within R

Now that you have the basic file structure for this assignment and specifically the “NYS_data” folder, it’s time to download the first five waves of NYS data. Recall these are the waves of data that Warr (1993) used in his study on the delinquent peer influences and the age-distribution of crime.

As we mentioned previously, it is technically against ICPSR’s bylaws to share data housed on ICPSR “without the written agreement of ICPSR.” This means that if you included ICPSR data and/or documentation directly within a reproducible file structure that you shared with someone else, you would technically be violating the bylaws. Fortunately, there is a package called “icpsrdata” that allows you to download data housed on ICPSR directly from within R. This means you simply need to provide the code for downloading and wrangling the data and you are 1) not violating ICPSR’s bylaws and 2) adhearing to open and reproducible research practices. Let’s show you how to do that now.

2.1: Identify ICPSR numbers for data you want to download

We need to know the ICPSR numbers for the first five waves of the NYS. Go to the NYS Series page on ICPSR and make note of the ICPSR numbers for the first five waves of data.