Background, Motivation, & Assumptions

So far, throughout the various R Assignments, I have been subliminally training you to create a reproducible workflow and file structure. The purpose of this assignment is to make the reasons explicit for some of the things I’ve had you do, so far, with little to no explanation.

Creating a reproducible file structure is important for efficiently working with and sharing your research and data. When you have a reproducible file structure, it will be easier for you to come back to a project after weeks or months of not working on it and pick up where you left off. Also, it will be easier for others to work directly with you, reproduce your work, and/or expand upon your work.

Of course, in order to create a reproducible file structure, you need to understand how your computer stores and accesses the files it uses. As computers have become more user-friendly, knowledge of this has increasingly been lost among younger generations of computer users. In fact, it’s something you can regularly find older faculty complaining about on twitter.

Monica Chin has an interesting article about the issue. In that article, she interviews professors and students and concludes by suggesting that your generation will likely create your own tools that will be less dependent on traditional file structures. But guess what? We’re not there yet. So you should learn something about the basics of creating a reproducible file structure so you can do reproducible work. Along the way we will also learn some things about R and RStudio that can be helpful in producing reproducible research. Specifically, we will:

  1. Create basic file structure template
  2. Use the “here” package to access files and create robust workflow
  3. Download data from ICPSR and load it into R using “haven” package.
  4. Create a README file so future you (and others) know the logic of your file structure.
  5. Create zip file of file structure to share for reproducibility.

Part 1 (Assignment 4.1): Create a Basic File Structure

Let’s start by creating a template of a simple file structure for this assignment. I have had you do this in the past by having you create a separate folder for each assignment where you save your RMD and knitted html files. As you have seen in class, this allows me to take your RMD file and, given I have all of the supporting files in my own OneDrive folder (e.g. datasets, immages, etc.) run your RMD file on virtually any computer. Here, we’ll create a file structure for this assignment that is self-contained. Meaning, you can share the file structure and I can copy that file structure to any computer and, givne that computer has R, RStudio, and the associated packages, reproduce your work with one-click.

  • Note: If you haven’t already, you should download and install the OneDrive app to your computer and link it to your UNCW account. This will create a OneDrive folder on your computer where you can easily access and store files that automatcally sync with the cloud, thus creating cloud-based backups of any file(s) you place in your OneDrive folder. Click on the follwoing links for directions for installing OneDrive on Windows or MAC.

1.1: Find OneDrive File Location

In order to create a basic file structure, you first need to locate the OneDrive folder on your computer, and specifically your LastName_CRM495_work folder. Here is what it looks like on my home (Windows) computer:

I have blacked out my other files, but you can see where my “Day_CRM495_Work” folder is located along with the shared “Day-SP22_CRM495” folder where you turn in your work. First, notice in the left pane that “OneDrive - UNC-Wilmington” is highlighted. That simply means that OneDrive is installed on my computer, it’s linked to my UNCW account, and I’m currently looking inside my OneDrive folder.

You’ll also notice a couple files from my graduate course from last semester (“Day_SC500_work” and “Day_F21_SC500”). This actually points to a potential problem with how I’ve named my files. Notice how the “Day_CRM495_Work” and “Day_SC500_work” folders are next to eachother and not next to their corresponding share folders? This could (and has) cause errors when I’m trying to move between my work folder and the shared folder where you commit your work. The reason it’s organized like this is because it’s organized alphabetically. Given I want the corresponding course material next to each other, an easy fix would be to rename them so they each include the abbreviation for semester (e.g. “F21” and “SP22”). Here is what that looks like:

Naming files seems like a mundane task, but being systematic about it can save a lot of time and headaches later on. I highly recommend watching Danielle Navarro’s videos on project structure. In the first three parts, she discusses how to create names that are readable by computers (e.g., for things like searching and sorting) and humans (e.g., provide basic information on what’s in the file or folder) alike.

1.2: Create Folder for R Assignment #4

In prior assignments, I have had you create an “Assignment #” folder in your commit folder. However, while this works relatively well for submitting assignments (e.g. sharing your RMD and knitted file), it is perhaps not the most robust method of sharing your work as, for example, “Assignment 4” is minimally informative. It does not include information about whose assignment it is nor does it clearly distinguish the folder as an “R Assignment” or “Project Assignment.” Also, the space between “Assignment” and “4” is potentially vulnerable to idisoyncracies in different operating systems and software. With an eye towards creating a reproducible file structure, let’s now create a new folder titled “LastName_CRM495_RAssignment4” inside your “LastName_CRM495_work” folder. Here is what it looks like inside my “Day_SP22_CRM495_Work” folder:

  • Note: My “work” folder likely looks different than yours in that I may not have the exact same files as you do. But, either way, you should now have a folder for R Assignment 4 that includes your last name and no white space.

1.3: Create Basic File Structure for R Assignment #4

For the remainder of this assignment, we are going to download some data, do some basic data cleaning/trimming, and take a screenshot of our file structure and insert it into an RMarkdown file. Antincipating this, we can create the sub-folders we are going to need right now. This is the equivalent of creating a template for your file strucutre. Something that I have only minimally done, but something Danielle Navarro advises, is that you can create these types of templates of file structures that you can use over again when starting similar types of projects (see Part 5 in her video series on “File Structure”). So let’s go ahead and do that.

  • Note: Eventually, I will show you how to create folders and your basic file structure all within R. But for right now, we’re going to use your computer’s operating system to create the file we will use for the rest of the assignment.
  1. Open your “LastName_CRM495_RAssignment4” folder. It should have nothing in it at this point. This will be the root folder that you will create other folders but also save your RMD document for this assignment.

  2. Create a new folder called “Data” inside the “LastName_CRM495_RAssignment4” folder.

    • There are a few ways to do this (on a Windows computer): 1) While in the folder press “ctrl + shift + n”; 2) Click on the “New Folder” button at the top of the folder screen; or 3) Right click within the empty folder and select “New” and then “Folder.”
  3. Create a new folder called “Figures” inside the “LastName_CRM495_RAssignment4” folder.

After you do that, this is what the folder should look like:

\(~\)

  1. Open a new RMD file.
  • Title it “R Assignment 4: File Structure”
  • have it knit to html
  • save the file as “LastName_CRM495_RAssign4_YEAR_MO_DY” in the root “LastName_CRM495_RAssigment4” folder. It should look like this:

\(~\)

  • Note: we use the date structure “YEAR_MO_DY,” technically known as the ISO 8601 standard, so that our files will be sortable by date. If we were to youse the “MO_DY_YEAR” standard, as is common in the U.S., files with names that include January as the month (01), would be sorted first regardless of the year.

  • Note: we save the RMD file in the root folder (“LastName_CRM495_RAssignment4”) because, as you’ll see in the next section, when you open the file from that directory, R will default to looking in that root folder. This will allow you to open files within all of the other folders using the “here” package within R.

Part 2 (Assignment 4.2): Download data from ICPSR and load it into R.

Now that you have a basic template of your file structure, we can start filling those folders with files and load them from within R. Let’s start by downloading Wave 1 of the NYS from ICPSR. To do this we’ll need two packages that you may not have used before - 1) the “here” package and 2) the “haven” package. The “haven” package is part of the tidyverse but not the core set of packages that load automatically with the library(tidyverse) command. Let’s go ahead and install (if necessary) and load those now, including the base tidyverse packages. I’ll explain more about each when we use them.

library(tidyverse)
library(haven)
library(here)

2.1: Download data from ICPSR

Start by navigating to the ICPSR landing page for Wave 1 of the NYS (ICPSR 8375). You’ll notice that the NYS has lots of options in terms of file formats for which you can download the data. Unfortunately, R is not one of them. However, this is a case where you could download either the Stata, SPSS, or SAS data files and import them into R using the haven package. For now let’s download the SPSS file.

Downloading SPSS Data

Downloading SPSS Data

  • Note: To download the data, you will need to sign up for an ICPSR account. Go ahead and do that before moving on with the assignment.

  • It is instructive to take a look at what is actually downloaded from ICPSR. First, notice that what ICPSR does is download a zip file with the title “ICPSR_08375-V2.zip” to your computer.

  • Opening this zip file reveals a folder titled “ICPSR_08375” inside.

  • When you open this folder, notice it contains several files that tell you about the data, but not the actual data. There is also another folder titled “DS0001” inside.

  • The data file itself is in this “DS0001” folder. Sometimes data you download from ICPSR will have multiple data sets associated with them, and each data file will be contained its own folder. So, for example, if there were a second data set associated with this particular wave of the NYS, there would likely be another folder named “DS0002.”

  • Inside the “DS0001” folder is a file named “08375-0001-Data.sav” - this is the actual data file. The folder also contains the codebook for the data.

Go ahead and “unzip” and/or move the entire “ICPSR_08375” folder to the “Data” folder inside your “LastName_CRM495_RAssignment4” folder.

  • Here is what it looks like on my computer:

\(~\)

2.2: Read data and save as object in RStudio.

Now that we have the data in our “Data” folder inside our root directory for R Assignment #4 (“LastName_CRM495_RAssignment4”) we can tell R where the SPSS data file is and load it into R’s Global Environment as a data object.

  1. In your RMD file, create a second-level header and title it “Read SPSS Data into R.”

  2. Insert an R chunk and type the following code:

nys_w1 <- read_spss(here("Data", "ICPSR_08375", "DS0001", "08375-0001-Data.sav"))
  • The code above is using the read_spss command that is a core function in the “haven” package. This is a function that converts data files stored as the format for the popular statistical software SPSS, into the R format (there are also read_stata and read_sas commands for those common data analysis software file types).
  • The code also uses the here command from the “here” package. It basically tells R that, starting from the root folder (“LastName_CRM495_RAssignment4”) look in the “Data” folder, then the “ICPSR_08375” folder, then the “DS0001” folder and grab/load the “08375-0001-Data.sav” file.
    • Note: Instead of separating the different folders by commas, you could also type in the file path: "Data/ICPSR_08375/DS0001/08375-0001-Data.sav". I tend to use the comma-separated version and I assume that’s what makes the most sense to you, but both ways are perfectly fine.
    • Note: The here package will help you start a reproducible project-oriented workflow from the beginning. Once installed, it works as follows:
      • You save your primary RMarkdown file in your top-level directory folder. For us, that means saving your R Script file in your “LastName_CRM495_RAssignment4” folder.
  1. Next, save your RMD file and close RStudio. Then simply click directly on the RMarkdown file in your LastName_CRM495_work folder to automatically open it with RStudio.
  • When you do this, your “working directory,” i.e., the place R looks for files by default, will automatically be set to your “LastName_CRM495_RAssignment4” folder.

  • The here package will then make it easy to find and call objects (e.g., datasets) in subfolders of your working directory (e.g., in the “Data” folder).

  • If you want to see where the root directory is currently set for the “here” package within your R session, simply type here() into the console and it will tell you.

  • You should now have a data object in your R “Global Environment” that has 1,725 observations (rows) and 519 variables (columns). You can confirm these data dimensions by typing dim(nys_w1). You can also look at the first six rows of the data by typing head(nys_w6).

Part 3 (Assignment 4.3): Create a Trimmed Data Set and Save It.

For this part of the assignment, I simply want you to do some basic data cleaning and create a trimmed data set that you can add to your “Data” folder.

3.1: Filtering data

Recall from R Assignment #3 that you reproduced the first figure in the Warr’s (1993) classic article titled “Age, Peers, and Delinquency”. As you know, Warr’s (1993) pooled data from the first five waves of the NYS for his analyses. But, of course, he didn’t use all 519 variables in each wave. Let’s take our wave 1 data object (nys_w1) and trim it to just the items that Warr used in his analyses. As a refesher, here is the table of the different items used by Warr (1993) across all five waves.

Warr (1993) NYS Items
variable1 wave1 wave2 wave3 wave4 wave5
icpsr 8375 8424 8506 8917 9112
age V169 V7 V10 V6 V6
peer_mar_dic V367 V210 V308 V288 V315
peer_alc_dic V370 V213 V311 V291 V318
peer_cheat_dic V365 V208 V306 V286 V313
peer_vandal_dic V366 V209 V307 V287 V314
peer_burg_dic V371 V214 V312 V292 V319
peer_selldrugs_dic V372 V215 V313 V293 V320
peer_theftlt5_dic V368 V211 V309 V289 V316
peer_theftgt50_dic V373 V216 V314 V294 V321
evsoc_dic V179 V17 V81 V24 V37
socimp_dic V180 V18 V82 V25 V38
peertroub_police V377 V223 V321 V301 V328
resp_mar V479 V566 V531 V597 V572
resp_cheat V412 V293 V400 V438 478
resp_burg V454 V355 V444 V551 V524
resp_selldrugs V428 V309 V418 V481 V496
resp_theftlt5 V400 V281 V388 V395 V464
resp_theftgt50 V386 V267 V374 V365 V448

1 Note: 'icpsr' indicates the icpsr number for the data set and not a variable

\(~\)

The items in the “wave1” column (except for the ICPSR nubmer) are the items from our “nys_w1” data we want to select. To do that, we’ll use the “select” function from the “dplyr” package that is part of the core suite of packages in the “tidyverse.” We’ll also want to provide informative variable names and we can do that with the “rename” function that is also a part of the “dplyr” package. Here is the code for doing that:

nys_w1_trim <- nys_w1 %>%
  select(CASEID, V169,
         V367, V370, V365, V366, V371, V372, V368, V373,
         V179, V180, V377, 
         V479, V412, V454, V428, V400, V386) %>%
  rename(
    age =   V169,
    peer_mar = V367,
    peer_alc = V370,
    peer_cheat = V365,
    peer_vandal = V366,
    peer_burg = V371,
    peer_selldrugs = V372,
    peer_theftlt5 = V368,
    peer_theftgt50 = V373,
    evsoc = V179,
    soc_imp =   V180,
    peertroub_police = V377,
    resp_mar = V479,
    resp_cheat = V412,
    resp_burg = V454,
    resp_selldrugs = V428,
    resp_theftlt5 = V400,
    resp_theftgt50 = V386)

In the above code, I simply told R to create a new object called “nys_w1_trim” from the “nys_w1” object we created earlier. Then I used the pipe operator (%>%) to move on to the “select” function from the “dplyr” package. For the select function, I simply list the specific items or columns from the data that I want to “select” or keep in the new data object. I then piped to the “rename” function in order to give the items informative names. It takes the form of newname = oldname.

  • Note: The “rename” function writes over the existing variable names. In a future assignment, I will show you how to create new variables using the “mutate” function which is also a part of the “dplyr” package within the “tidyverse.”

When you run the above code, you’ll notice that the “nys_w1_trim” data still has 1725 observations like the original data, but only has 19 variables.

dim(nys_w1_trim)
## [1] 1725   19

Recall you can take a look at the first six rows of the data, as well as check that the rename function worked as expected (i.e. it renamed the columns) by typing head(nys_w1_trim). Here is what the first six rows look like:

CASEID age peer_mar peer_alc peer_cheat peer_vandal peer_burg peer_selldrugs peer_theftlt5 peer_theftgt50 evsoc soc_imp peertroub_police resp_mar resp_cheat resp_burg resp_selldrugs resp_theftlt5 resp_theftgt50
1 13 1 3 3 3 2 1 2 1 3 3 1 1 0 0 0 0 0
2 15 1 2 3 1 1 1 1 1 4 3 3 1 0 0 0 0 0
3 11 1 1 1 1 1 1 1 1 1 5 1 1 0 0 0 0 0
4 16 4 4 4 3 2 2 3 2 2 3 2 7 3 1 0 0 1
5 14 NA NA NA NA NA NA NA NA 0 4 NA 1 2 0 0 5 0
6 11 2 1 4 2 2 1 3 1 2 3 3 1 2 0 0 1 0

\(~\)

As you can see, the variable/column names are now more informative.

3.2: Save the Data

Now that we have our trimmed data, let’s go ahead and save it in our “Data” file within our R Assignmnt 4 file structure. To do that we’ll simply use the “save” function that is a part of base R and the “here” package to tell R where to save the file. Here is the code:

save(nys_w1_trim, file = here("Data", "nys_w1_trim.rda"))

The code is telling R to save the data object that we just created (nys_w1_trim) and then uses the “here” package to tell R to go to the “Data” folder within our root directory and save it as “nys_w1_trim.rda”.

  • Note: we could have named the file something different (e.g. “nys_w1_warr1993”) but when we loaded it the object would still be “nys_w1_trim” in the global environment.

The “.rda” is just short for “.Rdata” file format, and allows you to save multiple objects in the same environment. Here, we are having it save just the nys_w1_trim data. But we could have saved both the “nys_w1” and “nys_w1_trim” data objects together by including both separated by commas like so:

save(nys_w1, nys_w1_trim, file = here("Data", "nys_w1_raw-and-trim.rda"))

Again, when we load the “nys_w1_raw-and-trim.rda” file, it would load the “nys_w1” and “nys_w1_trim” objects into our global environment in R.

3.3: Save RMD file, Quit R, and Reopen.

In order to drive home exactly what the “here” package is doing, complete the following steps:

  1. Save your RMD file and quit your RStudio Session.
  2. Go to the root folder–“LastName_CRM495_RAssignment4”–and reopen your RMD file.
  3. Load your libraries, including the “here” package.
  4. In Console, enter here() and check to see that the “here” package is currently “looking” in the correct root folder (“LastName_CRM495_RAssignment4”).

Part 4 (Assignment 4.4): Create a Basic Plot and Save It

The next thing we want to do is create a simple plot and save it. Saving your plots as separate files could be useful for future you and someone else. For example, if you wanted to use your plot in a presentation, rather than re-running all of your code or copying and pasting from your knitted file, you (or someone else) could simply insert the image file you created (publishers may also prefer the image files as well).

4.1: Create basic plot of age-distribution from NYS wave 1.

Let’s go ahead and reproduce the age-distribtuion chart we created in the last assignment, but instead of using the pooled data, let’s use the “nys-w1-trim” data we created above. We’ll also include some additional features like informative axis-labels and a plot title.

nys_w1_ageplot <- ggplot(data = nys_w1_trim, mapping = aes(x = factor(age))) + 
  geom_bar(fill = "lightblue", color = "black") +
  scale_y_continuous(limits = c(0, 300), breaks = seq(0, 300, 50)) + 
  labs (title = "NYS Wave 1 - Age Distribution", x = "Age", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        axis.title = element_text(size = 10)) + 
  theme_minimal()
nys_w1_ageplot

Ok, so admittedly it’s not the most useful or visually appealling plot, but it will work for what we need to do for this assignment. Notice that we assigned the ggplot to an object named “nys_w1_ageplot.” This will allow us to use the plot in subsequent R functions moving forward. Say, for example, to save it as a jpeg within our file structure.

4.2: Save the ggplot object

To save the plot we’ll use the function ggsave() that is part of the “ggplot2” package. It allows you to save the plot in multiple file formats and specify things like the dimensions of the plot (see here for details). For now, we’ll just save it as a jpeg and make it roughly the size of a typical letter-sized paper with one-inch margins.

ggsave(filename = here("Figures", "nys_w1_ageplot.jpg"), 
       plot = nys_w1_ageplot, width = 8, height = 6, units = "in")

In the above code, I used the “here” package to tell R to save the file in the “Figures” folder within our file structure and named the file “nys_w1_ageplot.jpg”. I also specified the ggplot object I wanted to save (plot = nys_w1_ageplot), width and height of the figure, and the units for that width and height, in this case inchces. I could have also specified centimeters (“cm”), milimeters (“mm”), or pixels (“px”).

Part 5 (Assignment 4.4): Create a README.txt file

I’ll be honest, this is something I have rarely done in my own work. But Danielle Navarro convinced me that this is something I should start doing regurlarly (see Part 5 of her video series on “Project Structure”). Basically, a README file is simply a plain text file that describes the logic of the file structure and what’s in each file. You can create it with any plain text editor, including within RStudio (siply click on “File” > “New File” > “Text File” in the menu).

You do not need to get fancy with a README.txt file, you simply need to provide the basic information necessary for future you or someone else to understand the logic of your file structure and the details regarding what each file and/or folder includes. Go ahead and open a plain text editor and create a README.txt document and save it in the root folder (“LastName_CRM495_RAssignment4”). Here is what my README.txt file looks like:

Part 6 (Assignment 4.6): Knit, Create a .zip File, and Submit Assignment

  1. Upon completing the tasks in the previous four sections:
    • Create a “Conclusion section where you write about what you learned in this assignment and any problems or issues you had in completing it.
    • “knit” your final RMD file to html format save the knitted html document to the root folder (“LastName_CRM495_RAssignment4”).
  2. Create a .zip file of your root folder.
    • To do this on a Windows:
      1. Go to the folder where your root folder is stored (in this case the “LastName_CRM495_Work” or, in my case, the “Day_SP22_CRM495_Work” folder)
      2. Right click on the root folder (“LastName_CRM495_RAssignment4”)
      3. Selct “Send to”.
      4. Select “Compressed (zipped) folder”
    • This will create a zip folder with the same name as your root folder. Zip files are compressed and thus take up less storage space than your typical (uncompressed) folders on your computer. Creating a zipped folder makes it easier to transfer the file between computers.
    • Note: see here for instructions on a MAC.
  3. To submit your assignment for grading:
    • place your unzipped root folder (“LastName_CRM495_RAssignment4”) in your LastName_CRM495_commit folder on OneDrive.
    • Then submit your zipped version of your root folder (““LastName_CRM495_RAssignment4.zip”) on Canvas in the “R Assignment 4” submission portal. This will allow me to have a time-stamped version of your assignment for grading purposes.