So far, throughout the various R Assignments, I have been subliminally training you to create a reproducible workflow and file structure. The purpose of this assignment is to make the reasons explicit for some of the things I’ve had you do, so far, with little to no explanation.
Creating a reproducible file structure is important for efficiently working with and sharing your research and data. When you have a reproducible file structure, it will be easier for you to come back to a project after weeks or months of not working on it and pick up where you left off. Also, it will be easier for others to work directly with you, reproduce your work, and/or expand upon your work.
Of course, in order to create a reproducible file structure, you need to understand how your computer stores and accesses the files it uses. As computers have become more user-friendly, knowledge of this has increasingly been lost among younger generations of computer users. In fact, it’s something you can regularly find older faculty complaining about on twitter.
Monica Chin has an interesting article about the issue. In that article, she interviews professors and students and concludes by suggesting that your generation will likely create your own tools that will be less dependent on traditional file structures. But guess what? We’re not there yet. So you should learn something about the basics of creating a reproducible file structure so you can do reproducible work. Along the way we will also learn some things about R and RStudio that can be helpful in producing reproducible research. Specifically, we will:
Let’s start by creating a template of a simple file structure for this assignment. I have had you do this in the past by having you create a separate folder for each assignment where you save your RMD and knitted html files. As you have seen in class, this allows me to take your RMD file and, given I have all of the supporting files in my own OneDrive folder (e.g. datasets, immages, etc.) run your RMD file on virtually any computer. Here, we’ll create a file structure for this assignment that is self-contained. Meaning, you can share the file structure and I can copy that file structure to any computer and, givne that computer has R, RStudio, and the associated packages, reproduce your work with one-click.
In order to create a basic file structure, you first need to locate the OneDrive folder on your computer, and specifically your LastName_CRM495_work folder. Here is what it looks like on my home (Windows) computer:
I have blacked out my other files, but you can see where my “Day_CRM495_Work” folder is located along with the shared “Day-SP22_CRM495” folder where you turn in your work. First, notice in the left pane that “OneDrive - UNC-Wilmington” is highlighted. That simply means that OneDrive is installed on my computer, it’s linked to my UNCW account, and I’m currently looking inside my OneDrive folder.
You’ll also notice a couple files from my graduate course from last semester (“Day_SC500_work” and “Day_F21_SC500”). This actually points to a potential problem with how I’ve named my files. Notice how the “Day_CRM495_Work” and “Day_SC500_work” folders are next to eachother and not next to their corresponding share folders? This could (and has) cause errors when I’m trying to move between my work folder and the shared folder where you commit your work. The reason it’s organized like this is because it’s organized alphabetically. Given I want the corresponding course material next to each other, an easy fix would be to rename them so they each include the abbreviation for semester (e.g. “F21” and “SP22”). Here is what that looks like:
Naming files seems like a mundane task, but being systematic about it can save a lot of time and headaches later on. I highly recommend watching Danielle Navarro’s videos on project structure. In the first three parts, she discusses how to create names that are readable by computers (e.g., for things like searching and sorting) and humans (e.g., provide basic information on what’s in the file or folder) alike.
In prior assignments, I have had you create an “Assignment #” folder in your commit folder. However, while this works relatively well for submitting assignments (e.g. sharing your RMD and knitted file), it is perhaps not the most robust method of sharing your work as, for example, “Assignment 4” is minimally informative. It does not include information about whose assignment it is nor does it clearly distinguish the folder as an “R Assignment” or “Project Assignment.” Also, the space between “Assignment” and “4” is potentially vulnerable to idisoyncracies in different operating systems and software. With an eye towards creating a reproducible file structure, let’s now create a new folder titled “LastName_CRM495_RAssignment4” inside your “LastName_CRM495_work” folder. Here is what it looks like inside my “Day_SP22_CRM495_Work” folder:
For the remainder of this assignment, we are going to download some data, do some basic data cleaning/trimming, and take a screenshot of our file structure and insert it into an RMarkdown file. Antincipating this, we can create the sub-folders we are going to need right now. This is the equivalent of creating a template for your file strucutre. Something that I have only minimally done, but something Danielle Navarro advises, is that you can create these types of templates of file structures that you can use over again when starting similar types of projects (see Part 5 in her video series on “File Structure”). So let’s go ahead and do that.
Open your “LastName_CRM495_RAssignment4” folder. It should have nothing in it at this point. This will be the root folder that you will create other folders but also save your RMD document for this assignment.
Create a new folder called “Data” inside the “LastName_CRM495_RAssignment4” folder.
Create a new folder called “Figures” inside the “LastName_CRM495_RAssignment4” folder.
After you do that, this is what the folder should look like:
\(~\)
\(~\)
Note: we use the date structure “YEAR_MO_DY,” technically known as the ISO 8601 standard, so that our files will be sortable by date. If we were to youse the “MO_DY_YEAR” standard, as is common in the U.S., files with names that include January as the month (01), would be sorted first regardless of the year.
Note: we save the RMD file in the root folder (“LastName_CRM495_RAssignment4”) because, as you’ll see in the next section, when you open the file from that directory, R will default to looking in that root folder. This will allow you to open files within all of the other folders using the “here” package within R.
Now that you have a basic template of your file structure, we can
start filling those folders with files and load them from within R.
Let’s start by downloading Wave 1 of the NYS from ICPSR. To do this
we’ll need two packages that you may not have used before - 1) the “here” package and 2) the “haven” package. The “haven”
package is part of the tidyverse but not the core set of packages that
load automatically with the library(tidyverse)
command.
Let’s go ahead and install (if necessary) and load those now, including
the base tidyverse packages. I’ll explain more about each when we use
them.
library(tidyverse)
library(haven)
library(here)
Start by navigating to the ICPSR landing page for Wave 1 of the NYS (ICPSR 8375). You’ll notice that the NYS has lots of options in terms of file formats for which you can download the data. Unfortunately, R is not one of them. However, this is a case where you could download either the Stata, SPSS, or SAS data files and import them into R using the haven package. For now let’s download the SPSS file.
Note: To download the data, you will need to sign up for an ICPSR account. Go ahead and do that before moving on with the assignment.
It is instructive to take a look at what is actually downloaded from ICPSR. First, notice that what ICPSR does is download a zip file with the title “ICPSR_08375-V2.zip” to your computer.
The data file itself is in this “DS0001” folder. Sometimes data you download from ICPSR will have multiple data sets associated with them, and each data file will be contained its own folder. So, for example, if there were a second data set associated with this particular wave of the NYS, there would likely be another folder named “DS0002.”
Inside the “DS0001” folder is a file named “08375-0001-Data.sav” - this is the actual data file. The folder also contains the codebook for the data.
Go ahead and “unzip” and/or move the entire “ICPSR_08375” folder to the “Data” folder inside your “LastName_CRM495_RAssignment4” folder.
\(~\)
Now that we have the data in our “Data” folder inside our root directory for R Assignment #4 (“LastName_CRM495_RAssignment4”) we can tell R where the SPSS data file is and load it into R’s Global Environment as a data object.
In your RMD file, create a second-level header and title it “Read SPSS Data into R.”
Insert an R chunk and type the following code:
nys_w1 <- read_spss(here("Data", "ICPSR_08375", "DS0001", "08375-0001-Data.sav"))
read_spss
command that is a
core function in the “haven” package. This is a function that converts
data files stored as the format for the popular statistical software
SPSS, into the R format (there are also read_stata
and
read_sas
commands for those common data analysis software
file types).here
command from the “here”
package. It basically tells R that, starting from the root folder
(“LastName_CRM495_RAssignment4”) look in the “Data” folder, then the
“ICPSR_08375” folder, then the “DS0001” folder and grab/load the
“08375-0001-Data.sav” file.
"Data/ICPSR_08375/DS0001/08375-0001-Data.sav"
. I tend to
use the comma-separated version and I assume that’s what makes the most
sense to you, but both ways are perfectly fine.When you do this, your “working directory,” i.e., the place R looks for files by default, will automatically be set to your “LastName_CRM495_RAssignment4” folder.
The here package will then make it easy to find and call objects (e.g., datasets) in subfolders of your working directory (e.g., in the “Data” folder).
If you want to see where the root directory is currently set for
the “here” package within your R session, simply type
here()
into the console and it will tell you.
You should now have a data object in your R “Global Environment”
that has 1,725 observations (rows) and 519 variables (columns). You can
confirm these data dimensions by typing dim(nys_w1)
. You
can also look at the first six rows of the data by typing
head(nys_w6)
.
For this part of the assignment, I simply want you to do some basic data cleaning and create a trimmed data set that you can add to your “Data” folder.
Recall from R Assignment #3 that you reproduced the first figure in the Warr’s (1993) classic article titled “Age, Peers, and Delinquency”. As you know, Warr’s (1993) pooled data from the first five waves of the NYS for his analyses. But, of course, he didn’t use all 519 variables in each wave. Let’s take our wave 1 data object (nys_w1) and trim it to just the items that Warr used in his analyses. As a refesher, here is the table of the different items used by Warr (1993) across all five waves.
Warr (1993) NYS Items | |||||
---|---|---|---|---|---|
variable1 | wave1 | wave2 | wave3 | wave4 | wave5 |
icpsr | 8375 | 8424 | 8506 | 8917 | 9112 |
age | V169 | V7 | V10 | V6 | V6 |
peer_mar_dic | V367 | V210 | V308 | V288 | V315 |
peer_alc_dic | V370 | V213 | V311 | V291 | V318 |
peer_cheat_dic | V365 | V208 | V306 | V286 | V313 |
peer_vandal_dic | V366 | V209 | V307 | V287 | V314 |
peer_burg_dic | V371 | V214 | V312 | V292 | V319 |
peer_selldrugs_dic | V372 | V215 | V313 | V293 | V320 |
peer_theftlt5_dic | V368 | V211 | V309 | V289 | V316 |
peer_theftgt50_dic | V373 | V216 | V314 | V294 | V321 |
evsoc_dic | V179 | V17 | V81 | V24 | V37 |
socimp_dic | V180 | V18 | V82 | V25 | V38 |
peertroub_police | V377 | V223 | V321 | V301 | V328 |
resp_mar | V479 | V566 | V531 | V597 | V572 |
resp_cheat | V412 | V293 | V400 | V438 | 478 |
resp_burg | V454 | V355 | V444 | V551 | V524 |
resp_selldrugs | V428 | V309 | V418 | V481 | V496 |
resp_theftlt5 | V400 | V281 | V388 | V395 | V464 |
resp_theftgt50 | V386 | V267 | V374 | V365 | V448 |
1
Note: 'icpsr' indicates the icpsr number for the data set and not a variable
|
\(~\)
The items in the “wave1” column (except for the ICPSR nubmer) are the items from our “nys_w1” data we want to select. To do that, we’ll use the “select” function from the “dplyr” package that is part of the core suite of packages in the “tidyverse.” We’ll also want to provide informative variable names and we can do that with the “rename” function that is also a part of the “dplyr” package. Here is the code for doing that:
nys_w1_trim <- nys_w1 %>%
select(CASEID, V169,
V367, V370, V365, V366, V371, V372, V368, V373,
V179, V180, V377,
V479, V412, V454, V428, V400, V386) %>%
rename(
age = V169,
peer_mar = V367,
peer_alc = V370,
peer_cheat = V365,
peer_vandal = V366,
peer_burg = V371,
peer_selldrugs = V372,
peer_theftlt5 = V368,
peer_theftgt50 = V373,
evsoc = V179,
soc_imp = V180,
peertroub_police = V377,
resp_mar = V479,
resp_cheat = V412,
resp_burg = V454,
resp_selldrugs = V428,
resp_theftlt5 = V400,
resp_theftgt50 = V386)
In the above code, I simply told R to create a new object called
“nys_w1_trim” from the “nys_w1” object we created earlier. Then I used
the pipe operator (%>%
) to move on to the “select”
function from the “dplyr” package. For the select function, I simply
list the specific items or columns from the data that I want to “select”
or keep in the new data object. I then piped to the “rename” function in
order to give the items informative names. It takes the form of
newname = oldname
.
When you run the above code, you’ll notice that the “nys_w1_trim” data still has 1725 observations like the original data, but only has 19 variables.
dim(nys_w1_trim)
## [1] 1725 19
Recall you can take a look at the first six rows of the data, as well
as check that the rename function worked as expected (i.e. it renamed
the columns) by typing head(nys_w1_trim)
. Here is what the
first six rows look like:
CASEID | age | peer_mar | peer_alc | peer_cheat | peer_vandal | peer_burg | peer_selldrugs | peer_theftlt5 | peer_theftgt50 | evsoc | soc_imp | peertroub_police | resp_mar | resp_cheat | resp_burg | resp_selldrugs | resp_theftlt5 | resp_theftgt50 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 13 | 1 | 3 | 3 | 3 | 2 | 1 | 2 | 1 | 3 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 15 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 1 | 4 | 3 | 3 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 11 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 16 | 4 | 4 | 4 | 3 | 2 | 2 | 3 | 2 | 2 | 3 | 2 | 7 | 3 | 1 | 0 | 0 | 1 |
5 | 14 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 4 | NA | 1 | 2 | 0 | 0 | 5 | 0 |
6 | 11 | 2 | 1 | 4 | 2 | 2 | 1 | 3 | 1 | 2 | 3 | 3 | 1 | 2 | 0 | 0 | 1 | 0 |
\(~\)
As you can see, the variable/column names are now more informative.
Now that we have our trimmed data, let’s go ahead and save it in our “Data” file within our R Assignmnt 4 file structure. To do that we’ll simply use the “save” function that is a part of base R and the “here” package to tell R where to save the file. Here is the code:
save(nys_w1_trim, file = here("Data", "nys_w1_trim.rda"))
The code is telling R to save the data object that we just created
(nys_w1_trim
) and then uses the “here” package to tell R to
go to the “Data” folder within our root directory and save it as
“nys_w1_trim.rda”.
The “.rda” is just short for “.Rdata” file format, and allows you to save multiple objects in the same environment. Here, we are having it save just the nys_w1_trim data. But we could have saved both the “nys_w1” and “nys_w1_trim” data objects together by including both separated by commas like so:
save(nys_w1, nys_w1_trim, file = here("Data", "nys_w1_raw-and-trim.rda"))
Again, when we load the “nys_w1_raw-and-trim.rda” file, it would load the “nys_w1” and “nys_w1_trim” objects into our global environment in R.
In order to drive home exactly what the “here” package is doing, complete the following steps:
here()
and check to see that the
“here” package is currently “looking” in the correct root folder
(“LastName_CRM495_RAssignment4”).The next thing we want to do is create a simple plot and save it. Saving your plots as separate files could be useful for future you and someone else. For example, if you wanted to use your plot in a presentation, rather than re-running all of your code or copying and pasting from your knitted file, you (or someone else) could simply insert the image file you created (publishers may also prefer the image files as well).
Let’s go ahead and reproduce the age-distribtuion chart we created in the last assignment, but instead of using the pooled data, let’s use the “nys-w1-trim” data we created above. We’ll also include some additional features like informative axis-labels and a plot title.
nys_w1_ageplot <- ggplot(data = nys_w1_trim, mapping = aes(x = factor(age))) +
geom_bar(fill = "lightblue", color = "black") +
scale_y_continuous(limits = c(0, 300), breaks = seq(0, 300, 50)) +
labs (title = "NYS Wave 1 - Age Distribution", x = "Age", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 10)) +
theme_minimal()
nys_w1_ageplot
Ok, so admittedly it’s not the most useful or visually appealling plot, but it will work for what we need to do for this assignment. Notice that we assigned the ggplot to an object named “nys_w1_ageplot.” This will allow us to use the plot in subsequent R functions moving forward. Say, for example, to save it as a jpeg within our file structure.
To save the plot we’ll use the function ggsave()
that is
part of the “ggplot2” package. It allows you to save the plot in
multiple file formats and specify things like the dimensions of the plot
(see here for
details). For now, we’ll just save it as a jpeg and make it roughly the
size of a typical letter-sized paper with one-inch margins.
ggsave(filename = here("Figures", "nys_w1_ageplot.jpg"),
plot = nys_w1_ageplot, width = 8, height = 6, units = "in")
In the above code, I used the “here” package to tell R to save the
file in the “Figures” folder within our file structure and named the
file “nys_w1_ageplot.jpg”. I also specified the ggplot object I wanted
to save (plot = nys_w1_ageplot
), width and height of the
figure, and the units for that width and height, in this case inchces. I
could have also specified centimeters (“cm”), milimeters (“mm”), or
pixels (“px”).
I’ll be honest, this is something I have rarely done in my own work. But Danielle Navarro convinced me that this is something I should start doing regurlarly (see Part 5 of her video series on “Project Structure”). Basically, a README file is simply a plain text file that describes the logic of the file structure and what’s in each file. You can create it with any plain text editor, including within RStudio (siply click on “File” > “New File” > “Text File” in the menu).
You do not need to get fancy with a README.txt file, you simply need to provide the basic information necessary for future you or someone else to understand the logic of your file structure and the details regarding what each file and/or folder includes. Go ahead and open a plain text editor and create a README.txt document and save it in the root folder (“LastName_CRM495_RAssignment4”). Here is what my README.txt file looks like: