Project Assignment - Phase 4: Identify & Describe Key Variables

Assumptions & Ground Rules

Since you have already completed Project Assignments #1, #2, and #3, I assume you have found an article on a topic of interest to you and that you have committed to one of the following options using the NYS data:

Direct Reproduction - reproduce or verify an original study’s findings using the same data and methods (e.g., coding decisions). OR
Conceptual Replication - test the repeatability, robustness, or generalizability of a theoretical or observational claim from a previous study using new data and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study.
- Note: Some of you may being doing a combination of these (e.g., reproducing and extending the results of a study that analyzed NYS data).

Moreover, since you have already completed R Assignments #1 through #5, we assume that you are familiar with:

RStudio.
Creating organized, descriptive, and reproducible R Markdown (RMD) files.
Creating and working within a reproducible file structure.
Downloading NYS data directly from within R.
Identifying measured items within the NYS codebooks.
Creating basic frequency tables within R using the sjmisc package.

If you are not familiar with these, please review R Assignments 1-5.

Note: For this assignment, have RMarkdown knit to an html file.

You have already drafted some of the following parts of this assignment in previous Project Assignments (e.g. Project Assignment #3). However, you should add and expand on those parts where appropriate (e.g., description of article, data source, and findings, and justification of replication or reproduction).

Remember, the ultimate goal for this project is to produce a clear and coherent direct reproduction or conceptual replication of a published article. The final version should read like a professional article or blog that justifies and describes your reproduction or replication.

Preface: Set Global Options for RMD File

I (Jake) wanted to make this explicit for this assignment as it will help clean up your knitted html files. I start all of my RMD files with the following code chunk:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = "center")

The above code chunk sets my frequently used settings for R code chunks as global options to be applied by default in all subsequent R code chunks.

echo = TRUE indicates R code chunks should be shown by default in final html document.
warning & message = FALSE suppresses R warnings/messages in final html document.
fig.align = “center” centers all figures or images by default.

You can change these options in the code chunk options for individual code chunks, but this ensures that the default presentation of code chunks is the way we generally want them to be unless otherwise specified.

Part 1: Describe Article and Justify Reproduction or Replication

As with the first part of Project Assignment #3, this should include a description of the article, data source, and specific findings that will be reproduced, along with a justification for the reproduction. As with the rest of the document, this should be professionally written - think of it like the introductory section of a published replication/reproduction article that must describe the original research and justify the replication/reproduction research.

Part of the description or justification for the replication or reproduction should be a description of the specific waves of NYS data you will be using. For a direct reproduction, this is relatively easy, just specify what waves of NYS data are analyzed by the article you are reproducing. For a conceptual replication, this may require a little more thought and justification (e.g., selecting wave of data or subsets of data across waves with similar demographics to the article you are replicating).
Note: Be sure to provide a link to the article and a full reference for the article (I’m fine if you include the reference at the end of the document but you should at least cite and link to the article in this part of the assignment.

Part 2: Describe Table(s) and/or Figure(s) you Plan to Reproduce/Replicate

The table(s) and/or figure(s) found in the original study that you plan to reproduce should be included as an image(s). This should also include a description of the main findings and/or your interpretation of the results presented in the table(s) and/or figure(s) you are planning to replicate or reproduce.

Part 3: Describe Methods and Measures of Reproduction/Replication

Think of this section as the “Methods” section for your reproduction or replication. Specifically, you should describe the original articles methods (e.g. data source, measurement strategies, etc.). Most importantly for your reproduction/replication, you need to identify and describe all of the variables (i.e., “key variables”) needed to reproduce the original article’s table(s) and/or figure(s). In other words, you need to describe the key theoretical constructs the article was examining and exactly how they measured them with NYS items (for direct reproductions) or how you plan to measure them with NYS items (for conceptual replications).

As an example, if I were proposing to reproduce Warr’s (1993) Figure 1 (see R Assignment #3), I might write something like the following:

In Figure 1 (pg. 22-23), Warr (1993) was fundamentally interested in how the prevalence of delinquent peers changed over adolescence into young adulthood. Specifically, he measured eight types of delinquent behavior (marijuana use, alcohol use, cheating on school tests, vandalism, burglary, selling hard drugs and theft less than $5 and greater than $50) and exmained how the proportion of respondents with no friends engaging in each of these forms of delinquency changed from age 11 to age 21 across the first five waves of the National Youth Survey (NYS).

The NYS is a nationally representative panel study of youth primarily meant to study the causes of criminal and delinquent behavior. The original five waves were collected each year from 1977 to 1981 where the majority of the questions asked about events and behaviors in the previous calendar year.

The peer delinquency variables were constructed from a series of questions that asked respondents: “Think of the people you listed as your clsoe friends. During the last year how many of them have…” engaged in the specific behaviors:

Marijuana: “used marijuana or hashish?”

Alcohol: “stolen something worth less than $5?”

Cheating: “cheated on school tests?”

Vandalism: “purposely damaged or destroyed property that did not
belong to them?”

Burglary: “broken into a vehicle or building to steal something?”

Selling Hard Drugs: “sold hard drugs such as heroin, cocaine, and LSD?”

Theft < $5: “stolen something worth less than $5?”

Theft > $50: “stolen something worth more than $50?”

The original answer categories were a five point scale (1 = “none of them,” 2 = “Very few of them,” 3 = “Some of them,” 4 = “Most of them,” 5 = All of them“). Warr dichotomized each of these items to indicate the”percentage of respondents with no delinquent friends."

Finally, Age was measured with a question asking them “how old are you?” and allowed them to answer in whole years only (e.g., ages 11-17 in wave 1).

Note: For those of you who are reproducing/replicating descriptive statistics tables, you should only focus on conceptually describing the key variables for the articles’ main research question(s) or hypotheses (first paragraph in my example above). But you should still try to identify and describe how each of the variables in their descriptive statistics table were measured (or can be measured) with the specific NYS items (third paragraph in my example above).

For those of you who are performing conceptual replications, you will need to identify specific items from the NYS that measure the key theoretical constructs as well as the extraneous/control variables used in the article. This may also include a discussion of how your measures are similar to or different from the one’s used in the article. Here is an example from my description of a conceptual replication of Orcutt’s (1987) paper that we will perform in a future R Assignment:

Orcutt (1987) set out to test key aspects of Sutherland’s theory by examining how (a respondent’s perception of their) peers’ behavior is associated with a respondent’s own subjective attitudes toward that same behavior (i.e. their “definitions” of the behavior) and, ultimately, with the respondent’s own self-reported participation in that behavior. In this case, the focal behavior is marijuana use. Orcutt also distinguishes between “competent use” and “incompetent use” to incorporate Sutherland’s ideas about the necessity of learning the requisite skills to accomplish deviant behavior as well as to integrate some key insights from Howard Becker’s (1953) classic description of the process of learning to become a regular marijuana user (hence the “with a little help from Becker” part of the title).

Orcutt’s (1987) data came from two “in-class” surveys of undergraduate students at two universities—University of Minnesota and Florida State University—in 1972 and 1973, respectively. In the original data collection, students received a survey that asked either about alcohol use or marijuana use. The data analyzed in this study was restricted to the 444 Minnesota students and 543 Florida State students who completed the survey about marijuana use.

Orcutt’s key dependent variable of “personal use of marijuana to get high” was measured with the question: “Which of the following statements best described the approximate number of times you have gotten ‘high’ on marijuana during the past year?” The Answer categories included: 1) “I did not use marijuana during the past year;” 2) “I used marijuana during the past year, but did not get ‘high’;” 3) “I got ‘high’ on marijuana during the past year; but only once or twice;” 4) I got ‘high’ on marijuana at least 3 times during the past year, but not more than 12 times;’ and 5) “I got ‘high’ on marijuana more than 12 times during the past year.”

The NYS includes multiple questions about marijuana use with two being key for our purposes. First is a question about use (V890 in NYS Wave 6): “How many times in the last year have you used marijuana or hashish? (GRASS, POT, HASH)” with the specific number of times reported coded as answers. Second is a question about getting high (V966 in NYS Wave 6): “How many times in the past year have you been high on marijuana?” with the specific number of times reported coded as answers. Ultimately, Orcutt makes this decision pretty easy on us. He was specifically interested in the distinction between “minimally competent use” and incompetent use. This means that the key distinction for Orcutt was between getting high and not getting high. The second question from the NYS better captures this than the first.

Orcutt measured peer marijuana use with the following question: “Of your four closest friends, how many would you say use marijuana at least once a month?” The specific number of friends who the respondent reported using marijuana at least once a month were coded as the answer categories and ranged from 1 to 4 with a mean of 1.2 (SD = 1.4) at Minnesota and 1.8 (SD = 1.6) at Florida State.

The NYS includes measures of peer delinquency and specifically peer marijuana use (V371 in NYS Wave 6). However, it is measured with the question: “Think of your friends. During the last year how many of them have used marijuana or hashish?” The answers were from a five-level ordinal scale with specific answer categories including: “None of them” (=1), “Very few of them” (=2), “Some of them” (=3), “Most of them” (=4), and “All of them” (=5).

While both data sources include a measure of peer marijuana use, the concept is measured differently - and in a way that makes the measures and corresponding results not exactly comparable. To get a sense of how these differences might matter, just try to imagine how someone in Orcutt’s data who answered that “3” of their friends used “marijuana at least once a month” (above the average) would answer the question in the NYS: Would the person answer “very few of them” (=2) in the NYS? Perhaps so, if they have 20 friends. But what if the person has 4 friends - might they respond instead with “most of them” (=4) instead?

Orcutt measured “subjective definitions” regarding marijuana use with the question: “How would you generally characterize your opinions toward marijuana?” The answer categories across the Minnesota and FSU data were slightly different, although both used 5-point Likert scales (technically a Likert-type scale) with the Minnesota survey response categories ranging from “highly negative to”highly positive" and the Flordia State response categories ranging from “negative” to “positive” (p. 346).

The NYS also includes items that measure one’s “subjective definition” of the marijuana use, particularly its “wrongness” (V356 in NYS Wave 6). The specific question asks: “How wrong is it for someone your age to use marijuana or hashish?” with answer categories on a 4-point ordinal scale including: “Not wrong” (=1), “A little bit wrong” (=2), “Wrong” (=3), and “Very wrong” (=4).

Part 4: Download NYS Data

Here you should move into the reproduction itself:

The first step should be to download the data via the “icpsrdata” package (as explained in R Assignment #5).
- You should also add R code that automatically creates a unique data folder (e.g., “NYS_data”) for this project as a subfolder in a working directory created specifically for this project (e.g., “LastName_CRM495_RR-Project”) before downloading, then use icpsrdata package to download to your new folder.
- Recall, this process ensures that anyone who runs your RMD file on their own computer will automatically create the appropriate subfolder and then download the data to the correct folder as well.
The second step would be to read the data into R and save it as an object.

Part 5: Summarize and describe the raw versions of the key variables or items.

You learned how to do this in “R Assignment #5.” Specifically, at this point, we recommend you use the “sjmisc” package. Using that package and tidyverse syntax, you can easily generate basic frequency and descriptive statistics tables (e.g., mydata %>% frq(myvar1) or mydata %>% descr(myvar1, myvar2)).

In “R Assignment #5” you only learned about the frq function in “sjmisc.” Frequency tables are usually more meaningful for describing categorical and ordinal variables as are common in the NYS. For example, in our description of the peer delinquency items above, they had the following ordinal answer categories: 1 = “none of them,” 2 = “Very few of them,” 3 = “Some of them,” 4 = “Most of them,” 5 = All of them“. The numerical values assigned to these represent an ordering but they have no numerical meaning beyond that (e.g., the difference between”None of them" and Very few of them" is not necessarily the same as the difference between “Some of them” and “Very few of them”).

There are certain items in the NYS that are measured at a ratio (meaningful zero value) and interval levels (meaningful interval but no meaningful zero value). In these situations, traditional descriptive statistics like the mean and standard deviation can provide meaningful descriptions of the variables. For example, age is technically a ratio-level variable and knowing the mean age in any NYS wave does tell you something about the age distribution in the data. If you have such a variable, you can use the descr() function in “sjmisc” to get these types of descriptive statistics.

Note: For this assignment, you are not required to recode the variables (e.g., using mutate) if that is needed for reproduction or to create polished versions of the descriptive statistics tables (e.g., with gt package) or figures (e.g., with ggplot2 package). You will be required to do those things eventually (e.g. Project Assignment #5) once you have had the opportunity to learn those skills via forthcoming R Assignments.
Note: Be sure to provide some basic description of the distributions of your items, especially the items measuring the key variables for the articles’ main research question(s) or hypotheses. It is not necessary to do this for the items that measure extraneous/control variables at this point.

Part 6: Write Conclusion and Submit Knitted File and File Structure

Upon completing the tasks in the previous sections, your knitted document should be well-organized (e.g., using leveled subheading - ## Top Heading, ### Subheading 1, etc.) and professionally written. Additionally, I recommend adding a table of contents (or perhaps a floating table of contents) as well, which you can do by modifying the YAML header at the top of your R Markdown file (See R Assignment #2). Finally, you might wish to select your own theme to personalize the aesthetic look of your final knitted document, which you can also do by modifying the YAML header.

To finalize and submit your document:

Create a "Conclusion section where you discuss the limitations of your proposed reproduction or replication and any issues or problems you anticipate in trying to replicate or reproduce the results in the table(s) and/or figure(s) you identified.
- Note: This can be a revised version of your conclusion for “Project Assignment #3.”
- Note: You may also want to include another paragraph or section where you write about what you learned in this assignment and any problems or issues you had in completing it.
“knit” your final RMD file to html format and save it using an informative file name (e.g., “LastName_CRM495_RR-Project-Phase4_YEAR_MO_DY”) within a file structure you create for this assignment (e.g., “LastName_CRM495_RR-Phase4”).
- Note: See “R Assignment #3” for details on creating a reproducible file structure.
Submit your knitted html file on Canvas.
Place a copy of your root folder in your LastName_495_commit folder on OneDrive.
- Note: The root folder should contain your reproducible file structure for this assignment. This means it should include your image files and anything else necessary to reproduce your knitted html document with “one click.”