R Assignment 4: Conceptual Replication

Background and Rationale

Reproduction, Replication, and their Varieties
Assumptions & Ground Rules
The Study

Part 1 (Assignment 4.1)

Goal: Identify Wave of NYS Data that is most similar to Orcutt’s samples

Part 2 (Assignment 4.2)

Goal: Identify Specific Items from NYS Data Most Similar to Orcutt’s Measures

Part 3 (Assignment 4.3)

Goal: Select, Rename, and Recode Items for Analysis

Part 4 (Assignment 4.4)

Goal: Replicate Descriptive Statistics (Table 1) from Orcutt (1987)

Part 5 (Assignment 4.5)

Goal: Replicate Descriptive Statistics (Table 1) from Orcutt (1987) for Earlier Waves of NYS

Background and Rationale

The purpose of this assignment is to perform a conceptual replication of some observations in Orcutt’s (1987) paper in Criminology titled: “Differential Association and Marijuana Use: A Closer Look at Sutherland (with a Little Help from Becker).” Since Orcutt’s (1987) original data are unavailable, we will assess whether some of his findings can be repeated with and generalize to a similar sample in the NYS data. Recall, you previously used a pooled version of the first five waves of the NYS data to conduct reproduction research in the previous assignment.

Reproduction, Replication, and their Varieties

Notice that we are drawing a distinction between “reproducibility” and “replicability” and, likewise, between reproduction and replication research. Recall, in R Assignment 3, we conducted a reproduction of part of Figure 1 from Mark Warr’s (1993) classic paper on age and peers using the exact same data as Warr. In a reproduction, the goal is to verify or repeat exactly some or all of the findings reported in a previous study using identical data and methods as the original study. Unfortunately, the terminology surrounding reproduction and replication is inconsistent and confusing. For example, some use the term “pure replication” to refer to what we call reproduction research (e.g., see Freese and Peterson 2017, pp.152-3). We note that our distinctions are consistent with those used in Ritchie’s (2020) book (which you are currently reading) and with others’ recent attempts to clarify terminology in this space (e.g., Patil, Peng, & Leek 2019).

In addition to distinguishing between reproducibility and replicability, we might also draw distinctions between different types of replications. Perhaps the most common is the distinction between a direct replication and a conceptual replication (cf. Crandall and Sherman 2016; Pridemore, Makel, & Plucker 2018, p.21). In a direct replication, one assesses the same theoretical or observational claim of a study using new data and measures that are collected or designed in such a way as to match the prior study’s design as exactly as possible, though perhaps with some notable exceptions (e.g., a larger sample size to improve statistical inferences). In contrast, a conceptual replication assesses the same theoretical or observational claim as a previous study using new data and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. we recommend reading Crandall and Sherman’s (2016) detailed discussion of these distinctions and their case for the relative utility of conceptual replications in advancing scientific progress; see also Nosek and Errington’s (2020) critique of these distinctions.

Underlying many of these terminological distinctions are differences in research procedures and research aims. For instance, drawing on Freese and Peterson’s (2017) typology of the different aims involved in replication and reproducibility research, reproduction research often aims to assess verifiability by attempting to reproduce or verify an original study’s findings using the same data and methods (e.g., code). Direct replications typically assess repeatability by testing whether the same findings emerge or repeat when applying the same methods to a new sample. Conceptual replications often assess repeatability, robustness, and/or generalizability of a theoretical or observational claim by, for instance, testing the original claim’s robustness to different measurement specifications using the same data or testing the claim’s generality to new samples (e.g., different groups or contexts). We recommend reading Freese and Peterson’s in-depth discussion of these aims; for convenience, we include their definitions (see 2017, p.152) of these four aims below.

Tests of verifiability: “taking the results of an original study as the object of inquiry and asks limited questions regarding whether the same results are obtained by doing the same analyses on the same data.”
Tests of robustness: “conduct a reanalysis on the original data using alternative specifications to see if the target finding is merely the result of analytic decisions.”
Tests of repeatability: “collecting new data to determine whether key results of a study can be observed by using the original procedures.”
Tests of generalizability: “the original study provides a premise for research trying to evaluate whether similar findings may be observed consistently across different methods or settings.”

Assumptions & Ground Rules

To accomplish a conceptual replication, we will need to do and learn the following:

Identify some of the key variables analyzed by Orcutt’s (1987) paper in Criminology titled: “Differential Association and Marijuana Use: A Closer Look at Sutherland (with a Little Help from Becker).”
Identify specific wave and specific items in the NYS that best align with the sample and variables used in Orcutt’s study.
Replicate Table 1 from Orcutt’s paper (pg. 348) using NYS items.

We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, we also assume you are familiar with RStudio, with creating R Markdown (RMD) files, and with the basics of using the here package and a self-referential file structure to create a reproducible R Markdown file.

If not, please review R Assignments 1 & 2.

The Study

We recommend doing a quick AIC reading of Orcutt’s (1987) study.To get you started, here is the abstract:

Based on Sutherland’s differential association theory and Becker’s early research on marijuana use, a contingency model estimating the exact probability of getting high on marijuana under various associational and motivational conditions is specified and tested. Data from surveys at two universities fit this model closely. Predicted first-order interactions and nonlinear effects of motivational balance and peer association are statistically significant and generate highly precise estimates of the probability of getting high. These results suggest that linear main-effects models employed in previous research on differential association processes do not adequately reflect the complex [causal] structure of Sutherland’s theory. In addition, this study raises serious questions about claims that differential association theory is untestable and has been made outdated by social learning theory.

While that all sounds pretty technical, fundamentally the paper is examining some core theoretical claims from Edwin Sutherland’s Differential Association Theory. Specifically, Orcutt sets out to test key aspects of Sutherland’s theory by examining how (a respondent’s perception of their) peers’ behavior is associated with a respondent’s own subjective attitudes toward that same behavior (i.e. their “definitions” of the behavior) and, ultimately, with the respondent’s own self-reported participation in that behavior. In this case, the focal behavior is marijuana use. Orcutt also distinguishes between “competent use” and “incompetent use” to incorporate Sutherland’s ideas about the necessity of learning the requisite skills to accomplish deviant behavior as well as to integrate some key insights from Howard Becker’s (1953) classic description of the process of learning to become a regular marijuana user (hence the “with a little help from Becker” part of the title).

Orcutt’s data came from two “in-class” surveys of undergraduate students at two universities—University of Minnesota and Florida State University—in 1972 and 1973, respectively. Here is the description Orcutt provides in the article:

Approximately half of the respondents in both surveys received a questionnaire focusing on alcohol use while the other half completed a parallel form dealing with marijuana use. This analysis will be restricted to students at each school who filled out the marijuana questionnaire–444 Minnesota undergraduates and 543 Florida State (FSU) undergraduates.

You can find more details on these data in two other articles that Orcutt published prior to this paper (see Orcutt, 1975 and Orcutt, 1978.

Part 1 (Assignment 4.1)

Goal: Identify Wave of NYS Data that is most similar to Orcutt’s samples

Recall, the purpose of a conceptual replication is to test the repeatability, robustness, or generalizability of a theoretical or observational claim from a previous study using new data collection and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. The NYS data provides an opportunity to do this, since it includes similar variables as those used by to Orcutt - although not exactly the same - and, unlike Orcutt’s two university student samples, the NYS is a larger, nationally representative panel survey of youth followed for 10 years across seven waves of data.

Specifically, youth respondents in the NYS ranged from ages 11 to 17 in Wave 1 to ages 21 to 27 in wave 7. Since Orcutt’s sample is a student-based sample from two specific universities, we will start by identifying the wave of the NYS in which the youth respondents are most similar to the population from which Orcutt was sampling in terms of age. This will allow us to assess the repeatability of his findings on a group of youth in a similar age range while the nationally representative sample will permit us to assess whether Orcutt’s findings will generalize beyond college students attending the two specific universities that he studied.

Note: Later in the assignment, you will have the opportunity to see if Orcutt’s findings also generalize to these same NYS respondents at different waves and, hence at different age ranges.

Let’s look at the age range for each wave of NYS data:

wave	year	min	max
Wave 1	1976	11	17
Wave 2	1977	12	18
Wave 3	1978	13	19
Wave 4	1979	14	20
Wave 5	1980	15	21
Wave 6	1983	18	24
Wave 7	1987	21	27

The table above shows that the Wave 6 NYS data likely includes the panel members when they are most similar to Orcutt’s college-based samples in terms of age with an age range of 18-24. It is important to note that we do not actually know the specific age range or distribution of Orcutt’s sample because he did not appear to report it in the article. This is a situation where we are likely fairly safe in assuming Orcutt’s sample was largely drawn from 18-22 year-olds, but without that information provided in the article and without the data being shared publicly, we really cannot know for sure.

Note: I also checked the articles cited above that Orcutt (1987) cites as providing more details about these data (Orcutt, 1975, 1978) and did not find any information on the age distribution of the samples in those articles either.

Let’s download the data directly from ICPSR like we did in “R Assignment #3.”

Note: You should already have the “NYS” folder in your “Datasets” folder from the previous assignment. You can check using the function from the previous assignment.
Note: Recall that when you first run this chunk during an R session, you will be asked to enter your ICPSR account information into the R console. So Check the R Console after trying to run the icpsr_download command to see and respond to the ICPSR username/password prompts. Also, recall that this requires that you have an ICPSR account, which you should have created for the previous assignment. If you receive errors, go to the ICPSR website and be sure that you are able to login using the username (email) and password that you are entering in the R console.

library(icpsrdata)
library(here)

ifelse(dir.exists(here("Datasets/NYS/ICPSR_09948")), TRUE, 
       icpsr_download(file_id = c(9948), 
               download_dir = here("Datasets/NYS")))

## [1] TRUE

Note: In order to prevent R from continually trying to download data we have already downloaded and to prevent issues when you knit your Rmd file, we added the icpsr_download function to an ifelse statement. This is similar to what we did in “R Assignment 3” when we were creating the “NYS” sub-folder in our “Datasets” folder. See Part 2 of “R Assignment 3” for an explanation of the logic. It is essentially checking to see if the “ICPSR-09948” folder exists (which is NYS Wave 6) in the “NYS” folder, returns the logical statement “TRUE” if it does and, if it does not, runs the icpsr_download command to download the NYS Wave 6 data from ICPSR.

Load the data into R and assign it to an object with the read_spss command like you have done in the previous three assignments.

library(haven)
nys_w6 <- read_spss(here("Datasets", "NYS", "ICPSR_09948", "DS0001", "09948-0001-Data.sav"))

Part 2 (Assignment 4.2)

Goal: Identify Specific Items from NYS Data Most Similar to Orcutt’s Measures

The next step is to identify the specific items from the NYS that allow us to measure the same constructs as Orcutt (1987) did. Recall that Orcutt (1987) was attempting to examine key propositions from Sutherland’s Differential Association Theory. Specifically, Orcutt was interested in the relationship between 1) associations with criminal/deviant patterns of behavior (i.e. peer’s use of marijuana), 2) individuals’ subjective attitudes toward that behavior (i.e., definitions favorable to marijuana use), and 3) individuals’ criminal/deviant behavior (i.e. self-reported marijuana use). Below is a simplified diagram of the theorized causal structure, or directed acyclic graph (DAG), representing the hypothesized relationships between these three variables.

1. Associations with criminal/deviant patterns

Orcutt measured peer marijuana use with the following question: “Of your four closest friends, how many would you say use marijuana at least once a month?” The specific number of friends who the respondent reported using marijuana at least once a month were coded as the answer categories and ranged from 1 to 4 with a mean of 1.2 (SD = 1.4) at Minnesota and 1.8 (SD = 1.6) at Florida State.

As you know from “R Assignment 3,” the NYS includes measures of peer delinquency and specifically peer marijuana use (V371 in NYS Wave 6). However, it is measured with the question: “Think of your friends. During the last year how many of them have used marijuana or hashish?” The answers were from a five-level ordinal scale with specific answer categories including: “None of them” (=1), “Very few of them” (=2), “Some of them” (=3), “Most of them” (=4), and “All of them” (=5).
- You can see how this is a “conceptual replication” with our first measure. While both data sources include a measure of peer marijuana use, the concept is measured differently - and in a way that makes the measures and corresponding results not exactly comparable. To get a sense of how these differences might matter, just try to imagine how someone in Orcutt’s data who answered that “3” of their friends used “marijuana at least once a month” (above the average) would answer the question in the NYS: Would the person answer “very few of them” (=2) in the NYS? Perhaps so, if they have 20 friends. But what if the person has 4 friends - might they respond instead with “most of them” (=4) instead?
- If you had substantive knowledge of Sutherland’s Differential Association Theory and the criminological literature on peer influence in general, you may also have an opinion regarding which measure of exposure to peer delinquency - Orcutt’s or the NYS version - best captures the theoretical concept of “exposure to delinquent patterns.” Going down this rabbit hole isn’t necessary for our particular assignment, but it is essential to think about and recognize: 1) that the measurement of abstract social constructs can be a tricky, variable, and error-prone endeavor; and 2) that such seemingly small measurement differences like the one here can have important implications for testing theories and for assessing the replicability of research findings.
Look at the distribution and summary statistics for the “Peer marijuana use” variable using frq and descr functions from the sjmisc package:

library(sjmisc)

nys_w6 %>%
  frq(V371)

## Y6-367:FRIENDS-USED MARIJUANA (V371) <numeric> 
## # total N=1725 valid N=1464 mean=2.46 sd=1.30
## 
## Value |            Label |   N | Raw % | Valid % | Cum. %
## ---------------------------------------------------------
##     1 |     None of them | 468 | 27.13 |   31.97 |  31.97
##     2 | Very few of them | 318 | 18.43 |   21.72 |  53.69
##     3 |     Some of them | 349 | 20.23 |   23.84 |  77.53
##     4 |     Most of them | 196 | 11.36 |   13.39 |  90.92
##     5 |      All of them | 133 |  7.71 |    9.08 | 100.00
##  <NA> |             <NA> | 261 | 15.13 |    <NA> |   <NA>

nys_w6 %>%
  descr(V371)

## 
## ## Basic descriptive statistics
## 
##   var    type                         label    n NA.prc mean  sd   se md
##  V371 numeric Y6-367:FRIENDS-USED MARIJUANA 1464  15.13 2.46 1.3 0.03  2
##  trimmed   range iqr skew
##     2.46 4 (1-5)   2 0.45

As you can see above, “None of them” is the modal category and a similar percentage of respondents answered “Very few of them” (18.4%) and “Some of them” (20.2 %), roughly equal to the last two categories combined - “Most of them” (11.36%) and “All of them” (7.7%). This suggests the distribution is right skewed or positive skewed. As you can see in the descriptive statistics, and characteristic of this “right” or “positive” skew, the mode (1) is less than the median (2) which is less than the mean (2.5).
- Also note that about 15% of the respondents (n = 261) are missing data for this question. This means that by using this item, our overall sample size will, at most, be n = 1,464 (1725 - 261).
For now, we will plan on keeping this variable “as-is” with five categories.

2. Subjective attitudes toward criminal/deviant behavior

Orcutt measured “subjective definitions” regarding marijuana use with the question: “How would you generally characterize your opinions toward marijuana?” The answer categories across the Minnesota and FSU data were slightly different, although both used 5-point Likert scales (technically a Likert-type scale) with the Minnesota survey response categories ranging from “highly negative to”highly positive” and the Flordia State response categories ranging from “negative” to “positive” (p. 346).

The NYS also includes items that measure one’s “subjective definition” of the marijuana use, particularly its “wrongness” (V356 in NYS Wave 6). The specific question asks: “How wrong is it for someone your age to use marijuana or hashish?” with answer categories on a 4-point ordinal scale including: “Not wrong” (=1), “A little bit wrong” (=2), “Wrong” (=3), and “Very wrong” (=4).
- Like before with the peer marijuana use variable, it is instructive to work through the intellectual exercise of thinking how the responses on Orcutt’s surveys would map onto the NYS question and response categories. Also, if we were planning to embark upon a serious attempt to contribute to this literature, then it would be worth seriously considering which is a better approach to measuring the theoretical concept of “subjective definitions.” Of course, this would require in-depth substantive knowledge of Sutherland’s theory.
Let’s look at the distribution and summary statistics for this item.

nys_w6 %>%
  frq(V356)

## Y6-352:USE MARIJUANA (V356) <numeric> 
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
## 
## Value |              Label |   N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
##     1 |          Not wrong | 215 | 12.46 |   14.37 |  14.37
##     2 | A little bit wrong | 381 | 22.09 |   25.47 |  39.84
##     3 |              Wrong | 473 | 27.42 |   31.62 |  71.46
##     4 |         Very wrong | 427 | 24.75 |   28.54 | 100.00
##  <NA> |               <NA> | 229 | 13.28 |    <NA> |   <NA>

nys_w6 %>%
  descr(V356)

## 
## ## Basic descriptive statistics
## 
##   var    type                label    n NA.prc mean   sd   se md trimmed
##  V356 numeric Y6-352:USE MARIJUANA 1496  13.28 2.74 1.02 0.03  3    2.74
##    range iqr  skew
##  3 (1-4)   2 -0.27

This item as originally coded is “left skewed” or “negative skewed,” with the “wrong” category as the modal category and the “very wrong” category as the second most common.
Orcutt took his five-category item and recoded it into three categories with “undecided” as a “neutral” definition and responses on the positive (e.g., “Highly positive” and “Positive”) or negative side (e.g., “Highly negative” and “Negative”) of this category coded as “positive” or “negative” respectively. This made sense given Orcutt used a “bipolar” rating scale for his answer categories in designing his survey. A “bipolar” rating scale simply refers to a set of answer categories that allow a respondent to answer in opposite directions, usually separated by a midpoint (in Orcutt’s case the “undecided” category).
- The NYS, however, uses a “unipolar” rating scale (see here for brief discussion of the distinction between bipolar and unipolar survey response scales). A unipolar rating scale includes answer categories that only move in one direction (in the case of the NYS, from “Not wrong” to “Very wrong”). As a result, the “subjective definition” item in the NYS does not lend itself nicely to a “neutral” categorization (perhaps one could argue the “A little bit wrong” answer conceptually aligns with Orcutt’s “undecided” answer category.) Arguably, the NYS approach is also a less desirable match to Sutherland’s concept of definitions, which presumably can range in content from favorable to unfavorable to crime. Below, when we recode the data, we will ultimately decide to collapse the “subjective definition” responses in our analysis into two dummy variables that indicate whether the respondent reportedly has internalized: (A) “negative” definitions unfavorable to marijuana use (i.e., 1=“At least a little bit wrong”) or (B) “positive” definitions favorable to marijuana use (1=“Not wrong”).
  - Note: Again, if we were embarking on a serious attempt at contributing to this literature, then we might also want to assess robustness of findings from our analysis to alternative coding (and analysis) decisions. For example, one alternative coding strategy might involve collapsing the “wrong” and “very wrong” categories into a dummy indicator of “negative definitions,” then use the “A little bit wrong” category to indicate “neutral definitions” and the “not wrong” category to indicate “positive definitions.” In situations where we lack a clear theoretical rationale for one measurement strategy over another, we generally advise to start by not collapsing categories and assessing relative frequencies or associations across the full range of responses; likewise, in such situations, we recommend assessing robustness of results across various theoretically defensible measurement approaches.

3. Individual criminal/delinquent behavior

Orcutt’s key dependent variable of “personal use of marijuana to get high” was measured with the question: Which of the following statements best described the approximate number of times you have gotten ‘high’ on marijuana during the past year?” The Answer categories included: 1) “I did not use marijuana during the past year;” 2) “I used marijuana during the past year, but did not get ‘high’;” 3) “I got ‘high’ on marijuana during the past year; but only once or twice;” 4) I got ‘high’ on marijuana at least 3 times during the past year, but not more than 12 times;’ and 5) “I got ‘high’ on marijuana more than 12 times during the past year.”

The NYS includes multiple questions about marijuana use with two being key for our purposes. First is a question about use (V890 in NYS Wave 6): “How many times in the last year have you used marijuana or hashish? (GRASS, POT, HASH)” with the specific number of times reported coded as answers. Second is a question about getting high (V966 in NYS Wave 6): “How many times in the past year have you been high on marijuana?” with the specific number of times reported coded as answers.

Take a moment and think which of these items from the NYS best reflects what Orcutt was trying to measure.
Ultimately, Orcutt makes this decision pretty easy on us. He was specifically interested in the distinction between “minimally competent use” and incompetent use. Here is what he said on pg. 347:

An important feature of this item is that it measures a respondent’s self reported ability to get high which, for Becker (1953), is a defining characteristic of a marijuana user. That is, this measure distinguishes between those who are minimally competent users-who have acquired the physical and subjective techniques for getting high-and those who are not. Therefore, according to Becker’s conception, respondents who checked either of the first two statements should be classified as nonusers. Thus, the dependent variable in this analysis is a proportional measure of initiation into marijuana use- Ownuse-based on a binary scoring of nonusers (0 = statements 1 or 2) versus users (1 = statements 3, 4, or 5).

This means that the key distinction for Orcutt was between getting high and not getting high. The second question from the NYS better captures this than the first. But before we decide, let’s look at the distribution and summary statistics for these items.

nys_w6 %>%
  frq(V890, V966)

## Y6-886:MARIJUANA-FREQ (V890) <numeric> 
## # total N=1725 valid N=1496 mean=32.74 sd=102.52
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 847 | 49.10 |   56.62 |  56.62
##     1 |  71 |  4.12 |    4.75 |  61.36
##     2 |  75 |  4.35 |    5.01 |  66.38
##     3 |  41 |  2.38 |    2.74 |  69.12
##     4 |  31 |  1.80 |    2.07 |  71.19
##     5 |  34 |  1.97 |    2.27 |  73.46
##     6 |  16 |  0.93 |    1.07 |  74.53
##     7 |   5 |  0.29 |    0.33 |  74.87
##     8 |   4 |  0.23 |    0.27 |  75.13
##     9 |   1 |  0.06 |    0.07 |  75.20
##    10 |  34 |  1.97 |    2.27 |  77.47
##    11 |   1 |  0.06 |    0.07 |  77.54
##    12 |  30 |  1.74 |    2.01 |  79.55
##    14 |   1 |  0.06 |    0.07 |  79.61
##    15 |  10 |  0.58 |    0.67 |  80.28
##    16 |   1 |  0.06 |    0.07 |  80.35
##    20 |  30 |  1.74 |    2.01 |  82.35
##    21 |   1 |  0.06 |    0.07 |  82.42
##    22 |   1 |  0.06 |    0.07 |  82.49
##    24 |   2 |  0.12 |    0.13 |  82.62
##    25 |  16 |  0.93 |    1.07 |  83.69
##    26 |   1 |  0.06 |    0.07 |  83.76
##    30 |  16 |  0.93 |    1.07 |  84.83
##    35 |   1 |  0.06 |    0.07 |  84.89
##    40 |   8 |  0.46 |    0.53 |  85.43
##    45 |   1 |  0.06 |    0.07 |  85.49
##    50 |  34 |  1.97 |    2.27 |  87.77
##    52 |  23 |  1.33 |    1.54 |  89.30
##    60 |   6 |  0.35 |    0.40 |  89.71
##    62 |   1 |  0.06 |    0.07 |  89.77
##    70 |   2 |  0.12 |    0.13 |  89.91
##    75 |   4 |  0.23 |    0.27 |  90.17
##    80 |   4 |  0.23 |    0.27 |  90.44
##    85 |   1 |  0.06 |    0.07 |  90.51
##   100 |  29 |  1.68 |    1.94 |  92.45
##   104 |   2 |  0.12 |    0.13 |  92.58
##   130 |   1 |  0.06 |    0.07 |  92.65
##   144 |   2 |  0.12 |    0.13 |  92.78
##   150 |  11 |  0.64 |    0.74 |  93.52
##   156 |   1 |  0.06 |    0.07 |  93.58
##   160 |   1 |  0.06 |    0.07 |  93.65
##   175 |   1 |  0.06 |    0.07 |  93.72
##   200 |  14 |  0.81 |    0.94 |  94.65
##   208 |   2 |  0.12 |    0.13 |  94.79
##   240 |   1 |  0.06 |    0.07 |  94.85
##   250 |   3 |  0.17 |    0.20 |  95.05
##   270 |   1 |  0.06 |    0.07 |  95.12
##   300 |  19 |  1.10 |    1.27 |  96.39
##   340 |   1 |  0.06 |    0.07 |  96.46
##   350 |   1 |  0.06 |    0.07 |  96.52
##   360 |   6 |  0.35 |    0.40 |  96.93
##   365 |  28 |  1.62 |    1.87 |  98.80
##   400 |   1 |  0.06 |    0.07 |  98.86
##   450 |   1 |  0.06 |    0.07 |  98.93
##   500 |   3 |  0.17 |    0.20 |  99.13
##   600 |   4 |  0.23 |    0.27 |  99.40
##   700 |   3 |  0.17 |    0.20 |  99.60
##   730 |   2 |  0.12 |    0.13 |  99.73
##   900 |   1 |  0.06 |    0.07 |  99.80
##   999 |   3 |  0.17 |    0.20 | 100.00
##  <NA> | 229 | 13.28 |    <NA> |   <NA>
## 
## Y6-962:HIGH ON MARIJUANA PAST YEAR (V966) <numeric> 
## # total N=1725 valid N=649 mean=61.87 sd=134.07
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##     0 |   66 |  3.83 |   10.17 |  10.17
##     1 |   64 |  3.71 |    9.86 |  20.03
##     2 |   77 |  4.46 |   11.86 |  31.90
##     3 |   33 |  1.91 |    5.08 |  36.98
##     4 |   35 |  2.03 |    5.39 |  42.37
##     5 |   30 |  1.74 |    4.62 |  47.00
##     6 |   20 |  1.16 |    3.08 |  50.08
##     7 |    8 |  0.46 |    1.23 |  51.31
##     8 |    5 |  0.29 |    0.77 |  52.08
##     9 |    1 |  0.06 |    0.15 |  52.23
##    10 |   31 |  1.80 |    4.78 |  57.01
##    12 |   21 |  1.22 |    3.24 |  60.25
##    14 |    1 |  0.06 |    0.15 |  60.40
##    15 |   15 |  0.87 |    2.31 |  62.71
##    18 |    1 |  0.06 |    0.15 |  62.87
##    20 |   25 |  1.45 |    3.85 |  66.72
##    22 |    1 |  0.06 |    0.15 |  66.87
##    24 |    2 |  0.12 |    0.31 |  67.18
##    25 |    9 |  0.52 |    1.39 |  68.57
##    26 |    2 |  0.12 |    0.31 |  68.88
##    30 |   16 |  0.93 |    2.47 |  71.34
##    35 |    2 |  0.12 |    0.31 |  71.65
##    40 |    8 |  0.46 |    1.23 |  72.88
##    45 |    2 |  0.12 |    0.31 |  73.19
##    48 |    1 |  0.06 |    0.15 |  73.34
##    50 |   27 |  1.57 |    4.16 |  77.50
##    52 |   10 |  0.58 |    1.54 |  79.04
##    60 |    2 |  0.12 |    0.31 |  79.35
##    62 |    1 |  0.06 |    0.15 |  79.51
##    70 |    2 |  0.12 |    0.31 |  79.82
##    75 |    3 |  0.17 |    0.46 |  80.28
##    80 |    3 |  0.17 |    0.46 |  80.74
##    85 |    2 |  0.12 |    0.31 |  81.05
##   100 |   30 |  1.74 |    4.62 |  85.67
##   104 |    2 |  0.12 |    0.31 |  85.98
##   110 |    1 |  0.06 |    0.15 |  86.13
##   125 |    1 |  0.06 |    0.15 |  86.29
##   150 |    8 |  0.46 |    1.23 |  87.52
##   156 |    1 |  0.06 |    0.15 |  87.67
##   160 |    2 |  0.12 |    0.31 |  87.98
##   200 |   11 |  0.64 |    1.69 |  89.68
##   240 |    1 |  0.06 |    0.15 |  89.83
##   250 |    4 |  0.23 |    0.62 |  90.45
##   270 |    2 |  0.12 |    0.31 |  90.76
##   300 |   20 |  1.16 |    3.08 |  93.84
##   320 |    1 |  0.06 |    0.15 |  93.99
##   350 |    2 |  0.12 |    0.31 |  94.30
##   352 |    1 |  0.06 |    0.15 |  94.45
##   360 |    4 |  0.23 |    0.62 |  95.07
##   365 |   21 |  1.22 |    3.24 |  98.31
##   400 |    1 |  0.06 |    0.15 |  98.46
##   450 |    1 |  0.06 |    0.15 |  98.61
##   500 |    1 |  0.06 |    0.15 |  98.77
##   600 |    2 |  0.12 |    0.31 |  99.08
##   700 |    1 |  0.06 |    0.15 |  99.23
##   999 |    5 |  0.29 |    0.77 | 100.00
##  <NA> | 1076 | 62.38 |    <NA> |   <NA>

nys_w6 %>%
  descr(V890, V966)

## 
## ## Basic descriptive statistics
## 
##   var    type                              label    n NA.prc  mean     sd   se
##  V890 numeric              Y6-886:MARIJUANA-FREQ 1496  13.28 32.74 102.52 2.65
##  V966 numeric Y6-962:HIGH ON MARIJUANA PAST YEAR  649  62.38 61.87 134.07 5.26
##  md trimmed       range iqr skew
##   0    6.10 999 (0-999)   8 4.90
##   6   27.62 999 (0-999)  48 3.81

What jumps out at you from these distributions? Like before, and something common for lots of deviant behaviors, is that the data are right skewed for the question about “using marijuana,” with zero being the modal answer. However, another thing that should jump out at you is the number of missing (“NA”) cases in each of these items. Specifically, the question about getting high (V966) is missing for 62% of the sample!
- When I (Jake) first saw this, it was not completely clear to me why so many cases were missing on this variable. My hunch was that it was a result of a skip pattern in the survey, specifically that they only asked the question about “getting high” to the respondents who reported that they had used marijuana in the past year (V890). However, when I looked at the codebook (both the ICPSR version and the original version included in the ICPSR documentation), this skip pattern was not completely clear. Ultimately, I had to go to the original survey instrument that is included in the codebook from ICPSR:

- Unfortunately, the original page numbers are not included in the instrument that is attached to the ICPSR codebook. However, with some simple math, we can be fairly confident that the large number of missing for the question about getting high resulted from respondents who reported no use not being asked that question.

  - Go back and look at the distributions for **V890** about use and **V966** about getting high. Notice that for **V890**, there are 1,496 "valid" responses (i.e. non-missing) and 847 respondents who reported zero marijuana use in the past year. If you subtract 847 from 1,496, you get 649---the number of valid cases in **V966**. 
  
  - Ultimately, this means we'll need both questions to construct a measure similar to Orcutt (1987). Essentially, we'll want to create a dichotomous variable that combines those who answered zero on both the question about "using marijuana" or the question about "getting high" from using marijuana as "non-users" (1 = Incompetent-/Non-User). Any respondent that answered one or above on the question about getting high will be coded as a (competent) "user" in Orcutt's terms (1 = Competent User). 
  
    - ***Note:*** If we were doing this conceptual replication in the wild, we would likely want to directly examine the distinction between competent and incompetent use by distinguishing between 1) non-users (i.e., answered zero on the question about use), 2) incompetent users (answer 1 or more on question about use and answered zero on question about getting high), and 3) competent users (answered above zero on question about getting high).

Part 3 (Assignment 4.3)

Goal: Select, Rename, and Recode Items for Analysis

Up to this point, the bulk of the work has been intellectual and theoretical. Indeed, that’s the “conceptual” part of a conceptual replication. But now that we know what items from the NYS align most closely with the items used in Orcutt’s (1987) study and have a good idea of how we want to use them, we need to wrangle and recode the data so that we can analyze it. Like with “R Assignment #3,” this will involve selecting the specific variables, giving them informative names, and recoding them to closely resemble Orcutt’s coding decisions. Like with the previous assignment, in addition to the items identified above that will be used for the conceptual replication, we will also select the “CASEID” variable in order to maintain the individual identifier and we’ll create the “wave” variable to indicate that the data we are working with comes from Wave 6.

Note: We would likely not artificially limit ourselves to studying marijuana if we were pursuing this conceptual replication as a research project. Sutherland’s theory is not limited to explaining substance use and examining the robustness of it’s predictions across multiple types of criminal, delinquent, and deviant behavior would be a way to not simply replicate but extend Orcutt’s (1987) study.

1. Select, Rename, and Recode the Data

Here is the code for selecting items, renaming them, and recoding them. We’ll explain the logic below:

library(dplyr)
nys_w6_trim <- nys_w6 %>%
  dplyr::select(CASEID, V371, V356, V890, V966) %>%
  #Provide Informative Names:
  rename(marpeer = V371, 
         mardef = V356,
         maruse = V890,
         marhigh = V966) %>%
  #Recode key Variables
  mutate(marpeer_fct = as_factor(marpeer), #create factor variable from marpeer
         mardef_fct = as_factor(mardef), #create factor variable from mardef
         mardef_neg = ifelse(mardef %in% c(2, 3, 4), 1, 0), #create dummy variable indicating negative definition of marijuana ("A little bit wrong" to "Very Wrong")
         mardef_neg = ifelse(is.na(mardef), NA, mardef_neg),
         mardef_pos = ifelse(mardef == 1, 1, 0), #create dummy variable indicating positive definition of marijuana ("Not Wrong")
         mardef_neut = ifelse(mardef == 2, 1, 0), #create dummy variable indicating neutral definition of marijuana ("A little bit wrong")
         mardef_negneut = ifelse((mardef == 3 | mardef == 4), 1, 0), #create dummy variable indicating negative definition ("Wrong" and "Very Wrong") to align with neutral definition ("A little bit wrong")
         mardef_negneut = ifelse(is.na(mardef), NA, mardef_negneut),
         maruse_dic = ifelse(maruse >= 1, 1, 0), #create dummy variable for maruse
         marhigh_dic = ifelse(marhigh >= 1, 1, 0), #responses 1 or greater on marhigh = 1
         marhigh_dic = ifelse(maruse == 0, 0, marhigh_dic), #responses of 0 on maruse are coded as 0
         wave = 6) 
glimpse(nys_w6_trim)

## Rows: 1,725
## Columns: 14
## $ CASEID         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ marpeer        <dbl+lbl>  4, NA,  1,  4,  3,  3,  4,  3,  3,  1,  3,  2,  3,…
## $ mardef         <dbl+lbl>  2, NA,  4,  2,  2,  2,  2,  2,  2,  4,  2,  3,  1,…
## $ maruse         <dbl> 300, NA, 0, 300, 2, 100, 4, 4, 0, 0, 0, 0, 9, 25, 150, …
## $ marhigh        <dbl> 300, NA, NA, 300, 2, 100, 4, 1, NA, NA, NA, 1, 10, 25, …
## $ marpeer_fct    <fct> Most of them, NA, None of them, Most of them, Some of t…
## $ mardef_fct     <fct> A little bit wrong, NA, Very wrong, A little bit wrong,…
## $ mardef_neg     <dbl> 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, …
## $ mardef_pos     <dbl> 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, …
## $ mardef_neut    <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, …
## $ mardef_negneut <dbl> 0, NA, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, …
## $ maruse_dic     <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ marhigh_dic    <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ wave           <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…

There are a few things that we should explain about the above code.

In the raw nys_w6 data that we imported from ICPSR, all of the categorical variables are treated as numerical in the data frame (specifically “double” format with labels). You can see this by looking at the output from the glimpse(nys_w6_trim) function above (see the next to variable names). But the “peer influence” (marpeer) and “subjective definitions” (mardef) variables are technically not numerical, they are ordered categories with numbers assigned to each category.

-We want to make sure that R knows these variables are categorical and thus we need to convert them to factors, R’s method for working with categorical variables (see Ch. 15 of R for Data Science for more details). Like with most things, the tidyverse suite of packages has a built-in package for working with factors called “forcats”.

-Actually, the data frame already includes value labels for each variable. You can see this for any variable by using the function get_labels() that is part of the “sjlabelled” package (you can also see this in the glimpse() function above as the “marpeer” and “marhigh” items have <dbl + lbl> next to them).

library(sjlabelled)

get_labels(nys_w6$V371)

## [1] "None of them"     "Very few of them" "Some of them"     "Most of them"    
## [5] "All of them"

get_labels(nys_w6$V356)

## [1] "Not wrong"          "A little bit wrong" "Wrong"             
## [4] "Very wrong"

The as_factor function simply tells R to create a factor variable and, in this case, uses the built-in value labels of the variables we are turning into factors to indicate the factor level.

We used the ifelse command to create four dummy variables related to the “subjective definitions” item. Remember, that the ifelse command takes the form of a logical test: ifelse(test, yes, no). We first created two dummy variables distinguishing between “positive definitions” (mardef_pos) and “negative definitions” (mardef_neg).

For the “mardef_neg” variable, we simply told R to assign a value of 1 if the respondent answered “A little wrong” (2) through “Very wrong” (4) and zero otherwise. Recall, this question had unipolar answer categories ranging from “Not wrong” to “very Wrong” views of marijuana use. To do this, we used the logical operator %in% (see the Data Transformation chapter in R For Data Science for an explanation of the different logical operators) and listed the values we wanted coded as one in a vector using the c() function. A vector is simply a one-dimensional array or series of values (see here for brief explanation). This is basically a shortcut for writing ifelse((mardef == 2 | mardef == 3 | mardeff == 4), 1, 0).
- For the “mardef_pos” variable, we simply told R to code values of “Not wrong” (1) as 1 and zero otherwise. These two dummy variables are likely as close as we can get to Orcutt’s (1987) three dummy variables for “Negative,” “Neutral,” and “Positive” subjective definitions of marijuana.
We also created two dummy variables—“mardef_neut” and “mardef_negneut”—that account for a “neutral” definition by assigning the “A little Wrong” answer category (2) to be considered neutral, and the “Wrong” (3) and “Very wrong” (4) answers as negative. Note we had to create an additional “negative” dummy variable because dummy variables for multiple categories of the same variable should be mutually exclusive. The “mardef_neg” variable included the “A little wrong” answer and thus would have overlapped with our “mardef_neut” variable. Of course, given the unipolar nature of the “subjective definitions” question in NYS, this specific coding strategy may not be conceptually justified.

We used the if_else function to create two dummy variables for the using marijuana (“maruse_dic”) and getting high from marijuana (“marhigh_dic”) items.

For the “maruse_dic” variable, we simply created dummy variables similar to how we did in “R Assignment 3” by telling R that if the “maruse” variable is greater than or equal to one, make the “maruse_dic” variable equal to 1 and make it zero otherwise (i.e. if it equals zero).
For the “marhigh_dic” variable, we needed two if_else commands that are essentially nested. This is because we needed the “marhigh_dic” variable to account for those who reported using marijuana but did not get high.
- First, we created the dummy variable like we did with “maruse_dic” (if the “marhigh” variable is greater than or equal to one, make the “marhigh_dic” variable equal to 1 and make it zero otherwise—i.e. if it equals zero). This takes the the 649 valid cases in the “marhigh” variable and assigns them to the appropriate category (1 or 0). But this only affects the 649 valid cases. -Second, In order to account for the fact that (almost all) respondents who reported not using marijuana in the past year were not asked if they got high from marijuana, we needed another if_else command. Thus, the second if_else command tells R that if “maruse” equals zero make “marhigh_dic” equal to zero and otherwise leave it as the value “marhigh_dic” already is. This is how we nested the two if_else commands. If the second command does not apply (i.e. maruse does not equal zero), then the “marhigh_dic” variable remains the value we told it to take in the first if_else command. -Note: If you want to get a sense of this, run the above code where we created the “nys_w6_trim” data but comment out the second if_else command and create frequency table for the “marhigh_dic” variable. Here is what it will look like:

## marhigh_dic <numeric> 
## # total N=1725 valid N=649 mean=0.90 sd=0.30
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##     0 |   66 |  3.83 |   10.17 |  10.17
##     1 |  583 | 33.80 |   89.83 | 100.00
##  <NA> | 1076 | 62.38 |    <NA> |   <NA>

  - Notice how it has 1,076 missing values? This includes the respondents who were legitimately missing on both the "marijuana use" and "getting high" items but also those who answered zero to the "marijuana use" question and thus were not asked the "getting high" question. Compare this to the variable you actually created with the nested `if_else` commands:

## marhigh_dic <numeric> 
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 911 | 52.81 |   61.02 |  61.02
##     1 | 582 | 33.74 |   38.98 | 100.00
##  <NA> | 232 | 13.45 |    <NA> |   <NA>

2. Check that the data wrangling worked as expected

Remember, anytime you recode and/or manipulate data, you want to check that R did what you wanted it to do. The thing about programming languages like R is that they will do exactly what you tell them to do (or won’t do something because you didn’t speak to them correctly). But what you tell them to do is not always what you expect them to do. So it is crucial to check your data after you have made changes.

First, let’s see that our new factor variables worked how we expected them to. Like “R Assignment 3” we’ll use the flat_table() function from the “sjmisc” package

library(sjmisc)

nys_w6_trim %>%
  flat_table(marpeer_fct, marpeer, show.values = TRUE)

##                  marpeer [1] None of them [2] Very few of them [3] Some of them [4] Most of them [5] All of them
## marpeer_fct                                                                                                     
## None of them                          468                    0                0                0               0
## Very few of them                        0                  318                0                0               0
## Some of them                            0                    0              349                0               0
## Most of them                            0                    0                0              196               0
## All of them                             0                    0                0                0             133

nys_w6_trim %>%
  flat_table(mardef_fct, mardef, show.values = TRUE)

##                    mardef [1] Not wrong [2] A little bit wrong [3] Wrong [4] Very wrong
## mardef_fct                                                                             
## Not wrong                           215                      0         0              0
## A little bit wrong                    0                    381         0              0
## Wrong                                 0                      0       473              0
## Very wrong                            0                      0         0            427

Notice that we included the show_values = TRUE option in the flat_table() function. This is because the flat_table() function automatically includes the value labels. By including the show_values = TRUE option, it also places the actual value in the data set in brackets next to the value labels. Notice how the new factor variable “mardef_fct” doesn’t have any bracketed values? That’s because by creating it as a factor variable, R now recognizes it as a true categorical variable without a real numerical value. Of course, R still has the levels ordered to represent the numerical values of the original variable. Thus, if you include the factor variables in a summary statistics table as before, it will give you the same values.

Also note that if you want to check the levels of a factor, you can use the base R command levels() and specify the variable from a specific data frame by using the $ operator.

nys_w6_trim %>%
  descr(marpeer, marpeer_fct, mardef, mardef_fct)

## 
## ## Basic descriptive statistics
## 
##          var        type                         label    n NA.prc mean   sd
##      marpeer     numeric Y6-367:FRIENDS-USED MARIJUANA 1464  15.13 2.46 1.30
##  marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1464  15.13 2.46 1.30
##       mardef     numeric          Y6-352:USE MARIJUANA 1496  13.28 2.74 1.02
##   mardef_fct categorical          Y6-352:USE MARIJUANA 1496  13.28 2.74 1.02
##    se md trimmed   range iqr  skew
##  0.03  2    2.46 4 (1-5)   2  0.45
##  0.03  2    2.46 4 (1-5)   2  0.45
##  0.03  3    2.74 3 (1-4)   2 -0.27
##  0.03  3    2.74 3 (1-4)   2 -0.27

For our dummy variables, we simply need to check that they are coded as we expected (i.e. the answer categories we thought we assigned to 1 and 0 are actually assigned to those values) and that they are mutually exclusive (related dummy variables do not overlap). To do that, we can look at cross-tabulations (i.e. “crostabs”) with them and the factor variables from which they were created:

nys_w6_trim %>%
  flat_table(mardef, mardef_neg, show.values = TRUE)

##                        mardef_neg   0   1
## mardef                                   
## [1] Not wrong                     215   0
## [2] A little bit wrong              0 381
## [3] Wrong                           0 473
## [4] Very wrong                      0 427

nys_w6_trim %>%
  flat_table(mardef, mardef_pos, show.values = TRUE)

##                        mardef_pos   0   1
## mardef                                   
## [1] Not wrong                       0 215
## [2] A little bit wrong            381   0
## [3] Wrong                         473   0
## [4] Very wrong                    427   0

nys_w6_trim %>%
  flat_table(mardef, mardef_neut, show.values = TRUE)

##                        mardef_neut   0   1
## mardef                                    
## [1] Not wrong                      215   0
## [2] A little bit wrong               0 381
## [3] Wrong                          473   0
## [4] Very wrong                     427   0

nys_w6_trim %>%
  flat_table(mardef, mardef_negneut, show.values = TRUE)

##                        mardef_negneut   0   1
## mardef                                       
## [1] Not wrong                         215   0
## [2] A little bit wrong                381   0
## [3] Wrong                               0 473
## [4] Very wrong                          0 427

Our dummy coding seemed to work as expected and, given you are looking at the related set of dummy variables, they are mutually exclusive. This means, observations are mutually exclusive within the “mardef_pos” and “mardef_neg” set of dummy variables and the “mardef_pos”, “mardef_neut” and “mardef_negneut” set of dummy variables.

Now let’s check the marijuana use and getting high from marijuana items. We created a dummy variable (“marhigh_dic”) to be congruent with Orcutt’s (1987) coding decision. Specifically, if our recode logic worked, we should have everyone reporting getting “high from marijuana” one or more times in the past year coded as 1 and those who did not use marijuana or did not get high from marijuana in the past year coded as zero. Those who were missing on the use question, should also be missing on our newly created variable.

Let’s start with simply looking at their frequencies:

nys_w6_trim %>%
  frq(maruse_dic, marhigh_dic)

## maruse_dic <numeric> 
## # total N=1725 valid N=1496 mean=0.43 sd=0.50
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 847 | 49.10 |   56.62 |  56.62
##     1 | 649 | 37.62 |   43.38 | 100.00
##  <NA> | 229 | 13.28 |    <NA> |   <NA>
## 
## marhigh_dic <numeric> 
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 911 | 52.81 |   61.02 |  61.02
##     1 | 582 | 33.74 |   38.98 | 100.00
##  <NA> | 232 | 13.45 |    <NA> |   <NA>

The first thing that jumps out is that the the “marhigh_dic” variable has three additional responses that are missing. This means that three people who answered the question about using marijuana in the past year, did not answer the question about getting high in the past year. But this poses a puzzle. Recall, that the number of valid cases on the getting high item (V966) was exactly the number of valid cases on the use item minus the number of respondents who answered they had used marijuana “zero” times in the past year. What gives?
- Let’s try to find out by looking at the crosstab between “maruse_dic” and “marhigh_dic” but with missing values included. To do this we’ll have to create the variables with explicit missing values.

nys_w6_trim %>%
  mutate(maruse_na = ifelse(is.na(maruse), -1, maruse_dic), 
         marhigh_na = ifelse(is.na(marhigh), -1, marhigh_dic)) %>% 
  flat_table(maruse_na, marhigh_na, exclude = NULL)

##           marhigh_na  -1   0   1
## maruse_na                       
## -1                   229   0   0
## 0                    844   3   0
## 1                      3  64 582

First, notice in the if_else commands above, I told R to create two new variables that equal -1 if “maruse” (or “marhigh”) are missing (is.na checks whether the item is missing for a specific observation) and the value of the dichotomous variable we created otherwise.
Second, notice that the cell corresponding to missing on both “maruse_na” and “marhigh_na” (i.e. both are -1) has 229 observations in it. This makes sense as it suggests the same respondents who did not answer the question about marijuana use also didn’t answer the questions about getting high (this is probably largely a result of attrition between survey waves - i.e. they were not able to follow-up with most of these respondents).
Third, notice the cell corresponding to 1 on the “maruse_na” variable and -1 on the “marhigh_na” variable has 3 observations. This is the source of difference between the missing cases we noticed earlier when looking at the frequencies of our “maruse_dic” and “maruse_high” variables. Three respondents who answered that they used marijuana one or more times in the past year did not answer the question about how many times they got high from using marijuana in the past year.
Finally, notice the cell corresponding to those with zero on “maruse_na” and zero on “marhigh_na.” This indicates that three people who answered they had not used marijuana in the past year were also asked about whether they got high, a violation of the skip pattern. This is why we could have 3 additional missing values on the getting high question but still have the math work that we discussed earlier (i.e., “If you subtract the number answering zero on the use question, 847, from the total number of valid cases, 1,496, you get 649—the number of valid cases in getting high question). Essentially, we lost three who didn’t answer the getting high question but should have and gained three who answered the getting high question but shouldn’t have.
- The crucial question for us though is did we correctly classify those 3 subjects who answered “zero” to the number of times they used marijuana but also answered the getting high question. Based on our current dichotomous coding strategy, they are coded as “zero” on the “marhigh_dic” variable. This is because in our nested if_else commands, we told R to code all subjects who answered “zero” on the marijuana use question to be zero on the “marhigh_dic” variable, and we did this after telling R to code those who responded 1 or more on the getting high question as 1. Essentially, if any of these three respondents answered “zero” to using marijuana but answered 1 or more to getting high, they are coded as zero in our “marhigh_dic” variable. This sounds nonsensical, but it is a possibility for these three subjects.
-Let’s look at a crosstab for our “maruse” and “marhigh” variables to see if this happened. Before we do this, we will create a truncated variable that gives everyone who answers 10 or above on these questions a value of 10. This will just make it easier to see the crosstab.

nys_w6_trim %>%
  mutate(maruse_trunc = ifelse(maruse >= 10, 10, maruse), 
         marhigh_trunc = ifelse(marhigh >= 10, 10, marhigh)) %>% 
flat_table(maruse_trunc, marhigh_trunc, exclude = NULL)

##              marhigh_trunc   0   1   2   3   4   5   6   7   8   9  10
## maruse_trunc                                                          
## 0                            2   1   0   0   0   0   0   0   0   0   0
## 1                           29  39   3   0   0   0   0   0   0   0   0
## 2                           14   8  48   2   0   0   0   0   1   0   1
## 3                            8   8   6  17   1   1   0   0   0   0   0
## 4                            2   4   5   2  17   1   0   0   0   0   0
## 5                            3   0   3   2   4  16   5   1   0   0   0
## 6                            1   0   0   2   1   4   6   1   0   0   1
## 7                            0   0   0   1   0   1   1   2   0   0   0
## 8                            0   0   0   1   1   1   0   0   1   0   0
## 9                            0   0   0   0   0   0   0   0   0   0   1
## 10                           7   4  12   6  11   6   8   4   3   1 307

First, notice in the if_else commands above, I told R to create two new variables that equal 10 if “maruse” (or “marhigh”) are greater than or equal to 10 and equal their existing value otherwise.
Second, in terms of the reason we wanted to look at this crosstab, notice that the three people who answered zero to the question about use are in the top row of the crosstab. Two of them answered “zero”, as expected, to the number of times they had gotten high on marijuana. However, one of them had reported zero use but also reported getting high on marijuana one time. Also notice that there are multiple people who report getting high on marijuana more than they report using marijuana (just look at the values above the diagonal in the crosstab, these are all respondents who report getting high on marijuana more times than they report using marijuana). What gives?
Some of this could simply be measurement error, perhaps resulting from respondents not remembering exactly how many times they have used and gotten high from marijuana. It could also reflect differences in how each question was specifically asked. Recall that for the marijuana use question respondents were asked specifically about using “marijuana and hashish” whereas with the getting high question they were only asked specifically about “marijuana.” But if this type of difference in interpretation would seem to explain the values below the diagonal in the crosstab.
- i.e., respondents who count both marijuana and hashish in answering about use but only count marijuana in answering about getting high would likely report getting high less than they report use.
Of course, these discrepancies between use and getting high could also reflect real differences in behavior and interpretation of the questions. For example, perhaps respondents think of “smoking marijuana” when they think of use, but think of multiple things like “edibles”, “contact high”, etc. when they think of how many times they have “been high on marijuana.”
You know how Ritchie (2020) keeps saying data in the real world are messy? This is what he is talking about! Ultimately, we don’t know why these discrepancies exist. But, for our conceptual replication purposes, Orcutt (1987) was primarily interested in competent use. Thus, if respondents don’t report using marijuana but report getting high (e.g. from edibles, contact highs, etc.) it would seem to not reflect competent use as Orcutt (1987) intended. Thus, we are comfortable with this one individual being coded as zero.

Part 4 (Assignment 4.4)

Goal: Replicate Descriptive Statistics (Table 1) from Orcutt (1987)

We are now ready to actually replicate Orcutt’s descriptive analyses, but we think it’s important to take a moment and recognize all the work we had to do to get to this point. In addition to the conceptual work, we also had to spend a fair amount of time wrangling the data and checking to make sure our operations worked how we intended. This is completely normal. When we are working with data, it is common for the bulk of that work to be taken up with these data management and data wrangling tasks (see this blog for a review).

The data wrangling process is also where you see the “Garden of Forking Paths” in a study begin to emerge. Think about all of the different decisions we had to make when wrangling the data above. Do you buy our justifications for these decisions? Do you think someone else, given the same task, could have reasonably made different decisions? This is why creating a reproducible workflow, sharing data, and generally adhering to open-science practices is so important. Other scholars can check and criticize our decisions and the relevant community of scientists has the potential to come to some form of intersubjective consensus

1. Summarize Data

Given we have our data in order, the next step is to summarize it similar to Orcutt.

First, let’s look at Table 1 (pg. 348):

Orcutt is simply presenting the means and standard deviations (in parentheses) for each of the key variables in his analyses for both samples separately and for the combined sample. In order to reproduce this from wave 6 of the NYS, we first need to calculate these summary statistics for each of the key variables we wrangled above. Fortunately, the “sjmisc” package basically does this for you with the descr function. All you need to do is assign that descr function to an object and it will create a dataframe of summary statistics
- Let’s take a look at what this would look like. We’ll only do it for the key variables needed to replicate Orcutt’s table above. Specifically, “marhigh_dic,” “marpeer_fct,” and “mardef_pos.” To do this we’ll use the select command from “dplyr.” We’ll also use the drop_na command like we did in the last assignment to perform listwise deletion for missing values across these three variables (i.e., only include respondents who have complete data on all three variables).

tb1_sumstat <- nys_w6_trim %>%
  select(marhigh_dic, marpeer_fct, mardef_pos) %>%
  drop_na() %>%
  sjmisc::descr(marhigh_dic, marpeer_fct, mardef_pos)
tb1_sumstat

## 
## ## Basic descriptive statistics
## 
##          var        type                         label    n NA.prc mean   sd
##  marhigh_dic     numeric                   marhigh_dic 1461      0 0.39 0.49
##  marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1461      0 2.46 1.30
##   mardef_pos     numeric                    mardef_pos 1461      0 0.14 0.35
##    se md trimmed   range iqr skew
##  0.01  0    0.37 1 (0-1)   1 0.43
##  0.03  2    2.33 4 (1-5)   2 0.46
##  0.01  0    0.05 1 (0-1)   0 2.04

First, notice that you now have a dataframe in your Global Environment called “tb1_sumstat” that includes three observations and 13 variables. This is a summary data set where each variable is an observation and each characteristic (e.g. type and label) and each summary statistic (e.g., mean and sd) are variables.
Second, notice that this data set includes the specific information we need to replicate Orcutt’s (1987) Table 1 (i.e. the mean and standard devation for each variable). The problem of course, is this data set also includes a lot of other information that we don’t necessarily want to report, even though it may be informative.
- What is cool about the “sjmisc” package essentially creating dataframes as its output is that, by assigning it to an object, you can wrangle the information in it just like you did above with the full data set. For our purposes here, we really just need three of the “variables” in the data set: “var,” “mean,” and “sd.” Of course, we may want to keep other information stored as variables (e.g., labels) and perhaps add some information as well (e.g., an NYS indicator). Let’s go ahead and use the “dplyr” functions you are now familiar with to wrangle this summary data into the form we want to use to create a table like Orcutt’s.

tb1_sumstat <- tb1_sumstat %>%
  select(var, label, n, mean, sd) %>%
  mutate(sample = "NYS Wave 6",
         label = ifelse(var == "marhigh_dic", "Competent User (1 = User)", label),
         label = ifelse(var == "marpeer_fct", "Friends' Use (1 = None of them - 5 = All of them)", label),
         label = ifelse(var == "mardef_pos", "Subjective Definition (1 = Positive Definition)", label),
         label = as_factor(label))
tb1_sumstat

## 
## ## Basic descriptive statistics
## 
##          var                                             label    n mean   sd
##  marhigh_dic                         Competent User (1 = User) 1461 0.39 0.49
##  marpeer_fct Friends' Use (1 = None of them - 5 = All of them) 1461 2.46 1.30
##   mardef_pos   Subjective Definition (1 = Positive Definition) 1461 0.14 0.35
##      sample
##  NYS Wave 6
##  NYS Wave 6
##  NYS Wave 6

In the above code, we simply told R to overwrite the “tb1_sumstat” dataframe that we created above and select four specific variables (“var,” “label,” “n,” “mean,” and “sd”) using the select command. Then, using the mutate command, we told R to create a new variable called “sample” to indicate these data were from the “NYS Wave 6” data and recode the “label” variable to have more informative labels. We also told R to make the label variable a factor variable (this may come in handy later). R automatically assigned the labels to levels of the factor based on the order they appear in the data. If you want to see the levels of a factor variables, simply type levels(tb1_sumstat$label) into a code chunk or the console window.

2. Replicate Orcutt’s Table 1 with NYS Wave 6 Data

We actually can produce a simple table that replicates the information in Orcutt’s with our current setup. We simply have to tell R which columns in our tb1_sumstat object to show and we can do that with the select command.

tb1_sumstat %>%
  select(label, mean, sd) %>%
  mutate(mean = round(mean, digits = 3),
         sd = round(sd, digits = 3)) %>%
  rename(Variable = label,
         Mean = mean,
         SD = sd)

## 
## ## Basic descriptive statistics
## 
##                                           Variable  Mean    SD
##                          Competent User (1 = User) 0.394 0.489
##  Friends' Use (1 = None of them - 5 = All of them) 2.457 1.304
##    Subjective Definition (1 = Positive Definition) 0.143 0.350

Note: the code above should be relatively self-explanatory at this point except what we put in the mutate function. The data included seven decimal places for the mean and sd. The mutate function in the code above simply tells R to write over these variables with the same variable rounded to three decimal places using the round() function.
At this point we have performed a conceptual replication of Orcutt’s (1987) Table 1. All the information is technically in the above table. However, it’s an ugly table that would not look great in a presentation or publication. Also, if we were presenting or publishing this, we may also want to place our results next to Orcutt’s in order to compare the results, especially amongst the variables that are measured most similarly.

3. Make the table more visually appealling

In order to make the table more presentable, we’re going to use the “gt” package.

The “gt” package was built by people at RStudio and the basic idea is it allows you to take anything formatted as a data table (e.g. a dataframe or a tidyverse tibble) and create a table using its built in table elements and formatting options. Ultimately, the table created with the “gt” package can be rendered in html within an RMarkdown document (see an introductory video here).
- Note: Being able to create publication-/presentation-ready tables is something that I (Jake) have never been able to figure out entirely within R before. This poses problems for reproducibility as my table workflow would include copying text from R to excel, formatting the numbers in excel, and then copying that text to a table in Microsoft Word. That is at least three steps where something could go wrong. So figuring out how to create tables entirely within RStudio has been a goal of mine for awhile. Indeed, if you look at the bottom of the “gt” package website, it includes links to other packages that can be used to create tables. I counted 15 packages in the list. I’ve probably tried to use most of them at one point or another and have never been happy with the process, customization options, and/or the resulting table. This is my first time using the “gt” package and I’m hoping, given its association with RStudio, that it effectively balances ease of use with the ability to customize and create nice tables that effectively present the information we are trying to convey.
Given the descr funciton in the “sjmisc” package already create a data table for the information we wanted, we can easily create a simple gt table from the tb1_sumstat object we created above usting the gt() function.

library(gt)
tb1_sumstat %>%
  gt()

var	label	n	mean	sd	sample
marhigh_dic	Competent User (1 = User)	1461	0.3942505	0.4888564	NYS Wave 6
marpeer_fct	Friends' Use (1 = None of them - 5 = All of them)	1461	2.4565366	1.3039692	NYS Wave 6
mardef_pos	Subjective Definition (1 = Positive Definition)	1461	0.1430527	0.3502465	NYS Wave 6

Note: We did not overwrite the object when we produced our basic table above, so the table produced by the gt() function has three additional columns (“var”, “n”, and “sample”) and includes the mean and sd measured to seven decimal places. The cool thing about the “gt” package, is that we should be able to customize the appearance of this table entirely with functions available within the “gt” package.
The table already looks visually better than the one above, but we ultimately want to 1) hide the columns we don’t need, 2) format the specific style of text and numbers that are displayed (e.g., column headings, text alignment, and number of decimal places displayed), and 3) add a title and caption to the table. So let’s take these in turn and see what “gt” can really do!

Hide columns we don’t need, specifically

We don’t want the “var”, “n”, and “sample” columns presented in our table (although some of the information may be useful to include in a caption for the table). To do this, we simply use the cols_hide function that is built into the “gt” package.

library(gt)
tb1_sumstat %>%
  gt() %>%
  cols_hide(columns = c(var, n, sample))

label	mean	sd
Competent User (1 = User)	0.3942505	0.4888564
Friends' Use (1 = None of them - 5 = All of them)	2.4565366	1.3039692
Subjective Definition (1 = Positive Definition)	0.1430527	0.3502465

Note: While the table only includes three columns, we did not do anything to the underlyinbg object—tb1_sumstat. It still has all six columns in it, we just used the cols_hide command to tell R not to display them.

Format text and number style

While the table is starting to look visually better than the plain text one we created earlier, there are still some things that don’t look correct. Specifically, the column labels are all lowercase and not the correct text (e.g. “label” vs. “Variable” in the first column), the column with the variable descriptions is center aligned, and it’s showing seven decimal places for the mean and standard deviation statistics we are displaying.

tb1_sumstat %>%
  gt() %>%
  cols_hide(columns = c(var, n, sample)) %>%
  cols_label(
    label = "Variable",
    mean = "Mean",
    sd = "SD") %>%
  cols_align(
    align = "left",
    columns = label) %>%
  fmt_number(
    columns = c(mean, sd),
    decimals = 3)

Variable	Mean	SD
Competent User (1 = User)	0.394	0.489
Friends' Use (1 = None of them - 5 = All of them)	2.457	1.304
Subjective Definition (1 = Positive Definition)	0.143	0.350

-Note: You can find information on all of the customization options at the “Reference” website for the “gt” package. It’s important to recognize, again, that in the above code we are simply manipulating the appearance of the columns and table elements, we are not changing the underlying data frame.

The last thing we want to do is add a title to the table and a note at the bottom indicating a the sample size and anything else we may want to include.

tb1_gtsumstat <- tb1_sumstat %>%
  gt() %>%
  cols_hide(columns = c(var, n, sample)) %>%
  cols_label(
    label = "Variable",
    mean = "Mean",
    sd = "SD") %>%
  cols_align(
    align = "left",
    columns = label) %>%
  fmt_number(
    columns = c(mean, sd),
    decimals = 3) %>%
  tab_spanner(
    label = "NYS Wave 6",
    id = "nys",
    columns = c(mean, sd)) %>%
  tab_header(
    title = md("**Table 1: Variable Means and Standard Deviations**")) %>%
  tab_footnote(
    footnote = md("*n = 1,461*"),
    locations = cells_column_spanners(
      spanners = "nys"))
tb1_gtsumstat

Variable	NYS Wave 6¹
Table 1: Variable Means and Standard Deviations
Variable	Mean	SD
Competent User (1 = User)	0.394	0.489
Friends' Use (1 = None of them - 5 = All of them)	2.457	1.304
Subjective Definition (1 = Positive Definition)	0.143	0.350
¹ n = 1,461

-Note: In the above table, we grouped the “Mean” and “SD” columns under “NYS Wave 6” using the tab_spanner() function, added a title to the table with the tab_header() function, and added a footnote referencing the sample size of NYS Wave 6 using the tab_footnote() function (if we simply wanted to add a note to the end of our table without reference to an object in the table we would ahve used the tab_source_note() function).

-Note:The md() command before the text we were adding with a given function allows for markdown syntax to be used. That’s why the title shows up as bold when we added “**” to both sides of the title within the parentheses under the tab_header() function.

We could continue to modify the table above to get it exactly how we wanted (e.g., adjust font size, remove horizontal lines between variables, etc.), but this is good enough for our current purposes.

3. Compare the Results to Orcutt’s

Given we tried to perform a conceptual replication of Orcutt’s (1987) study, it would be good to see our results side-by-side with Orcutt’s. To do this, we created the table below. It required some more data wrangling and some data entry (i.e. we had to enter the values of Orcutt’s table directly).

Note: We are showing you the code in case you are curious about about how the table was created. We DO NOT expect you to type all of this in yourself nor do we expect you to recreate this table.

#Enter data from Orcutt's Table 1:
cols = c("Variable", "Minnesota", "FSU", "Combined")
label = c("Competent User (1 = User)", "Friends' Use (0 - 4)", "Subjective Definition (1 = Neutral)", "Subjective Definition (1 = Positive)")
min_mean = c(.345, 1.230, .155, .372)
min_sd = c(.476, 1.423, .363, .484)
fsu_mean = c(.475, 1.803, .121, .497)
fsu_sd = c(.416, 1.545, .137, .441)
comb_mean = c(.416, 1.545, .137, .441)
comb_sd = c(.493, 1.530, .344, .497)

#Create data frame of Orcutt's Table 1
orc_tb1 <- as.data.frame(cbind(label, min_mean, min_sd, fsu_mean, fsu_sd, comb_mean, comb_sd))

#Merge Orcutt data with NYS summary data
nys_orc_tb1 <- tb1_sumstat %>%
  rename(nys_mean = mean,
         nys_sd = sd) %>%
  mutate(label = fct_recode(label, "Subjective Definition (1 = Positive)" = "Subjective Definition (1 = Positive Definition)")) %>%
  full_join(orc_tb1) %>%
  mutate(label = as_factor(label),
         nys_mean = as_numeric(nys_mean),
         nys_sd = as_numeric(nys_sd),
         min_mean = as_numeric(min_mean),
         min_sd = as_numeric(min_sd),
         fsu_mean = as_numeric(fsu_mean),
         fsu_sd = as_numeric(fsu_sd),
         comb_mean = as_numeric(comb_mean),
         comb_sd = as_numeric(comb_sd)) %>%
  arrange(label)

#Create and Refine Table
tb1_nysorc_comb <- nys_orc_tb1 %>%
  gt() %>%
  cols_hide(columns = c(var, n, sample)) %>%
  cols_label(
    label = "Variable",
    nys_mean = "Mean",
    nys_sd = "SD",
    min_mean = "Mean",
    min_sd = "SD",
    fsu_mean = "Mean",
    fsu_sd = "SD",
    comb_mean = "Mean",
    comb_sd = "SD") %>%
  cols_align(
    align = "left",
    columns = label) %>%
  fmt_number(
    columns = c(nys_mean, nys_sd, min_mean, min_sd, fsu_mean, fsu_sd, comb_mean, comb_sd),
    decimals = 3) %>%
  tab_spanner(
    label = "NYS Wave 6",
    id = "nys",
    columns = c(nys_mean, nys_sd)) %>%
  tab_spanner(
    label = "Orcutt - Minn.",
    id = "minn",
    columns = c(min_mean, min_sd)) %>%
  tab_spanner(
    label = "Orcutt - FSU",
    id = "fsu",
    columns = c(fsu_mean, fsu_sd)) %>%
  tab_spanner(
    label = "Orcutt - Combined",
    id = "comb",
    columns = c(comb_mean, comb_sd)) %>%
  tab_header(
    title = md("**Table 1: Variable Means and Standard Deviations**")) %>%
  tab_footnote(
    footnote = md("*n = 1,461*"),
    locations = cells_column_spanners(
      spanners = "nys")) %>%
  tab_footnote(
    footnote = md("*n = 444*"),
    locations = cells_column_spanners(
      spanners = "minn")) %>%
  tab_footnote(
    footnote = md("*n = 543*"),
    locations = cells_column_spanners(
      spanners = "fsu")) %>%
  tab_footnote(
    footnote = md("*n = 987*"),
    locations = cells_column_spanners(
      spanners = "comb")) %>%
  fmt_missing(
    columns = everything(),
    missing_text = "---") 
  # tab_options(
  #   footnotes.sep = ", ") #This isn't working; should place footnotes on same line (see: https://github.com/rstudio/gt/issues/833)
tb1_nysorc_comb

Variable	NYS Wave 6¹		Orcutt - Minn.²		Orcutt - FSU³		Orcutt - Combined⁴
Table 1: Variable Means and Standard Deviations
Variable	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Competent User (1 = User)	0.394	0.489	0.345	0.476	0.475	0.416	0.416	0.493
Friends' Use (0 - 4)	—	—	1.230	1.423	1.803	1.545	1.545	1.530
Friends' Use (1 = None of them - 5 = All of them)	2.457	1.304	—	—	—	—	—	—
Subjective Definition (1 = Neutral)	—	—	0.155	0.363	0.121	0.137	0.137	0.344
Subjective Definition (1 = Positive)	0.143	0.350	0.372	0.484	0.497	0.441	0.441	0.497
¹ n = 1,461
² n = 444
³ n = 543
⁴ n = 987

Looking across the values in the table above, you can see some similarities and points of departure between the results from the NYS data and Orcutt’s (1987) results.

First, the percentage of “competent” marijuana users in the past year among all three samples is relatively similar, this is especially the case for NYS data and the combined Minnesota and FSU sample from Orcutt’s study. Orcutt’s sample has about 2% more competent marijuana users than the NYS sample. Note that this combined average reflects a combination of Minnesota’s relativley low marijuana use (35%) and FSU’s relatively high marijuana use (48%) when compared to the NYS (39%).
Second, the peer marijuana use variables between the NYS data and Orcutt’s (1987) samples are measured very differently and thus not directly comparable (and why we put them on separate rows in the above table). In order to compare them, you have to use your informed judgement about what the specific values mean. For example, in Orcutt’s study, respondents reported the number of their four closest friends who they thought used marijuana at least once a month. Compare this to the NYS which asked respondents how many of their friends had used marijuana in the past year with categorical answer categories ranging from “None of them” to “All of them.” For Orcutt, the average was about 1.5 close friends who respondents thought were using marijuana at least once a month (closer to 1 for Minnesota and closer to 2 for FSU). Compare this to the NYS data where the average of 2.5 falls between respondents reporting about their friends’ use that “very few of them” and “some of them” used marijuana in the past year. Leaving aside the different time frames and frequency implied by the questions, do you think respondents from Orcutt’s study that reported 1 to 2 close friends using marijuana at least once a month would have reported “very few of them” or “some of them” to the question in the NYS? Perhaps, but on some level, this question is objectively unknowable without a specific study built to test the overlap between these two questions.
Third, and perhaps the clearest difference between the NYS data and Orcutt’s results, is respondents’ “Subjective Definitions” of marijuana use. These differences are likely due, in large part, to the different ways the questions were asked and the coding decisions we made. Recall that the NYS question was a unipola question about the “wrongness” of marijuana where respondents reported their views on a four-point scale ranging from “Not wrong” to “Very wrong.” Orcutt’s question was bipolar asking respondents to report their “opinion” about marijuana with questions ranging from “Highly negative” (Minnesota) or “Negative” (FSU) to “Highly positive” (Minnesota) or “Positive” (FSU) with a neutral “no opinion category in the middle. These questions are not only asking about qualitatively different things (e.g., wrongness vs. valence of opinion), they are also asking respondents to respond in very different ways (e.g., unipolar vs. bipolar).
- Interestingly, the percent reporting “positive” subjective definitions of marijuana in the NYS (i.e. “Not Wrong”) is more similar to the percent reporting “neutral” subjective definitions in Orcutt’s study. Why this may be interesting is that both of them are based on single answer categories in a multiple-answer scale (albeit a four-point vs. five-point scale respectively). Perhaps, for the sake of comparison, comparing the “Wrong” and Very wrong” categories in the NYS to the “Negative” categories in Orcutt’s study makes the most sense. But this would require some more data wrangling that we will forgoe for the moment.
- For now, just look at the frequency distribution for the NYS subjective definitions measure, paying close attention to the percentage of respondents in the “Wrong” and Very wrong” categories.

nys_w6_trim %>%
  sjmisc::frq(mardef)

## Y6-352:USE MARIJUANA (mardef) <numeric> 
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
## 
## Value |              Label |   N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
##     1 |          Not wrong | 215 | 12.46 |   14.37 |  14.37
##     2 | A little bit wrong | 381 | 22.09 |   25.47 |  39.84
##     3 |              Wrong | 473 | 27.42 |   31.62 |  71.46
##     4 |         Very wrong | 427 | 24.75 |   28.54 | 100.00
##  <NA> |               <NA> | 229 | 13.28 |    <NA> |   <NA>

The combine valid percentage of those answering that marijuana use is “Wrong” and “Very wrong” totals about 60%. Compare this to the percentage of respondents from Orcutt’s samples who were not in the “Positive” and “Neutral” subjective definition categories (i.e. 1 - (% Positive + % Neutral)). Actually, because they are dummy variables, we can also calculate their standard deviation with the formula: Where n = the sample size and p = the proportion of the sample represented by the dummy variable.
- Note: Again, we leave the code below in case you are interested in seeing how the sausage is made. We DO NOT expect you to recreate this code yourself.

sample <-  c("NYS", "Minnesota", "FSU", "Combined")
n <- c(1461, 444, 543, 987)
mean <-  c(.602, (1 - (.155 + .372)), (1 - (.121 + .497)), (1 - (.137 + .441)))

neg_def <- cbind.data.frame(sample, n, mean) %>%
  mutate(sd = sqrt((n/(n-1))*(mean*(1 - mean))))

tb1_nysorc_negdef <- gt::gt(neg_def) %>%
  cols_hide(columns = n) %>%
  fmt_number(
    columns = c(mean, sd),
    decimals = 3) %>%
  cols_label(
    sample = "Sample",
    mean = "Mean",
    sd = "SD") %>%
  tab_spanner(
    label = "Subjective Definition (1 = Negative)",
    columns = c(mean, sd))
tb1_nysorc_negdef

Sample	Subjective Definition (1 = Negative)
Sample	Mean	SD
NYS	0.602	0.490
Minnesota	0.473	0.500
FSU	0.382	0.486
Combined	0.422	0.494

As evident in the above table, a larger proportion of respondents in the NYS sample report negative subjective definitions of marijuana than any of the samples in Orcutt’s (1987) study. The absolute difference ranges from 22% between the NYS and the FSU sample to 13% betwen the NYS and Minnesota sample. There is an 18% difference between the NYS sample and Orcutt’s combined sample. Given the differences in measurement, what do you think might account for these differences?
- Perhaps college students have generally more positive attitudes toward marijuana than a general sample of college-aged persons. Perhaps marijuana was generally viewed differently in 1983 when the NYS was collected than it was in 1972 and 1973 when Orcutt (1987) collected his data. Ultimately, we can’t be completely sure, but you can see how thinking about these things may lead to interesting theoretical and empirical questions.

Part 5 (Assignment 4.5)

Goal: Replicate Descriptive Statistics (Table 1) from Orcutt (1987) for Earlier Waves of NYS

We have just walked you through a conceptual replication of Orcutt’s (1987) descriptive statistics using wave 6 of the NYS—the wave where NYS respondents were presumably most similar in age to Orcutt’s sample. Substantively, this was an attempt to see if Orcutt’s findings generalized to a nationally representative sample. But another substantive question to address related to Orcutt’s findings is how the results would generalize to different age-groups.

1. Use earlier wave of NYS to perform your own conceptual replication of Orcutt (1987)

To examine the question of whether Orcutt’s findings generalize to different age groups, one basic thing we can do is perform a similar conceptual replication using earlier waves of the NYS (waves 1 - 5). This is now your task.

Using Jake’s cool code, I (Jon) have randomly assigned each of you to one of the first five waves of the NYS and provided you with a pooled data set called “nys_fwtrim_orc1987” (see assignment on Canvas to download directly and/or for the script used to create the data set). The data set includes the following variables:

variable	question	answers
CASEID	Unique Identifier	NA
wave	Wave of NYS data collection	NA
age	How old are you?	specific age
marpeer	Think of your friends. During the last year how many of them have used marijuana or hashish?	1 = None of them - 5 = All of them
mardef	How wrong is it for someone your age to use marijuana or hashish?	1 = Not wrong - 4 = Very wrong
maruse	In the last year, how often have you used marijuana - hashish ('grass', 'pot', 'hash')?	1 = Never - 9 = 2 to 3 times/day

Note: In the first five waves, the NYS only asked about marijuana use in the last year and not about whether the respondents’ got high from marijuana in the past year.
Note: Also note that they did not ask for a raw count in Waves 1 and 2 and instead asked for them to answer on a 9-point scale that ranged from 1 = “Never” to 9 = “2 to 3 times per day”. Here is the full range of answers from the Wave 1 codebook:

Note: This question was also included in wave 6 (V891), but it was only asked of those who reported using marijuana 10 or more times in the last year.
Note: Given the first five waves of the NYS did not ask about getting high from marijuana this is another instance where the conceptualization of “marijuana use” in your replication with one of the first five waves of data will be different from Orcutt’s conceptualization. But note that with this measure you are still able to distinguish between those who have used marijuana at all in the past year and those who have not.

2. NYS Wave Assignments

Here are your wave assignments:

NYS Wave	Students
NYS Wave
Wave 1	Student 4	Student 10	Student 3
Wave 2	Student 1	Student 8	NA
Wave 3	Student 9	Student 6	NA
Wave 4	Student 7	Student 2	NA
Wave 5	Student 11	Student 5	NA

Note: Remember, you are only responsible for performing a conceptual replication of Orcutt (1987) using the specific wave to which you are assigned.

3. “Draw the Owl”

You should now have everything that you need to replicate Orcutt’s (1987) Table 1 with the NYS data to which you were assigned. The pooled data set is provided with the assignment on canvas (along with the R Script used to create it). If you have followed along up to this point, you should have all the requisite skills to recreate Table 1 above using the NYS wave to which you were assigned.

All that is left to do is:

Download the pooled datafile from canvas then load the pooled data into R.

Note: The pooled datafile on Canvas is already saved as an R data object (.rds format). I recommend saving the pooled file in your “NYS” subfolder and then using the load() function as follows:
load(here('Datasets/NYS', 'nys_fwtrim_orc1987.rds'))

Filter the pooled data to only include the wave you were assigned (see “R Assignment 3” if you forgot how to do this).
Recode the data in your wave to match the logic of Orcutt and what we did with wave 6 above.
Create a table using the “gt” package that summarizes the data similar to Orcutt and what we did above with wave 6.
Explain how the results of your conceptual replication are similar to or different from Orcutt’s analysis and how they are similar or different to the conceptual replication presented above for wave 6 of the NYS.

Remember:

Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment parts or questions.
Write plain text after headings and before or after code chunks to explain what you are doing. This is not just for my sake - such text will serve as useful reminders to you when working on later assignments!
Upon completing the assignment, “knit” your final RMD file again and save the final knitted html document to your “Assignments” folder in your LastName_P680_work folder as: LastName_P680_Assign4_YEAR_MO_DY.
Inside the “LastName_P680_commit” folder in our shared folder, create another folder named: Assignment 4.
To submit your assignment for grading, save copies of both your (1) final knitted “Assign4” html file and (2) your “Assign4_RMD” file into the “LastName_P680_commit / Assignment 4” folder.
- Remember, be sure to save copies of both files - do not just drag the files over from your “work” folder, or you may lose those original copies from your “work” folder.