Background and Rationale

The purpose of this assignment is to perform a conceptual replication of some observations in Orcutt’s (1987) paper in Criminology titled: “Differential Association and Marijuana Use: A Closer Look at Sutherland (with a Little Help from Becker).” Since Orcutt’s (1987) original data are unavailable, we will assess whether some of his findings can be repeated with and generalize to a similar sample in the NYS data. Recall, you previously used a pooled version of the first five waves of the NYS data to conduct reproduction research in the previous assignment.

Reproduction, Replication, and their Varieties

Notice that we are drawing a distinction between “reproducibility” and “replicability” and, likewise, between reproduction and replication research. Recall, in R Assignment 3, we conducted a reproduction of part of Figure 1 from Mark Warr’s (1993) classic paper on age and peers using the exact same data as Warr. In a reproduction, the goal is to verify or repeat exactly some or all of the findings reported in a previous study using identical data and methods as the original study. Unfortunately, the terminology surrounding reproduction and replication is inconsistent and confusing. For example, some use the term “pure replication” to refer to what we call reproduction research (e.g., see Freese and Peterson 2017, pp.152-3). We note that our distinctions are consistent with those used in Ritchie’s (2020) book (which you are currently reading) and with others’ recent attempts to clarify terminology in this space (e.g., Patil, Peng, & Leek 2019).

In addition to distinguishing between reproducibility and replicability, we might also draw distinctions between different types of replications. Perhaps the most common is the distinction between a direct replication and a conceptual replication (cf. Crandall and Sherman 2016; Pridemore, Makel, & Plucker 2018, p.21). In a direct replication, one assesses the same theoretical or observational claim of a study using new data and measures that are collected or designed in such a way as to match the prior study’s design as exactly as possible, though perhaps with some notable exceptions (e.g., a larger sample size to improve statistical inferences). In contrast, a conceptual replication assesses the same theoretical or observational claim as a previous study using new data and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. we recommend reading Crandall and Sherman’s (2016) detailed discussion of these distinctions and their case for the relative utility of conceptual replications in advancing scientific progress; see also Nosek and Errington’s (2020) critique of these distinctions.

Underlying many of these terminological distinctions are differences in research procedures and research aims. For instance, drawing on Freese and Peterson’s (2017) typology of the different aims involved in replication and reproducibility research, reproduction research often aims to assess verifiability by attempting to reproduce or verify an original study’s findings using the same data and methods (e.g., code). Direct replications typically assess repeatability by testing whether the same findings emerge or repeat when applying the same methods to a new sample. Conceptual replications often assess repeatability, robustness, and/or generalizability of a theoretical or observational claim by, for instance, testing the original claim’s robustness to different measurement specifications using the same data or testing the claim’s generality to new samples (e.g., different groups or contexts). We recommend reading Freese and Peterson’s in-depth discussion of these aims; for convenience, we include their definitions (see 2017, p.152) of these four aims below.

  • Tests of verifiability: “taking the results of an original study as the object of inquiry and asks limited questions regarding whether the same results are obtained by doing the same analyses on the same data.”
  • Tests of robustness: “conduct a reanalysis on the original data using alternative specifications to see if the target finding is merely the result of analytic decisions.”
  • Tests of repeatability: “collecting new data to determine whether key results of a study can be observed by using the original procedures.”
  • Tests of generalizability: “the original study provides a premise for research trying to evaluate whether similar findings may be observed consistently across different methods or settings.”

Assumptions & Ground Rules

To accomplish a conceptual replication, we will need to do and learn the following:

  1. Identify some of the key variables analyzed by Orcutt’s (1987) paper in Criminology titled: “Differential Association and Marijuana Use: A Closer Look at Sutherland (with a Little Help from Becker).”
  2. Identify specific wave and specific items in the NYS that best align with the sample and variables used in Orcutt’s study.
  3. Replicate Table 1 from Orcutt’s paper (pg. 348) using NYS items.

We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.

At this point, we also assume you are familiar with RStudio, with creating R Markdown (RMD) files, and with the basics of using the here package and a self-referential file structure to create a reproducible R Markdown file.

If not, please review R Assignments 1 & 2.

The Study

We recommend doing a quick AIC reading of Orcutt’s (1987) study.To get you started, here is the abstract:

Based on Sutherland’s differential association theory and Becker’s early research on marijuana use, a contingency model estimating the exact probability of getting high on marijuana under various associational and motivational conditions is specified and tested. Data from surveys at two universities fit this model closely. Predicted first-order interactions and nonlinear effects of motivational balance and peer association are statistically significant and generate highly precise estimates of the probability of getting high. These results suggest that linear main-effects models employed in previous research on differential association processes do not adequately reflect the complex [causal] structure of Sutherland’s theory. In addition, this study raises serious questions about claims that differential association theory is untestable and has been made outdated by social learning theory.

While that all sounds pretty technical, fundamentally the paper is examining some core theoretical claims from Edwin Sutherland’s Differential Association Theory. Specifically, Orcutt sets out to test key aspects of Sutherland’s theory by examining how (a respondent’s perception of their) peers’ behavior is associated with a respondent’s own subjective attitudes toward that same behavior (i.e. their “definitions” of the behavior) and, ultimately, with the respondent’s own self-reported participation in that behavior. In this case, the focal behavior is marijuana use. Orcutt also distinguishes between “competent use” and “incompetent use” to incorporate Sutherland’s ideas about the necessity of learning the requisite skills to accomplish deviant behavior as well as to integrate some key insights from Howard Becker’s (1953) classic description of the process of learning to become a regular marijuana user (hence the “with a little help from Becker” part of the title).

Orcutt’s data came from two “in-class” surveys of undergraduate students at two universities—University of Minnesota and Florida State University—in 1972 and 1973, respectively. Here is the description Orcutt provides in the article:

Approximately half of the respondents in both surveys received a questionnaire focusing on alcohol use while the other half completed a parallel form dealing with marijuana use. This analysis will be restricted to students at each school who filled out the marijuana questionnaire–444 Minnesota undergraduates and 543 Florida State (FSU) undergraduates.

You can find more details on these data in two other articles that Orcutt published prior to this paper (see Orcutt, 1975 and Orcutt, 1978.

Part 1 (Assignment 4.1)

Goal: Identify Wave of NYS Data that is most similar to Orcutt’s samples

Recall, the purpose of a conceptual replication is to test the repeatability, robustness, or generalizability of a theoretical or observational claim from a previous study using new data collection and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. The NYS data provides an opportunity to do this, since it includes similar variables as those used by to Orcutt - although not exactly the same - and, unlike Orcutt’s two university student samples, the NYS is a larger, nationally representative panel survey of youth followed for 10 years across seven waves of data.

Specifically, youth respondents in the NYS ranged from ages 11 to 17 in Wave 1 to ages 21 to 27 in wave 7. Since Orcutt’s sample is a student-based sample from two specific universities, we will start by identifying the wave of the NYS in which the youth respondents are most similar to the population from which Orcutt was sampling in terms of age. This will allow us to assess the repeatability of his findings on a group of youth in a similar age range while the nationally representative sample will permit us to assess whether Orcutt’s findings will generalize beyond college students attending the two specific universities that he studied.

  • Note: Later in the assignment, you will have the opportunity to see if Orcutt’s findings also generalize to these same NYS respondents at different waves and, hence at different age ranges.

Let’s look at the age range for each wave of NYS data:

wave year min max
Wave 1 1976 11 17
Wave 2 1977 12 18
Wave 3 1978 13 19
Wave 4 1979 14 20
Wave 5 1980 15 21
Wave 6 1983 18 24
Wave 7 1987 21 27

\(~\) The table above shows that the Wave 6 NYS data likely includes the panel members when they are most similar to Orcutt’s college-based samples in terms of age with an age range of 18-24. It is important to note that we do not actually know the specific age range or distribution of Orcutt’s sample because he did not appear to report it in the article. This is a situation where we are likely fairly safe in assuming Orcutt’s sample was largely drawn from 18-22 year-olds, but without that information provided in the article and without the data being shared publicly, we really cannot know for sure.

  • Note: I also checked the articles cited above that Orcutt (1987) cites as providing more details about these data (Orcutt, 1975, 1978) and did not find any information on the age distribution of the samples in those articles either.
  1. Let’s download the data directly from ICPSR like we did in “R Assignment #3.”
  • Note: You should already have the “NYS” folder in your “Datasets” folder from the previous assignment. You can check using the function from the previous assignment.

  • Note: Recall that when you first run this chunk during an R session, you will be asked to enter your ICPSR account information into the R console. So Check the R Console after trying to run the icpsr_download command to see and respond to the ICPSR username/password prompts. Also, recall that this requires that you have an ICPSR account, which you should have created for the previous assignment. If you receive errors, go to the ICPSR website and be sure that you are able to login using the username (email) and password that you are entering in the R console.

library(icpsrdata)
library(here)

ifelse(dir.exists(here("Datasets/NYS/ICPSR_09948")), TRUE, 
       icpsr_download(file_id = c(9948), 
               download_dir = here("Datasets/NYS")))
## [1] TRUE
  • Note: In order to prevent R from continually trying to download data we have already downloaded and to prevent issues when you knit your Rmd file, we added the icpsr_download function to an ifelse statement. This is similar to what we did in “R Assignment 3” when we were creating the “NYS” sub-folder in our “Datasets” folder. See Part 2 of “R Assignment 3” for an explanation of the logic. It is essentially checking to see if the “ICPSR-09948” folder exists (which is NYS Wave 6) in the “NYS” folder, returns the logical statement “TRUE” if it does and, if it does not, runs the icpsr_download command to download the NYS Wave 6 data from ICPSR.
  1. Load the data into R and assign it to an object with the read_spss command like you have done in the previous three assignments.
library(haven)
nys_w6 <- read_spss(here("Datasets", "NYS", "ICPSR_09948", "DS0001", "09948-0001-Data.sav"))

Part 2 (Assignment 4.2)

Goal: Identify Specific Items from NYS Data Most Similar to Orcutt’s Measures

The next step is to identify the specific items from the NYS that allow us to measure the same constructs as Orcutt (1987) did. Recall that Orcutt (1987) was attempting to examine key propositions from Sutherland’s Differential Association Theory. Specifically, Orcutt was interested in the relationship between 1) associations with criminal/deviant patterns of behavior (i.e. peer’s use of marijuana), 2) individuals’ subjective attitudes toward that behavior (i.e., definitions favorable to marijuana use), and 3) individuals’ criminal/deviant behavior (i.e. self-reported marijuana use). Below is a simplified diagram of the theorized causal structure, or directed acyclic graph (DAG), representing the hypothesized relationships between these three variables.

1. Associations with criminal/deviant patterns

Orcutt measured peer marijuana use with the following question: “Of your four closest friends, how many would you say use marijuana at least once a month?” The specific number of friends who the respondent reported using marijuana at least once a month were coded as the answer categories and ranged from 1 to 4 with a mean of 1.2 (SD = 1.4) at Minnesota and 1.8 (SD = 1.6) at Florida State.

  • As you know from “R Assignment 3,” the NYS includes measures of peer delinquency and specifically peer marijuana use (V371 in NYS Wave 6). However, it is measured with the question: “Think of your friends. During the last year how many of them have used marijuana or hashish?” The answers were from a five-level ordinal scale with specific answer categories including: “None of them” (=1), “Very few of them” (=2), “Some of them” (=3), “Most of them” (=4), and “All of them” (=5).

    • You can see how this is a “conceptual replication” with our first measure. While both data sources include a measure of peer marijuana use, the concept is measured differently - and in a way that makes the measures and corresponding results not exactly comparable. To get a sense of how these differences might matter, just try to imagine how someone in Orcutt’s data who answered that “3” of their friends used “marijuana at least once a month” (above the average) would answer the question in the NYS: Would the person answer “very few of them” (=2) in the NYS? Perhaps so, if they have 20 friends. But what if the person has 4 friends - might they respond instead with “most of them” (=4) instead?

    • If you had substantive knowledge of Sutherland’s Differential Association Theory and the criminological literature on peer influence in general, you may also have an opinion regarding which measure of exposure to peer delinquency - Orcutt’s or the NYS version - best captures the theoretical concept of “exposure to delinquent patterns.” Going down this rabbit hole isn’t necessary for our particular assignment, but it is essential to think about and recognize: 1) that the measurement of abstract social constructs can be a tricky, variable, and error-prone endeavor; and 2) that such seemingly small measurement differences like the one here can have important implications for testing theories and for assessing the replicability of research findings.

  • Look at the distribution and summary statistics for the “Peer marijuana use” variable using frq and descr functions from the sjmisc package:

library(sjmisc)

nys_w6 %>%
  frq(V371)
## Y6-367:FRIENDS-USED MARIJUANA (V371) <numeric> 
## # total N=1725 valid N=1464 mean=2.46 sd=1.30
## 
## Value |            Label |   N | Raw % | Valid % | Cum. %
## ---------------------------------------------------------
##     1 |     None of them | 468 | 27.13 |   31.97 |  31.97
##     2 | Very few of them | 318 | 18.43 |   21.72 |  53.69
##     3 |     Some of them | 349 | 20.23 |   23.84 |  77.53
##     4 |     Most of them | 196 | 11.36 |   13.39 |  90.92
##     5 |      All of them | 133 |  7.71 |    9.08 | 100.00
##  <NA> |             <NA> | 261 | 15.13 |    <NA> |   <NA>
nys_w6 %>%
  descr(V371)
## 
## ## Basic descriptive statistics
## 
##   var    type                         label    n NA.prc mean  sd   se md
##  V371 numeric Y6-367:FRIENDS-USED MARIJUANA 1464  15.13 2.46 1.3 0.03  2
##  trimmed   range iqr skew
##     2.46 4 (1-5)   2 0.45
  • As you can see above, “None of them” is the modal category and a similar percentage of respondents answered “Very few of them” (18.4%) and “Some of them” (20.2 %), roughly equal to the last two categories combined - “Most of them” (11.36%) and “All of them” (7.7%). This suggests the distribution is right skewed or positive skewed. As you can see in the descriptive statistics, and characteristic of this “right” or “positive” skew, the mode (1) is less than the median (2) which is less than the mean (2.5).

    • Also note that about 15% of the respondents (n = 261) are missing data for this question. This means that by using this item, our overall sample size will, at most, be n = 1,464 (1725 - 261).
  • For now, we will plan on keeping this variable “as-is” with five categories.

2. Subjective attitudes toward criminal/deviant behavior

Orcutt measured “subjective definitions” regarding marijuana use with the question: “How would you generally characterize your opinions toward marijuana?” The answer categories across the Minnesota and FSU data were slightly different, although both used 5-point Likert scales (technically a Likert-type scale) with the Minnesota survey response categories ranging from “highly negative to”highly positive” and the Flordia State response categories ranging from “negative” to “positive” (p. 346).

  • The NYS also includes items that measure one’s “subjective definition” of the marijuana use, particularly its “wrongness” (V356 in NYS Wave 6). The specific question asks: “How wrong is it for someone your age to use marijuana or hashish?” with answer categories on a 4-point ordinal scale including: “Not wrong” (=1), “A little bit wrong” (=2), “Wrong” (=3), and “Very wrong” (=4).

    • Like before with the peer marijuana use variable, it is instructive to work through the intellectual exercise of thinking how the responses on Orcutt’s surveys would map onto the NYS question and response categories. Also, if we were planning to embark upon a serious attempt to contribute to this literature, then it would be worth seriously considering which is a better approach to measuring the theoretical concept of “subjective definitions.” Of course, this would require in-depth substantive knowledge of Sutherland’s theory.
  • Let’s look at the distribution and summary statistics for this item.

nys_w6 %>%
  frq(V356)
## Y6-352:USE MARIJUANA (V356) <numeric> 
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
## 
## Value |              Label |   N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
##     1 |          Not wrong | 215 | 12.46 |   14.37 |  14.37
##     2 | A little bit wrong | 381 | 22.09 |   25.47 |  39.84
##     3 |              Wrong | 473 | 27.42 |   31.62 |  71.46
##     4 |         Very wrong | 427 | 24.75 |   28.54 | 100.00
##  <NA> |               <NA> | 229 | 13.28 |    <NA> |   <NA>
nys_w6 %>%
  descr(V356)
## 
## ## Basic descriptive statistics
## 
##   var    type                label    n NA.prc mean   sd   se md trimmed
##  V356 numeric Y6-352:USE MARIJUANA 1496  13.28 2.74 1.02 0.03  3    2.74
##    range iqr  skew
##  3 (1-4)   2 -0.27
  • This item as originally coded is “left skewed” or “negative skewed,” with the “wrong” category as the modal category and the “very wrong” category as the second most common.

  • Orcutt took his five-category item and recoded it into three categories with “undecided” as a “neutral” definition and responses on the positive (e.g., “Highly positive” and “Positive”) or negative side (e.g., “Highly negative” and “Negative”) of this category coded as “positive” or “negative” respectively. This made sense given Orcutt used a “bipolar” rating scale for his answer categories in designing his survey. A “bipolar” rating scale simply refers to a set of answer categories that allow a respondent to answer in opposite directions, usually separated by a midpoint (in Orcutt’s case the “undecided” category).

    • The NYS, however, uses a “unipolar” rating scale (see here for brief discussion of the distinction between bipolar and unipolar survey response scales). A unipolar rating scale includes answer categories that only move in one direction (in the case of the NYS, from “Not wrong” to “Very wrong”). As a result, the “subjective definition” item in the NYS does not lend itself nicely to a “neutral” categorization (perhaps one could argue the “A little bit wrong” answer conceptually aligns with Orcutt’s “undecided” answer category.) Arguably, the NYS approach is also a less desirable match to Sutherland’s concept of definitions, which presumably can range in content from favorable to unfavorable to crime. Below, when we recode the data, we will ultimately decide to collapse the “subjective definition” responses in our analysis into two dummy variables that indicate whether the respondent reportedly has internalized: (A) “negative” definitions unfavorable to marijuana use (i.e., 1=“At least a little bit wrong”) or (B) “positive” definitions favorable to marijuana use (1=“Not wrong”).

      • Note: Again, if we were embarking on a serious attempt at contributing to this literature, then we might also want to assess robustness of findings from our analysis to alternative coding (and analysis) decisions. For example, one alternative coding strategy might involve collapsing the “wrong” and “very wrong” categories into a dummy indicator of “negative definitions,” then use the “A little bit wrong” category to indicate “neutral definitions” and the “not wrong” category to indicate “positive definitions.” In situations where we lack a clear theoretical rationale for one measurement strategy over another, we generally advise to start by not collapsing categories and assessing relative frequencies or associations across the full range of responses; likewise, in such situations, we recommend assessing robustness of results across various theoretically defensible measurement approaches.
3. Individual criminal/delinquent behavior

Orcutt’s key dependent variable of “personal use of marijuana to get high” was measured with the question: Which of the following statements best described the approximate number of times you have gotten ‘high’ on marijuana during the past year?” The Answer categories included: 1) “I did not use marijuana during the past year;” 2) “I used marijuana during the past year, but did not get ‘high’;” 3) “I got ‘high’ on marijuana during the past year; but only once or twice;” 4) I got ‘high’ on marijuana at least 3 times during the past year, but not more than 12 times;’ and 5) “I got ‘high’ on marijuana more than 12 times during the past year.”

The NYS includes multiple questions about marijuana use with two being key for our purposes. First is a question about use (V890 in NYS Wave 6): “How many times in the last year have you used marijuana or hashish? (GRASS, POT, HASH)” with the specific number of times reported coded as answers. Second is a question about getting high (V966 in NYS Wave 6): “How many times in the past year have you been high on marijuana?” with the specific number of times reported coded as answers.

  • Take a moment and think which of these items from the NYS best reflects what Orcutt was trying to measure.

  • Ultimately, Orcutt makes this decision pretty easy on us. He was specifically interested in the distinction between “minimally competent use” and incompetent use. Here is what he said on pg. 347:

An important feature of this item is that it measures a respondent’s self reported ability to get high which, for Becker (1953), is a defining characteristic of a marijuana user. That is, this measure distinguishes between those who are minimally competent users-who have acquired the physical and subjective techniques for getting high-and those who are not. Therefore, according to Becker’s conception, respondents who checked either of the first two statements should be classified as nonusers. Thus, the dependent variable in this analysis is a proportional measure of initiation into marijuana use- Ownuse-based on a binary scoring of nonusers (0 = statements 1 or 2) versus users (1 = statements 3, 4, or 5).

  • This means that the key distinction for Orcutt was between getting high and not getting high. The second question from the NYS better captures this than the first. But before we decide, let’s look at the distribution and summary statistics for these items.
nys_w6 %>%
  frq(V890, V966)
## Y6-886:MARIJUANA-FREQ (V890) <numeric> 
## # total N=1725 valid N=1496 mean=32.74 sd=102.52
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 847 | 49.10 |   56.62 |  56.62
##     1 |  71 |  4.12 |    4.75 |  61.36
##     2 |  75 |  4.35 |    5.01 |  66.38
##     3 |  41 |  2.38 |    2.74 |  69.12
##     4 |  31 |  1.80 |    2.07 |  71.19
##     5 |  34 |  1.97 |    2.27 |  73.46
##     6 |  16 |  0.93 |    1.07 |  74.53
##     7 |   5 |  0.29 |    0.33 |  74.87
##     8 |   4 |  0.23 |    0.27 |  75.13
##     9 |   1 |  0.06 |    0.07 |  75.20
##    10 |  34 |  1.97 |    2.27 |  77.47
##    11 |   1 |  0.06 |    0.07 |  77.54
##    12 |  30 |  1.74 |    2.01 |  79.55
##    14 |   1 |  0.06 |    0.07 |  79.61
##    15 |  10 |  0.58 |    0.67 |  80.28
##    16 |   1 |  0.06 |    0.07 |  80.35
##    20 |  30 |  1.74 |    2.01 |  82.35
##    21 |   1 |  0.06 |    0.07 |  82.42
##    22 |   1 |  0.06 |    0.07 |  82.49
##    24 |   2 |  0.12 |    0.13 |  82.62
##    25 |  16 |  0.93 |    1.07 |  83.69
##    26 |   1 |  0.06 |    0.07 |  83.76
##    30 |  16 |  0.93 |    1.07 |  84.83
##    35 |   1 |  0.06 |    0.07 |  84.89
##    40 |   8 |  0.46 |    0.53 |  85.43
##    45 |   1 |  0.06 |    0.07 |  85.49
##    50 |  34 |  1.97 |    2.27 |  87.77
##    52 |  23 |  1.33 |    1.54 |  89.30
##    60 |   6 |  0.35 |    0.40 |  89.71
##    62 |   1 |  0.06 |    0.07 |  89.77
##    70 |   2 |  0.12 |    0.13 |  89.91
##    75 |   4 |  0.23 |    0.27 |  90.17
##    80 |   4 |  0.23 |    0.27 |  90.44
##    85 |   1 |  0.06 |    0.07 |  90.51
##   100 |  29 |  1.68 |    1.94 |  92.45
##   104 |   2 |  0.12 |    0.13 |  92.58
##   130 |   1 |  0.06 |    0.07 |  92.65
##   144 |   2 |  0.12 |    0.13 |  92.78
##   150 |  11 |  0.64 |    0.74 |  93.52
##   156 |   1 |  0.06 |    0.07 |  93.58
##   160 |   1 |  0.06 |    0.07 |  93.65
##   175 |   1 |  0.06 |    0.07 |  93.72
##   200 |  14 |  0.81 |    0.94 |  94.65
##   208 |   2 |  0.12 |    0.13 |  94.79
##   240 |   1 |  0.06 |    0.07 |  94.85
##   250 |   3 |  0.17 |    0.20 |  95.05
##   270 |   1 |  0.06 |    0.07 |  95.12
##   300 |  19 |  1.10 |    1.27 |  96.39
##   340 |   1 |  0.06 |    0.07 |  96.46
##   350 |   1 |  0.06 |    0.07 |  96.52
##   360 |   6 |  0.35 |    0.40 |  96.93
##   365 |  28 |  1.62 |    1.87 |  98.80
##   400 |   1 |  0.06 |    0.07 |  98.86
##   450 |   1 |  0.06 |    0.07 |  98.93
##   500 |   3 |  0.17 |    0.20 |  99.13
##   600 |   4 |  0.23 |    0.27 |  99.40
##   700 |   3 |  0.17 |    0.20 |  99.60
##   730 |   2 |  0.12 |    0.13 |  99.73
##   900 |   1 |  0.06 |    0.07 |  99.80
##   999 |   3 |  0.17 |    0.20 | 100.00
##  <NA> | 229 | 13.28 |    <NA> |   <NA>
## 
## Y6-962:HIGH ON MARIJUANA PAST YEAR (V966) <numeric> 
## # total N=1725 valid N=649 mean=61.87 sd=134.07
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##     0 |   66 |  3.83 |   10.17 |  10.17
##     1 |   64 |  3.71 |    9.86 |  20.03
##     2 |   77 |  4.46 |   11.86 |  31.90
##     3 |   33 |  1.91 |    5.08 |  36.98
##     4 |   35 |  2.03 |    5.39 |  42.37
##     5 |   30 |  1.74 |    4.62 |  47.00
##     6 |   20 |  1.16 |    3.08 |  50.08
##     7 |    8 |  0.46 |    1.23 |  51.31
##     8 |    5 |  0.29 |    0.77 |  52.08
##     9 |    1 |  0.06 |    0.15 |  52.23
##    10 |   31 |  1.80 |    4.78 |  57.01
##    12 |   21 |  1.22 |    3.24 |  60.25
##    14 |    1 |  0.06 |    0.15 |  60.40
##    15 |   15 |  0.87 |    2.31 |  62.71
##    18 |    1 |  0.06 |    0.15 |  62.87
##    20 |   25 |  1.45 |    3.85 |  66.72
##    22 |    1 |  0.06 |    0.15 |  66.87
##    24 |    2 |  0.12 |    0.31 |  67.18
##    25 |    9 |  0.52 |    1.39 |  68.57
##    26 |    2 |  0.12 |    0.31 |  68.88
##    30 |   16 |  0.93 |    2.47 |  71.34
##    35 |    2 |  0.12 |    0.31 |  71.65
##    40 |    8 |  0.46 |    1.23 |  72.88
##    45 |    2 |  0.12 |    0.31 |  73.19
##    48 |    1 |  0.06 |    0.15 |  73.34
##    50 |   27 |  1.57 |    4.16 |  77.50
##    52 |   10 |  0.58 |    1.54 |  79.04
##    60 |    2 |  0.12 |    0.31 |  79.35
##    62 |    1 |  0.06 |    0.15 |  79.51
##    70 |    2 |  0.12 |    0.31 |  79.82
##    75 |    3 |  0.17 |    0.46 |  80.28
##    80 |    3 |  0.17 |    0.46 |  80.74
##    85 |    2 |  0.12 |    0.31 |  81.05
##   100 |   30 |  1.74 |    4.62 |  85.67
##   104 |    2 |  0.12 |    0.31 |  85.98
##   110 |    1 |  0.06 |    0.15 |  86.13
##   125 |    1 |  0.06 |    0.15 |  86.29
##   150 |    8 |  0.46 |    1.23 |  87.52
##   156 |    1 |  0.06 |    0.15 |  87.67
##   160 |    2 |  0.12 |    0.31 |  87.98
##   200 |   11 |  0.64 |    1.69 |  89.68
##   240 |    1 |  0.06 |    0.15 |  89.83
##   250 |    4 |  0.23 |    0.62 |  90.45
##   270 |    2 |  0.12 |    0.31 |  90.76
##   300 |   20 |  1.16 |    3.08 |  93.84
##   320 |    1 |  0.06 |    0.15 |  93.99
##   350 |    2 |  0.12 |    0.31 |  94.30
##   352 |    1 |  0.06 |    0.15 |  94.45
##   360 |    4 |  0.23 |    0.62 |  95.07
##   365 |   21 |  1.22 |    3.24 |  98.31
##   400 |    1 |  0.06 |    0.15 |  98.46
##   450 |    1 |  0.06 |    0.15 |  98.61
##   500 |    1 |  0.06 |    0.15 |  98.77
##   600 |    2 |  0.12 |    0.31 |  99.08
##   700 |    1 |  0.06 |    0.15 |  99.23
##   999 |    5 |  0.29 |    0.77 | 100.00
##  <NA> | 1076 | 62.38 |    <NA> |   <NA>
nys_w6 %>%
  descr(V890, V966)
## 
## ## Basic descriptive statistics
## 
##   var    type                              label    n NA.prc  mean     sd   se
##  V890 numeric              Y6-886:MARIJUANA-FREQ 1496  13.28 32.74 102.52 2.65
##  V966 numeric Y6-962:HIGH ON MARIJUANA PAST YEAR  649  62.38 61.87 134.07 5.26
##  md trimmed       range iqr skew
##   0    6.10 999 (0-999)   8 4.90
##   6   27.62 999 (0-999)  48 3.81
  • What jumps out at you from these distributions? Like before, and something common for lots of deviant behaviors, is that the data are right skewed for the question about “using marijuana,” with zero being the modal answer. However, another thing that should jump out at you is the number of missing (“NA”) cases in each of these items. Specifically, the question about getting high (V966) is missing for 62% of the sample!

    • When I (Jake) first saw this, it was not completely clear to me why so many cases were missing on this variable. My hunch was that it was a result of a skip pattern in the survey, specifically that they only asked the question about “getting high” to the respondents who reported that they had used marijuana in the past year (V890). However, when I looked at the codebook (both the ICPSR version and the original version included in the ICPSR documentation), this skip pattern was not completely clear. Ultimately, I had to go to the original survey instrument that is included in the codebook from ICPSR:

- Unfortunately, the original page numbers are not included in the instrument that is attached to the ICPSR codebook. However, with some simple math, we can be fairly confident that the large number of missing for the question about getting high resulted from respondents who reported no use not being asked that question.

  - Go back and look at the distributions for **V890** about use and **V966** about getting high. Notice that for **V890**, there are 1,496 "valid" responses (i.e. non-missing) and 847 respondents who reported zero marijuana use in the past year. If you subtract 847 from 1,496, you get 649---the number of valid cases in **V966**. 
  
  - Ultimately, this means we'll need both questions to construct a measure similar to Orcutt (1987). Essentially, we'll want to create a dichotomous variable that combines those who answered zero on both the question about "using marijuana" or the question about "getting high" from using marijuana as "non-users" (1 = Incompetent-/Non-User). Any respondent that answered one or above on the question about getting high will be coded as a (competent) "user" in Orcutt's terms (1 = Competent User). 
  
    - ***Note:*** If we were doing this conceptual replication in the wild, we would likely want to directly examine the distinction between competent and incompetent use by distinguishing between 1) non-users (i.e., answered zero on the question about use), 2) incompetent users (answer 1 or more on question about use and answered zero on question about getting high), and 3) competent users (answered above zero on question about getting high).  

Part 3 (Assignment 4.3)

Goal: Select, Rename, and Recode Items for Analysis

Up to this point, the bulk of the work has been intellectual and theoretical. Indeed, that’s the “conceptual” part of a conceptual replication. But now that we know what items from the NYS align most closely with the items used in Orcutt’s (1987) study and have a good idea of how we want to use them, we need to wrangle and recode the data so that we can analyze it. Like with “R Assignment #3,” this will involve selecting the specific variables, giving them informative names, and recoding them to closely resemble Orcutt’s coding decisions. Like with the previous assignment, in addition to the items identified above that will be used for the conceptual replication, we will also select the “CASEID” variable in order to maintain the individual identifier and we’ll create the “wave” variable to indicate that the data we are working with comes from Wave 6.

  • Note: We would likely not artificially limit ourselves to studying marijuana if we were pursuing this conceptual replication as a research project. Sutherland’s theory is not limited to explaining substance use and examining the robustness of it’s predictions across multiple types of criminal, delinquent, and deviant behavior would be a way to not simply replicate but extend Orcutt’s (1987) study.
1. Select, Rename, and Recode the Data

Here is the code for selecting items, renaming them, and recoding them. We’ll explain the logic below:

library(dplyr)
nys_w6_trim <- nys_w6 %>%
  dplyr::select(CASEID, V371, V356, V890, V966) %>%
  #Provide Informative Names:
  rename(marpeer = V371, 
         mardef = V356,
         maruse = V890,
         marhigh = V966) %>%
  #Recode key Variables
  mutate(marpeer_fct = as_factor(marpeer), #create factor variable from marpeer
         mardef_fct = as_factor(mardef), #create factor variable from mardef
         mardef_neg = ifelse(mardef %in% c(2, 3, 4), 1, 0), #create dummy variable indicating negative definition of marijuana ("A little bit wrong" to "Very Wrong")
         mardef_neg = ifelse(is.na(mardef), NA, mardef_neg),
         mardef_pos = ifelse(mardef == 1, 1, 0), #create dummy variable indicating positive definition of marijuana ("Not Wrong")
         mardef_neut = ifelse(mardef == 2, 1, 0), #create dummy variable indicating neutral definition of marijuana ("A little bit wrong")
         mardef_negneut = ifelse((mardef == 3 | mardef == 4), 1, 0), #create dummy variable indicating negative definition ("Wrong" and "Very Wrong") to align with neutral definition ("A little bit wrong")
         mardef_negneut = ifelse(is.na(mardef), NA, mardef_negneut),
         maruse_dic = ifelse(maruse >= 1, 1, 0), #create dummy variable for maruse
         marhigh_dic = ifelse(marhigh >= 1, 1, 0), #responses 1 or greater on marhigh = 1
         marhigh_dic = ifelse(maruse == 0, 0, marhigh_dic), #responses of 0 on maruse are coded as 0
         wave = 6) 
glimpse(nys_w6_trim)
## Rows: 1,725
## Columns: 14
## $ CASEID         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ marpeer        <dbl+lbl>  4, NA,  1,  4,  3,  3,  4,  3,  3,  1,  3,  2,  3,…
## $ mardef         <dbl+lbl>  2, NA,  4,  2,  2,  2,  2,  2,  2,  4,  2,  3,  1,…
## $ maruse         <dbl> 300, NA, 0, 300, 2, 100, 4, 4, 0, 0, 0, 0, 9, 25, 150, …
## $ marhigh        <dbl> 300, NA, NA, 300, 2, 100, 4, 1, NA, NA, NA, 1, 10, 25, …
## $ marpeer_fct    <fct> Most of them, NA, None of them, Most of them, Some of t…
## $ mardef_fct     <fct> A little bit wrong, NA, Very wrong, A little bit wrong,…
## $ mardef_neg     <dbl> 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, …
## $ mardef_pos     <dbl> 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, …
## $ mardef_neut    <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, …
## $ mardef_negneut <dbl> 0, NA, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, …
## $ maruse_dic     <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ marhigh_dic    <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ wave           <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…

There are a few things that we should explain about the above code.

  1. In the raw nys_w6 data that we imported from ICPSR, all of the categorical variables are treated as numerical in the data frame (specifically “double” format with labels). You can see this by looking at the output from the glimpse(nys_w6_trim) function above (see the next to variable names). But the “peer influence” (marpeer) and “subjective definitions” (mardef) variables are technically not numerical, they are ordered categories with numbers assigned to each category.

-We want to make sure that R knows these variables are categorical and thus we need to convert them to factors, R’s method for working with categorical variables (see Ch. 15 of R for Data Science for more details). Like with most things, the tidyverse suite of packages has a built-in package for working with factors called “forcats”.

-Actually, the data frame already includes value labels for each variable. You can see this for any variable by using the function get_labels() that is part of the “sjlabelled” package (you can also see this in the glimpse() function above as the “marpeer” and “marhigh” items have <dbl + lbl> next to them).

library(sjlabelled)

get_labels(nys_w6$V371)
## [1] "None of them"     "Very few of them" "Some of them"     "Most of them"    
## [5] "All of them"
get_labels(nys_w6$V356)
## [1] "Not wrong"          "A little bit wrong" "Wrong"             
## [4] "Very wrong"
  • The as_factor function simply tells R to create a factor variable and, in this case, uses the built-in value labels of the variables we are turning into factors to indicate the factor level.
  1. We used the ifelse command to create four dummy variables related to the “subjective definitions” item. Remember, that the ifelse command takes the form of a logical test: ifelse(test, yes, no). We first created two dummy variables distinguishing between “positive definitions” (mardef_pos) and “negative definitions” (mardef_neg).
  • For the “mardef_neg” variable, we simply told R to assign a value of 1 if the respondent answered “A little wrong” (2) through “Very wrong” (4) and zero otherwise. Recall, this question had unipolar answer categories ranging from “Not wrong” to “very Wrong” views of marijuana use. To do this, we used the logical operator %in% (see the Data Transformation chapter in R For Data Science for an explanation of the different logical operators) and listed the values we wanted coded as one in a vector using the c() function. A vector is simply a one-dimensional array or series of values (see here for brief explanation). This is basically a shortcut for writing ifelse((mardef == 2 | mardef == 3 | mardeff == 4), 1, 0).
    • For the “mardef_pos” variable, we simply told R to code values of “Not wrong” (1) as 1 and zero otherwise. These two dummy variables are likely as close as we can get to Orcutt’s (1987) three dummy variables for “Negative,” “Neutral,” and “Positive” subjective definitions of marijuana.
  • We also created two dummy variables—“mardef_neut” and “mardef_negneut”—that account for a “neutral” definition by assigning the “A little Wrong” answer category (2) to be considered neutral, and the “Wrong” (3) and “Very wrong” (4) answers as negative. Note we had to create an additional “negative” dummy variable because dummy variables for multiple categories of the same variable should be mutually exclusive. The “mardef_neg” variable included the “A little wrong” answer and thus would have overlapped with our “mardef_neut” variable. Of course, given the unipolar nature of the “subjective definitions” question in NYS, this specific coding strategy may not be conceptually justified.
  1. We used the if_else function to create two dummy variables for the using marijuana (“maruse_dic”) and getting high from marijuana (“marhigh_dic”) items.
  • For the “maruse_dic” variable, we simply created dummy variables similar to how we did in “R Assignment 3” by telling R that if the “maruse” variable is greater than or equal to one, make the “maruse_dic” variable equal to 1 and make it zero otherwise (i.e. if it equals zero).

  • For the “marhigh_dic” variable, we needed two if_else commands that are essentially nested. This is because we needed the “marhigh_dic” variable to account for those who reported using marijuana but did not get high.

    • First, we created the dummy variable like we did with “maruse_dic” (if the “marhigh” variable is greater than or equal to one, make the “marhigh_dic” variable equal to 1 and make it zero otherwise—i.e. if it equals zero). This takes the the 649 valid cases in the “marhigh” variable and assigns them to the appropriate category (1 or 0). But this only affects the 649 valid cases. -Second, In order to account for the fact that (almost all) respondents who reported not using marijuana in the past year were not asked if they got high from marijuana, we needed another if_else command. Thus, the second if_else command tells R that if “maruse” equals zero make “marhigh_dic” equal to zero and otherwise leave it as the value “marhigh_dic” already is. This is how we nested the two if_else commands. If the second command does not apply (i.e. maruse does not equal zero), then the “marhigh_dic” variable remains the value we told it to take in the first if_else command. -Note: If you want to get a sense of this, run the above code where we created the “nys_w6_trim” data but comment out the second if_else command and create frequency table for the “marhigh_dic” variable. Here is what it will look like:
## marhigh_dic <numeric> 
## # total N=1725 valid N=649 mean=0.90 sd=0.30
## 
## Value |    N | Raw % | Valid % | Cum. %
## ---------------------------------------
##     0 |   66 |  3.83 |   10.17 |  10.17
##     1 |  583 | 33.80 |   89.83 | 100.00
##  <NA> | 1076 | 62.38 |    <NA> |   <NA>
  - Notice how it has 1,076 missing values? This includes the respondents who were legitimately missing on both the "marijuana use" and "getting high" items but also those who answered zero to the "marijuana use" question and thus were not asked the "getting high" question. Compare this to the variable you actually created with the nested `if_else` commands:
## marhigh_dic <numeric> 
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 911 | 52.81 |   61.02 |  61.02
##     1 | 582 | 33.74 |   38.98 | 100.00
##  <NA> | 232 | 13.45 |    <NA> |   <NA>
2. Check that the data wrangling worked as expected

Remember, anytime you recode and/or manipulate data, you want to check that R did what you wanted it to do. The thing about programming languages like R is that they will do exactly what you tell them to do (or won’t do something because you didn’t speak to them correctly). But what you tell them to do is not always what you expect them to do. So it is crucial to check your data after you have made changes.

  1. First, let’s see that our new factor variables worked how we expected them to. Like “R Assignment 3” we’ll use the flat_table() function from the “sjmisc” package
library(sjmisc)

nys_w6_trim %>%
  flat_table(marpeer_fct, marpeer, show.values = TRUE)
##                  marpeer [1] None of them [2] Very few of them [3] Some of them [4] Most of them [5] All of them
## marpeer_fct                                                                                                     
## None of them                          468                    0                0                0               0
## Very few of them                        0                  318                0                0               0
## Some of them                            0                    0              349                0               0
## Most of them                            0                    0                0              196               0
## All of them                             0                    0                0                0             133
nys_w6_trim %>%
  flat_table(mardef_fct, mardef, show.values = TRUE)
##                    mardef [1] Not wrong [2] A little bit wrong [3] Wrong [4] Very wrong
## mardef_fct                                                                             
## Not wrong                           215                      0         0              0
## A little bit wrong                    0                    381         0              0
## Wrong                                 0                      0       473              0
## Very wrong                            0                      0         0            427

Notice that we included the show_values = TRUE option in the flat_table() function. This is because the flat_table() function automatically includes the value labels. By including the show_values = TRUE option, it also places the actual value in the data set in brackets next to the value labels. Notice how the new factor variable “mardef_fct” doesn’t have any bracketed values? That’s because by creating it as a factor variable, R now recognizes it as a true categorical variable without a real numerical value. Of course, R still has the levels ordered to represent the numerical values of the original variable. Thus, if you include the factor variables in a summary statistics table as before, it will give you the same values.

Also note that if you want to check the levels of a factor, you can use the base R command levels() and specify the variable from a specific data frame by using the $ operator.

nys_w6_trim %>%
  descr(marpeer, marpeer_fct, mardef, mardef_fct)
## 
## ## Basic descriptive statistics
## 
##          var        type                         label    n NA.prc mean   sd
##      marpeer     numeric Y6-367:FRIENDS-USED MARIJUANA 1464  15.13 2.46 1.30
##  marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1464  15.13 2.46 1.30
##       mardef     numeric          Y6-352:USE MARIJUANA 1496  13.28 2.74 1.02
##   mardef_fct categorical          Y6-352:USE MARIJUANA 1496  13.28 2.74 1.02
##    se md trimmed   range iqr  skew
##  0.03  2    2.46 4 (1-5)   2  0.45
##  0.03  2    2.46 4 (1-5)   2  0.45
##  0.03  3    2.74 3 (1-4)   2 -0.27
##  0.03  3    2.74 3 (1-4)   2 -0.27

For our dummy variables, we simply need to check that they are coded as we expected (i.e. the answer categories we thought we assigned to 1 and 0 are actually assigned to those values) and that they are mutually exclusive (related dummy variables do not overlap). To do that, we can look at cross-tabulations (i.e. “crostabs”) with them and the factor variables from which they were created:

nys_w6_trim %>%
  flat_table(mardef, mardef_neg, show.values = TRUE)
##                        mardef_neg   0   1
## mardef                                   
## [1] Not wrong                     215   0
## [2] A little bit wrong              0 381
## [3] Wrong                           0 473
## [4] Very wrong                      0 427
nys_w6_trim %>%
  flat_table(mardef, mardef_pos, show.values = TRUE)
##                        mardef_pos   0   1
## mardef                                   
## [1] Not wrong                       0 215
## [2] A little bit wrong            381   0
## [3] Wrong                         473   0
## [4] Very wrong                    427   0
nys_w6_trim %>%
  flat_table(mardef, mardef_neut, show.values = TRUE)
##                        mardef_neut   0   1
## mardef                                    
## [1] Not wrong                      215   0
## [2] A little bit wrong               0 381
## [3] Wrong                          473   0
## [4] Very wrong                     427   0
nys_w6_trim %>%
  flat_table(mardef, mardef_negneut, show.values = TRUE)
##                        mardef_negneut   0   1
## mardef                                       
## [1] Not wrong                         215   0
## [2] A little bit wrong                381   0
## [3] Wrong                               0 473
## [4] Very wrong                          0 427

Our dummy coding seemed to work as expected and, given you are looking at the related set of dummy variables, they are mutually exclusive. This means, observations are mutually exclusive within the “mardef_pos” and “mardef_neg” set of dummy variables and the “mardef_pos”, “mardef_neut” and “mardef_negneut” set of dummy variables.

  1. Now let’s check the marijuana use and getting high from marijuana items. We created a dummy variable (“marhigh_dic”) to be congruent with Orcutt’s (1987) coding decision. Specifically, if our recode logic worked, we should have everyone reporting getting “high from marijuana” one or more times in the past year coded as 1 and those who did not use marijuana or did not get high from marijuana in the past year coded as zero. Those who were missing on the use question, should also be missing on our newly created variable.
  • Let’s start with simply looking at their frequencies:
nys_w6_trim %>%
  frq(maruse_dic, marhigh_dic)
## maruse_dic <numeric> 
## # total N=1725 valid N=1496 mean=0.43 sd=0.50
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 847 | 49.10 |   56.62 |  56.62
##     1 | 649 | 37.62 |   43.38 | 100.00
##  <NA> | 229 | 13.28 |    <NA> |   <NA>
## 
## marhigh_dic <numeric> 
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     0 | 911 | 52.81 |   61.02 |  61.02
##     1 | 582 | 33.74 |   38.98 | 100.00
##  <NA> | 232 | 13.45 |    <NA> |   <NA>
  • The first thing that jumps out is that the the “marhigh_dic” variable has three additional responses that are missing. This means that three people who answered the question about using marijuana in the past year, did not answer the question about getting high in the past year. But this poses a puzzle. Recall, that the number of valid cases on the getting high item (V966) was exactly the number of valid cases on the use item minus the number of respondents who answered they had used marijuana “zero” times in the past year. What gives?

    • Let’s try to find out by looking at the crosstab between “maruse_dic” and “marhigh_dic” but with missing values included. To do this we’ll have to create the variables with explicit missing values.
nys_w6_trim %>%
  mutate(maruse_na = ifelse(is.na(maruse), -1, maruse_dic), 
         marhigh_na = ifelse(is.na(marhigh), -1, marhigh_dic)) %>% 
  flat_table(maruse_na, marhigh_na, exclude = NULL)
##           marhigh_na  -1   0   1
## maruse_na                       
## -1                   229   0   0
## 0                    844   3   0
## 1                      3  64 582
  • First, notice in the if_else commands above, I told R to create two new variables that equal -1 if “maruse” (or “marhigh”) are missing (is.na checks whether the item is missing for a specific observation) and the value of the dichotomous variable we created otherwise.

  • Second, notice that the cell corresponding to missing on both “maruse_na” and “marhigh_na” (i.e. both are -1) has 229 observations in it. This makes sense as it suggests the same respondents who did not answer the question about marijuana use also didn’t answer the questions about getting high (this is probably largely a result of attrition between survey waves - i.e. they were not able to follow-up with most of these respondents).

  • Third, notice the cell corresponding to 1 on the “maruse_na” variable and -1 on the “marhigh_na” variable has 3 observations. This is the source of difference between the missing cases we noticed earlier when looking at the frequencies of our “maruse_dic” and “maruse_high” variables. Three respondents who answered that they used marijuana one or more times in the past year did not answer the question about how many times they got high from using marijuana in the past year.

  • Finally, notice the cell corresponding to those with zero on “maruse_na” and zero on “marhigh_na.” This indicates that three people who answered they had not used marijuana in the past year were also asked about whether they got high, a violation of the skip pattern. This is why we could have 3 additional missing values on the getting high question but still have the math work that we discussed earlier (i.e., “If you subtract the number answering zero on the use question, 847, from the total number of valid cases, 1,496, you get 649—the number of valid cases in getting high question). Essentially, we lost three who didn’t answer the getting high question but should have and gained three who answered the getting high question but shouldn’t have.

    • The crucial question for us though is did we correctly classify those 3 subjects who answered “zero” to the number of times they used marijuana but also answered the getting high question. Based on our current dichotomous coding strategy, they are coded as “zero” on the “marhigh_dic” variable. This is because in our nested if_else commands, we told R to code all subjects who answered “zero” on the marijuana use question to be zero on the “marhigh_dic” variable, and we did this after telling R to code those who responded 1 or more on the getting high question as 1. Essentially, if any of these three respondents answered “zero” to using marijuana but answered 1 or more to getting high, they are coded as zero in our “marhigh_dic” variable. This sounds nonsensical, but it is a possibility for these three subjects.

    -Let’s look at a crosstab for our “maruse” and “marhigh” variables to see if this happened. Before we do this, we will create a truncated variable that gives everyone who answers 10 or above on these questions a value of 10. This will just make it easier to see the crosstab.

nys_w6_trim %>%
  mutate(maruse_trunc = ifelse(maruse >= 10, 10, maruse), 
         marhigh_trunc = ifelse(marhigh >= 10, 10, marhigh)) %>% 
flat_table(maruse_trunc, marhigh_trunc, exclude = NULL)
##              marhigh_trunc   0   1   2   3   4   5   6   7   8   9  10
## maruse_trunc                                                          
## 0                            2   1   0   0   0   0   0   0   0   0   0
## 1                           29  39   3   0   0   0   0   0   0   0   0
## 2                           14   8  48   2   0   0   0   0   1   0   1
## 3                            8   8   6  17   1   1   0   0   0   0   0
## 4                            2   4   5   2  17   1   0   0   0   0   0
## 5                            3   0   3   2   4  16   5   1   0   0   0
## 6                            1   0   0   2   1   4   6   1   0   0   1
## 7                            0   0   0   1   0   1   1   2   0   0   0
## 8                            0   0   0   1   1   1   0   0   1   0   0
## 9                            0   0   0   0   0   0   0   0   0   0   1
## 10                           7   4  12   6  11   6   8   4   3   1 307
  • First, notice in the if_else commands above, I told R to create two new variables that equal 10 if “maruse” (or “marhigh”) are greater than or equal to 10 and equal their existing value otherwise.

  • Second, in terms of the reason we wanted to look at this crosstab, notice that the three people who answered zero to the question about use are in the top row of the crosstab. Two of them answered “zero”, as expected, to the number of times they had gotten high on marijuana. However, one of them had reported zero use but also reported getting high on marijuana one time. Also notice that there are multiple people who report getting high on marijuana more than they report using marijuana (just look at the values above the diagonal in the crosstab, these are all respondents who report getting high on marijuana more times than they report using marijuana). What gives?

  • Some of this could simply be measurement error, perhaps resulting from respondents not remembering exactly how many times they have used and gotten high from marijuana. It could also reflect differences in how each question was specifically asked. Recall that for the marijuana use question respondents were asked specifically about using “marijuana and hashish” whereas with the getting high question they were only asked specifically about “marijuana.” But if this type of difference in interpretation would seem to explain the values below the diagonal in the crosstab.

    • i.e., respondents who count both marijuana and hashish in answering about use but only count marijuana in answering about getting high would likely report getting high less than they report use.
  • Of course, these discrepancies between use and getting high could also reflect real differences in behavior and interpretation of the questions. For example, perhaps respondents think of “smoking marijuana” when they think of use, but think of multiple things like “edibles”, “contact high”, etc. when they think of how many times they have “been high on marijuana.”

  • You know how Ritchie (2020) keeps saying data in the real world are messy? This is what he is talking about! Ultimately, we don’t know why these discrepancies exist. But, for our conceptual replication purposes, Orcutt (1987) was primarily interested in competent use. Thus, if respondents don’t report using marijuana but report getting high (e.g. from edibles, contact highs, etc.) it would seem to not reflect competent use as Orcutt (1987) intended. Thus, we are comfortable with this one individual being coded as zero.

Part 4 (Assignment 4.4)

Goal: Replicate Descriptive Statistics (Table 1) from Orcutt (1987)

We are now ready to actually replicate Orcutt’s descriptive analyses, but we think it’s important to take a moment and recognize all the work we had to do to get to this point. In addition to the conceptual work, we also had to spend a fair amount of time wrangling the data and checking to make sure our operations worked how we intended. This is completely normal. When we are working with data, it is common for the bulk of that work to be taken up with these data management and data wrangling tasks (see this blog for a review).

  • The data wrangling process is also where you see the “Garden of Forking Paths” in a study begin to emerge. Think about all of the different decisions we had to make when wrangling the data above. Do you buy our justifications for these decisions? Do you think someone else, given the same task, could have reasonably made different decisions? This is why creating a reproducible workflow, sharing data, and generally adhering to open-science practices is so important. Other scholars can check and criticize our decisions and the relevant community of scientists has the potential to come to some form of intersubjective consensus
1. Summarize Data

Given we have our data in order, the next step is to summarize it similar to Orcutt.

  • First, let’s look at Table 1 (pg. 348):