The purpose of this assignment is to perform a conceptual replication of some observations in Orcutt’s (1987) paper in Criminology titled: “Differential Association and Marijuana Use: A Closer Look at Sutherland (with a Little Help from Becker).” Since Orcutt’s (1987) original data are unavailable, we will assess whether some of his findings can be repeated with and generalize to a similar sample in the NYS data. Recall, you previously used a pooled version of the first five waves of the NYS data to conduct reproduction research in the previous assignment.
Notice that we are drawing a distinction between “reproducibility” and “replicability” and, likewise, between reproduction and replication research. Recall, in R Assignment 3, we conducted a reproduction of part of Figure 1 from Mark Warr’s (1993) classic paper on age and peers using the exact same data as Warr. In a reproduction, the goal is to verify or repeat exactly some or all of the findings reported in a previous study using identical data and methods as the original study. Unfortunately, the terminology surrounding reproduction and replication is inconsistent and confusing. For example, some use the term “pure replication” to refer to what we call reproduction research (e.g., see Freese and Peterson 2017, pp.152-3). We note that our distinctions are consistent with those used in Ritchie’s (2020) book (which you are currently reading) and with others’ recent attempts to clarify terminology in this space (e.g., Patil, Peng, & Leek 2019).
In addition to distinguishing between reproducibility and replicability, we might also draw distinctions between different types of replications. Perhaps the most common is the distinction between a direct replication and a conceptual replication (cf. Crandall and Sherman 2016; Pridemore, Makel, & Plucker 2018, p.21). In a direct replication, one assesses the same theoretical or observational claim of a study using new data and measures that are collected or designed in such a way as to match the prior study’s design as exactly as possible, though perhaps with some notable exceptions (e.g., a larger sample size to improve statistical inferences). In contrast, a conceptual replication assesses the same theoretical or observational claim as a previous study using new data and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. we recommend reading Crandall and Sherman’s (2016) detailed discussion of these distinctions and their case for the relative utility of conceptual replications in advancing scientific progress; see also Nosek and Errington’s (2020) critique of these distinctions.
Underlying many of these terminological distinctions are differences in research procedures and research aims. For instance, drawing on Freese and Peterson’s (2017) typology of the different aims involved in replication and reproducibility research, reproduction research often aims to assess verifiability by attempting to reproduce or verify an original study’s findings using the same data and methods (e.g., code). Direct replications typically assess repeatability by testing whether the same findings emerge or repeat when applying the same methods to a new sample. Conceptual replications often assess repeatability, robustness, and/or generalizability of a theoretical or observational claim by, for instance, testing the original claim’s robustness to different measurement specifications using the same data or testing the claim’s generality to new samples (e.g., different groups or contexts). We recommend reading Freese and Peterson’s in-depth discussion of these aims; for convenience, we include their definitions (see 2017, p.152) of these four aims below.
To accomplish a conceptual replication, we will need to do and learn the following:
We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.
At this point, we also assume you are familiar with RStudio, with
creating R Markdown (RMD) files, and with the basics of using the
here
package and a self-referential file structure to
create a reproducible R Markdown file.
If not, please review R Assignments 1 & 2.
We recommend doing a quick AIC reading of Orcutt’s (1987) study.To get you started, here is the abstract:
Based on Sutherland’s differential association theory and Becker’s early research on marijuana use, a contingency model estimating the exact probability of getting high on marijuana under various associational and motivational conditions is specified and tested. Data from surveys at two universities fit this model closely. Predicted first-order interactions and nonlinear effects of motivational balance and peer association are statistically significant and generate highly precise estimates of the probability of getting high. These results suggest that linear main-effects models employed in previous research on differential association processes do not adequately reflect the complex [causal] structure of Sutherland’s theory. In addition, this study raises serious questions about claims that differential association theory is untestable and has been made outdated by social learning theory.
While that all sounds pretty technical, fundamentally the paper is examining some core theoretical claims from Edwin Sutherland’s Differential Association Theory. Specifically, Orcutt sets out to test key aspects of Sutherland’s theory by examining how (a respondent’s perception of their) peers’ behavior is associated with a respondent’s own subjective attitudes toward that same behavior (i.e. their “definitions” of the behavior) and, ultimately, with the respondent’s own self-reported participation in that behavior. In this case, the focal behavior is marijuana use. Orcutt also distinguishes between “competent use” and “incompetent use” to incorporate Sutherland’s ideas about the necessity of learning the requisite skills to accomplish deviant behavior as well as to integrate some key insights from Howard Becker’s (1953) classic description of the process of learning to become a regular marijuana user (hence the “with a little help from Becker” part of the title).
Orcutt’s data came from two “in-class” surveys of undergraduate students at two universities—University of Minnesota and Florida State University—in 1972 and 1973, respectively. Here is the description Orcutt provides in the article:
Approximately half of the respondents in both surveys received a questionnaire focusing on alcohol use while the other half completed a parallel form dealing with marijuana use. This analysis will be restricted to students at each school who filled out the marijuana questionnaire–444 Minnesota undergraduates and 543 Florida State (FSU) undergraduates.
You can find more details on these data in two other articles that Orcutt published prior to this paper (see Orcutt, 1975 and Orcutt, 1978.
Recall, the purpose of a conceptual replication is to test the repeatability, robustness, or generalizability of a theoretical or observational claim from a previous study using new data collection and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. The NYS data provides an opportunity to do this, since it includes similar variables as those used by to Orcutt - although not exactly the same - and, unlike Orcutt’s two university student samples, the NYS is a larger, nationally representative panel survey of youth followed for 10 years across seven waves of data.
Specifically, youth respondents in the NYS ranged from ages 11 to 17 in Wave 1 to ages 21 to 27 in wave 7. Since Orcutt’s sample is a student-based sample from two specific universities, we will start by identifying the wave of the NYS in which the youth respondents are most similar to the population from which Orcutt was sampling in terms of age. This will allow us to assess the repeatability of his findings on a group of youth in a similar age range while the nationally representative sample will permit us to assess whether Orcutt’s findings will generalize beyond college students attending the two specific universities that he studied.
Let’s look at the age range for each wave of NYS data:
wave | year | min | max |
---|---|---|---|
Wave 1 | 1976 | 11 | 17 |
Wave 2 | 1977 | 12 | 18 |
Wave 3 | 1978 | 13 | 19 |
Wave 4 | 1979 | 14 | 20 |
Wave 5 | 1980 | 15 | 21 |
Wave 6 | 1983 | 18 | 24 |
Wave 7 | 1987 | 21 | 27 |
\(~\) The table above shows that the Wave 6 NYS data likely includes the panel members when they are most similar to Orcutt’s college-based samples in terms of age with an age range of 18-24. It is important to note that we do not actually know the specific age range or distribution of Orcutt’s sample because he did not appear to report it in the article. This is a situation where we are likely fairly safe in assuming Orcutt’s sample was largely drawn from 18-22 year-olds, but without that information provided in the article and without the data being shared publicly, we really cannot know for sure.
Note: You should already have the “NYS” folder in your “Datasets” folder from the previous assignment. You can check using the function from the previous assignment.
Note: Recall that when you first run
this chunk during an R session, you will be asked to enter your ICPSR
account information into the R console. So Check the R
Console after trying to run the
icpsr_download
command to see and respond to the ICPSR
username/password prompts. Also, recall that this requires that you have
an ICPSR account, which you should have created for the previous
assignment. If you receive errors, go to the ICPSR website and be sure
that you are able to login using the username (email) and password that
you are entering in the R console.
library(icpsrdata)
library(here)
ifelse(dir.exists(here("Datasets/NYS/ICPSR_09948")), TRUE,
icpsr_download(file_id = c(9948),
download_dir = here("Datasets/NYS")))
## [1] TRUE
icpsr_download
function to an ifelse
statement. This is similar to what we did in “R Assignment 3” when we
were creating the “NYS” sub-folder in our “Datasets” folder. See Part 2
of “R Assignment 3” for an explanation of the logic. It is essentially
checking to see if the “ICPSR-09948” folder exists (which is NYS Wave 6)
in the “NYS” folder, returns the logical statement “TRUE” if it does
and, if it does not, runs the icpsr_download
command to
download the NYS Wave 6 data from ICPSR.read_spss
command like you have done in the previous three
assignments.library(haven)
nys_w6 <- read_spss(here("Datasets", "NYS", "ICPSR_09948", "DS0001", "09948-0001-Data.sav"))
The next step is to identify the specific items from the NYS that allow us to measure the same constructs as Orcutt (1987) did. Recall that Orcutt (1987) was attempting to examine key propositions from Sutherland’s Differential Association Theory. Specifically, Orcutt was interested in the relationship between 1) associations with criminal/deviant patterns of behavior (i.e. peer’s use of marijuana), 2) individuals’ subjective attitudes toward that behavior (i.e., definitions favorable to marijuana use), and 3) individuals’ criminal/deviant behavior (i.e. self-reported marijuana use). Below is a simplified diagram of the theorized causal structure, or directed acyclic graph (DAG), representing the hypothesized relationships between these three variables.
Orcutt measured peer marijuana use with the following question: “Of your four closest friends, how many would you say use marijuana at least once a month?” The specific number of friends who the respondent reported using marijuana at least once a month were coded as the answer categories and ranged from 1 to 4 with a mean of 1.2 (SD = 1.4) at Minnesota and 1.8 (SD = 1.6) at Florida State.
As you know from “R Assignment 3,” the NYS includes measures of peer delinquency and specifically peer marijuana use (V371 in NYS Wave 6). However, it is measured with the question: “Think of your friends. During the last year how many of them have used marijuana or hashish?” The answers were from a five-level ordinal scale with specific answer categories including: “None of them” (=1), “Very few of them” (=2), “Some of them” (=3), “Most of them” (=4), and “All of them” (=5).
You can see how this is a “conceptual replication” with our first measure. While both data sources include a measure of peer marijuana use, the concept is measured differently - and in a way that makes the measures and corresponding results not exactly comparable. To get a sense of how these differences might matter, just try to imagine how someone in Orcutt’s data who answered that “3” of their friends used “marijuana at least once a month” (above the average) would answer the question in the NYS: Would the person answer “very few of them” (=2) in the NYS? Perhaps so, if they have 20 friends. But what if the person has 4 friends - might they respond instead with “most of them” (=4) instead?
If you had substantive knowledge of Sutherland’s Differential Association Theory and the criminological literature on peer influence in general, you may also have an opinion regarding which measure of exposure to peer delinquency - Orcutt’s or the NYS version - best captures the theoretical concept of “exposure to delinquent patterns.” Going down this rabbit hole isn’t necessary for our particular assignment, but it is essential to think about and recognize: 1) that the measurement of abstract social constructs can be a tricky, variable, and error-prone endeavor; and 2) that such seemingly small measurement differences like the one here can have important implications for testing theories and for assessing the replicability of research findings.
Look at the distribution and summary statistics for the “Peer
marijuana use” variable using frq
and descr
functions from the sjmisc
package:
library(sjmisc)
nys_w6 %>%
frq(V371)
## Y6-367:FRIENDS-USED MARIJUANA (V371) <numeric>
## # total N=1725 valid N=1464 mean=2.46 sd=1.30
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ---------------------------------------------------------
## 1 | None of them | 468 | 27.13 | 31.97 | 31.97
## 2 | Very few of them | 318 | 18.43 | 21.72 | 53.69
## 3 | Some of them | 349 | 20.23 | 23.84 | 77.53
## 4 | Most of them | 196 | 11.36 | 13.39 | 90.92
## 5 | All of them | 133 | 7.71 | 9.08 | 100.00
## <NA> | <NA> | 261 | 15.13 | <NA> | <NA>
nys_w6 %>%
descr(V371)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md
## V371 numeric Y6-367:FRIENDS-USED MARIJUANA 1464 15.13 2.46 1.3 0.03 2
## trimmed range iqr skew
## 2.46 4 (1-5) 2 0.45
As you can see above, “None of them” is the modal category and a similar percentage of respondents answered “Very few of them” (18.4%) and “Some of them” (20.2 %), roughly equal to the last two categories combined - “Most of them” (11.36%) and “All of them” (7.7%). This suggests the distribution is right skewed or positive skewed. As you can see in the descriptive statistics, and characteristic of this “right” or “positive” skew, the mode (1) is less than the median (2) which is less than the mean (2.5).
For now, we will plan on keeping this variable “as-is” with five categories.
Orcutt measured “subjective definitions” regarding marijuana use with the question: “How would you generally characterize your opinions toward marijuana?” The answer categories across the Minnesota and FSU data were slightly different, although both used 5-point Likert scales (technically a Likert-type scale) with the Minnesota survey response categories ranging from “highly negative to”highly positive” and the Flordia State response categories ranging from “negative” to “positive” (p. 346).
The NYS also includes items that measure one’s “subjective definition” of the marijuana use, particularly its “wrongness” (V356 in NYS Wave 6). The specific question asks: “How wrong is it for someone your age to use marijuana or hashish?” with answer categories on a 4-point ordinal scale including: “Not wrong” (=1), “A little bit wrong” (=2), “Wrong” (=3), and “Very wrong” (=4).
Let’s look at the distribution and summary statistics for this item.
nys_w6 %>%
frq(V356)
## Y6-352:USE MARIJUANA (V356) <numeric>
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
## 1 | Not wrong | 215 | 12.46 | 14.37 | 14.37
## 2 | A little bit wrong | 381 | 22.09 | 25.47 | 39.84
## 3 | Wrong | 473 | 27.42 | 31.62 | 71.46
## 4 | Very wrong | 427 | 24.75 | 28.54 | 100.00
## <NA> | <NA> | 229 | 13.28 | <NA> | <NA>
nys_w6 %>%
descr(V356)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md trimmed
## V356 numeric Y6-352:USE MARIJUANA 1496 13.28 2.74 1.02 0.03 3 2.74
## range iqr skew
## 3 (1-4) 2 -0.27
This item as originally coded is “left skewed” or “negative skewed,” with the “wrong” category as the modal category and the “very wrong” category as the second most common.
Orcutt took his five-category item and recoded it into three categories with “undecided” as a “neutral” definition and responses on the positive (e.g., “Highly positive” and “Positive”) or negative side (e.g., “Highly negative” and “Negative”) of this category coded as “positive” or “negative” respectively. This made sense given Orcutt used a “bipolar” rating scale for his answer categories in designing his survey. A “bipolar” rating scale simply refers to a set of answer categories that allow a respondent to answer in opposite directions, usually separated by a midpoint (in Orcutt’s case the “undecided” category).
The NYS, however, uses a “unipolar” rating scale (see here for brief discussion of the distinction between bipolar and unipolar survey response scales). A unipolar rating scale includes answer categories that only move in one direction (in the case of the NYS, from “Not wrong” to “Very wrong”). As a result, the “subjective definition” item in the NYS does not lend itself nicely to a “neutral” categorization (perhaps one could argue the “A little bit wrong” answer conceptually aligns with Orcutt’s “undecided” answer category.) Arguably, the NYS approach is also a less desirable match to Sutherland’s concept of definitions, which presumably can range in content from favorable to unfavorable to crime. Below, when we recode the data, we will ultimately decide to collapse the “subjective definition” responses in our analysis into two dummy variables that indicate whether the respondent reportedly has internalized: (A) “negative” definitions unfavorable to marijuana use (i.e., 1=“At least a little bit wrong”) or (B) “positive” definitions favorable to marijuana use (1=“Not wrong”).
Orcutt’s key dependent variable of “personal use of marijuana to get high” was measured with the question: Which of the following statements best described the approximate number of times you have gotten ‘high’ on marijuana during the past year?” The Answer categories included: 1) “I did not use marijuana during the past year;” 2) “I used marijuana during the past year, but did not get ‘high’;” 3) “I got ‘high’ on marijuana during the past year; but only once or twice;” 4) I got ‘high’ on marijuana at least 3 times during the past year, but not more than 12 times;’ and 5) “I got ‘high’ on marijuana more than 12 times during the past year.”
The NYS includes multiple questions about marijuana use with two being key for our purposes. First is a question about use (V890 in NYS Wave 6): “How many times in the last year have you used marijuana or hashish? (GRASS, POT, HASH)” with the specific number of times reported coded as answers. Second is a question about getting high (V966 in NYS Wave 6): “How many times in the past year have you been high on marijuana?” with the specific number of times reported coded as answers.
Take a moment and think which of these items from the NYS best reflects what Orcutt was trying to measure.
Ultimately, Orcutt makes this decision pretty easy on us. He was specifically interested in the distinction between “minimally competent use” and incompetent use. Here is what he said on pg. 347:
An important feature of this item is that it measures a respondent’s self reported ability to get high which, for Becker (1953), is a defining characteristic of a marijuana user. That is, this measure distinguishes between those who are minimally competent users-who have acquired the physical and subjective techniques for getting high-and those who are not. Therefore, according to Becker’s conception, respondents who checked either of the first two statements should be classified as nonusers. Thus, the dependent variable in this analysis is a proportional measure of initiation into marijuana use- Ownuse-based on a binary scoring of nonusers (0 = statements 1 or 2) versus users (1 = statements 3, 4, or 5).
nys_w6 %>%
frq(V890, V966)
## Y6-886:MARIJUANA-FREQ (V890) <numeric>
## # total N=1725 valid N=1496 mean=32.74 sd=102.52
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 847 | 49.10 | 56.62 | 56.62
## 1 | 71 | 4.12 | 4.75 | 61.36
## 2 | 75 | 4.35 | 5.01 | 66.38
## 3 | 41 | 2.38 | 2.74 | 69.12
## 4 | 31 | 1.80 | 2.07 | 71.19
## 5 | 34 | 1.97 | 2.27 | 73.46
## 6 | 16 | 0.93 | 1.07 | 74.53
## 7 | 5 | 0.29 | 0.33 | 74.87
## 8 | 4 | 0.23 | 0.27 | 75.13
## 9 | 1 | 0.06 | 0.07 | 75.20
## 10 | 34 | 1.97 | 2.27 | 77.47
## 11 | 1 | 0.06 | 0.07 | 77.54
## 12 | 30 | 1.74 | 2.01 | 79.55
## 14 | 1 | 0.06 | 0.07 | 79.61
## 15 | 10 | 0.58 | 0.67 | 80.28
## 16 | 1 | 0.06 | 0.07 | 80.35
## 20 | 30 | 1.74 | 2.01 | 82.35
## 21 | 1 | 0.06 | 0.07 | 82.42
## 22 | 1 | 0.06 | 0.07 | 82.49
## 24 | 2 | 0.12 | 0.13 | 82.62
## 25 | 16 | 0.93 | 1.07 | 83.69
## 26 | 1 | 0.06 | 0.07 | 83.76
## 30 | 16 | 0.93 | 1.07 | 84.83
## 35 | 1 | 0.06 | 0.07 | 84.89
## 40 | 8 | 0.46 | 0.53 | 85.43
## 45 | 1 | 0.06 | 0.07 | 85.49
## 50 | 34 | 1.97 | 2.27 | 87.77
## 52 | 23 | 1.33 | 1.54 | 89.30
## 60 | 6 | 0.35 | 0.40 | 89.71
## 62 | 1 | 0.06 | 0.07 | 89.77
## 70 | 2 | 0.12 | 0.13 | 89.91
## 75 | 4 | 0.23 | 0.27 | 90.17
## 80 | 4 | 0.23 | 0.27 | 90.44
## 85 | 1 | 0.06 | 0.07 | 90.51
## 100 | 29 | 1.68 | 1.94 | 92.45
## 104 | 2 | 0.12 | 0.13 | 92.58
## 130 | 1 | 0.06 | 0.07 | 92.65
## 144 | 2 | 0.12 | 0.13 | 92.78
## 150 | 11 | 0.64 | 0.74 | 93.52
## 156 | 1 | 0.06 | 0.07 | 93.58
## 160 | 1 | 0.06 | 0.07 | 93.65
## 175 | 1 | 0.06 | 0.07 | 93.72
## 200 | 14 | 0.81 | 0.94 | 94.65
## 208 | 2 | 0.12 | 0.13 | 94.79
## 240 | 1 | 0.06 | 0.07 | 94.85
## 250 | 3 | 0.17 | 0.20 | 95.05
## 270 | 1 | 0.06 | 0.07 | 95.12
## 300 | 19 | 1.10 | 1.27 | 96.39
## 340 | 1 | 0.06 | 0.07 | 96.46
## 350 | 1 | 0.06 | 0.07 | 96.52
## 360 | 6 | 0.35 | 0.40 | 96.93
## 365 | 28 | 1.62 | 1.87 | 98.80
## 400 | 1 | 0.06 | 0.07 | 98.86
## 450 | 1 | 0.06 | 0.07 | 98.93
## 500 | 3 | 0.17 | 0.20 | 99.13
## 600 | 4 | 0.23 | 0.27 | 99.40
## 700 | 3 | 0.17 | 0.20 | 99.60
## 730 | 2 | 0.12 | 0.13 | 99.73
## 900 | 1 | 0.06 | 0.07 | 99.80
## 999 | 3 | 0.17 | 0.20 | 100.00
## <NA> | 229 | 13.28 | <NA> | <NA>
##
## Y6-962:HIGH ON MARIJUANA PAST YEAR (V966) <numeric>
## # total N=1725 valid N=649 mean=61.87 sd=134.07
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## 0 | 66 | 3.83 | 10.17 | 10.17
## 1 | 64 | 3.71 | 9.86 | 20.03
## 2 | 77 | 4.46 | 11.86 | 31.90
## 3 | 33 | 1.91 | 5.08 | 36.98
## 4 | 35 | 2.03 | 5.39 | 42.37
## 5 | 30 | 1.74 | 4.62 | 47.00
## 6 | 20 | 1.16 | 3.08 | 50.08
## 7 | 8 | 0.46 | 1.23 | 51.31
## 8 | 5 | 0.29 | 0.77 | 52.08
## 9 | 1 | 0.06 | 0.15 | 52.23
## 10 | 31 | 1.80 | 4.78 | 57.01
## 12 | 21 | 1.22 | 3.24 | 60.25
## 14 | 1 | 0.06 | 0.15 | 60.40
## 15 | 15 | 0.87 | 2.31 | 62.71
## 18 | 1 | 0.06 | 0.15 | 62.87
## 20 | 25 | 1.45 | 3.85 | 66.72
## 22 | 1 | 0.06 | 0.15 | 66.87
## 24 | 2 | 0.12 | 0.31 | 67.18
## 25 | 9 | 0.52 | 1.39 | 68.57
## 26 | 2 | 0.12 | 0.31 | 68.88
## 30 | 16 | 0.93 | 2.47 | 71.34
## 35 | 2 | 0.12 | 0.31 | 71.65
## 40 | 8 | 0.46 | 1.23 | 72.88
## 45 | 2 | 0.12 | 0.31 | 73.19
## 48 | 1 | 0.06 | 0.15 | 73.34
## 50 | 27 | 1.57 | 4.16 | 77.50
## 52 | 10 | 0.58 | 1.54 | 79.04
## 60 | 2 | 0.12 | 0.31 | 79.35
## 62 | 1 | 0.06 | 0.15 | 79.51
## 70 | 2 | 0.12 | 0.31 | 79.82
## 75 | 3 | 0.17 | 0.46 | 80.28
## 80 | 3 | 0.17 | 0.46 | 80.74
## 85 | 2 | 0.12 | 0.31 | 81.05
## 100 | 30 | 1.74 | 4.62 | 85.67
## 104 | 2 | 0.12 | 0.31 | 85.98
## 110 | 1 | 0.06 | 0.15 | 86.13
## 125 | 1 | 0.06 | 0.15 | 86.29
## 150 | 8 | 0.46 | 1.23 | 87.52
## 156 | 1 | 0.06 | 0.15 | 87.67
## 160 | 2 | 0.12 | 0.31 | 87.98
## 200 | 11 | 0.64 | 1.69 | 89.68
## 240 | 1 | 0.06 | 0.15 | 89.83
## 250 | 4 | 0.23 | 0.62 | 90.45
## 270 | 2 | 0.12 | 0.31 | 90.76
## 300 | 20 | 1.16 | 3.08 | 93.84
## 320 | 1 | 0.06 | 0.15 | 93.99
## 350 | 2 | 0.12 | 0.31 | 94.30
## 352 | 1 | 0.06 | 0.15 | 94.45
## 360 | 4 | 0.23 | 0.62 | 95.07
## 365 | 21 | 1.22 | 3.24 | 98.31
## 400 | 1 | 0.06 | 0.15 | 98.46
## 450 | 1 | 0.06 | 0.15 | 98.61
## 500 | 1 | 0.06 | 0.15 | 98.77
## 600 | 2 | 0.12 | 0.31 | 99.08
## 700 | 1 | 0.06 | 0.15 | 99.23
## 999 | 5 | 0.29 | 0.77 | 100.00
## <NA> | 1076 | 62.38 | <NA> | <NA>
nys_w6 %>%
descr(V890, V966)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se
## V890 numeric Y6-886:MARIJUANA-FREQ 1496 13.28 32.74 102.52 2.65
## V966 numeric Y6-962:HIGH ON MARIJUANA PAST YEAR 649 62.38 61.87 134.07 5.26
## md trimmed range iqr skew
## 0 6.10 999 (0-999) 8 4.90
## 6 27.62 999 (0-999) 48 3.81
What jumps out at you from these distributions? Like before, and something common for lots of deviant behaviors, is that the data are right skewed for the question about “using marijuana,” with zero being the modal answer. However, another thing that should jump out at you is the number of missing (“NA”) cases in each of these items. Specifically, the question about getting high (V966) is missing for 62% of the sample!
- Unfortunately, the original page numbers are not included in the instrument that is attached to the ICPSR codebook. However, with some simple math, we can be fairly confident that the large number of missing for the question about getting high resulted from respondents who reported no use not being asked that question.
- Go back and look at the distributions for **V890** about use and **V966** about getting high. Notice that for **V890**, there are 1,496 "valid" responses (i.e. non-missing) and 847 respondents who reported zero marijuana use in the past year. If you subtract 847 from 1,496, you get 649---the number of valid cases in **V966**.
- Ultimately, this means we'll need both questions to construct a measure similar to Orcutt (1987). Essentially, we'll want to create a dichotomous variable that combines those who answered zero on both the question about "using marijuana" or the question about "getting high" from using marijuana as "non-users" (1 = Incompetent-/Non-User). Any respondent that answered one or above on the question about getting high will be coded as a (competent) "user" in Orcutt's terms (1 = Competent User).
- ***Note:*** If we were doing this conceptual replication in the wild, we would likely want to directly examine the distinction between competent and incompetent use by distinguishing between 1) non-users (i.e., answered zero on the question about use), 2) incompetent users (answer 1 or more on question about use and answered zero on question about getting high), and 3) competent users (answered above zero on question about getting high).
Up to this point, the bulk of the work has been intellectual and theoretical. Indeed, that’s the “conceptual” part of a conceptual replication. But now that we know what items from the NYS align most closely with the items used in Orcutt’s (1987) study and have a good idea of how we want to use them, we need to wrangle and recode the data so that we can analyze it. Like with “R Assignment #3,” this will involve selecting the specific variables, giving them informative names, and recoding them to closely resemble Orcutt’s coding decisions. Like with the previous assignment, in addition to the items identified above that will be used for the conceptual replication, we will also select the “CASEID” variable in order to maintain the individual identifier and we’ll create the “wave” variable to indicate that the data we are working with comes from Wave 6.
Here is the code for selecting items, renaming them, and recoding them. We’ll explain the logic below:
library(dplyr)
nys_w6_trim <- nys_w6 %>%
dplyr::select(CASEID, V371, V356, V890, V966) %>%
#Provide Informative Names:
rename(marpeer = V371,
mardef = V356,
maruse = V890,
marhigh = V966) %>%
#Recode key Variables
mutate(marpeer_fct = as_factor(marpeer), #create factor variable from marpeer
mardef_fct = as_factor(mardef), #create factor variable from mardef
mardef_neg = ifelse(mardef %in% c(2, 3, 4), 1, 0), #create dummy variable indicating negative definition of marijuana ("A little bit wrong" to "Very Wrong")
mardef_neg = ifelse(is.na(mardef), NA, mardef_neg),
mardef_pos = ifelse(mardef == 1, 1, 0), #create dummy variable indicating positive definition of marijuana ("Not Wrong")
mardef_neut = ifelse(mardef == 2, 1, 0), #create dummy variable indicating neutral definition of marijuana ("A little bit wrong")
mardef_negneut = ifelse((mardef == 3 | mardef == 4), 1, 0), #create dummy variable indicating negative definition ("Wrong" and "Very Wrong") to align with neutral definition ("A little bit wrong")
mardef_negneut = ifelse(is.na(mardef), NA, mardef_negneut),
maruse_dic = ifelse(maruse >= 1, 1, 0), #create dummy variable for maruse
marhigh_dic = ifelse(marhigh >= 1, 1, 0), #responses 1 or greater on marhigh = 1
marhigh_dic = ifelse(maruse == 0, 0, marhigh_dic), #responses of 0 on maruse are coded as 0
wave = 6)
glimpse(nys_w6_trim)
## Rows: 1,725
## Columns: 14
## $ CASEID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ marpeer <dbl+lbl> 4, NA, 1, 4, 3, 3, 4, 3, 3, 1, 3, 2, 3,…
## $ mardef <dbl+lbl> 2, NA, 4, 2, 2, 2, 2, 2, 2, 4, 2, 3, 1,…
## $ maruse <dbl> 300, NA, 0, 300, 2, 100, 4, 4, 0, 0, 0, 0, 9, 25, 150, …
## $ marhigh <dbl> 300, NA, NA, 300, 2, 100, 4, 1, NA, NA, NA, 1, 10, 25, …
## $ marpeer_fct <fct> Most of them, NA, None of them, Most of them, Some of t…
## $ mardef_fct <fct> A little bit wrong, NA, Very wrong, A little bit wrong,…
## $ mardef_neg <dbl> 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, …
## $ mardef_pos <dbl> 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, …
## $ mardef_neut <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, …
## $ mardef_negneut <dbl> 0, NA, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, …
## $ maruse_dic <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ marhigh_dic <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ wave <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
There are a few things that we should explain about the above code.
nys_w6
data that we imported from ICPSR, all
of the categorical variables are treated as numerical in the data frame
(specifically “double” format with labels). You can see this by looking
at the output from the glimpse(nys_w6_trim)
function above
(see the -We want to make sure that R knows these variables are categorical and thus we need to convert them to factors, R’s method for working with categorical variables (see Ch. 15 of R for Data Science for more details). Like with most things, the tidyverse suite of packages has a built-in package for working with factors called “forcats”.
-Actually, the data frame already includes value labels for each
variable. You can see this for any variable by using the function
get_labels()
that is part of the “sjlabelled”
package (you can also see this in the glimpse()
function
above as the “marpeer” and “marhigh” items have
<dbl + lbl>
next to them).
library(sjlabelled)
get_labels(nys_w6$V371)
## [1] "None of them" "Very few of them" "Some of them" "Most of them"
## [5] "All of them"
get_labels(nys_w6$V356)
## [1] "Not wrong" "A little bit wrong" "Wrong"
## [4] "Very wrong"
as_factor
function simply tells R to create a
factor variable and, in this case, uses the built-in value labels of the
variables we are turning into factors to indicate the factor level.ifelse
command to create four dummy
variables related to the “subjective definitions” item. Remember, that
the ifelse
command takes the form of a logical test:
ifelse(test, yes, no)
. We first created two dummy variables
distinguishing between “positive definitions” (mardef_pos) and “negative
definitions” (mardef_neg).%in%
(see the
Data
Transformation chapter in R For Data Science for an
explanation of the different logical operators) and listed the values we
wanted coded as one in a vector using the c()
function. A
vector is simply a one-dimensional array or series of values (see here for brief explanation). This
is basically a shortcut for writing
ifelse((mardef == 2 | mardef == 3 | mardeff == 4), 1, 0)
.
if_else
function to create two dummy
variables for the using marijuana (“maruse_dic”) and getting high from
marijuana (“marhigh_dic”) items.For the “maruse_dic” variable, we simply created dummy variables similar to how we did in “R Assignment 3” by telling R that if the “maruse” variable is greater than or equal to one, make the “maruse_dic” variable equal to 1 and make it zero otherwise (i.e. if it equals zero).
For the “marhigh_dic” variable, we needed two
if_else
commands that are essentially nested. This is
because we needed the “marhigh_dic” variable to account for those who
reported using marijuana but did not get high.
if_else
command. Thus, the
second if_else
command tells R that if “maruse” equals zero
make “marhigh_dic” equal to zero and otherwise leave it as the value
“marhigh_dic” already is. This is how we nested the two
if_else
commands. If the second command does not apply
(i.e. maruse does not equal zero), then the “marhigh_dic” variable
remains the value we told it to take in the first if_else
command. -Note: If you want to get a sense of
this, run the above code where we created the “nys_w6_trim” data but
comment out the second if_else
command and create frequency
table for the “marhigh_dic” variable. Here is what it will look
like:## marhigh_dic <numeric>
## # total N=1725 valid N=649 mean=0.90 sd=0.30
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## 0 | 66 | 3.83 | 10.17 | 10.17
## 1 | 583 | 33.80 | 89.83 | 100.00
## <NA> | 1076 | 62.38 | <NA> | <NA>
- Notice how it has 1,076 missing values? This includes the respondents who were legitimately missing on both the "marijuana use" and "getting high" items but also those who answered zero to the "marijuana use" question and thus were not asked the "getting high" question. Compare this to the variable you actually created with the nested `if_else` commands:
## marhigh_dic <numeric>
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 911 | 52.81 | 61.02 | 61.02
## 1 | 582 | 33.74 | 38.98 | 100.00
## <NA> | 232 | 13.45 | <NA> | <NA>
Remember, anytime you recode and/or manipulate data, you want to check that R did what you wanted it to do. The thing about programming languages like R is that they will do exactly what you tell them to do (or won’t do something because you didn’t speak to them correctly). But what you tell them to do is not always what you expect them to do. So it is crucial to check your data after you have made changes.
flat_table()
function from the “sjmisc” packagelibrary(sjmisc)
nys_w6_trim %>%
flat_table(marpeer_fct, marpeer, show.values = TRUE)
## marpeer [1] None of them [2] Very few of them [3] Some of them [4] Most of them [5] All of them
## marpeer_fct
## None of them 468 0 0 0 0
## Very few of them 0 318 0 0 0
## Some of them 0 0 349 0 0
## Most of them 0 0 0 196 0
## All of them 0 0 0 0 133
nys_w6_trim %>%
flat_table(mardef_fct, mardef, show.values = TRUE)
## mardef [1] Not wrong [2] A little bit wrong [3] Wrong [4] Very wrong
## mardef_fct
## Not wrong 215 0 0 0
## A little bit wrong 0 381 0 0
## Wrong 0 0 473 0
## Very wrong 0 0 0 427
Notice that we included the show_values = TRUE
option in
the flat_table()
function. This is because the
flat_table()
function automatically includes the value
labels. By including the show_values = TRUE
option, it also
places the actual value in the data set in brackets next to the value
labels. Notice how the new factor variable “mardef_fct” doesn’t have any
bracketed values? That’s because by creating it as a factor variable, R
now recognizes it as a true categorical variable without a real
numerical value. Of course, R still has the levels ordered to represent
the numerical values of the original variable. Thus, if you include the
factor variables in a summary statistics table as before, it will give
you the same values.
Also note that if you want to check the levels of a factor, you can
use the base R command levels()
and specify the variable
from a specific data frame by using the $
operator.
nys_w6_trim %>%
descr(marpeer, marpeer_fct, mardef, mardef_fct)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd
## marpeer numeric Y6-367:FRIENDS-USED MARIJUANA 1464 15.13 2.46 1.30
## marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1464 15.13 2.46 1.30
## mardef numeric Y6-352:USE MARIJUANA 1496 13.28 2.74 1.02
## mardef_fct categorical Y6-352:USE MARIJUANA 1496 13.28 2.74 1.02
## se md trimmed range iqr skew
## 0.03 2 2.46 4 (1-5) 2 0.45
## 0.03 2 2.46 4 (1-5) 2 0.45
## 0.03 3 2.74 3 (1-4) 2 -0.27
## 0.03 3 2.74 3 (1-4) 2 -0.27
For our dummy variables, we simply need to check that they are coded as we expected (i.e. the answer categories we thought we assigned to 1 and 0 are actually assigned to those values) and that they are mutually exclusive (related dummy variables do not overlap). To do that, we can look at cross-tabulations (i.e. “crostabs”) with them and the factor variables from which they were created:
nys_w6_trim %>%
flat_table(mardef, mardef_neg, show.values = TRUE)
## mardef_neg 0 1
## mardef
## [1] Not wrong 215 0
## [2] A little bit wrong 0 381
## [3] Wrong 0 473
## [4] Very wrong 0 427
nys_w6_trim %>%
flat_table(mardef, mardef_pos, show.values = TRUE)
## mardef_pos 0 1
## mardef
## [1] Not wrong 0 215
## [2] A little bit wrong 381 0
## [3] Wrong 473 0
## [4] Very wrong 427 0
nys_w6_trim %>%
flat_table(mardef, mardef_neut, show.values = TRUE)
## mardef_neut 0 1
## mardef
## [1] Not wrong 215 0
## [2] A little bit wrong 0 381
## [3] Wrong 473 0
## [4] Very wrong 427 0
nys_w6_trim %>%
flat_table(mardef, mardef_negneut, show.values = TRUE)
## mardef_negneut 0 1
## mardef
## [1] Not wrong 215 0
## [2] A little bit wrong 381 0
## [3] Wrong 0 473
## [4] Very wrong 0 427
Our dummy coding seemed to work as expected and, given you are looking at the related set of dummy variables, they are mutually exclusive. This means, observations are mutually exclusive within the “mardef_pos” and “mardef_neg” set of dummy variables and the “mardef_pos”, “mardef_neut” and “mardef_negneut” set of dummy variables.
nys_w6_trim %>%
frq(maruse_dic, marhigh_dic)
## maruse_dic <numeric>
## # total N=1725 valid N=1496 mean=0.43 sd=0.50
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 847 | 49.10 | 56.62 | 56.62
## 1 | 649 | 37.62 | 43.38 | 100.00
## <NA> | 229 | 13.28 | <NA> | <NA>
##
## marhigh_dic <numeric>
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 911 | 52.81 | 61.02 | 61.02
## 1 | 582 | 33.74 | 38.98 | 100.00
## <NA> | 232 | 13.45 | <NA> | <NA>
The first thing that jumps out is that the the “marhigh_dic” variable has three additional responses that are missing. This means that three people who answered the question about using marijuana in the past year, did not answer the question about getting high in the past year. But this poses a puzzle. Recall, that the number of valid cases on the getting high item (V966) was exactly the number of valid cases on the use item minus the number of respondents who answered they had used marijuana “zero” times in the past year. What gives?
nys_w6_trim %>%
mutate(maruse_na = ifelse(is.na(maruse), -1, maruse_dic),
marhigh_na = ifelse(is.na(marhigh), -1, marhigh_dic)) %>%
flat_table(maruse_na, marhigh_na, exclude = NULL)
## marhigh_na -1 0 1
## maruse_na
## -1 229 0 0
## 0 844 3 0
## 1 3 64 582
First, notice in the if_else
commands above, I told
R to create two new variables that equal -1 if “maruse” (or “marhigh”)
are missing (is.na
checks whether the item is missing for a
specific observation) and the value of the dichotomous variable we
created otherwise.
Second, notice that the cell corresponding to missing on both “maruse_na” and “marhigh_na” (i.e. both are -1) has 229 observations in it. This makes sense as it suggests the same respondents who did not answer the question about marijuana use also didn’t answer the questions about getting high (this is probably largely a result of attrition between survey waves - i.e. they were not able to follow-up with most of these respondents).
Third, notice the cell corresponding to 1 on the “maruse_na” variable and -1 on the “marhigh_na” variable has 3 observations. This is the source of difference between the missing cases we noticed earlier when looking at the frequencies of our “maruse_dic” and “maruse_high” variables. Three respondents who answered that they used marijuana one or more times in the past year did not answer the question about how many times they got high from using marijuana in the past year.
Finally, notice the cell corresponding to those with zero on “maruse_na” and zero on “marhigh_na.” This indicates that three people who answered they had not used marijuana in the past year were also asked about whether they got high, a violation of the skip pattern. This is why we could have 3 additional missing values on the getting high question but still have the math work that we discussed earlier (i.e., “If you subtract the number answering zero on the use question, 847, from the total number of valid cases, 1,496, you get 649—the number of valid cases in getting high question). Essentially, we lost three who didn’t answer the getting high question but should have and gained three who answered the getting high question but shouldn’t have.
if_else
commands, we told R to code all subjects who
answered “zero” on the marijuana use question to be zero on the
“marhigh_dic” variable, and we did this after telling R to code those
who responded 1 or more on the getting high question as 1. Essentially,
if any of these three respondents answered “zero” to using marijuana but
answered 1 or more to getting high, they are coded as zero in our
“marhigh_dic” variable. This sounds nonsensical, but it is a possibility
for these three subjects.-Let’s look at a crosstab for our “maruse” and “marhigh” variables to see if this happened. Before we do this, we will create a truncated variable that gives everyone who answers 10 or above on these questions a value of 10. This will just make it easier to see the crosstab.
nys_w6_trim %>%
mutate(maruse_trunc = ifelse(maruse >= 10, 10, maruse),
marhigh_trunc = ifelse(marhigh >= 10, 10, marhigh)) %>%
flat_table(maruse_trunc, marhigh_trunc, exclude = NULL)
## marhigh_trunc 0 1 2 3 4 5 6 7 8 9 10
## maruse_trunc
## 0 2 1 0 0 0 0 0 0 0 0 0
## 1 29 39 3 0 0 0 0 0 0 0 0
## 2 14 8 48 2 0 0 0 0 1 0 1
## 3 8 8 6 17 1 1 0 0 0 0 0
## 4 2 4 5 2 17 1 0 0 0 0 0
## 5 3 0 3 2 4 16 5 1 0 0 0
## 6 1 0 0 2 1 4 6 1 0 0 1
## 7 0 0 0 1 0 1 1 2 0 0 0
## 8 0 0 0 1 1 1 0 0 1 0 0
## 9 0 0 0 0 0 0 0 0 0 0 1
## 10 7 4 12 6 11 6 8 4 3 1 307
First, notice in the if_else
commands above, I told
R to create two new variables that equal 10 if “maruse” (or “marhigh”)
are greater than or equal to 10 and equal their existing value
otherwise.
Second, in terms of the reason we wanted to look at this crosstab, notice that the three people who answered zero to the question about use are in the top row of the crosstab. Two of them answered “zero”, as expected, to the number of times they had gotten high on marijuana. However, one of them had reported zero use but also reported getting high on marijuana one time. Also notice that there are multiple people who report getting high on marijuana more than they report using marijuana (just look at the values above the diagonal in the crosstab, these are all respondents who report getting high on marijuana more times than they report using marijuana). What gives?
Some of this could simply be measurement error, perhaps resulting from respondents not remembering exactly how many times they have used and gotten high from marijuana. It could also reflect differences in how each question was specifically asked. Recall that for the marijuana use question respondents were asked specifically about using “marijuana and hashish” whereas with the getting high question they were only asked specifically about “marijuana.” But if this type of difference in interpretation would seem to explain the values below the diagonal in the crosstab.
Of course, these discrepancies between use and getting high could also reflect real differences in behavior and interpretation of the questions. For example, perhaps respondents think of “smoking marijuana” when they think of use, but think of multiple things like “edibles”, “contact high”, etc. when they think of how many times they have “been high on marijuana.”
You know how Ritchie (2020) keeps saying data in the real world are messy? This is what he is talking about! Ultimately, we don’t know why these discrepancies exist. But, for our conceptual replication purposes, Orcutt (1987) was primarily interested in competent use. Thus, if respondents don’t report using marijuana but report getting high (e.g. from edibles, contact highs, etc.) it would seem to not reflect competent use as Orcutt (1987) intended. Thus, we are comfortable with this one individual being coded as zero.
We are now ready to actually replicate Orcutt’s descriptive analyses, but we think it’s important to take a moment and recognize all the work we had to do to get to this point. In addition to the conceptual work, we also had to spend a fair amount of time wrangling the data and checking to make sure our operations worked how we intended. This is completely normal. When we are working with data, it is common for the bulk of that work to be taken up with these data management and data wrangling tasks (see this blog for a review).
Given we have our data in order, the next step is to summarize it similar to Orcutt.
Orcutt is simply presenting the means and standard deviations (in
parentheses) for each of the key variables in his analyses for both
samples separately and for the combined sample. In order to reproduce
this from wave 6 of the NYS, we first need to calculate these summary
statistics for each of the key variables we wrangled above. Fortunately,
the “sjmisc” package basically does this for you with the
descr
function. All you need to do is assign that
descr
function to an object and it will create a dataframe
of summary statistics
select
command from “dplyr.” We’ll also use
the drop_na
command like we did in the last assignment to
perform listwise deletion for missing values across these three
variables (i.e., only include respondents who have complete data on all
three variables).tb1_sumstat <- nys_w6_trim %>%
select(marhigh_dic, marpeer_fct, mardef_pos) %>%
drop_na() %>%
sjmisc::descr(marhigh_dic, marpeer_fct, mardef_pos)
tb1_sumstat
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd
## marhigh_dic numeric marhigh_dic 1461 0 0.39 0.49
## marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1461 0 2.46 1.30
## mardef_pos numeric mardef_pos 1461 0 0.14 0.35
## se md trimmed range iqr skew
## 0.01 0 0.37 1 (0-1) 1 0.43
## 0.03 2 2.33 4 (1-5) 2 0.46
## 0.01 0 0.05 1 (0-1) 0 2.04
First, notice that you now have a dataframe in your Global Environment called “tb1_sumstat” that includes three observations and 13 variables. This is a summary data set where each variable is an observation and each characteristic (e.g. type and label) and each summary statistic (e.g., mean and sd) are variables.
Second, notice that this data set includes the specific information we need to replicate Orcutt’s (1987) Table 1 (i.e. the mean and standard devation for each variable). The problem of course, is this data set also includes a lot of other information that we don’t necessarily want to report, even though it may be informative.
tb1_sumstat <- tb1_sumstat %>%
select(var, label, n, mean, sd) %>%
mutate(sample = "NYS Wave 6",
label = ifelse(var == "marhigh_dic", "Competent User (1 = User)", label),
label = ifelse(var == "marpeer_fct", "Friends' Use (1 = None of them - 5 = All of them)", label),
label = ifelse(var == "mardef_pos", "Subjective Definition (1 = Positive Definition)", label),
label = as_factor(label))
tb1_sumstat
##
## ## Basic descriptive statistics
##
## var label n mean sd
## marhigh_dic Competent User (1 = User) 1461 0.39 0.49
## marpeer_fct Friends' Use (1 = None of them - 5 = All of them) 1461 2.46 1.30
## mardef_pos Subjective Definition (1 = Positive Definition) 1461 0.14 0.35
## sample
## NYS Wave 6
## NYS Wave 6
## NYS Wave 6
select
command. Then, using the mutate
command, we told R to
create a new variable called “sample” to indicate these data were from
the “NYS Wave 6” data and recode the “label” variable to have more
informative labels. We also told R to make the label variable a factor
variable (this may come in handy later). R automatically assigned the
labels to levels of the factor based on the order they appear in the
data. If you want to see the levels of a factor variables, simply type
levels(tb1_sumstat$label)
into a code chunk or the console
window.We actually can produce a simple table that replicates the
information in Orcutt’s with our current setup. We simply have to tell R
which columns in our tb1_sumstat
object to show and we can
do that with the select
command.
tb1_sumstat %>%
select(label, mean, sd) %>%
mutate(mean = round(mean, digits = 3),
sd = round(sd, digits = 3)) %>%
rename(Variable = label,
Mean = mean,
SD = sd)
##
## ## Basic descriptive statistics
##
## Variable Mean SD
## Competent User (1 = User) 0.394 0.489
## Friends' Use (1 = None of them - 5 = All of them) 2.457 1.304
## Subjective Definition (1 = Positive Definition) 0.143 0.350
Note: the code above should be
relatively self-explanatory at this point except what we put in the
mutate
function. The data included seven decimal places for
the mean and sd. The mutate
function in the code above
simply tells R to write over these variables with the same variable
rounded to three decimal places using the round()
function.
At this point we have performed a conceptual replication of Orcutt’s (1987) Table 1. All the information is technically in the above table. However, it’s an ugly table that would not look great in a presentation or publication. Also, if we were presenting or publishing this, we may also want to place our results next to Orcutt’s in order to compare the results, especially amongst the variables that are measured most similarly.
In order to make the table more presentable, we’re going to use the “gt” package.
The “gt” package was built by people at RStudio and the basic idea is it allows you to take anything formatted as a data table (e.g. a dataframe or a tidyverse tibble) and create a table using its built in table elements and formatting options. Ultimately, the table created with the “gt” package can be rendered in html within an RMarkdown document (see an introductory video here).
Given the descr
funciton in the “sjmisc” package
already create a data table for the information we wanted, we can easily
create a simple gt table from the tb1_sumstat
object we
created above usting the gt()
function.
library(gt)
tb1_sumstat %>%
gt()
var | label | n | mean | sd | sample |
---|---|---|---|---|---|
marhigh_dic | Competent User (1 = User) | 1461 | 0.3942505 | 0.4888564 | NYS Wave 6 |
marpeer_fct | Friends' Use (1 = None of them - 5 = All of them) | 1461 | 2.4565366 | 1.3039692 | NYS Wave 6 |
mardef_pos | Subjective Definition (1 = Positive Definition) | 1461 | 0.1430527 | 0.3502465 | NYS Wave 6 |
Note: We did not overwrite the object
when we produced our basic table above, so the table produced by the
gt()
function has three additional columns (“var”, “n”, and
“sample”) and includes the mean and sd measured to seven decimal places.
The cool thing about the “gt” package, is that we should be able to
customize the appearance of this table entirely with functions available
within the “gt” package.
The table already looks visually better than the one above, but we ultimately want to 1) hide the columns we don’t need, 2) format the specific style of text and numbers that are displayed (e.g., column headings, text alignment, and number of decimal places displayed), and 3) add a title and caption to the table. So let’s take these in turn and see what “gt” can really do!
cols_hide
function that is built into the “gt”
package.library(gt)
tb1_sumstat %>%
gt() %>%
cols_hide(columns = c(var, n, sample))
label | mean | sd |
---|---|---|
Competent User (1 = User) | 0.3942505 | 0.4888564 |
Friends' Use (1 = None of them - 5 = All of them) | 2.4565366 | 1.3039692 |
Subjective Definition (1 = Positive Definition) | 0.1430527 | 0.3502465 |
cols_hide
command to tell R not to display them.tb1_sumstat %>%
gt() %>%
cols_hide(columns = c(var, n, sample)) %>%
cols_label(
label = "Variable",
mean = "Mean",
sd = "SD") %>%
cols_align(
align = "left",
columns = label) %>%
fmt_number(
columns = c(mean, sd),
decimals = 3)
Variable | Mean | SD |
---|---|---|
Competent User (1 = User) | 0.394 | 0.489 |
Friends' Use (1 = None of them - 5 = All of them) | 2.457 | 1.304 |
Subjective Definition (1 = Positive Definition) | 0.143 | 0.350 |
-Note: You can find information on all of the customization options at the “Reference” website for the “gt” package. It’s important to recognize, again, that in the above code we are simply manipulating the appearance of the columns and table elements, we are not changing the underlying data frame.
tb1_gtsumstat <- tb1_sumstat %>%
gt() %>%
cols_hide(columns = c(var, n, sample)) %>%
cols_label(
label = "Variable",
mean = "Mean",
sd = "SD") %>%
cols_align(
align = "left",
columns = label) %>%
fmt_number(
columns = c(mean, sd),
decimals = 3) %>%
tab_spanner(
label = "NYS Wave 6",
id = "nys",
columns = c(mean, sd)) %>%
tab_header(
title = md("**Table 1: Variable Means and Standard Deviations**")) %>%
tab_footnote(
footnote = md("*n = 1,461*"),
locations = cells_column_spanners(
spanners = "nys"))
tb1_gtsumstat
Table 1: Variable Means and Standard Deviations | ||
Variable | NYS Wave 61 | |
---|---|---|
Mean | SD | |
Competent User (1 = User) | 0.394 | 0.489 |
Friends' Use (1 = None of them - 5 = All of them) | 2.457 | 1.304 |
Subjective Definition (1 = Positive Definition) | 0.143 | 0.350 |
1 n = 1,461 |
-Note: In the above table, we grouped the
“Mean” and “SD” columns under “NYS Wave 6” using the
tab_spanner()
function, added a title to the table with the
tab_header()
function, and added a footnote referencing the
sample size of NYS Wave 6 using the tab_footnote()
function
(if we simply wanted to add a note to the end of our table without
reference to an object in the table we would ahve used the
tab_source_note()
function).
-Note:The md()
command before
the text we were adding with a given function allows for markdown syntax
to be used. That’s why the title shows up as bold when we added “**” to
both sides of the title within the parentheses under the
tab_header()
function.
Given we tried to perform a conceptual replication of Orcutt’s (1987) study, it would be good to see our results side-by-side with Orcutt’s. To do this, we created the table below. It required some more data wrangling and some data entry (i.e. we had to enter the values of Orcutt’s table directly).
#Enter data from Orcutt's Table 1:
cols = c("Variable", "Minnesota", "FSU", "Combined")
label = c("Competent User (1 = User)", "Friends' Use (0 - 4)", "Subjective Definition (1 = Neutral)", "Subjective Definition (1 = Positive)")
min_mean = c(.345, 1.230, .155, .372)
min_sd = c(.476, 1.423, .363, .484)
fsu_mean = c(.475, 1.803, .121, .497)
fsu_sd = c(.416, 1.545, .137, .441)
comb_mean = c(.416, 1.545, .137, .441)
comb_sd = c(.493, 1.530, .344, .497)
#Create data frame of Orcutt's Table 1
orc_tb1 <- as.data.frame(cbind(label, min_mean, min_sd, fsu_mean, fsu_sd, comb_mean, comb_sd))
#Merge Orcutt data with NYS summary data
nys_orc_tb1 <- tb1_sumstat %>%
rename(nys_mean = mean,
nys_sd = sd) %>%
mutate(label = fct_recode(label, "Subjective Definition (1 = Positive)" = "Subjective Definition (1 = Positive Definition)")) %>%
full_join(orc_tb1) %>%
mutate(label = as_factor(label),
nys_mean = as_numeric(nys_mean),
nys_sd = as_numeric(nys_sd),
min_mean = as_numeric(min_mean),
min_sd = as_numeric(min_sd),
fsu_mean = as_numeric(fsu_mean),
fsu_sd = as_numeric(fsu_sd),
comb_mean = as_numeric(comb_mean),
comb_sd = as_numeric(comb_sd)) %>%
arrange(label)
#Create and Refine Table
tb1_nysorc_comb <- nys_orc_tb1 %>%
gt() %>%
cols_hide(columns = c(var, n, sample)) %>%
cols_label(
label = "Variable",
nys_mean = "Mean",
nys_sd = "SD",
min_mean = "Mean",
min_sd = "SD",
fsu_mean = "Mean",
fsu_sd = "SD",
comb_mean = "Mean",
comb_sd = "SD") %>%
cols_align(
align = "left",
columns = label) %>%
fmt_number(
columns = c(nys_mean, nys_sd, min_mean, min_sd, fsu_mean, fsu_sd, comb_mean, comb_sd),
decimals = 3) %>%
tab_spanner(
label = "NYS Wave 6",
id = "nys",
columns = c(nys_mean, nys_sd)) %>%
tab_spanner(
label = "Orcutt - Minn.",
id = "minn",
columns = c(min_mean, min_sd)) %>%
tab_spanner(
label = "Orcutt - FSU",
id = "fsu",
columns = c(fsu_mean, fsu_sd)) %>%
tab_spanner(
label = "Orcutt - Combined",
id = "comb",
columns = c(comb_mean, comb_sd)) %>%
tab_header(
title = md("**Table 1: Variable Means and Standard Deviations**")) %>%
tab_footnote(
footnote = md("*n = 1,461*"),
locations = cells_column_spanners(
spanners = "nys")) %>%
tab_footnote(
footnote = md("*n = 444*"),
locations = cells_column_spanners(
spanners = "minn")) %>%
tab_footnote(
footnote = md("*n = 543*"),
locations = cells_column_spanners(
spanners = "fsu")) %>%
tab_footnote(
footnote = md("*n = 987*"),
locations = cells_column_spanners(
spanners = "comb")) %>%
fmt_missing(
columns = everything(),
missing_text = "---")
# tab_options(
# footnotes.sep = ", ") #This isn't working; should place footnotes on same line (see: https://github.com/rstudio/gt/issues/833)
tb1_nysorc_comb
Table 1: Variable Means and Standard Deviations | ||||||||
Variable | NYS Wave 61 | Orcutt - Minn.2 | Orcutt - FSU3 | Orcutt - Combined4 | ||||
---|---|---|---|---|---|---|---|---|
Mean | SD | Mean | SD | Mean | SD | Mean | SD | |
Competent User (1 = User) | 0.394 | 0.489 | 0.345 | 0.476 | 0.475 | 0.416 | 0.416 | 0.493 |
Friends' Use (0 - 4) | — | — | 1.230 | 1.423 | 1.803 | 1.545 | 1.545 | 1.530 |
Friends' Use (1 = None of them - 5 = All of them) | 2.457 | 1.304 | — | — | — | — | — | — |
Subjective Definition (1 = Neutral) | — | — | 0.155 | 0.363 | 0.121 | 0.137 | 0.137 | 0.344 |
Subjective Definition (1 = Positive) | 0.143 | 0.350 | 0.372 | 0.484 | 0.497 | 0.441 | 0.441 | 0.497 |
1 n = 1,461 | ||||||||
2 n = 444 | ||||||||
3 n = 543 | ||||||||
4 n = 987 |
Looking across the values in the table above, you can see some similarities and points of departure between the results from the NYS data and Orcutt’s (1987) results.
First, the percentage of “competent” marijuana users in the past year among all three samples is relatively similar, this is especially the case for NYS data and the combined Minnesota and FSU sample from Orcutt’s study. Orcutt’s sample has about 2% more competent marijuana users than the NYS sample. Note that this combined average reflects a combination of Minnesota’s relativley low marijuana use (35%) and FSU’s relatively high marijuana use (48%) when compared to the NYS (39%).
Second, the peer marijuana use variables between the NYS data and Orcutt’s (1987) samples are measured very differently and thus not directly comparable (and why we put them on separate rows in the above table). In order to compare them, you have to use your informed judgement about what the specific values mean. For example, in Orcutt’s study, respondents reported the number of their four closest friends who they thought used marijuana at least once a month. Compare this to the NYS which asked respondents how many of their friends had used marijuana in the past year with categorical answer categories ranging from “None of them” to “All of them.” For Orcutt, the average was about 1.5 close friends who respondents thought were using marijuana at least once a month (closer to 1 for Minnesota and closer to 2 for FSU). Compare this to the NYS data where the average of 2.5 falls between respondents reporting about their friends’ use that “very few of them” and “some of them” used marijuana in the past year. Leaving aside the different time frames and frequency implied by the questions, do you think respondents from Orcutt’s study that reported 1 to 2 close friends using marijuana at least once a month would have reported “very few of them” or “some of them” to the question in the NYS? Perhaps, but on some level, this question is objectively unknowable without a specific study built to test the overlap between these two questions.
Third, and perhaps the clearest difference between the NYS data and Orcutt’s results, is respondents’ “Subjective Definitions” of marijuana use. These differences are likely due, in large part, to the different ways the questions were asked and the coding decisions we made. Recall that the NYS question was a unipola question about the “wrongness” of marijuana where respondents reported their views on a four-point scale ranging from “Not wrong” to “Very wrong.” Orcutt’s question was bipolar asking respondents to report their “opinion” about marijuana with questions ranging from “Highly negative” (Minnesota) or “Negative” (FSU) to “Highly positive” (Minnesota) or “Positive” (FSU) with a neutral “no opinion category in the middle. These questions are not only asking about qualitatively different things (e.g., wrongness vs. valence of opinion), they are also asking respondents to respond in very different ways (e.g., unipolar vs. bipolar).
nys_w6_trim %>%
sjmisc::frq(mardef)
## Y6-352:USE MARIJUANA (mardef) <numeric>
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
## 1 | Not wrong | 215 | 12.46 | 14.37 | 14.37
## 2 | A little bit wrong | 381 | 22.09 | 25.47 | 39.84
## 3 | Wrong | 473 | 27.42 | 31.62 | 71.46
## 4 | Very wrong | 427 | 24.75 | 28.54 | 100.00
## <NA> | <NA> | 229 | 13.28 | <NA> | <NA>
sample <- c("NYS", "Minnesota", "FSU", "Combined")
n <- c(1461, 444, 543, 987)
mean <- c(.602, (1 - (.155 + .372)), (1 - (.121 + .497)), (1 - (.137 + .441)))
neg_def <- cbind.data.frame(sample, n, mean) %>%
mutate(sd = sqrt((n/(n-1))*(mean*(1 - mean))))
tb1_nysorc_negdef <- gt::gt(neg_def) %>%
cols_hide(columns = n) %>%
fmt_number(
columns = c(mean, sd),
decimals = 3) %>%
cols_label(
sample = "Sample",
mean = "Mean",
sd = "SD") %>%
tab_spanner(
label = "Subjective Definition (1 = Negative)",
columns = c(mean, sd))
tb1_nysorc_negdef
Sample | Subjective Definition (1 = Negative) | |
---|---|---|
Mean | SD | |
NYS | 0.602 | 0.490 |
Minnesota | 0.473 | 0.500 |
FSU | 0.382 | 0.486 |
Combined | 0.422 | 0.494 |
As evident in the above table, a larger proportion of respondents in the NYS sample report negative subjective definitions of marijuana than any of the samples in Orcutt’s (1987) study. The absolute difference ranges from 22% between the NYS and the FSU sample to 13% betwen the NYS and Minnesota sample. There is an 18% difference between the NYS sample and Orcutt’s combined sample. Given the differences in measurement, what do you think might account for these differences?
We have just walked you through a conceptual replication of Orcutt’s (1987) descriptive statistics using wave 6 of the NYS—the wave where NYS respondents were presumably most similar in age to Orcutt’s sample. Substantively, this was an attempt to see if Orcutt’s findings generalized to a nationally representative sample. But another substantive question to address related to Orcutt’s findings is how the results would generalize to different age-groups.
To examine the question of whether Orcutt’s findings generalize to different age groups, one basic thing we can do is perform a similar conceptual replication using earlier waves of the NYS (waves 1 - 5). This is now your task.
variable | question | answers |
---|---|---|
CASEID | Unique Identifier | NA |
wave | Wave of NYS data collection | NA |
age | How old are you? | specific age |
marpeer | Think of your friends. During the last year how many of them have used marijuana or hashish? | 1 = None of them - 5 = All of them |
mardef | How wrong is it for someone your age to use marijuana or hashish? | 1 = Not wrong - 4 = Very wrong |
maruse | In the last year, how often have you used marijuana - hashish ('grass', 'pot', 'hash')? | 1 = Never - 9 = 2 to 3 times/day |
Note: In the first five waves, the NYS only asked about marijuana use in the last year and not about whether the respondents’ got high from marijuana in the past year.
Note: Also note that they did not ask for a raw count in Waves 1 and 2 and instead asked for them to answer on a 9-point scale that ranged from 1 = “Never” to 9 = “2 to 3 times per day”. Here is the full range of answers from the Wave 1 codebook:
Note: This question was also included in wave 6 (V891), but it was only asked of those who reported using marijuana 10 or more times in the last year.
Note: Given the first five waves of the NYS did not ask about getting high from marijuana this is another instance where the conceptualization of “marijuana use” in your replication with one of the first five waves of data will be different from Orcutt’s conceptualization. But note that with this measure you are still able to distinguish between those who have used marijuana at all in the past year and those who have not.
Here are your wave assignments:
NYS Wave | Students | ||
---|---|---|---|
Wave 1 | Student 4 | Student 10 | Student 3 |
Wave 2 | Student 1 | Student 8 | NA |
Wave 3 | Student 9 | Student 6 | NA |
Wave 4 | Student 7 | Student 2 | NA |
Wave 5 | Student 11 | Student 5 | NA |
You should now have everything that you need to replicate Orcutt’s (1987) Table 1 with the NYS data to which you were assigned. The pooled data set is provided with the assignment on canvas (along with the R Script used to create it). If you have followed along up to this point, you should have all the requisite skills to recreate Table 1 above using the NYS wave to which you were assigned.
All that is left to do is:
.rds
format). I recommend saving the pooled
file in your “NYS” subfolder and then using the load()
function as follows:load(here('Datasets/NYS', 'nys_fwtrim_orc1987.rds'))
Remember:
Keep the file clean and easy to follow by using RMD level headings (e.g., denoted with ## or ###) separating R code chunks, organized by assignment parts or questions.
Write plain text after headings and before or after code chunks to explain what you are doing. This is not just for my sake - such text will serve as useful reminders to you when working on later assignments!
Upon completing the assignment, “knit” your final RMD file again and save the final knitted html document to your “Assignments” folder in your LastName_P680_work folder as: LastName_P680_Assign4_YEAR_MO_DY.
Inside the “LastName_P680_commit” folder in our shared folder, create another folder named: Assignment 4.
To submit your assignment for grading, save copies of both your (1) final knitted “Assign4” html file and (2) your “Assign4_RMD” file into the “LastName_P680_commit / Assignment 4” folder.