The main purpose of this assignment is to teach you how to create publishable-ready tables entirely within the R ecosystem. Doing so is not only useful for general reproducibility purposes, but it also helps avoid errors that can occur when transfering information from one source (e.g. output in R) to another (e.g. excel spreadsheets or tables in a word processor).
The secondary purpose of this assignment is to walk you through the basic steps necessary to perform a conceptual replication. Specifically, we will perform a conceptual replication of some observations in in Orcutt’s (1987) paper in Criminology titled: “Differential Association and Marijuana Use: A Closer Look at Sutherland (with a Little Help from Becker).” Since Orcutt’s (1987) original data are unavailable, we will assess whether some of his findings can be repeated with and generalize to a similar sample in the NYS data. Walking through the conceptual replication process will hopefully demonstrate some of the important conceptual work necessary when doing such work and show the benefits and limitations to using different data and measurement strategies to examine the same phenomenon.
To accomplish these goals, we will need to do and learn the following:
We assume that you are now familiar with installing and loading packages in R. Thus, when you see a package being used, we expect that you know it needs to be installed and that it needs to be loaded within your own R session in order to use it.
At this point, we also assume you are familiar with RStudio and with creating R Markdown (RMD) files. If not, please review R Assignments 1 & 2.
As with previous assignments, for this and all future assignments, you MUST type all commands in by hand. Do not copy & paste from the instructions except for troubleshooting purposes (i.e., if you cannot figure out what you mistyped).
library(tidyverse)
library(here)
library(haven)
library(gt)
library(icpsrdata)
library(sjmisc)
library(janitor)
We recommend doing a quick AIC reading of Orcutt’s (1987) study.To get you started, here is the abstract:
Based on Sutherland’s differential association theory and Becker’s early research on marijuana use, a contingency model estimating the exact probability of getting high on marijuana under various associational and motivational conditions is specified and tested. Data from surveys at two universities fit this model closely. Predicted first-order interactions and nonlinear effects of motivational balance and peer association are statistically significant and generate highly precise estimates of the probability of getting high. These results suggest that linear main-effects models employed in previous research on differential association processes do not adequately reflect the complex [causal] structure of Sutherland’s theory. In addition, this study raises serious questions about claims that differential association theory is untestable and has been made outdated by social learning theory.
While that all sounds pretty technical, fundamentally the paper is examining some core theoretical claims from Edwin Sutherland’s Differential Association Theory. Specifically, Orcutt sets out to test key aspects of Sutherland’s theory by examining how (a respondent’s perception of their) peers’ behavior is associated with a respondent’s own subjective attitudes toward that same behavior (i.e. their “definitions” of the behavior) and, ultimately, with the respondent’s own self-reported participation in that behavior. In this case, the focal behavior is marijuana use. Orcutt also distinguishes between “competent use” and “incompetent use” to incorporate Sutherland’s ideas about the necessity of learning the requisite skills to accomplish deviant behavior as well as to integrate some key insights from Howard Becker’s (1953) classic description of the process of learning to become a regular marijuana user (hence the “with a little help from Becker” part of the title).
Orcutt’s data came from two “in-class” surveys of undergraduate students at two universities—University of Minnesota and Florida State University—in 1972 and 1973, respectively. Here is the description Orcutt provides in the article:
Approximately half of the respondents in both surveys received a questionnaire focusing on alcohol use while the other half completed a parallel form dealing with marijuana use. This analysis will be restricted to students at each school who filled out the marijuana questionnaire–444 Minnesota undergraduates and 543 Florida State (FSU) undergraduates.
You can find more details on these data in two other articles that Orcutt published prior to this paper (see Orcutt, 1975 and Orcutt, 1978.
Recall, the purpose of a conceptual replication is to test the repeatability, robustness, or generalizability of a theoretical or observational claim from a previous study using new data collection and/or new measurement procedures that are conceptually similar but not identical to those used in the previous study. The NYS data provides an opportunity to do this, since it includes similar variables as those used by to Orcutt - although not exactly the same - and, unlike Orcutt’s two university student samples, the NYS is a larger, nationally representative panel survey of youth followed for 10 years across seven waves of data (in the publicly available NYS data).
Specifically, respondents in the NYS ranged from ages 11 to 17 in Wave 1 to ages 21 to 27 in wave 7. Since Orcutt’s sample is a student-based sample from two specific universities, we will start by identifying the wave of the NYS in which the NYS respondents are most similar to the population from which Orcutt was sampling in terms of age. This will allow us to assess the repeatability of his findings on a group of respondents in a similar age range while the nationally representative sample will permit us to assess whether Orcutt’s findings will generalize beyond college students attending the two specific universities that he studied.
Let’s look at the age range for each wave of NYS data:
NYS Age Range by Wave | |||
wave | year | min | max |
---|---|---|---|
Wave 1 | 1976 | 11 | 17 |
Wave 2 | 1977 | 12 | 18 |
Wave 3 | 1978 | 13 | 19 |
Wave 4 | 1979 | 14 | 20 |
Wave 5 | 1980 | 15 | 21 |
Wave 6 | 1983 | 18 | 24 |
Wave 7 | 1987 | 21 | 27 |
\(~\)
The table above shows that the Wave 6 NYS data likely includes the panel members when they are most similar to Orcutt’s college-based samples in terms of age with an age range of 18-24. It is important to note that we do not actually know the specific age range or distribution of Orcutt’s sample because he did not report it in the article. This is a situation where we are likely fairly safe in assuming Orcutt’s sample was largely drawn from 18-22 year-olds, but without that information provided in the article and without the data being shared publicly, we really cannot know for sure.
Let’s download the data directly from ICPSR like we did in “R Assignment 5.”
Note: You will need to create your reproducible file structure for this assignment prior to trying to download the data. See “R Assignment 5” for details.
Note: Recall that when you first run
this chunk during an R session, you will be asked to enter your ICPSR
account information into the R console. So Check the R
Console after trying to run the
icpsr_download
command to see and respond to the ICPSR
username/password prompts. Also, recall that this requires that you have
an ICPSR account, which you should have created for the previous
assignments. If you receive errors, go to the ICPSR website and be sure
that you are able to login using the username (email) and password that
you are entering in the R console.
library(icpsrdata)
library(here)
ifelse(dir.exists(here("NYS_data", "ICPSR_09948")), TRUE,
icpsr_download(file_id = c(9948),
download_dir = here("NYS_data")))
## [1] TRUE
icpsr_download
function to an ifelse
statement. This is similar to what we did in “R Assignment 5” when we
were creating the “NYS_data” folder. See Part 2.2 of “R Assignment 5”
for an explanation of the logic. It is essentially checking to see if
the “ICPSR-09948” folder exists (which is NYS Wave 6) in the “NYS_data”
folder, returns the logical statement “TRUE” if it does and, if it does
not, runs the icpsr_download
command to download the NYS
Wave 6 data from ICPSR.Load the data into R and assign it to an object with the
read_spss
command like you have done in the previous three
assignments.
library(haven)
nys_w6 <- read_spss(here("NYS_data", "ICPSR_09948", "DS0001", "09948-0001-Data.sav"))
The next step is to identify the specific items from the NYS that allow us to measure the same constructs as Orcutt (1987) did. Recall that Orcutt (1987) was attempting to examine key propositions from Sutherland’s Differential Association Theory. Specifically, Orcutt was interested in the relationship between 1) associations with criminal/deviant patterns of behavior (i.e. peer’s use of marijuana), 2) individuals’ subjective attitudes toward that behavior (i.e., definitions favorable to marijuana use), and 3) individuals’ criminal/deviant behavior (i.e. self-reported marijuana use). Below is a simplified diagram of the theorized causal structure, or directed acyclic graph (DAG), representing the hypothesized relationships between these three variables.
Orcutt measured peer marijuana use with the following question: “Of your four closest friends, how many would you say use marijuana at least once a month?” The specific number of friends who the respondent reported using marijuana at least once a month were coded as the answer categories and ranged from 1 to 4 with a mean of 1.2 (SD = 1.4) at Minnesota and 1.8 (SD = 1.6) at Florida State.
As you know from previous R Assignments, the NYS includes measures of peer delinquency and specifically peer marijuana use (V371 in NYS Wave 6). However, it is measured with the question: “Think of your friends. During the last year how many of them have used marijuana or hashish?” The answers were from a five-level ordinal scale with specific answer categories including: “None of them” (=1), “Very few of them” (=2), “Some of them” (=3), “Most of them” (=4), and “All of them” (=5).
You can see how this is a “conceptual replication” with our first measure. While both data sources include a measure of peer marijuana use, the concept is measured differently and in a way that makes the measures and corresponding results not exactly comparable. To get a sense of how these differences might matter, just try to imagine how someone in Orcutt’s data who answered that “3” of their friends used “marijuana at least once a month” (above the average) would answer the question in the NYS: Would the person answer “very few of them” (=2) in the NYS? Perhaps so, if they have 20 friends. But what if the person has 4 friends - might they respond instead with “most of them” (=4) instead?
If you had substantive knowledge of Sutherland’s Differential Association Theory and the criminological literature on peer influence in general, you may also have an opinion regarding which measure of exposure to peer delinquency - Orcutt’s or the NYS version - best captures the theoretical concept of “exposure to delinquent patterns.” Going down this rabbit hole is not necessary for our particular assignment, but it is essential to think about and recognize: 1) that the measurement of abstract social constructs can be a tricky, variable, and error-prone endeavor; and 2) that such seemingly small measurement differences like the one here can have important implications for testing theories and for assessing the replicability of research findings.
Given all of this, let’s go ahead and look at the summary statistics
for the “Peer marijuana use” variable using frq
and
descr
functions from the sjmisc
package:
library(sjmisc)
nys_w6 %>%
frq(V371)
## Y6-367:FRIENDS-USED MARIJUANA (V371) <numeric>
## # total N=1725 valid N=1464 mean=2.46 sd=1.30
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ---------------------------------------------------------
## 1 | None of them | 468 | 27.13 | 31.97 | 31.97
## 2 | Very few of them | 318 | 18.43 | 21.72 | 53.69
## 3 | Some of them | 349 | 20.23 | 23.84 | 77.53
## 4 | Most of them | 196 | 11.36 | 13.39 | 90.92
## 5 | All of them | 133 | 7.71 | 9.08 | 100.00
## <NA> | <NA> | 261 | 15.13 | <NA> | <NA>
nys_w6 %>%
descr(V371)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md
## V371 numeric Y6-367:FRIENDS-USED MARIJUANA 1464 15.13 2.46 1.3 0.03 2
## trimmed range iqr skew
## 2.46 4 (1-5) 2 0.45
As you can see above, “None of them” is the modal category and a similar percentage of respondents answered “Very few of them” (18.4%) and “Some of them” (20.2 %), roughly equal to the last two categories combined - “Most of them” (11.36%) and “All of them” (7.7%). This suggests the distribution is right skewed or positive skewed. As you can see in the descriptive statistics, and characteristic of this “right” or “positive” skew, the mode (1) is less than the median (2) which is less than the mean (2.5).
Also note that about 15% of the respondents (n = 261) are missing data for this question. This means that by using this item, our overall sample size will, at most, be n = 1,464 (1725 - 261).
For now, we will plan on keeping this variable “as-is” with five categories.
Orcutt (1987) measured “subjective definitions” regarding marijuana use with the question: “How would you generally characterize your opinions toward marijuana?” The answer categories across the Minnesota and FSU data were slightly different, although both used 5-point Likert scales (technically a Likert-type scale) with the Minnesota survey response categories ranging from “highly negative to”highly positive” and the Flordia State response categories ranging from “negative” to “positive” (p. 346).
The NYS also includes items that measure one’s “subjective definition” of marijuana use, particularly its “wrongness” (V356 in NYS Wave 6). The specific question asks: “How wrong is it for someone your age to use marijuana or hashish?” with answer categories on a 4-point ordinal scale including: “Not wrong” (=1), “A little bit wrong” (=2), “Wrong” (=3), and “Very wrong” (=4).
Like before with the peer marijuana use variable, it is instructive to work through the intellectual exercise of thinking how the responses on Orcutt’s surveys would map onto the NYS question and response categories. Also, if we were planning to embark upon a serious attempt to contribute to this literature, then it would be worth seriously considering which is a better approach to measuring the theoretical concept of “subjective definitions.” Of course, this would require in-depth substantive knowledge of Sutherland’s theory.
Let’s look at the distribution and summary statistics for this item.
nys_w6 %>%
frq(V356)
## Y6-352:USE MARIJUANA (V356) <numeric>
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
## 1 | Not wrong | 215 | 12.46 | 14.37 | 14.37
## 2 | A little bit wrong | 381 | 22.09 | 25.47 | 39.84
## 3 | Wrong | 473 | 27.42 | 31.62 | 71.46
## 4 | Very wrong | 427 | 24.75 | 28.54 | 100.00
## <NA> | <NA> | 229 | 13.28 | <NA> | <NA>
nys_w6 %>%
descr(V356)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md trimmed
## V356 numeric Y6-352:USE MARIJUANA 1496 13.28 2.74 1.02 0.03 3 2.74
## range iqr skew
## 3 (1-4) 2 -0.27
This item as originally coded is “left skewed” or “negative skewed,” with the “wrong” category as the modal category and the “very wrong” category as the second most common.
Orcutt took his five-category item and recoded it into three categories with “undecided” as a “neutral” definition and responses on the positive (e.g., “Highly positive” and “Positive”) or negative side (e.g., “Highly negative” and “Negative”) of this category coded as “positive” or “negative” respectively. This made sense given Orcutt used a “bipolar” rating scale for his answer categories in designing his survey. A “bipolar” rating scale simply refers to a set of answer categories that allow a respondent to answer in opposite directions, usually separated by a midpoint (in Orcutt’s case the “undecided” category).
The NYS, however, uses a “unipolar” rating scale (see here for brief discussion of the distinction between bipolar and unipolar survey response scales). A unipolar rating scale includes answer categories that only move in one direction (in the case of the NYS, from “Not wrong” to “Very wrong”). As a result, the “subjective definition” item in the NYS does not lend itself nicely to a “neutral” categorization (perhaps one could argue the “A little bit wrong” answer conceptually aligns with Orcutt’s “undecided” answer category.) Arguably, the NYS approach also is a less desirable match to Sutherland’s concept of definitions, which presumably can range in content from favorable to unfavorable to crime. Below, when we recode the data, we will ultimately decide to collapse the “subjective definition” responses in our analysis into two dummy variables that indicate whether the respondent reportedly has internalized: (A) “negative” definitions unfavorable to marijuana use (i.e., 1=“At least a little bit wrong”) or (B) “positive” definitions favorable to marijuana use (1=“Not wrong”).
Orcutt’s key dependent variable of “personal use of marijuana to get high” was measured with the question: Which of the following statements best described the approximate number of times you have gotten ‘high’ on marijuana during the past year?” The Answer categories included: 1) “I did not use marijuana during the past year;” 2) “I used marijuana during the past year, but did not get ‘high’;” 3) “I got ‘high’ on marijuana during the past year; but only once or twice;” 4) I got ‘high’ on marijuana at least 3 times during the past year, but not more than 12 times;’ and 5) “I got ‘high’ on marijuana more than 12 times during the past year.”
The NYS includes multiple questions about marijuana use with two being key for our purposes. First is a question about use (V890 in NYS Wave 6): “How many times in the last year have you used marijuana or hashish? (GRASS, POT, HASH)” with the specific number of times reported coded as answers. Second is a question about getting high (V966 in NYS Wave 6): “How many times in the past year have you been high on marijuana?” with the specific number of times reported coded as answers.
Ultimately, Orcutt makes this decision pretty easy on us. He was specifically interested in the distinction between “minimally competent use” and incompetent use. Here is what he said specifically on pg. 347:
An important feature of this item is that it measures a respondent’s self reported ability to get high which, for Becker (1953), is a defining characteristic of a marijuana user. That is, this measure distinguishes between those who are minimally competent users-who have acquired the physical and subjective techniques for getting high-and those who are not. Therefore, according to Becker’s conception, respondents who checked either of the first two statements should be classified as nonusers. Thus, the dependent variable in this analysis is a proportional measure of initiation into marijuana use–Ownuse–based on a binary scoring of nonusers (0 = statements 1 or 2) versus users (1 = statements 3, 4, or 5).
This means that the key distinction for Orcutt was between getting high and not getting high. The second question from the NYS better captures this than the first. But before we decide, let’s look at the distribution and summary statistics for these items.
nys_w6 %>%
frq(V890, V966)
## Y6-886:MARIJUANA-FREQ (V890) <numeric>
## # total N=1725 valid N=1496 mean=32.74 sd=102.52
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 847 | 49.10 | 56.62 | 56.62
## 1 | 71 | 4.12 | 4.75 | 61.36
## 2 | 75 | 4.35 | 5.01 | 66.38
## 3 | 41 | 2.38 | 2.74 | 69.12
## 4 | 31 | 1.80 | 2.07 | 71.19
## 5 | 34 | 1.97 | 2.27 | 73.46
## 6 | 16 | 0.93 | 1.07 | 74.53
## 7 | 5 | 0.29 | 0.33 | 74.87
## 8 | 4 | 0.23 | 0.27 | 75.13
## 9 | 1 | 0.06 | 0.07 | 75.20
## 10 | 34 | 1.97 | 2.27 | 77.47
## 11 | 1 | 0.06 | 0.07 | 77.54
## 12 | 30 | 1.74 | 2.01 | 79.55
## 14 | 1 | 0.06 | 0.07 | 79.61
## 15 | 10 | 0.58 | 0.67 | 80.28
## 16 | 1 | 0.06 | 0.07 | 80.35
## 20 | 30 | 1.74 | 2.01 | 82.35
## 21 | 1 | 0.06 | 0.07 | 82.42
## 22 | 1 | 0.06 | 0.07 | 82.49
## 24 | 2 | 0.12 | 0.13 | 82.62
## 25 | 16 | 0.93 | 1.07 | 83.69
## 26 | 1 | 0.06 | 0.07 | 83.76
## 30 | 16 | 0.93 | 1.07 | 84.83
## 35 | 1 | 0.06 | 0.07 | 84.89
## 40 | 8 | 0.46 | 0.53 | 85.43
## 45 | 1 | 0.06 | 0.07 | 85.49
## 50 | 34 | 1.97 | 2.27 | 87.77
## 52 | 23 | 1.33 | 1.54 | 89.30
## 60 | 6 | 0.35 | 0.40 | 89.71
## 62 | 1 | 0.06 | 0.07 | 89.77
## 70 | 2 | 0.12 | 0.13 | 89.91
## 75 | 4 | 0.23 | 0.27 | 90.17
## 80 | 4 | 0.23 | 0.27 | 90.44
## 85 | 1 | 0.06 | 0.07 | 90.51
## 100 | 29 | 1.68 | 1.94 | 92.45
## 104 | 2 | 0.12 | 0.13 | 92.58
## 130 | 1 | 0.06 | 0.07 | 92.65
## 144 | 2 | 0.12 | 0.13 | 92.78
## 150 | 11 | 0.64 | 0.74 | 93.52
## 156 | 1 | 0.06 | 0.07 | 93.58
## 160 | 1 | 0.06 | 0.07 | 93.65
## 175 | 1 | 0.06 | 0.07 | 93.72
## 200 | 14 | 0.81 | 0.94 | 94.65
## 208 | 2 | 0.12 | 0.13 | 94.79
## 240 | 1 | 0.06 | 0.07 | 94.85
## 250 | 3 | 0.17 | 0.20 | 95.05
## 270 | 1 | 0.06 | 0.07 | 95.12
## 300 | 19 | 1.10 | 1.27 | 96.39
## 340 | 1 | 0.06 | 0.07 | 96.46
## 350 | 1 | 0.06 | 0.07 | 96.52
## 360 | 6 | 0.35 | 0.40 | 96.93
## 365 | 28 | 1.62 | 1.87 | 98.80
## 400 | 1 | 0.06 | 0.07 | 98.86
## 450 | 1 | 0.06 | 0.07 | 98.93
## 500 | 3 | 0.17 | 0.20 | 99.13
## 600 | 4 | 0.23 | 0.27 | 99.40
## 700 | 3 | 0.17 | 0.20 | 99.60
## 730 | 2 | 0.12 | 0.13 | 99.73
## 900 | 1 | 0.06 | 0.07 | 99.80
## 999 | 3 | 0.17 | 0.20 | 100.00
## <NA> | 229 | 13.28 | <NA> | <NA>
##
## Y6-962:HIGH ON MARIJUANA PAST YEAR (V966) <numeric>
## # total N=1725 valid N=649 mean=61.87 sd=134.07
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## 0 | 66 | 3.83 | 10.17 | 10.17
## 1 | 64 | 3.71 | 9.86 | 20.03
## 2 | 77 | 4.46 | 11.86 | 31.90
## 3 | 33 | 1.91 | 5.08 | 36.98
## 4 | 35 | 2.03 | 5.39 | 42.37
## 5 | 30 | 1.74 | 4.62 | 47.00
## 6 | 20 | 1.16 | 3.08 | 50.08
## 7 | 8 | 0.46 | 1.23 | 51.31
## 8 | 5 | 0.29 | 0.77 | 52.08
## 9 | 1 | 0.06 | 0.15 | 52.23
## 10 | 31 | 1.80 | 4.78 | 57.01
## 12 | 21 | 1.22 | 3.24 | 60.25
## 14 | 1 | 0.06 | 0.15 | 60.40
## 15 | 15 | 0.87 | 2.31 | 62.71
## 18 | 1 | 0.06 | 0.15 | 62.87
## 20 | 25 | 1.45 | 3.85 | 66.72
## 22 | 1 | 0.06 | 0.15 | 66.87
## 24 | 2 | 0.12 | 0.31 | 67.18
## 25 | 9 | 0.52 | 1.39 | 68.57
## 26 | 2 | 0.12 | 0.31 | 68.88
## 30 | 16 | 0.93 | 2.47 | 71.34
## 35 | 2 | 0.12 | 0.31 | 71.65
## 40 | 8 | 0.46 | 1.23 | 72.88
## 45 | 2 | 0.12 | 0.31 | 73.19
## 48 | 1 | 0.06 | 0.15 | 73.34
## 50 | 27 | 1.57 | 4.16 | 77.50
## 52 | 10 | 0.58 | 1.54 | 79.04
## 60 | 2 | 0.12 | 0.31 | 79.35
## 62 | 1 | 0.06 | 0.15 | 79.51
## 70 | 2 | 0.12 | 0.31 | 79.82
## 75 | 3 | 0.17 | 0.46 | 80.28
## 80 | 3 | 0.17 | 0.46 | 80.74
## 85 | 2 | 0.12 | 0.31 | 81.05
## 100 | 30 | 1.74 | 4.62 | 85.67
## 104 | 2 | 0.12 | 0.31 | 85.98
## 110 | 1 | 0.06 | 0.15 | 86.13
## 125 | 1 | 0.06 | 0.15 | 86.29
## 150 | 8 | 0.46 | 1.23 | 87.52
## 156 | 1 | 0.06 | 0.15 | 87.67
## 160 | 2 | 0.12 | 0.31 | 87.98
## 200 | 11 | 0.64 | 1.69 | 89.68
## 240 | 1 | 0.06 | 0.15 | 89.83
## 250 | 4 | 0.23 | 0.62 | 90.45
## 270 | 2 | 0.12 | 0.31 | 90.76
## 300 | 20 | 1.16 | 3.08 | 93.84
## 320 | 1 | 0.06 | 0.15 | 93.99
## 350 | 2 | 0.12 | 0.31 | 94.30
## 352 | 1 | 0.06 | 0.15 | 94.45
## 360 | 4 | 0.23 | 0.62 | 95.07
## 365 | 21 | 1.22 | 3.24 | 98.31
## 400 | 1 | 0.06 | 0.15 | 98.46
## 450 | 1 | 0.06 | 0.15 | 98.61
## 500 | 1 | 0.06 | 0.15 | 98.77
## 600 | 2 | 0.12 | 0.31 | 99.08
## 700 | 1 | 0.06 | 0.15 | 99.23
## 999 | 5 | 0.29 | 0.77 | 100.00
## <NA> | 1076 | 62.38 | <NA> | <NA>
nys_w6 %>%
descr(V890, V966)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se
## V890 numeric Y6-886:MARIJUANA-FREQ 1496 13.28 32.74 102.52 2.65
## V966 numeric Y6-962:HIGH ON MARIJUANA PAST YEAR 649 62.38 61.87 134.07 5.26
## md trimmed range iqr skew
## 0 6.10 999 (0-999) 8 4.90
## 6 27.62 999 (0-999) 48 3.81
What jumps out at you from these distributions? Like before, and something common for lots of deviant behaviors, is that the data are right skewed for the question about “using marijuana,” with zero being the modal answer. However, another thing that should jump out at you is the number of missing (“NA”) cases in each of these items. Specifically, the question about getting high (V966) is missing for 62% of the sample!
When I (Jake) first saw this, it was not completely clear to me why so many cases were missing on this variable. My hunch was that it was a result of a skip pattern in the survey, specifically that they only asked the question about “getting high” to the respondents who reported that they had used marijuana in the past year (V890). However, when I looked at the codebook (both the ICPSR version and the original version included in the ICPSR documentation), this skip pattern was not completely clear. Ultimately, I had to go to the original survey instrument that is included in the codebook from ICPSR:
Unfortunately, the original page numbers are not included in the instrument that is attached to the ICPSR codebook. However, with some simple math, we can be fairly confident that the large number of missing for the question about getting high resulted from respondents who reported no use not being asked that question.
Go back and look at the distributions for V890 about use and V966 about getting high. Notice that for V890, there are 1,496 “valid” responses (i.e. non-missing) and 847 respondents who reported zero marijuana use in the past year. If you subtract 847 from 1,496, you get 649—the number of valid cases in V966.
Ultimately, this means we’ll need both questions to construct a measure similar to Orcutt (1987). Essentially, we’ll want to create a dichotomous variable that combines those who answered zero on both the question about “using marijuana” or the question about “getting high” from using marijuana as “non-users” (1 = Incompetent-/Non-User). Any respondent that answered one or above on the question about getting high will be coded as a (competent) “user” in Orcutt’s terms (1 = Competent User).
Note: If we were doing this conceptual replication in the wild, we would likely want to directly examine the distinction between competent and incompetent use by distinguishing between 1) non-users (i.e., answered zero on the question about use), 2) incompetent users (answer 1 or more on question about use and answered zero on question about getting high), and 3) competent users (answered above zero on question about getting high).
Up to this point, the bulk of the work has been intellectual and theoretical. Indeed, that’s the “conceptual” part of conducting a conceptual replication. But now that we know what items from the NYS align most closely with the items used in Orcutt’s (1987) study and have a good idea of how we want to use them, we need to wrangle and recode the data so that we can analyze it. Like with “R Assignment 5” and “R Assignment 6,” this will involve selecting the specific variables, giving them informative names, and recoding them to closely resemble Orcutt’s coding decisions. Like with the previous assignments, in addition to the items identified above that will be used for the conceptual replication, we will also select the “CASEID” variable in order to maintain the individual identifier and we’ll create the “wave” variable to indicate that the data we are working with comes from Wave 6.
Here is the code for selecting items, renaming them, and recoding them. We’ll explain the logic below:
library(dplyr)
nys_w6_trim <- nys_w6 %>%
dplyr::select(CASEID, V371, V356, V890, V966) %>%
#Provide Informative Names:
rename(marpeer = V371,
mardef = V356,
maruse = V890,
marhigh = V966) %>%
#Recode key Variables
mutate(marpeer_fct = as_factor(marpeer), #create factor variable from marpeer
mardef_fct = as_factor(mardef), #create factor variable from mardef
mardef_neg = ifelse((mardef == 2 | mardef == 3 | mardef == 4), 1, 0), #create dummy variable indicating negative definition of marijuana ("A little bit wrong" to "Very Wrong")
mardef_pos = ifelse(mardef == 1, 1, 0), #create dummy variable indicating positive definition of marijuana ("Not Wrong")
mardef_neut = ifelse(mardef == 2, 1, 0), #create dummy variable indicating neutral definition of marijuana ("A little bit wrong")
mardef_negneut = ifelse((mardef == 3 | mardef == 4), 1, 0), #create dummy variable indicating negative definition ("Wrong" and "Very Wrong") to align with neutral definition ("A little bit wrong")
mardef_negneut = ifelse(is.na(mardef), NA, mardef_negneut),
maruse_dic = ifelse(maruse >= 1, 1, 0), #create dummy variable for maruse
marhigh_dic = ifelse(marhigh >= 1, 1, 0), #responses 1 or greater on marhigh = 1
marhigh_dic = ifelse(maruse == 0, 0, marhigh_dic), #responses of 0 on maruse are coded as 0
wave = 6)
glimpse(nys_w6_trim)
## Rows: 1,725
## Columns: 14
## $ CASEID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
## $ marpeer <dbl+lbl> 4, NA, 1, 4, 3, 3, 4, 3, 3, 1, 3, 2, 3,~
## $ mardef <dbl+lbl> 2, NA, 4, 2, 2, 2, 2, 2, 2, 4, 2, 3, 1,~
## $ maruse <dbl> 300, NA, 0, 300, 2, 100, 4, 4, 0, 0, 0, 0, 9, 25, 150, ~
## $ marhigh <dbl> 300, NA, NA, 300, 2, 100, 4, 1, NA, NA, NA, 1, 10, 25, ~
## $ marpeer_fct <fct> Most of them, NA, None of them, Most of them, Some of t~
## $ mardef_fct <fct> A little bit wrong, NA, Very wrong, A little bit wrong,~
## $ mardef_neg <dbl> 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, ~
## $ mardef_pos <dbl> 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ~
## $ mardef_neut <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, ~
## $ mardef_negneut <dbl> 0, NA, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, ~
## $ maruse_dic <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, ~
## $ marhigh_dic <dbl> 1, NA, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, ~
## $ wave <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
There are a few things that we should explain about the above code.
In the raw nys_w6
data that we imported from ICPSR,
all of the categorical variables are treated as numerical in the data
frame (specifically “double” format with labels). You can see this by
looking at the output from the glimpse(nys_w6_trim)
function above (see the
We want to make sure that R knows these variables are categorical and thus we need to convert them to factors, R’s method for working with categorical variables (see Ch. 15 of R for Data Science for more details). Like with most things, the tidyverse suite of packages has a built-in package for working with factors called “forcats”.
Actually, the data frame already includes value labels for each
variable. You can see this for any variable by using the function
get_labels()
that is part of the “sjlabelled”
package (you can also see this in the glimpse()
function
above as the “marpeer” and “marhigh” items have
<dbl + lbl>
next to them).
library(sjlabelled)
get_labels(nys_w6$V371)
## [1] "None of them" "Very few of them" "Some of them" "Most of them"
## [5] "All of them"
get_labels(nys_w6$V356)
## [1] "Not wrong" "A little bit wrong" "Wrong"
## [4] "Very wrong"
as_factor
function simply tells R to create a
factor variable and, in this case, uses the built-in value labels of the
variables we are turning into factors to indicate the factor level.We used the ifelse
command to create four dummy
variables related to the “subjective definitions” item. Remember, that
the ifelse
command takes the form of a logical test:
ifelse(test, yes, no)
. We first created two dummy variables
distinguishing between “positive definitions” (mardef_pos) and “negative
definitions” (mardef_neg).
For the “mardef_neg” variable, we simply told R to assign a value
of 1 if the respondent answered “A little wrong” (2) or
“Wrong,” (3) or “Very wrong” (4) and zero otherwise. Recall,
this question had unipolar answer categories ranging from “Not wrong” to
“very Wrong” views of marijuana use.
For the “mardef_pos” variable, we simply told R to code values of “Not wrong” (1) as 1 and zero otherwise. These two dummy variables are likely as close as we can get to Orcutt’s (1987) three dummy variables for “Negative,” “Neutral,” and “Positive” subjective definitions of marijuana.
We also created two dummy variables—“mardef_neut” and “mardef_negneut”—that account for a “neutral” definition by assigning the “A little Wrong” answer category (2) to be considered neutral, and the “Wrong” (3) and “Very wrong” (4) answers as negative. Note we had to create an additional “negative” dummy variable because dummy variables for multiple categories of the same variable should be mutually exclusive. The “mardef_neg” variable included the “A little wrong” answer and thus would have overlapped with our “mardef_neut” variable. Of course, given the unipolar nature of the “subjective definitions” question in NYS, this specific coding strategy may not be conceptually justified.
We used the if_else
function to create two dummy
variables for the using marijuana (“maruse_dic”) and getting high from
marijuana (“marhigh_dic”) items.
For the “maruse_dic” variable, we simply created dummy variables similar to how we did in “R Assignment 3” by telling R that if the “maruse” variable is greater than or equal to one, make the “maruse_dic” variable equal to 1 and make it zero otherwise (i.e. if it equals zero).
For the “marhigh_dic” variable, we needed two
if_else
commands that are essentially nested. This is
because we needed the “marhigh_dic” variable to account for those who
reported using marijuana but did not get high.
First, we created the dummy variable like we did with “maruse_dic” (if the “marhigh” variable is greater than or equal to one, make the “marhigh_dic” variable equal to 1 and make it zero otherwise—i.e. if it equals zero). This takes the the 649 valid cases in the “marhigh” variable and assigns them to the appropriate category (1 or 0). But this only affects the 649 valid cases.
Second, In order to account for the fact that (almost all)
respondents who reported not using marijuana in the past year
were not asked if they got high from marijuana, we needed another
if_else
command. Thus, the second if_else
command tells R that if “maruse” equals zero make “marhigh_dic” equal to
zero and otherwise leave it as the value “marhigh_dic” already is. This
is how we nested the two if_else
commands. If the second
command does not apply (i.e. maruse does not equal zero), then the
“marhigh_dic” variable remains the value we told it to take in the first
if_else
command.
if_else
command and create frequency table
for the “marhigh_dic” variable. Here is what it will look like:## marhigh_dic <numeric>
## # total N=1725 valid N=649 mean=0.90 sd=0.30
##
## Value | N | Raw % | Valid % | Cum. %
## ---------------------------------------
## 0 | 66 | 3.83 | 10.17 | 10.17
## 1 | 583 | 33.80 | 89.83 | 100.00
## <NA> | 1076 | 62.38 | <NA> | <NA>
if_else
commands:## marhigh_dic <numeric>
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 911 | 52.81 | 61.02 | 61.02
## 1 | 582 | 33.74 | 38.98 | 100.00
## <NA> | 232 | 13.45 | <NA> | <NA>
Remember, anytime you recode and/or manipulate data, you want to check that R did what you wanted it to do.
flat_table()
function from the “sjmisc” package. We
could also use the tabyl()
function from the “janitor” package like in “R
Assignment 6.” But for our purposes here, we like that the
flat_table()
function prints out the value labels (and
numerical values with show.values = TRUE
option) instead of
just hte numerical values.library(sjmisc)
nys_w6_trim %>%
flat_table(marpeer_fct, marpeer, show.values = TRUE, )
## marpeer [1] None of them [2] Very few of them [3] Some of them [4] Most of them [5] All of them
## marpeer_fct
## None of them 468 0 0 0 0
## Very few of them 0 318 0 0 0
## Some of them 0 0 349 0 0
## Most of them 0 0 0 196 0
## All of them 0 0 0 0 133
nys_w6_trim %>%
flat_table(mardef_fct, mardef, show.values = TRUE)
## mardef [1] Not wrong [2] A little bit wrong [3] Wrong [4] Very wrong
## mardef_fct
## Not wrong 215 0 0 0
## A little bit wrong 0 381 0 0
## Wrong 0 0 473 0
## Very wrong 0 0 0 427
Notice that we included the show_values = TRUE
option in
the flat_table()
function. This is because the
flat_table()
function automatically includes the value
labels. By including the show_values = TRUE
option, it also
places the actual value in the data set in brackets next to the value
labels. Notice how the new factor variable “mardef_fct” doesn’t have any
bracketed values? That’s because by creating it as a factor variable, R
now recognizes it as a true categorical variable without a real
numerical value. Of course, R still has the levels ordered to represent
the numerical values of the original variable. Thus, if you include the
factor variables in a summary statistics table as before, it will give
you the same values.
Also note that if you want to check the levels of a factor, you can
use the base R command levels()
and specify the variable
from a specific data frame by using the $
operator (e.g.,
levels(nys_w6_trim$mardef_fct)
)
nys_w6_trim %>%
descr(marpeer, marpeer_fct, mardef, mardef_fct)
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd
## marpeer numeric Y6-367:FRIENDS-USED MARIJUANA 1464 15.13 2.46 1.30
## marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1464 15.13 2.46 1.30
## mardef numeric Y6-352:USE MARIJUANA 1496 13.28 2.74 1.02
## mardef_fct categorical Y6-352:USE MARIJUANA 1496 13.28 2.74 1.02
## se md trimmed range iqr skew
## 0.03 2 2.46 4 (1-5) 2 0.45
## 0.03 2 2.46 4 (1-5) 2 0.45
## 0.03 3 2.74 3 (1-4) 2 -0.27
## 0.03 3 2.74 3 (1-4) 2 -0.27
For our dummy variables, we simply need to check that they are coded as we expected (i.e. the answer categories we thought we assigned to 1 and 0 are actually assigned to those values) and that they are mutually exclusive (related dummy variables do not overlap). To do that, we can look at cross-tabulations (i.e. “crostabs”) with them and the factor variables from which they were created:
nys_w6_trim %>%
flat_table(mardef, mardef_neg, show.values = TRUE)
## mardef_neg 0 1
## mardef
## [1] Not wrong 215 0
## [2] A little bit wrong 0 381
## [3] Wrong 0 473
## [4] Very wrong 0 427
nys_w6_trim %>%
flat_table(mardef, mardef_pos, show.values = TRUE)
## mardef_pos 0 1
## mardef
## [1] Not wrong 0 215
## [2] A little bit wrong 381 0
## [3] Wrong 473 0
## [4] Very wrong 427 0
nys_w6_trim %>%
flat_table(mardef, mardef_neut, show.values = TRUE)
## mardef_neut 0 1
## mardef
## [1] Not wrong 215 0
## [2] A little bit wrong 0 381
## [3] Wrong 473 0
## [4] Very wrong 427 0
nys_w6_trim %>%
flat_table(mardef, mardef_negneut, show.values = TRUE)
## mardef_negneut 0 1
## mardef
## [1] Not wrong 215 0
## [2] A little bit wrong 381 0
## [3] Wrong 0 473
## [4] Very wrong 0 427
Our dummy coding seemed to work as expected and, given you are looking at the related set of dummy variables, they are mutually exclusive. This means, observations are mutually exclusive within the “mardef_pos” and “mardef_neg” set of dummy variables and the “mardef_pos”, “mardef_neut” and “mardef_negneut” set of dummy variables.
Now let’s check the marijuana use and getting high from marijuana items. We created a dummy variable (“marhigh_dic”) to be congruent with Orcutt’s (1987) coding decision. Specifically, if our recode logic worked, we should have everyone reporting getting “high from marijuana” one or more times in the past year coded as 1 and those who did not use marijuana or did not get high from marijuana in the past year coded as zero. Those who were missing on the use question, should also be missing on our newly created variable.
nys_w6_trim %>%
frq(maruse_dic, marhigh_dic)
## maruse_dic <numeric>
## # total N=1725 valid N=1496 mean=0.43 sd=0.50
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 847 | 49.10 | 56.62 | 56.62
## 1 | 649 | 37.62 | 43.38 | 100.00
## <NA> | 229 | 13.28 | <NA> | <NA>
##
## marhigh_dic <numeric>
## # total N=1725 valid N=1493 mean=0.39 sd=0.49
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 911 | 52.81 | 61.02 | 61.02
## 1 | 582 | 33.74 | 38.98 | 100.00
## <NA> | 232 | 13.45 | <NA> | <NA>
The first thing that jumps out is that the the “marhigh_dic” variable has three additional responses that are missing. This means that three people who answered the question about using marijuana in the past year, did not answer the question about getting high in the past year. But this poses a puzzle. Recall, that the number of valid cases on the getting high item (V966) was exactly the number of valid cases on the use item minus the number of respondents who answered they had used marijuana “zero” times in the past year. What gives?
tabyl()
function as it automatically
prints out the missing category unlike the flat_table()
function (note: I use the adorn_title
argument to simply
tell the tabyl()
function to print the variable name for
the columns).nys_w6_trim %>%
tabyl(maruse_dic, marhigh_dic) %>%
adorn_title
## marhigh_dic
## maruse_dic 0 1 NA_
## 0 847 0 0
## 1 64 582 3
## <NA> 0 0 229
First, notice that the cell corresponding to missing on both “maruse_dic” and “marhigh_dic” has 229 observations in it. This makes sense as it suggests the same respondents who did not answer the question about marijuana use also didn’t answer the questions about getting high (this is probably largely a result of attrition between survey waves - i.e. they were not able to follow-up with most of these respondents).
Second, notice the cell corresponding to 1 on the “maruse_dic” variable and NA on the “marhigh_dic” variable has 3 observations. This is the source of difference between the missing cases we noticed earlier when looking at the frequencies for these variables. Three respondents who answered that they used marijuana one or more times in the past year did not answer the question about how many times they got high from using marijuana in the past year.
Finally, notice the cell corresponding to those with zero on “maruse_dic” and zero on “marhigh_dic.” This indicates that three people who answered they had not used marijuana in the past year were also asked about whether they got high, a violation of the skip pattern. This is why we could have 3 additional missing values on the getting high question but still have the math work that we discussed earlier (i.e., “If you subtract the number answering zero on the use question, 847, from the total number of valid cases, 1,496, you get 649—the number of valid cases in getting high question). Essentially, we lost three who didn’t answer the getting high question but should have and gained three who answered the getting high question but shouldn’t have.
The crucial question for us though is did we correctly classify
those 3 subjects who answered “zero” to the number of times they used
marijuana but also answered the getting high question. Based on our
current dichotomous coding strategy, they are coded as “zero” on the
“marhigh_dic” variable. This is because in our nested
if_else
commands, we told R to code all subjects who
answered “zero” on the marijuana use question to be zero on the
“marhigh_dic” variable, and we did this after telling R to code those
who responded 1 or more on the getting high question as 1. Essentially,
if any of these three respondents answered “zero” to using marijuana but
answered 1 or more to getting high, they are coded as zero in our
“marhigh_dic” variable. This sounds nonsensical, but it is a possibility
for these three subjects in the data.
nys_w6_trim %>%
mutate(maruse_trunc = ifelse(maruse >= 10, 10, maruse),
marhigh_trunc = ifelse(marhigh >= 10, 10, marhigh)) %>%
flat_table(maruse_trunc, marhigh_trunc, exclude = NULL)
## marhigh_trunc 0 1 2 3 4 5 6 7 8 9 10
## maruse_trunc
## 0 2 1 0 0 0 0 0 0 0 0 0
## 1 29 39 3 0 0 0 0 0 0 0 0
## 2 14 8 48 2 0 0 0 0 1 0 1
## 3 8 8 6 17 1 1 0 0 0 0 0
## 4 2 4 5 2 17 1 0 0 0 0 0
## 5 3 0 3 2 4 16 5 1 0 0 0
## 6 1 0 0 2 1 4 6 1 0 0 1
## 7 0 0 0 1 0 1 1 2 0 0 0
## 8 0 0 0 1 1 1 0 0 1 0 0
## 9 0 0 0 0 0 0 0 0 0 0 1
## 10 7 4 12 6 11 6 8 4 3 1 307
First, notice in the if_else
commands above, I told
R to create two new variables that equal 10 if “maruse” (or “marhigh”)
are greater than or equal to 10 and equal their existing value
otherwise.
Second, in terms of the reason we wanted to look at this crosstab, notice that the three people who answered zero to the question about use are in the top row of the crosstab. Two of them answered “zero”, as expected, to the number of times they had gotten high on marijuana. However, one of them had reported zero use but also reported getting high on marijuana one time. Also notice that there are multiple people who report getting high on marijuana more than they report using marijuana (just look at the values above the diagonal in the crosstab, these are all respondents who report getting high on marijuana more times than they report using marijuana). What gives?
Some of this could simply be measurement error, perhaps resulting from respondents not remembering exactly how many times they have used and gotten high from marijuana. It could also reflect differences in how each question was specifically asked. Recall that for the marijuana use question respondents were asked specifically about using “marijuana and hashish” whereas with the getting high question they were only asked specifically about “marijuana.” But this type of difference in interpretation would seem to explain the values below the diagonal in the crosstab (i.e., respondents who count both marijuana and hashish in answering about use but only count marijuana in answering about getting high would likely report getting high less than they report use).
Of course, these discrepancies between use and getting high could also reflect real differences in behavior and interpretation of the questions. For example, perhaps respondents think of “smoking marijuana” when they think of use, but think of multiple things like “edibles”, “contact high”, etc. when they think of how many times they have “been high on marijuana.”
You know how Ritchie (2020) keeps saying data in the real world are messy? This is what he is talking about! Ultimately, we don’t know why these discrepancies exist. But, for our conceptual replication purposes, Orcutt (1987) was primarily interested in competent use. Thus, if respondents don’t report using marijuana but report getting high (e.g. from edibles, contact highs, etc.) it would seem to not reflect competent use as Orcutt (1987) intended. Thus, we are comfortable with this one individual being coded as zero.
We are now ready to actually replicate Orcutt’s descriptive analyses, but we think it’s important to take a moment and recognize all the work we had to do to get to this point. In addition to the conceptual work, we also had to spend a fair amount of time wrangling the data and checking to make sure our operations worked how we intended. As you leanred in “R Assignment 6,” it is completely normal for the bulk of data analysis work to be taken up with these data management and data wrangling tasks.
Given we have our data in order, the next step is to summarize it similar to Orcutt.
Orcutt is simply presenting the means and standard deviations (in
parentheses) for each of the key variables in his analyses for both
samples separately and for the combined sample. In order to reproduce
this from wave 6 of the NYS, we first need to calculate these summary
statistics for each of the key variables we wrangled above. Fortunately,
the “sjmisc” package basically does this for you with the
descr
function. All you need to do is assign that
descr
function to an object and it will create a dataframe
of summary statistics.
Let’s take a look at what this would look like. We’ll only do it for
the key variables needed to replicate Orcutt’s table above.
Specifically, “marhigh_dic,” “marpeer_fct,” and “mardef_pos.” To do this
we’ll use the select
command from “dplyr.” We’ll also use
the drop_na
command to perform listwise deletion for
missing values across these three variables (i.e., only include
respondents who have complete data on all three variables).
tb1_sumstat <- nys_w6_trim %>%
select(marhigh_dic, marpeer_fct, mardef_pos) %>%
drop_na() %>%
sjmisc::descr(marhigh_dic, marpeer_fct, mardef_pos)
tb1_sumstat
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd
## marhigh_dic numeric marhigh_dic 1461 0 0.39 0.49
## marpeer_fct categorical Y6-367:FRIENDS-USED MARIJUANA 1461 0 2.46 1.30
## mardef_pos numeric mardef_pos 1461 0 0.14 0.35
## se md trimmed range iqr skew
## 0.01 0 0.37 1 (0-1) 1 0.43
## 0.03 2 2.33 4 (1-5) 2 0.46
## 0.01 0 0.05 1 (0-1) 0 2.04
First, notice that you now have a dataframe in your Global Environment called “tb1_sumstat” that includes three observations and 13 variables. This is a summary data set where each variable is an observation and each characteristic (e.g. type and label) and each summary statistic (e.g., mean and sd) are variables.
Second, notice that this data set includes the specific information we need to replicate Orcutt’s (1987) Table 1 (i.e. the mean and standard deviation for each variable). The problem of course, is this data set also includes a lot of other information that we don’t necessarily want to report, even though it may be informative.
What is cool about the “sjmisc” package essentially creating dataframes as its output is that, by assigning it to an object, you can wrangle the information in it just like you did above with the full data set. For our purposes here, we really just need three of the “variables” in the data set: “var,” “mean,” and “sd.” Of course, we may want to keep other information stored as variables (e.g., labels) and perhaps add some information as well (e.g., an NYS indicator). Let’s go ahead and use the “dplyr” functions you are now familiar with to wrangle this summary data into the form we want to use to create a table like Orcutt’s.
tb1_sumstat <- tb1_sumstat %>%
select(var, label, n, mean, sd) %>%
mutate(sample = "NYS Wave 6",
label = ifelse(var == "marhigh_dic", "Competent User (1 = User)", label),
label = ifelse(var == "marpeer_fct", "Friends' Use (1 = None of them - 5 = All of them)", label),
label = ifelse(var == "mardef_pos", "Subjective Definition (1 = Positive Definition)", label),
label = as_factor(label))
tb1_sumstat
##
## ## Basic descriptive statistics
##
## var label n mean sd
## marhigh_dic Competent User (1 = User) 1461 0.39 0.49
## marpeer_fct Friends' Use (1 = None of them - 5 = All of them) 1461 2.46 1.30
## mardef_pos Subjective Definition (1 = Positive Definition) 1461 0.14 0.35
## sample
## NYS Wave 6
## NYS Wave 6
## NYS Wave 6
In the above code, we simply told R to overwrite the “tb1_sumstat”
dataframe that we created above and select four specific variables
(“var,” “label,” “n,” “mean,” and “sd”) using the select
command. Then, using the mutate
command, we told R to
create a new variable called “sample” to indicate these data were from
the “NYS Wave 6” data and recode the “label” variable to have more
informative labels. We also told R to make the label variable a factor
variable (this may come in handy later). R automatically assigned the
labels to levels of the factor based on the order they appear in the
data. If you want to see the levels of a factor variables, simply type
levels(tb1_sumstat$label)
into a code chunk or the console
window.
We actually can produce a simple table that replicates the
information in Orcutt’s with our current setup. We simply have to tell R
which columns in our tb1_sumstat
object to show and we can
do that with the select
command.
tb1_sumstat %>%
select(label, mean, sd) %>%
mutate(mean = round(mean, digits = 3),
sd = round(sd, digits = 3)) %>%
rename(Variable = label,
Mean = mean,
SD = sd)
##
## ## Basic descriptive statistics
##
## Variable Mean SD
## Competent User (1 = User) 0.394 0.489
## Friends' Use (1 = None of them - 5 = All of them) 2.457 1.304
## Subjective Definition (1 = Positive Definition) 0.143 0.350
mutate
function. The data included seven decimal places for
the mean and sd. The mutate
function in the code above
simply tells R to write over these variables with the same variable
rounded to three decimal places using the round()
function.At this point we have performed a conceptual replication of Orcutt’s (1987) Table 1. All the information is technically in the above table. However, it’s an ugly table that would not look great in a presentation or publication. Also, if we were presenting or publishing this, we may also want to place our results next to Orcutt’s in order to compare the results, especially amongst the variables that are measured most similarly.
In order to make the table more presentable, we’re going to use the “gt” package. The “gt” package was built by people at RStudio and the basic idea is it allows you to take anything formatted as a data table (e.g. a dataframe or a tidyverse tibble) and create a table using its built in table elements and formatting options. Ultimately, the table created with the “gt” package can be rendered in html within an RMarkdown document (see an introductory video here).
Given the descr
funciton in the “sjmisc” package already
create a data table for the information we wanted, we can easily create
a simple gt table from the tb1_sumstat
object we created
above using the gt()
function (you will notice that I did
this a lot in previous assignments in order to make simple tables look
nicer within a knitted html document).
library(gt)
tb1_sumstat %>%
gt()
var | label | n | mean | sd | sample |
---|---|---|---|---|---|
marhigh_dic | Competent User (1 = User) | 1461 | 0.3942505 | 0.4888564 | NYS Wave 6 |
marpeer_fct | Friends' Use (1 = None of them - 5 = All of them) | 1461 | 2.4565366 | 1.3039692 | NYS Wave 6 |
mardef_pos | Subjective Definition (1 = Positive Definition) | 1461 | 0.1430527 | 0.3502465 | NYS Wave 6 |
\(~\)
gt()
function has three additional columns (“var”, “n”, and
“sample”) and includes the mean and sd measured to seven decimal places.
The cool thing about the “gt” package, is that we should be able to
customize the appearance of this table entirely with functions available
within the “gt” package.The table already looks visually better than the one above, but we ultimately want to 1) hide the columns we don’t need, 2) format the specific style of text and numbers that are displayed (e.g., column headings, text alignment, and number of decimal places displayed), and 3) add a title and caption to the table. So let’s take these in turn and see what “gt” can really do!
We don’t want the “var”, “n”, and “sample” columns presented in our
table (although some of the information may be useful to include in a
caption for the table). To do this, we simply use the
cols_hide
function that is built into the “gt” package.
library(gt)
tb1_sumstat %>%
gt() %>%
cols_hide(columns = c(var, n, sample))
label | mean | sd |
---|---|---|
Competent User (1 = User) | 0.3942505 | 0.4888564 |
Friends' Use (1 = None of them - 5 = All of them) | 2.4565366 | 1.3039692 |
Subjective Definition (1 = Positive Definition) | 0.1430527 | 0.3502465 |
tb1_sumstat
. It still has all six columns in it, we
just used the cols_hide
command to tell R not to display
them.While the table is starting to look visually better than the plain text one we created earlier, there are still some things that don’t look correct. Specifically, the column labels are all lowercase and not the correct text (e.g. “label” vs. “Variable” in the first column), the column with the variable descriptions is center aligned, and it’s showing seven decimal places for the mean and standard deviation statistics we are displaying.
tb1_sumstat %>%
gt() %>%
cols_hide(columns = c(var, n, sample)) %>%
cols_label(
label = "Variable",
mean = "Mean",
sd = "SD") %>%
cols_align(
align = "left",
columns = label) %>%
fmt_number(
columns = c(mean, sd),
decimals = 3)
Variable | Mean | SD |
---|---|---|
Competent User (1 = User) | 0.394 | 0.489 |
Friends' Use (1 = None of them - 5 = All of them) | 2.457 | 1.304 |
Subjective Definition (1 = Positive Definition) | 0.143 | 0.350 |
\(~\)
The last thing we want to do is add a title to the table and a note at the bottom indicating a the sample size and anything else we may want to include.
tb1_gtsumstat <- tb1_sumstat %>%
gt() %>%
cols_hide(columns = c(var, n, sample)) %>%
cols_label(
label = "Variable",
mean = "Mean",
sd = "SD") %>%
cols_align(
align = "left",
columns = label) %>%
fmt_number(
columns = c(mean, sd),
decimals = 3) %>%
tab_spanner(
label = "NYS Wave 6",
id = "nys",
columns = c(mean, sd)) %>%
tab_header(
title = md("**Table 1: Variable Means and Standard Deviations**")) %>%
tab_footnote(
footnote = md("*n = 1,461*"),
locations = cells_column_spanners(
spanners = "nys"))
tb1_gtsumstat
Table 1: Variable Means and Standard Deviations | ||
Variable | NYS Wave 61 | |
---|---|---|
Mean | SD | |
Competent User (1 = User) | 0.394 | 0.489 |
Friends' Use (1 = None of them - 5 = All of them) | 2.457 | 1.304 |
Subjective Definition (1 = Positive Definition) | 0.143 | 0.350 |
1 n = 1,461 |
\(~\)
Note: In the above table, we grouped
the “Mean” and “SD” columns under “NYS Wave 6” using the
tab_spanner()
function, added a title to the table with the
tab_header()
function, and added a footnote referencing the
sample size of NYS Wave 6 using the tab_footnote()
function
(if we simply wanted to add a note to the end of our table without
reference to an object in the table we would ahve used the
tab_source_note()
function).
Note:The md()
command
before the text we were adding with a given function allows for markdown
syntax to be used. That’s why the title shows up as bold when we added
“**” to both sides of the title within the parentheses under the
tab_header()
function.
We could continue to modify the table above to get it exactly how we wanted (e.g., adjust font size, remove horizontal lines between variables, etc.), but this is good enough for our current purposes.
Given we tried to perform a conceptual replication of Orcutt’s (1987) study, it would be good to see our results side-by-side with Orcutt’s. To do this, we created the table below. It required some more data wrangling and some data entry (i.e. we had to enter the values of Orcutt’s table directly).
#Enter data from Orcutt's Table 1:
cols = c("Variable", "Minnesota", "FSU", "Combined")
label = c("Competent User (1 = User)", "Friends' Use (0 - 4)", "Subjective Definition (1 = Neutral)", "Subjective Definition (1 = Positive)")
min_mean = c(.345, 1.230, .155, .372)
min_sd = c(.476, 1.423, .363, .484)
fsu_mean = c(.475, 1.803, .121, .497)
fsu_sd = c(.416, 1.545, .137, .441)
comb_mean = c(.416, 1.545, .137, .441)
comb_sd = c(.493, 1.530, .344, .497)
#Create data frame of Orcutt's Table 1
orc_tb1 <- as.data.frame(cbind(label, min_mean, min_sd, fsu_mean, fsu_sd, comb_mean, comb_sd))
#Merge Orcutt data with NYS summary data
nys_orc_tb1 <- tb1_sumstat %>%
rename(nys_mean = mean,
nys_sd = sd) %>%
mutate(label = fct_recode(label, "Subjective Definition (1 = Positive)" = "Subjective Definition (1 = Positive Definition)")) %>%
full_join(orc_tb1) %>%
mutate(label = as_factor(label),
nys_mean = as_numeric(nys_mean),
nys_sd = as_numeric(nys_sd),
min_mean = as_numeric(min_mean),
min_sd = as_numeric(min_sd),
fsu_mean = as_numeric(fsu_mean),
fsu_sd = as_numeric(fsu_sd),
comb_mean = as_numeric(comb_mean),
comb_sd = as_numeric(comb_sd)) %>%
arrange(label)
#Create and Refine Table
tb1_nysorc_comb <- nys_orc_tb1 %>%
gt() %>%
cols_hide(columns = c(var, n, sample)) %>%
cols_label(
label = "Variable",
nys_mean = "Mean",
nys_sd = "SD",
min_mean = "Mean",
min_sd = "SD",
fsu_mean = "Mean",
fsu_sd = "SD",
comb_mean = "Mean",
comb_sd = "SD") %>%
cols_align(
align = "left",
columns = label) %>%
fmt_number(
columns = c(nys_mean, nys_sd, min_mean, min_sd, fsu_mean, fsu_sd, comb_mean, comb_sd),
decimals = 3) %>%
tab_spanner(
label = "NYS Wave 6",
id = "nys",
columns = c(nys_mean, nys_sd)) %>%
tab_spanner(
label = "Orcutt - Minn.",
id = "minn",
columns = c(min_mean, min_sd)) %>%
tab_spanner(
label = "Orcutt - FSU",
id = "fsu",
columns = c(fsu_mean, fsu_sd)) %>%
tab_spanner(
label = "Orcutt - Combined",
id = "comb",
columns = c(comb_mean, comb_sd)) %>%
tab_header(
title = md("**Table 1: Variable Means and Standard Deviations**")) %>%
tab_footnote(
footnote = md("*n = 1,461*"),
locations = cells_column_spanners(
spanners = "nys")) %>%
tab_footnote(
footnote = md("*n = 444*"),
locations = cells_column_spanners(
spanners = "minn")) %>%
tab_footnote(
footnote = md("*n = 543*"),
locations = cells_column_spanners(
spanners = "fsu")) %>%
tab_footnote(
footnote = md("*n = 987*"),
locations = cells_column_spanners(
spanners = "comb")) %>%
fmt_missing(
columns = everything(),
missing_text = "---")
# tab_options(
# footnotes.sep = ", ") #This isn't working; should place footnotes on same line (see: https://github.com/rstudio/gt/issues/833)
tb1_nysorc_comb
Table 1: Variable Means and Standard Deviations | ||||||||
Variable | NYS Wave 61 | Orcutt - Minn.2 | Orcutt - FSU3 | Orcutt - Combined4 | ||||
---|---|---|---|---|---|---|---|---|
Mean | SD | Mean | SD | Mean | SD | Mean | SD | |
Competent User (1 = User) | 0.394 | 0.489 | 0.345 | 0.476 | 0.475 | 0.416 | 0.416 | 0.493 |
Friends' Use (0 - 4) | — | — | 1.230 | 1.423 | 1.803 | 1.545 | 1.545 | 1.530 |
Friends' Use (1 = None of them - 5 = All of them) | 2.457 | 1.304 | — | — | — | — | — | — |
Subjective Definition (1 = Neutral) | — | — | 0.155 | 0.363 | 0.121 | 0.137 | 0.137 | 0.344 |
Subjective Definition (1 = Positive) | 0.143 | 0.350 | 0.372 | 0.484 | 0.497 | 0.441 | 0.441 | 0.497 |
1 n = 1,461 | ||||||||
2 n = 444 | ||||||||
3 n = 543 | ||||||||
4 n = 987 |
\(~\)
Looking across the values in the table above, you can see some similarities and points of departure between the results from the NYS data and Orcutt’s (1987) results.
First, the percentage of “competent” marijuana users in the past year among all three samples is relatively similar, this is especially the case for NYS data and the combined Minnesota and FSU sample from Orcutt’s study. Orcutt’s sample has about 2% more competent marijuana users than the NYS sample. Note that this combined average reflects a combination of Minnesota’s relatively low marijuana use (35%) and FSU’s relatively high marijuana use (48%) when compared to the NYS (39%).
Second, the peer marijuana use variables between the NYS data and Orcutt’s (1987) samples are measured very differently and thus not directly comparable (this is why we put them on separate rows in the above table). In order to compare them, you have to use your informed judgement about what the specific values mean. For example, in Orcutt’s study, respondents reported the number of their four closest friends who they thought used marijuana at least once a month. Compare this to the NYS which asked respondents how many of their friends had used marijuana in the past year with categorical answer categories ranging from “None of them” to “All of them.” For Orcutt, the average was about 1.5 close friends who respondents thought were using marijuana at least once a month (closer to 1 for Minnesota and closer to 2 for FSU). Compare this to the NYS data where the average of 2.5 falls between respondents reporting about their friends’ use that “very few of them” and “some of them” used marijuana in the past year. Leaving aside the different time frames and frequency implied by the questions, do you think respondents from Orcutt’s study that reported 1 to 2 close friends using marijuana at least once a month would have reported “very few of them” or “some of them” to the question in the NYS? Perhaps, but on some level, this question is objectively unknowable without a specific study built to test the overlap between these two questions.
Third, and perhaps the clearest difference between the NYS data and Orcutt’s results, is respondents’ “Subjective Definitions” of marijuana use. These differences are likely due, in large part, to the different ways the questions were asked and the coding decisions we made. Recall that the NYS question was a unipolar question about the “wrongness” of marijuana where respondents reported their views on a four-point scale ranging from “Not wrong” to “Very wrong.” Orcutt’s question was bipolar asking respondents to report their “opinion” about marijuana with questions ranging from “Highly negative” (Minnesota) or “Negative” (FSU) to “Highly positive” (Minnesota) or “Positive” (FSU) with a neutral “no opinion category in the middle. These questions are not only asking about qualitatively different things (e.g., wrongness vs. valence of opinion), they are also asking respondents to respond in very different ways (e.g., unipolar vs. bipolar).
Interestingly, the percent reporting “positive” subjective definitions of marijuana in the NYS (i.e. “Not Wrong”) is more similar to the percent reporting “neutral” subjective definitions in Orcutt’s study. Why this may be interesting is that both of them are based on single answer categories in a multiple-answer scale (albeit a four-point vs. five-point scale respectively). Perhaps, for the sake of comparison, comparing the “Wrong” and Very wrong” categories in the NYS to the “Negative” categories in Orcutt’s study makes the most sense. But this would require some more data wrangling that we will forgo for the moment.
For now, just look at the frequency distribution for the NYS subjective definitions measure, paying close attention to the percentage of respondents in the “Wrong” and Very wrong” categories.
nys_w6_trim %>%
sjmisc::frq(mardef)
## Y6-352:USE MARIJUANA (mardef) <numeric>
## # total N=1725 valid N=1496 mean=2.74 sd=1.02
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
## 1 | Not wrong | 215 | 12.46 | 14.37 | 14.37
## 2 | A little bit wrong | 381 | 22.09 | 25.47 | 39.84
## 3 | Wrong | 473 | 27.42 | 31.62 | 71.46
## 4 | Very wrong | 427 | 24.75 | 28.54 | 100.00
## <NA> | <NA> | 229 | 13.28 | <NA> | <NA>
The combine valid percentage of those answering that marijuana use is “Wrong” and “Very wrong” totals about 60%. Compare this to the percentage of respondents from Orcutt’s samples who were not in the “Positive” and “Neutral” subjective definition categories (i.e. 1 - (% Positive + % Neutral)). Actually, because they are dummy variables, we can also calculate their standard deviation with the formula: \[sd=\sqrt{\left(\frac{n}{n-1}\right)p(1-p)}\] Where n = the sample size and p = the proportion of the sample represented by the dummy variable.
sample <- c("NYS", "Minnesota", "FSU", "Combined")
n <- c(1461, 444, 543, 987)
mean <- c(.602, (1 - (.155 + .372)), (1 - (.121 + .497)), (1 - (.137 + .441)))
neg_def <- cbind.data.frame(sample, n, mean) %>%
mutate(sd = sqrt((n/(n-1))*(mean*(1 - mean))))
tb1_nysorc_negdef <- gt::gt(neg_def) %>%
cols_hide(columns = n) %>%
fmt_number(
columns = c(mean, sd),
decimals = 3) %>%
cols_label(
sample = "Sample",
mean = "Mean",
sd = "SD") %>%
tab_spanner(
label = "Subjective Definition (1 = Negative)",
columns = c(mean, sd))
tb1_nysorc_negdef
Sample | Subjective Definition (1 = Negative) | |
---|---|---|
Mean | SD | |
NYS | 0.602 | 0.490 |
Minnesota | 0.473 | 0.500 |
FSU | 0.382 | 0.486 |
Combined | 0.422 | 0.494 |
\(~\)
As evident in the above table, a larger proportion of respondents in the NYS sample report negative subjective definitions of marijuana than any of the samples in Orcutt’s (1987) study. The absolute difference ranges from 22% between the NYS and the FSU sample to 13% between the NYS and Minnesota sample. There is an 18% difference between the NYS sample and Orcutt’s combined sample. Given the differences in measurement, what do you think might account for these differences?
Perhaps college students have generally more positive attitudes toward marijuana than a general sample of college-aged persons. Perhaps marijuana was generally viewed differently in 1983 when the NYS was collected than it was in 1972 and 1973 when Orcutt (1987) collected his data. Ultimately, we can’t be completely sure, but you can see how thinking about these things may lead to interesting theoretical and empirical questions.
We have just walked you through a conceptual replication of Orcutt’s (1987) descriptive statistics using wave 6 of the NYS–the wave where NYS respondents were presumably most similar in age to Orcutt’s sample. Substantively, this was an attempt to see if Orcutt’s findings generalized to a nationally representative sample. But another substantive question to address related to Orcutt’s findings is how the results would generalize to different age-groups.
To examine the question of whether Orcutt’s findings generalize to different age groups, one basic thing we can do is perform a similar conceptual replication using earlier waves of the NYS (waves 1 - 5). This is now your task.
I (Jake) have randomly assigned each of you to one of the first five waves of the NYS and provided you with a pooled data set called “nys_fwtrim_orc1987” (see assignment on Canvas to download directly and/or for the script used to create the data set). The data set includes the following variables:
Variables in nys_fwtrim_orc1987 data | ||
variable | question | answers |
---|---|---|
CASEID | Unique Identifier | NA |
wave | Wave of NYS data collection | NA |
age | How old are you? | specific age |
marpeer | Think of your friends. During the last year how many of them have used marijuana or hashish? | 1 = None of them - 5 = All of them |
mardef | How wrong is it for someone your age to use marijuana or hashish? | 1 = Not wrong - 4 = Very wrong |
maruse | In the last year, how often have you used marijuana - hashish ('grass', 'pot', 'hash')? | 1 = Never - 9 = 2 to 3 times/day |
Note: In the first five waves, the NYS only asked about marijuana use in the last year and not about whether the respondents’ got high from marijuana in the past year.
Note: Also note that they did not ask for a raw count in Waves 1 and 2 and instead asked for them to answer on a 9-point scale that ranged from 1 = “Never” to 9 = “2 to 3 times per day”. Here is the full range of answers from the Wave 1 codebook:
\(~\)
Note: This question was also included in wave 6 (V891), but it was only asked of those who reported using marijuana 10 or more times in the last year.
Note: Given the first five waves of the NYS did not ask about getting high from marijuana this is another instance where the conceptualization of “marijuana use” in your replication with one of the first five waves of data will be different from Orcutt’s conceptualization. But note that with this measure you are still able to distinguish between those who have used marijuana at all in the past year and those who have not.
Here are your wave assignments:
NYS Wave | Students | ||
---|---|---|---|
Wave 1 | Student 6 | Student 12 | Student 1 |
Wave 2 | Student 4 | Student 2 | Student 5 |
Wave 3 | Student 15 | Student 13 | Student 9 |
Wave 4 | Student 10 | Student 11 | Student 7 |
Wave 5 | Student 8 | Student 14 | Student 3 |
\(~\)
You should now have everything that you need to replicate Orcutt’s (1987) Table 1 with the NYS data to which you were assigned. The pooled data set is provided with the assignment on canvas (along with the R Script used to create it). If you have followed along up to this point, you should have all the requisite skills to recreate Table 1 above using the NYS wave to which you were assigned.
In order to complete the assignment, here is what you need to do:
Before looking at the data, write a brief statement or commentary about whether and why you think your conceptual replication using the NYS wave to which you were assigned will produce similar or different results from Orcutt’s analysis and the results for wave 6 of the NYS presented above.
Download the pooled data file from canvas then load the pooled data into R.
.rda
format). I recommend saving the pooled
file in a subfolder (e.g., “NYS_data”) within your root file for this
project and then use the load()
function as follows:
load(here('NYS_data', 'nys_fwtrim_orc1987.rda'))
Filter the pooled data to only include the wave you were assigned.
load(here('NYS_data', 'nys_fwtrim_orc1987.rda'))
#Wave #1:
nys_w1_orc1987 <- nys_fwtrim_orc1987 %>%
filter(wave == 1)
#Wave #2:
nys_w2_orc1987 <- nys_fwtrim_orc1987 %>%
filter(wave == 2)
#Wave #3:
nys_w3_orc1987 <- nys_fwtrim_orc1987 %>%
filter(wave == 3)
#Wave #4:
nys_w4_orc1987 <- nys_fwtrim_orc1987 %>%
filter(wave == 4)
#Wave #5:
nys_w5_orc1987 <- nys_fwtrim_orc1987 %>%
filter(wave == 5)
Recode the data in your wave to match the logic of Orcutt (1987) and what we did with wave 6 above.
Create a table using the “gt” package that summarizes the data similar to Orcutt (1987) and what we did above with wave 6 (see Part 5 above).
Explain how the results of your conceptual replication are similar to or different from Orcutt’s analysis, how they are similar or different to the conceptual replication presented above for wave 6 of the NYS, and how they are different from your prior expectations as outlined in step 1.
“knit” your final RMD file to html format and save it using an informative file name (e.g., “LastName_CRM495_RAssgin7_YEAR_MO_DY”) within a file structure you create for this assignment (e.g., “LastName_CRM495_RAssign7”)
Submit your knitted html file on Canvas.
Place a copy of your root folder in your LastName_495_commit folder on OneDrive.