Code
library(haven)
library(here)
library(janitor)
library(Statamarkdown)
library(easystats)
library(gt)
library(ggtext)
library(patchwork)
library(pals)
library(extRemes)
library(qqplotr)
library(tidyverse)
Back in 2016, Andrew Gelman blogged about a study by Kuriakose and Robbins (2016) that investigated “near duplication” of survey responses across 1,000+ public opinion surveys as a way to discover potential data fraud. Around that time, we (Jake, Jon, and Maja Kotlaja) were working with a professional international research organization to collect household interview data in Belgrade, Serbia. Upon reading the post and study, it immediately struck fear in us. How would we know if the data we collected were legitimate? Jon had contracted with professional research organizations to collect survey data in the past and, as he had done before, we did our due diligence in identifying potential contractors and generally making sure they are above board. For instance, we made sure to get multiple quotes, checked each organization’s references and body of work, and had detailed exchanges about sampling and interview methods before finally selecting and entering into contractual agreement with an organization (MASMI-Belgrade in Serbia). Meanwhile, the survey organization itself had also instilled confidence in us by doing their own data quality checks, and they had even thrown out one interviewer’s responses due to data quality concerns.
At some point, we had to assume that we were doing all that one reasonably could be expected to do - that is, aside from traveling to Serbia to directly oversee the survey organization and each interviewer ourselves. Like so many other researchers in similar situations domestically and internationally, eventually we had to trust that we were hiring an experienced professional organization that knows what they are doing to carry out the work professionally. Yet, here’s the thing: I (Jake) generally am not a very trusting person. Maybe it is more accurate to say that I am a “trust, but verify” kind of person. As this was my first time collecting international survey data, I had also heard Jon’s horror stories of the extra (and perhaps ethnocentric) scrutiny that reviewers often applied when submitting manuscripts containing analyses of international survey data. Also, though fraud can happen anywhere, the Kuriakose and Robbins (2016) study we mentioned above actually found near duplicate survey observations were more common in data from non-OECD country samples compared to the OECD samples they examined (Serbia is a non-OECD country). So, for our own peace of mind and to satisfy potentially skeptical reviewers, we knew that someday we should learn how to detect instances of near duplication in our data.
Since completing data collection in Serbia, a global pandemic occurred and, concurrently with the pandemic, we also contracted with another international organization to collect similar survey data in Bosnia and Herzegovina. So, this problem has been at the back of our minds for awhile, and we are finally ready to buckle down and investigate it with our own international survey data. In this post, we tested the near duplication waters by focusing first on our most recently collected dataset from Sarajevo, Bosnia and Herzegovina.1 Afterwards, we plan to conduct similar investigations of data we collected in other international locations (Dhaka, Bangladesh; Belgrade, Serbia). As you will see, this is a lot of work for ultimately what will be a footnote in future papers generated using these data. Still, it is important work, and it is exactly the type of thing that we envisioned sharing on our blog when we launched it. Who knows - maybe our departure into duplication will help some of you dig a little deeper into your data and construct your own esoteric footnotes someday. You’re welcome?
library(haven)
library(here)
library(janitor)
library(Statamarkdown)
library(easystats)
library(gt)
library(ggtext)
library(patchwork)
library(pals)
library(extRemes)
library(qqplotr)
library(tidyverse)
The first thing most survey organizations likely do (and most researchers should do) is check for exact duplicate entries in the raw data file. The janitor
package allows one to quickly identify exact duplicate entries.
load(here("Data", "sarajevo_data.RData"))
#look for exact duplication after removing adminstrative variables:
%>%
sarajevo_data get_dupes(-c(ID, starttime, Interviewer, Region, Municipality, Cluster, Settlement, weight))
No duplicate combinations found of: Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8a, Q8b, ... and 244 other variables
Using the janitor
package, we found no exact duplicates across the whole range of survey questions (minus the administrative variables) in the Sarajevo, Bosnia and Herzegovina (hereafter BiH) data. This is unsurprising given the thoroughness and professionalism we saw from our contracted survey organization, Custom Concepts. So we’re off the hook, right? Footnote complete!? Maybe, but duplicating entries exactly would be a pretty brazen attempt at falsification, one that would likely be caught by a professional survey organization before the data even got to us.2
Kuriakose and Robbins (2016) (“KR” from here) developed a more nuanced method for detecting likely duplicates by focusing on “near duplicates.” The basic logic is that a dishonest interviewer or survey firm could save time and money by duplicating responses from valid interviews, and that a sophisticated interviewer or firm engaging in fraud is unlikely to simply duplicate the results exactly because exact duplication is relatively easy to check and identify (see above). Rather, duplicating a valid response and then changing the identifier as well as the values of one or two (or more) data columns would avoid the crude duplicate detection methods employed above.
KR’s process involves pairwise comparisons of every observation in the data set on the substantive variables (i.e. non-administrative variables like unique ID) to detect the “maximum percentage of variables that an observation shares with any other observation in the data” (p.284). Based on their simulations, KR (2016) determined that a maximum percent match of 85% or more is worthy of further scrutiny.
As we will discuss in more detail later, KR’s identification of an 85% maximum percent match threshold for detecting fraud is problematic because it rests on a lot of untenable assumptions about the underlying data generating processes, including strong assumptions about study characteristics (e.g., survey item response values, distributions, and inter-item correlations) and about the methods or decisions of fraudsters when duplicating data. Below, we will use 80% and 85% maximum percent thresholds as heuristics to aid in decision-making about potential fraud concerns in our data. However, we feel compelled to strongly caution readers NOT to use these thresholds like “significance levels” to forensically test for the presence or absence of fraud via near duplication. Doing so inevitably will result in undesirably high false positive and false negative rates. With that said, we still think KR’s process of examining percent match distributions is useful as one among many forensic tools that researchers should consider using to better understand their data and to check for obvious or apparent anomalies that beg for additional exploration.
I (Jake) initially started by using KR’s function directly in Stata. This meant I had to fire up my Stata version 14 that I probably haven’t opened in at least a couple of years.3 To conduct these analyses in R, I iterated and revised an existing R function meant to mimic KR’s Stata function, which we use below. Given my lack of real programming skills, it was quite the challenge for me. I will try to document its development in a future companion blog to this one tentatively titled: “Stumbling in the Dark: Building/Iterating an R Function to Match Stata’s percentmatch.” In that companion blog, I will walk through the development of the function and also examine whether it produces results consistent with KR’s Stata function applied to the same data.4
Regardless of whether one opts for Stata or R, some basic data pre-processing step are required before examining near duplication. As usual, we will happily do this within the friendly confines of R and the tidyverse.
The first thing we need to do is remove all of the administrative variables and demographic variables (e.g., sex, age, ethnicity, and religion) so our data include only the substantive variables.5 As before with the exact duplication check using janitor
, we will use a copy of the raw BiH data collected in 2021. Our thinking is that this should be among the first checks in one’s workflow before cleaning and analyzing data.
<- c("Q7", "Q8a", "Q8b")
vars_mis99
<- sarajevo_data %>%
sarajevo_data_noad select(-c(starttime, Interviewer, Region, Municipality, Cluster, Settlement, Q1, Q2, Q44, Q45, weight)) %>%
mutate(
#recode missing codes to NA:
across(-any_of(c(vars_mis99, "Q45_oth", "ID")), ~ na_if(.x, 9)),
across(all_of(vars_mis99) & -any_of(c("Q45_oth", "ID")), ~ na_if(.x, 99)))
We now have the raw BiH data with consistent NA
values across all of the variables and none of the administrative variables that will artificially increase (or decrease) duplication. The next step is to remove cases where missing values exceed the thresholds set by KR. Specifically, they removed any variable with 10% or more missing values and any observation with 25% or more missing values.
= map(sarajevo_data_noad, ~mean(is.na(.))) %>%
sarajevo_colmiss as_tibble() %>%
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "pct_colmiss") %>%
filter(pct_colmiss >= .07) %>%
arrange(desc(pct_colmiss))
%>%
sarajevo_colmiss gt() %>%
cols_label(
variable = "Variable",
pct_colmiss = "% Missing") %>%
fmt_percent(
columns = pct_colmiss,
decimals = 1) %>%
tab_header(
title = md("**Variables by % Missing**")) %>%
tab_options(container.height = px(500))
Variables by % Missing | |
Variable | % Missing |
---|---|
Q45_oth | 98.9% |
Q8b | 45.2% |
Q31_2 | 29.9% |
Q31_8 | 29.8% |
Q31_1 | 29.7% |
Q31_3 | 28.7% |
Q31_5 | 27.6% |
Q31_4 | 26.9% |
Q31_7 | 26.9% |
Q31_6 | 26.4% |
Q32_3 | 11.4% |
Q14_18 | 10.3% |
Q13_37 | 9.0% |
Q11_8 | 8.5% |
Q11_4 | 7.0% |
As you can see above, there are some variables with a substantial amount of missing data. The variables with the most missing data, “Q45_oth” and “Q8b,” are related to skip patterns (Q8b asks about the respondent’s number of children and was only asked of those who identified as a parent; Q45_oth are the specific answers given for respondents who identified as “other” ethnicity). After those, the substantive questions with the most missing data are Q31_1 - Q31_8. These were questions about respondents’ perceptions of crime and deviance in their neighborhood: “How many people in your neighborhood do these things at least once a year? (e.g.,”Take money or property that didn’t belong to you worth less than 5 BAM.”).”
There are two other variables that are above the 10% missing threshold. Q32_3 (11.4%) and Q14_18 (10.3%). Q32_3 was part of a set of questions that asked: “Imagine that you did something that most of your family, friends, and other people in your neighborhood would disapprove of. Generally, how likely is it that these people would forgive you?” The specific item Q32_3 asked about their perceived likelihood that “Most of the other people in your neighborhood would forgive you.” Given this item’s missingness is an outlier among the set and considering the missing patterns in the Q31 set, there may be something about the neighborhood questions that specifically caused missing data (e.g., respondents’ greater inability or unwillingness to accurately report on crime-related questions about neighbors). Meanwhile, Q14_18 was part of a set of questions that asked “How often did you do the following in the past two years?” with the specific item asking about how often the respondent “neglected an important duty.”
Given this pattern of missing data, we decided to stick with KR’s 10% criterion, though perhaps leaving in Q32_3 and Q14_18 is defensible as well since they have substantially fewer missing observations than the Q31 set of questions. On the other hand, Q45_oth and Q8b should clearly be removed as they are a result of skip patterns which KR specifically mention as a rationale for their 10% criteria (p.286).
The next step is to identify observations with more than 25% missing data. Before doing this, we removed Q45_oth and Q8b as to not include them in the calculation. This turned out to be more difficult than expected and we actually turned to ChatGPT for help.6 We ended up using a simple map_dbl()
function from the “purrr” package in Tidyverse.
#Remove skip pattern questions:
<- sarajevo_data_noad %>%
sarajevo_data_noadskip select(-Q45_oth, -Q8b)
<- map_dbl(seq_len(nrow(sarajevo_data_noadskip)), ~mean(is.na(sarajevo_data_noadskip[.x,]))) %>%
sarajevo_rowmiss as_tibble() %>%
mutate(ID = row_number()) %>% #This works here, but is not a robust solution
rename(pct_rowmiss = value) %>%
relocate(ID)
#Print Results
%>%
sarajevo_rowmiss arrange(desc(pct_rowmiss)) %>%
filter(pct_rowmiss >= .25) %>%
gt() %>%
cols_label(
pct_rowmiss = "% Missing") %>%
fmt_percent(
columns = pct_rowmiss,
decimals = 1) %>%
tab_header(
title = md("**Observations by % Missing**"))
Observations by % Missing | |
ID | % Missing |
---|---|
149 | 29.4% |
31 | 28.2% |
262 | 27.4% |
345 | 27.0% |
There are four cases (IDs) that are missing more than 25% of the observations (ID = 149, 31, 262, 345). Following KR, we will remove these from the data before checking duplicates. To summarize, we will remove the following survey questions: Q45_oth, Q8b, Q31_2, Q31_8, Q31_1, Q31_3, Q31_5, Q31_4, Q31_7, Q31_6, Q32_3, Q14_18 and the following observations: ID = 149, 31, 262, 345.
We are not sure exactly how KR came up with the 10% and 25% criteria for variables and observations respectively. The rationale for these cut-offs is not fully described in their article.7 Generally, we are suspicious of rules of thumb like these and, when encountering arbitrary thresholds like we do here, we prefer to assess how sensitive results are to alternative decision criteria. For now, we will plug our noses and move forward with these cutoffs, as this blog entry already bites off more than most readers are probably willing to chew.
<- sarajevo_data_noad %>%
sarajevo_data_noadtrim select(-all_of(colmiss)) %>%
filter(!ID %in% c(rowmiss))
With our pre-processing complete, we can now run our trimmed data through our function to produce the maximum percent (technically, maximum proportion) match metric for each observation (along with the ID of the observation with the closest match) and append it to our trimmed data. From there, we can identify those observations with greater than 85% maximum percent match, or what KR would consider a “near duplicate” of another observation and thus a potential falsification. In fact, they use the term “likely falsified” for those row hits exhibiting an 85% maximum match, though, as we noted earlier and will discuss later, we strongly caution against using this (or any) simple threshold criteria to make forensic determinations about the likelihood of falsification or fraud.
source("Day_percentmatch-function.R")
<- percentmatch(sarajevo_data_noadtrim, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
sarajevo_data_pctmatchR rename(ID = id,
pct_match = pm,
match_id = id_m) %>%
left_join(sarajevo_data_noadtrim) %>%
relocate(pct_match, match_id, .after = last_col())
<- sarajevo_data_pctmatchR %>%
sarajevo_data_pctmatchR85 zap_label() %>%
arrange(desc(pct_match)) %>%
filter(round(pct_match, digits = 2) >= .85) %>%
select(ID, pct_match, match_id)
#gt(sarajevo_data_pctmatchR85)
#80% cutoff
<- sarajevo_data_pctmatchR %>%
sarajevo_data_pctmatchR80 zap_label() %>%
arrange(desc(pct_match)) %>%
filter(round(pct_match, digits = 2) >= .80) %>%
select(ID, pct_match, match_id)
<- left_join(sarajevo_data_pctmatchR85, sarajevo_data_pctmatchR80)
sarajevo_data_pctmatch_cut
gt(sarajevo_data_pctmatch_cut) %>%
cols_label(
pct_match = "Prop. Match",
match_id = "Matching ID") %>%
fmt_number(
columns = pct_match,
decimals = 3) %>%
tab_header(
title = md('**Observations with "Near Duplicates"**')) %>%
tab_options(container.height = px(500))
Observations with “Near Duplicates” | ||
ID | Prop. Match | Matching ID |
---|---|---|
336 | 0.882 | 681 |
681 | 0.882 | 336 |
29 | 0.878 | 552 |
169 | 0.878 | 464 |
464 | 0.878 | 169 |
552 | 0.878 | 29 |
141 | 0.869 | 421 |
421 | 0.869 | 141 |
87 | 0.861 | 105 |
105 | 0.861 | 87 |
51 | 0.852 | 752 |
137 | 0.852 | 326 |
182 | 0.852 | 800 |
326 | 0.852 | 137 |
353 | 0.852 | 466 |
406 | 0.852 | 552 |
466 | 0.852 | 353 |
477 | 0.852 | 495 |
495 | 0.852 | 477 |
752 | 0.852 | 51 |
760 | 0.852 | 477 |
800 | 0.852 | 182 |
83 | 0.848 | 154 |
154 | 0.848 | 83 |
493 | 0.848 | 552 |
As you can see above, when rounded to the second decimal, we have 25 observations who have at least 85% of their data for the substantive variables matching another observation in the data. According to their tutorial, KR would have interpreted this as indicating that 3.04% of the observations in our data are “likely falsified.” Again, we think this interpretation is premature (at best) and problematic, for reasons that should become clear later.
For now, note that this observed “near duplicate” percent is below the 5% and 10% benchmarks used by KR to indicate levels of near duplication that would bias results. If we relax the level of concern for near duplication to 80%, we would have 128 observations falling within that range (see below). So, while only about 3% meet KR criteria for near duplication, about 16% fall within a more relaxed 80% match criterion.
Although we have no principled reason to look at 80% as opposed to any other number, the 3% of our sample that falls clearly within KR’s criteria are at least worthy of further investigation.
In addition to looking at specific observations that have large maximum proportion match values, we can also look at the overall distribution of maximum proportion match values within the data.
<- sarajevo_data_pctmatchR %>%
pctmatch_tab select(pct_match) %>%
report_table()
gt(data = pctmatch_tab) %>%
cols_hide(columns = c(n_Obs, MAD, percentage_Missing)) %>%
fmt_number(
columns = c(Mean, SD, Median, Min, Max, Skewness, Kurtosis),
decimals = 2,
use_seps = FALSE) %>%
tab_header(
title = md('**Descriptive Statistics for Maximum Proportion Match**'))
Descriptive Statistics for Maximum Proportion Match | |||||||
Variable | Mean | SD | Median | Min | Max | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|
pct_match | 0.69 | 0.12 | 0.72 | 0.26 | 0.88 | −1.05 | 1.11 |
The table above shows basic descriptive statistics for the maximum proportion match variable. The mean of the sample is 0.69, which is close to the mean of .66 that KR found in their simulation of 100,000 data sets with 1,000 observations, 100 variables, and random assignment of two distinct values in each cell (see p.285).
To get a better sense of how these values are distributed, we can visualize the entire distribution using a histogram that is similar to those found throughout KR’s paper.
<- ggplot(data = sarajevo_data_pctmatchR, aes(x = pct_match)) +
pctmatch_hist geom_histogram(aes(y = after_stat(density)),
color = "white", fill = "#440154FF") +
scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) +
labs(title = "Distribution of Maximum Proportion Match",
y = "Density", x = "Maximum Proportion Match") +
theme_minimal() +
theme(panel.grid = element_blank())
pctmatch_hist
A couple of interesting things stand out with this distribution. First, it actually looks pretty similar to the distribution KR demonstrated in the appendix of their pre-print for a non-OECD country sample in which fraud was deemed unlikely to have occurred. There are a handful of observations above the 85% criterion, but there are no weird peaks in cases above 80% match like KR identified in the example of expected fraud they report in their published paper (see p.289).
Second, our distribution has a fatter tail (i.e., the range goes as low as 0.26) compared to many of those in KR’s study. KR’s examples, including the simulated examples, rarely had observations exhibiting less than a 40% maximum match. Based on their examples, KR claim that the distributions of non-fraudulent data generally should resemble a Gumbel distribution, which is a known type of extreme value distribution.
At first glance, this might seem to raise a red flag for our data, as our observed maximum proportion match distribution displayed above does not appear to approximate a Gumbel distribution. However, KR do not offer any principled reasons to expect all non-fraudulent maximum proportion match distributions to specifically mirror a Gumbel distribution as opposed to, say, any other known or generalized extreme value distribution (or, for that matter, to any other distributional form).
In fact, if you visit the journal issue containing KR’s published paper, you will also find a response paper from Simmons and colleagues, a team of scholars at Pew Research Center. (They had previously posted a version on Pew’s website as well). In their response, Simmons and colleagues specifically challenged KR’s claim that the maximum percent match statistic should follow a Gumbel distribution, and they describe “several reasons why we should not expect this to be the case in general” (see pp.328-9). For example, they note that the “belief that the statistic should follow any particular smooth distribution depends on the assumption that the values are independent and identically distributed” (p.329) and, as they subsequently explain, these are invalid assumptions for modeling survey response data.
Nonetheless, for the sake of thoroughness, we examined the “fit” of our distribution to the Gumbel and to other extreme value distributions. For those interested in such minutiae, we have included this detour in an expandable section below. Here is TL;DR version: Although the proportion maximum distribution from our BiH data does not follow a theoretical Gumbel distribution, it does fit reasonably well with a Weibull distribution, which is another known extreme value distribution. Substantively, this might mean that our data are not that surprising given the (debatable) assumption that the maximum proportion match metric in our data reflects a data generating process that is congruent with extreme value theory.
The Gumbel distribution is an extreme-value distribution often used for modeling the distribution of maximum values (e.g., maximum percent match in our case).8 KR base their argument that the distribution of extreme values should follow a Gumbel distribution on observations of distributions generated by their simulated maximum percent match values for both uncorrelated and correlated data. They also note that their analyses of real-world survey data from reputable survey organizations, including those of relatively homogeneous groups (e.g., survey of small ethnic minority in a single metropolitan area), “closely approximate a Gumbel distribution (p. 289).”
However, as noted above, Simmons and colleagues (2016) criticized KR’s lack of theoretical justification for expecting a Gumbel distribution. Much like the arbitrary 85% “likely fraud” threshold, Simmons and colleagues contend this distributional expectation stems from arbitrary design and parameter selections in KR’s simulations. They state: “As with the 85% cutoff, the probability of observing deviations from the expected distribution proposed by K&R will depend greatly on the characteristics of the specific survey and the target population (p. 329).”
While we agree with this critique, we also would not be surprised if maximum proportion match metrics tend to approximate generalized extreme value distributions in certain situations. Assuming extreme value theory is (sometimes) relevant to the proportion match metric, then one could assess the fit of an observed or empirical distribution to Gumbel and other extreme value distributions using standard statistical tools that permit more systematic assessment than that found in KR’s study. For instance, when someone wants to compare an empirical distribution to a theoretical distribution, they will often examine a quantile-quantile plot. Then, if they want to examine these differences more formally, they may utilize a goodness-of-fit test (although these become less useful in large samples). To do this, we use the “extRemes” package created by Eric Gilleland.9.
First, we’ll fit a Gumbel distribution to the data using the fevd()
function in the “extRemes” package. Then we’ll plot the theoretical distribution based on the estimated parameters against the empirical distribution using the ggplot extension package “qqplotr.”
<- fevd(x = sarajevo_data_pctmatchR$pct_match, type = "Gumbel", time.units = NULL, period.basis = NULL)
fit_gumbel
summary(fit_gumbel)
fevd(x = sarajevo_data_pctmatchR$pct_match, type = "Gumbel",
time.units = NULL, period.basis = NULL)
[1] "Estimation Method used: MLE"
Negative Log-Likelihood Value: -421.9886
Estimated parameters:
location scale
0.6256592 0.1410741
Standard Error Estimates:
location scale
0.005241089 0.003251954
Estimated parameter covariance matrix.
location scale
location 2.746901e-05 5.869466e-06
scale 5.869466e-06 1.057521e-05
AIC = -839.9772
BIC = -830.5537
<- as_tibble_row(distill(fit_gumbel))
gumbel_param
<- ggplot(data = sarajevo_data_pctmatchR, aes(sample = pct_match)) +
gumbel_qqplot stat_qq_band(bandType = "boot", fill = "#218F8B", alpha = 0.25, identity = TRUE, B = 1000,
distribution = "evd",
dparams = list(loc = gumbel_param$location, scale = gumbel_param$scale)) +
stat_qq_line(color = "#218F8B", identity = TRUE,
distribution = "evd",
dparams = list(loc = gumbel_param$location, scale = gumbel_param$scale)) +
stat_qq_point(color = "#440154" , alpha = 0.25,
distribution = "evd",
dparams = list(loc = gumbel_param$location, scale = gumbel_param$scale)) +
labs(title = "Quantile-Quantile Plot of Gumbel Distribution",
x = "Gumbel Distribution Quantiles",
y = "Maximum Proportion Match Quantiles") +
scale_y_continuous(breaks=seq(.2,1,.1)) +
scale_x_continuous(breaks = seq(.2, 1.8, .2)) +
coord_cartesian(ylim = c(.2, 1)) +
theme_light() +
theme(panel.grid = element_blank(),
panel.border = element_blank())
gumbel_qqplot
The above chart shows the theoretical distribution and 95% confidence interval (based on pointwise parametric bootstrap with 1,000 samples) represented as a line and confidence band with the empirical distribution represented as the individual data points.
Our empirical distribution clearly does not fit with a theoretical Gumbel distribution. However, recall that the Gumbel distribution is only one type of extreme value distribution. Instead of specifically assuming the restrictive Gumbel distribution, we can use the same procedures as above to estimate a generalized extreme value distribution that estimates a shape parameter in addition to the location and scale parameters. (The Gumbel distribution is the reduced form generalized extreme value distribution when the shape parameter is zero.)
<- fevd(x = sarajevo_data_pctmatchR$pct_match, type = "GEV", time.units = NULL, period.basis = NULL)
fit_gev
summary(fit_gev)
fevd(x = sarajevo_data_pctmatchR$pct_match, type = "GEV", time.units = NULL,
period.basis = NULL)
[1] "Estimation Method used: MLE"
Negative Log-Likelihood Value: -693.2064
Estimated parameters:
location scale shape
0.6638157 0.1241461 -0.5632021
AIC = -1380.413
BIC = -1366.278
<- as_tibble_row(distill(fit_gev)) gev_param
In the summary above, you’ll notice that the estimated shape parameter is a negative value: -0.56. This means that the distribution better fits with a Weibull or Type III generalized extreme value distribution (see the data analysis clasroom lesson). The nice thing about the generalized extreme value distribution, according to NASA’s Global Modeling and Assimilation Office, is that it “…unites the Gumbel, Fréchet and Weibull distributions into a single family to allow a continuous range of possible shapes.”
Now, we can produce another quantile-quantile plot to compare the fit of this generalized extreme value distribution to our observed proportion maximum distribution.
<- ggplot(data = sarajevo_data_pctmatchR, aes(sample = pct_match)) +
gev_qqplot stat_qq_band(bandType = "boot", fill = "#218F8B", alpha = 0.25, identity = TRUE, B = 1000,
distribution = "evd",
dparams = list(loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape)) +
stat_qq_line(color = "#218F8B", identity = TRUE,
distribution = "evd",
dparams = list(loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape)) +
stat_qq_point(color = "#440154" , alpha = 0.25,
distribution = "evd",
dparams = list(loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape)) +
labs(title = "Quantile-Quantile Plot of Gumbel Distribution",
x = "Generalized Extreme Value Distribution Quantiles",
y = "Maximum Proportion Match Quantiles") +
scale_y_continuous(breaks=seq(.2,1,.1)) +
#scale_x_continuous(breaks = seq(.2, 1.8, .2)) +
coord_cartesian(ylim = c(.2, 1)) +
theme_light() +
theme(panel.grid = element_blank(),
panel.border = element_blank())
gev_qqplot
As you can see in the above plot, this generalized extreme value distribution fits our empirical distribution quite well and, unsurprisingly, it provides a much better fit to the data than a Gumbel distribution.
Another way to assess the fit of the theoretical distribution to the empirical distribution is to simply simulate data from those theoretical distributions and plot it next to our empirical data. You can see this below, where we simulate 10,000 draws from both the Gumbel and Weibull distributions fitted above and plot their densities next to the empirical distribution found in our data.
# Empirical Distribution Data
<- sarajevo_data_pctmatchR %>%
pctmatch select(pct_match) %>%
rename(pctmatch = pct_match) %>%
mutate(dist = "Empirical")
# Gumbel Distribution Data
set.seed(7734)
<- revd(10000,loc = gumbel_param$location, scale = gumbel_param$scale) %>%
gumbel as_tibble() %>%
rename(pctmatch = value) %>%
mutate(dist = "Gumbel")
# Weibull Distribution
set.seed(7734)
<- revd(10000,loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape) %>%
weibull as_tibble() %>%
rename(pctmatch = value) %>%
mutate(dist = "Weibull")
<- bind_rows(pctmatch, gumbel, weibull)
distcomp
#Compare theoretical distributions to empirical
<- ggplot(data = distcomp, aes(color = dist)) +
distcomp_plot geom_density(aes(x = pctmatch), linewidth = 1) +
# geom_vline(aes(xintercept = 0.70), color = "grey", linetype = 2, linewidth = .5) +
# geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = .5) +
scale_color_manual(values = c("#440154", "#FDE725FF", "#218F8B")) +
labs(title = "Density Plot of Emperical Distribution and Theoretical Extreme Value Distributions",
x = "Maximum Proportion Match",
y = "Density",
color = "Distribution") +
scale_x_continuous(breaks = seq(.2, 1.8, .2)) +
theme_light() +
theme(panel.grid = element_blank(),
panel.border = element_blank())
distcomp_plot
You can see from the plot above that our empirical distribution of maximum proportion match aligns well with the simulated data from the Weibull distribution. The main difference is that our actual data has more maximum proportion match values in the 70% - 80% range than would be expected if it followed more closely with the Weibull distribution. You can also see from the plot that our empirical data does not align with the Gumbel distribution. Part of this is due to the fact that, unlike the maximum proportion values we are calculating, the Gumbel distribution is not constrained between zero and one, which is another reason to be skeptical of KR’s claim that this metric should follow a Gumbel distribution. Indeed, one of the strengths of the Weibull distribution in relation to the maximum proportion match metric is that it has a “light tail with a finite upper bound (see Wikipedia’s “Extreme value theory” page).
At risk of unnecessarily repeating ourselves, it is not clear to us how exactly KR determined that their maximum percent match metric should follow a Gumbel distribution other than simply eyeballing the empirical distributions generated in their study. To examine this issue more formally, we would need to replicate their simulations. We are not going to do that here, but we definitely recommend Simmons and colleagues’ 2016 paper if you are interested in reading more about these issues. Rather, our main goal with this detour was to more formally examine the distributional properties of the maximum proportion match metric within the context of our sample. We ended up stumbling onto some extreme value theory and found that one type of extreme value distribution (Weibull) is a reasonable fit to our data.
In sum, given the maximum proportion match metric in our data stems from a data generating process congruent with extreme value theory, our maximum proportion match calculations might not be that atypical or surprising. Again, though, Simmons and colleagues rightfully question the assumption that this metric should always match such distributions. Hence, we will end this detour by echoing Simmons and colleagues’ warning that a lack of fit between an observed percent match distribution in one’s own data and a Gumbel or other theoretical extreme value distribution will occur naturally - that is, for non-fraudulent reasons - in many social survey designs. For instance, Simmons and colleagues state:
Within any national population, we should expect to observe not a single distribution, but a mixture of distributions corresponding to the different subpopulations. Distinct subgroups within a surveyed population will share different numbers of characteristics or opinions that are measured in the survey and that essentially represent different data generating processes. Particularly homogenous subgroups within the population will appear as additional modes or bunching toward the tail of the distribution.
Therefore, an observed departure from theoretical extreme value distributions (e.g., clustering above 85% maximum percent match) should not be taken prima facie as evidence of fraud. Rather, we view such departures as reasons to further investigate one’s data, with the goal of trying to understand what processes (e.g., research design features and/or fraud) might have generated these distributional characteristics.
Another difference between our distribution and the distributions discussed by KR is there appear to be more data points near the 80% match level. Indeed, the mode for our data is 0.74, which is larger than the modes of 60-66% reported by KR in the examples and simulations from the pre-print appendix. Interestingly, the mode is closest to the 70% found in the “Ethnic Minority Survey in Three US Counties” reported in Figure 8 of the pre-print appendix. This was meant to be an example of a survey of “a unique sub-population in a geographically concentrated area (p. 23),” which is perhaps more similar to our survey of a single city in Bosnia and Herzegovina. Of course, that particular survey did not have nearly the concentration around the 80% level of maximum match. So there may be something else going on with these data that are producing these relatively large values.
A cynic might initially assume that something fishy is going on (e.g., interviewers or the survey organization falsifying data), but that the patterns were not quite detectable at KR’s 85% threshold. A less cynical alternative take is that there might be something about the nature of the research design and data themselves that generate relatively large values of maximum percent match for entirely non-fraudulent reasons.
Our initial hunch was that typical design and data features in criminological research would have a non-negligible impact on the percent match metric, thereby further reducing the utility of KR’s arbitrary threshold for detecting likely fraud in survey data. Specifically, based on our experiences with analyzing criminological survey data, we expected that a large number of respondents would match on identical (e.g., “zero”) response values across large item sets that ask about self-reported past criminal behavior and criminal intent, crime-related attitudes and beliefs, and other crime-relevant items (e.g., perceptions of family, peer, and neighborhood crime attitudes and behaviors). Thus, we expected the highly skewed and zero-inflated nature of these crime and crime-relevant survey response distributions might artificially inflate the central tendency of maximum percent match metrics.
To formally test this possibility, we could simulate data with known properties and vary things like zero-inflation or levels of inter-item agreement to assess the impact on simulated percent match distributions. However, Simmons and colleagues (2016) already varied some of the parameters from KR’s simulations to make similar points, so we will not reinvent that wheel here. Rather, we will first dig into the cynical explanation by examining the distributions of maximum percent match metrics for each individual interviewer in our BiH data. Then, following a detailed dig into interviewer-, municipality-, and cluster-specific proportion match distributions, we will briefly explore how features of criminological data might affect proportion match metrics. Overall, our goal in these investigations is to try to better understand our data and the processes that might have generated our observed maximum proportion match distributions, with the hope that doing so will help us more convincingly identify potential red flags (e.g., evidence of “likely fraud”) or, ideally, assuage our and others’ concerns about fraud.
When potentially problematic maximum proportion maximum values (KR’s “likely fraud” cases) are identified, we think this should be the start and not the endpoint of one’s investigation into near duplication patterns in their data. An obvious next step for us is to examine the distribution of maximum proportion match by interviewer. We happen to have interviewer ID variables in the raw data that make this possible. For one data set that was suspicious to KR, they produced a table showing the number of maximum percent match that was above and below 80% for all interviewers who had at least one observation that exceeded 80% match. In our data, we only have 14 interviewers, so we should be able to generate and visualize each interviewer’s distribution of maximum percent match values. Before we do, let’s first see how many interviews each interviewer performed in collecting our BiH data.
<- sarajevo_data %>%
int_count group_by(Interviewer) %>%
count(Interviewer)
%>%
sarajevo_data tabyl(Interviewer) %>%
adorn_pct_formatting() %>%
gt() %>%
cols_label(
n = "freq.",
percent = "%") %>%
tab_header(
title = md("**Interview Frequency**")) %>%
tab_options(container.height = px(500))
Interview Frequency | ||
Interviewer | freq. | % |
---|---|---|
45 | 40 | 4.8% |
46 | 102 | 12.3% |
47 | 130 | 15.7% |
48 | 40 | 4.8% |
49 | 52 | 6.3% |
50 | 50 | 6.1% |
62 | 30 | 3.6% |
63 | 82 | 9.9% |
64 | 40 | 4.8% |
65 | 49 | 5.9% |
66 | 80 | 9.7% |
67 | 70 | 8.5% |
68 | 51 | 6.2% |
69 | 10 | 1.2% |
The median number of interviews was 50.5 with a range of 10 - 130. If the near duplicates are spread out among all of the 14 interviewers, it may suggest that there was no systematic fraud present - at least, not at the level of the interviewer (e.g., a survey organization committing sophisticated fraud could near duplicate cases randomly across interviewers and design strata). Alternatively, if the near duplicates are concentrated among one or a few interviewers, then it might suggest the presence of strong “interviewer effects” on the maximum proportion match metric (e.g., interviewer fraud via near duplication). However, interviewer-level clustering might also occur for completely benign reasons as well.
Before going any further, this is a good place to pause and once again urge caution against jumping to any conclusions regarding fraud from near duplicate analyses - even after conducting more detailed interviewer-specific proportion match examinations. After all, interviewer-level clustering also might be expected to occur for entirely non-fraudulent reasons. For instance, interviewers tend to collect data around specific starting points and, as such, we might expect survey responses to be more homogeneous within versus across interviewers. Likewise, interviewers collecting data in areas with more homogeneous populations (e.g., areas with very low levels of self-reported crime) would be expected to have higher proportion match values, while interviewers collecting data from more heterogeneous populations and/or from multiple heterogeneous areas might be expected to have lower overall proportion match values.
Moreover, the belief that a high proportion match value is indicative of “likely fraud” itself is based on strong assumptions about how (interviewer, organization, or researcher) fraudsters engage in fraud. One might observe clustering of high proportion match values if a fraudster duplicated an interview or row of data and then lazily changed one, two, or a few variables. Yet, a more sophisticated fraudster that changed substantially more data - or an organization or researcher that used (quasi-)random or simulation techniques to generate partially duplicated or otherwise fraudulent data - would not necessarily generate clusters of high proportion match values. Rather, in these instances, we might expect to observe clusters of low proportion match values - or, in the simulation fraud example, no clusters at all! We will return to these points later. For now, we felt it was important to remind readers that these types of “forensic” data tools rarely, if ever, provide dispositive evidence of fraud. At best, they can provide us a deeper understanding of our data and potentially increase or reduce uncertainty about the probability of fraud.
Still, we think it is worthwhile to further investigate these near duplication patterns. So, to get a quick sense of interviewer-level variation in maximum proportion match distributions, we can plot a simple histogram for each interviewer.
<- sarajevo_data %>%
sarajevo_data_intselect(ID, starttime, Interviewer, Region, Municipality, Cluster, Settlement, weight) %>%
left_join(sarajevo_data %>% group_by(Interviewer) %>% summarize(int_count = n())) %>% #joining grouped data
mutate(int_label = paste0('Int. ', Interviewer, ' (n = ', int_count, ')'))
<- inner_join(x = sarajevo_data_pctmatchR, y = sarajevo_data_int, by = "ID") %>%
sarajevo_data_intmatch mutate(pct_match_gte80 = factor(ifelse(round(pct_match, digits = 2) >= .8, 1, 0)),
pct_match_lt80 = factor(ifelse(round(pct_match, digits = 2) < .8, 1, 0)),
pct_match_gte85 = factor(ifelse(round(pct_match, digits = 2) >= .85, 1, 0)),
pct_match_lt85 = factor(ifelse(round(pct_match, digits = 2) < .85, 1, 0)))
<- ggplot(data = sarajevo_data_intmatch) +
sarajevo_intmatch_plot geom_histogram(aes(x = round(pct_match, digits = 2), y = ..density..), bins = 20, color = "white", fill = "#440154FF") +
geom_density(aes(x = round(pct_match, digits = 2)), color = "#21918c", linewidth = 1) +
geom_vline(aes(xintercept = 0.80), color = "red", linetype = 2) +
facet_wrap(~int_label, ncol = 3) +
labs(title = "Distribution of Maximum Percent Match by Interviewer ID",
y = "Density", x = "Maximum Percent Match",
caption = "Note: The histogram is created with 20 bins. The smoothed density curve is overlaid to avoid idisoyncracies \nwith binning in small samples. The red line is drawn at 80% maximum percent match.") +
theme_minimal() +
theme(plot.caption = element_text(hjust = 0))
print(sarajevo_intmatch_plot)
When examining the distributions of specific data sets, KR looked for peaks in the maximum percent match values above 80%. Not surprisingly, given the overall distribution, you see many interviewers with relatively large numbers of observations around the 80% match level. For a few of these interviewers (#46, #63, and #66), these seem to reflect distributions that are shifted towards the high end of the overall distribution in general. However, a few interviewers do appear to have somewhat atypical peaks after the 80% match level (#45, #47, and #64). We will investigate these (and all) interviewers further to see if we can determine just how atypical their distributional peaks are and whether these interviewers should be red-flagged as potentially invalid.
As noted, KR created a table of interviewers who had at least one interview with 80% match or higher to see what percentage of the interviews they did that were above 80% match with another observation in the data. Since we only have 14 interviewers in our data, we can simply create this table for all of our interviewers:
<- sarajevo_data_intmatch %>%
int_pctmatch select(Interviewer, pct_match) %>%
group_by(Interviewer) %>%
report_table()
<- sarajevo_data_intmatch %>%
int_tab80 tabyl(Interviewer, pct_match_gte80) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
rename(match_lt80 = "0",
match_gte80 = "1") %>%
mutate(Total = parse_number(Total))
<- sarajevo_data_intmatch %>%
int_tab85 tabyl(Interviewer, pct_match_gte85) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
rename(match_lt85 = "0",
match_gte85 = "1") %>%
mutate(Total = parse_number(Total))
<- left_join(int_tab80, int_tab85) %>%
int_tab select(-match_lt85) %>%
relocate(Total, .after = match_gte85)
<- gt(int_tab) %>%
int_tabgt cols_label(
match_lt80 = "% Match < 80%",
match_gte80 = "% Match \u2265 80%",
match_gte85 = "% Match \u2265 85%",
Total = "Total Interviews") %>%
tab_header(
title = md("**Number (and %) of Respondents by Interviewer & Maximum Percent Match**"))
%>%
int_tabgt tab_options(container.height = px(500))
Number (and %) of Respondents by Interviewer & Maximum Percent Match | ||||
Interviewer code | % Match < 80% | % Match ≥ 80% | % Match ≥ 85% | Total Interviews |
---|---|---|---|---|
45 | 24 (60%) | 16 (40%) | 8 (20%) | 40 |
46 | 80 (78%) | 22 (22%) | 2 (2%) | 102 |
47 | 96 (74%) | 34 (26%) | 6 (5%) | 130 |
48 | 40 (100%) | 0 (0%) | 0 (0%) | 40 |
49 | 49 (94%) | 3 (6%) | 0 (0%) | 52 |
50 | 50 (100%) | 0 (0%) | 0 (0%) | 50 |
62 | 28 (93%) | 2 (7%) | 0 (0%) | 30 |
63 | 66 (80%) | 16 (20%) | 3 (4%) | 82 |
64 | 19 (48%) | 21 (53%) | 6 (15%) | 40 |
65 | 49 (100%) | 0 (0%) | 0 (0%) | 49 |
66 | 66 (85%) | 12 (15%) | 0 (0%) | 78 |
67 | 66 (97%) | 2 (3%) | 0 (0%) | 68 |
68 | 51 (100%) | 0 (0%) | 0 (0%) | 51 |
69 | 10 (100%) | 0 (0%) | 0 (0%) | 10 |
A couple things jump out to us in this table. First, notice that the three interviewers who had potentially suspicious spikes (#45, #47, and #64) account for 20 of the of the 25 total observations with 85% match or higher. Also, note that two of the three have a substantially larger share of their interviews (40% for #45 and 52.5% for #64) that are at or above an 80% match with another observation. The next closest is Interviewer #47, who did the most interviews (130) and had 26.2% (130) of their interviews at or above 80%.
Soon, we will be diving deeper into this rabbit hole to examine interviewer-level maximum proportion match values across different research design strata, so it will help to leave the summary tables and histograms behind and, instead, transition to alternative ways to visualize each interviewer’s entire distribution of proportion maximum values. Below, we produce a jitter plot of maximum proportion match by each interviewer. It is sorted in descending order from the interviewer that completed the most interviews (#47) to the least (#69). Note that the points are somewhat transparent, so darker colors indicates multiple points stacking upon those values. The points also include random variation along the vertical axis within each interviewer (i.e., points are randomly “jittered” or displaced up or down). This improves our ability to see how each interviewer’s observations are distributed in terms of maximum proportion match. We placed lines at the (arbitrary) 80% and 85% threshold levels so one can clearly see which interviewers have the most similar observations.
<- ggplot(data = sarajevo_data_intmatch) +
sarajevo_intmatch_jitter geom_jitter(aes(x = pct_match, y = reorder(int_label, int_count)), color = "#440154", alpha = 0.25,
position = position_jitter(width = 0, height = .25, seed = 7734)) +
geom_hline(yintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5), color = "grey", linewidth = 0.5) +
geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = 1) +
geom_vline(aes(xintercept = 0.85), color = "black", linetype = 2, linewidth = 1) +
scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) +
labs(title = "Maximum Percent Match by Interviewer ID",
y = "Interviewer", x = "Maximum Percent Match",
caption = ("***Note***: grey dashed line indicates 80% maximum percent match; black dashed line indicates 85% maximum percent match.")) +
theme_minimal() +
theme(panel.grid = element_blank(),
plot.caption = element_markdown())
sarajevo_intmatch_jitter
It is interesting to see the distributional similarities between the two interviewers with the highest and - at least according to KR, potentially most problematic - matches (#45 and #64). Substantively, the plot largely reproduces the histograms above. Where this jitterplot may be especially helpful is in examining how the maximum proportion match metric is distributed across interviewers while simultaneously considering other stratifying features of the data as well.
When administering the BiH survey, a two-stage “proportional stratified random sample” was collected by, first, stratifying by the eight different municipalities in Sarajevo, then proportionally and randomly selecting 83 neighborhood cluster across these municipalities and, finally, interviewing no more than 10 people from each cluster.10 Below, we will summarize how the maximum percent match metric is distributed across and within these stratifying features of the data. We will pay particular attention to distributions across the eight municipalities, as a number of them are large enough to allow us to compare the distribution of the maximum percent match metric across multiple interviewers and within the same municipality. First, let’s summarize and visualize maximum proportion match distributions by municipality:
#Municipality
<- sarajevo_data_intmatch %>%
muni_pctmatch select(Municipality, pct_match) %>%
group_by(Municipality) %>%
report_table()
<- sarajevo_data %>%
muni_count group_by(Municipality) %>%
count(Municipality)
<- sarajevo_data_intmatch %>%
muni_tab80 tabyl(Municipality, pct_match_gte80) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
rename(match_lt80 = "0",
match_gte80 = "1") %>%
mutate(Total = parse_number(Total))
<- sarajevo_data_intmatch %>%
muni_tab85 tabyl(Municipality, pct_match_gte85) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
rename(match_lt85 = "0",
match_gte85 = "1") %>%
mutate(Total = parse_number(Total))
<- left_join(muni_tab80, muni_tab85) %>%
muni_tab select(-match_lt85) %>%
relocate(Total, .after = match_gte85)
# Neighborhood Cluster:
<- sarajevo_data_intmatch %>%
clust_pctmatch select(Cluster, pct_match) %>%
group_by(Cluster) %>%
report_table()
<- sarajevo_data %>%
clust_count group_by(Cluster) %>%
count(Cluster)
<- sarajevo_data_intmatch %>%
clust_tab80 tabyl(Cluster, pct_match_gte80) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
rename(match_lt80 = "0",
match_gte80 = "1") %>%
mutate(Total = parse_number(Total))
<- sarajevo_data_intmatch %>%
clust_tab85 tabyl(Cluster, pct_match_gte85) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
rename(match_lt85 = "0",
match_gte85 = "1") %>%
mutate(Total = parse_number(Total))
<- left_join(clust_tab80, clust_tab85) %>%
clust_tab select(-match_lt85) %>%
relocate(Total, .after = match_gte85)
#Merge with pctmatch & Interviewer data
<- sarajevo_data_intmatch %>%
sarajevo_data_stratmatch left_join(sarajevo_data %>% group_by(Municipality) %>% summarize(muni_count = n())) %>%
left_join(sarajevo_data %>% group_by(Cluster) %>% summarize(clust_count = n())) %>%
mutate(muni_label = paste0('Muni. ', Municipality, ' (n = ', muni_count, ')'),
muni_longlab = paste0('Municipality ', Municipality, ' (n = ', muni_count, ')'),
clust_label = paste0('Clust. ', Cluster, ' (n = ', clust_count, ')'),
int_label = fct_reorder(as_factor(int_label), desc(int_count)),
muni_label = fct_reorder2(as_factor(muni_label), desc(muni_count), Municipality, .desc = FALSE),
muni_longlab = fct_reorder2(as_factor(muni_longlab), desc(muni_count), Municipality, .desc = FALSE),
clust_label = fct_reorder(as_factor(clust_label), Cluster))
save(sarajevo_data_stratmatch, file = here("Data", "sarajevo_data_stratmatch.RData"))
# Combinde Tables (GT Versions)
<- muni_tab %>%
muni_tabgt arrange(desc(Total)) %>%
gt() %>%
cols_label(
match_lt80 = "% Match < 80%",
match_gte80 = "% Match \u2265 80%",
match_gte85 = "% Match \u2265 85%",
Total = "Total Observations") %>%
tab_header(
title = md("**Number (and %) of Respondents by Municipality & Maximum Percent Match**"))
muni_tabgt
Number (and %) of Respondents by Municipality & Maximum Percent Match | ||||
Municipality | % Match < 80% | % Match ≥ 80% | % Match ≥ 85% | Total Observations |
---|---|---|---|---|
1 | 218 (91%) | 22 (9%) | 2 (1%) | 240 |
2 | 103 (80%) | 25 (20%) | 3 (2%) | 128 |
3 | 104 (81%) | 24 (19%) | 6 (5%) | 128 |
4 | 70 (64%) | 40 (36%) | 14 (13%) | 110 |
5 | 78 (98%) | 2 (3%) | 0 (0%) | 80 |
6 | 50 (100%) | 0 (0%) | 0 (0%) | 50 |
7 | 50 (100%) | 0 (0%) | 0 (0%) | 50 |
8 | 21 (58%) | 15 (42%) | 0 (0%) | 36 |
<- ggplot(data = sarajevo_data_stratmatch) +
sarajevo_munimatch_jitter geom_jitter(aes(x = pct_match, y = muni_label), color = "#35b779", alpha = 0.25,
position = position_jitter(width = 0, height = .35, seed = 7734)) +
geom_hline(yintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5), color = "grey", linewidth = 0.5) +
geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = 1) +
geom_vline(aes(xintercept = 0.85), color = "black", linetype = 2, linewidth = 1) +
scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) +
scale_y_discrete(limits = rev) +
labs(title = "Maximum Percent Match by Municipality",
y = "Municipality", x = "Maximum Percent Match",
caption = ("***Note***: grey dashed line indicates 80% maximum percent match; black dashed line indicates 85% maximum percent match.")) +
theme_minimal() +
theme(panel.grid = element_blank(),
plot.caption = element_markdown())
sarajevo_munimatch_jitter
The above chart is sorted by the number of interviews completed in each municipality. You will notice that relatively large maximum percent match values are concentrated in the municipalities with larger numbers of completed interviews (which are presumably larger population centers as well). Of course, you will also notice more variation in general in these municipalities. Now, let’s distinguish between different interviewers within each municipality.
A central goal of ours here is to determine whether the distributions of those “potentially problematic” interviewers with high proportion match values that we identified earlier stand out as atypical relative to the distributions of other interviewers operating in the same municipality. An atypical within-municipality interviewer distribution might be indicative of fraud in this more fine-grained relative investigation. With that said, municipalities are quite large in a city like Sarajevo, so even an atypical within-municipality interviewer distribution might simply be due to an interviewer working from a starting point that was more homogeneous relative to other interviewers in the same larger population strata. Still, let’s see what we find.
<- glasbey(16)[-c(14,15)] #creates 14 color palette from glasbey 16 color palette and removes the 14th and 15th colors
colorpal
<- ggplot(data = sarajevo_data_stratmatch) +
sarajevo_intmunimatch_jitter geom_jitter(aes(x = pct_match, y = muni_label,
color = reorder(int_label, int_count, decreasing = TRUE)), alpha = .5,
position = position_jitter(width = 0, height = .35, seed = 7734)) +
geom_hline(yintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5), color = "grey", linewidth = 0.5) +
geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = 1) +
geom_vline(aes(xintercept = 0.85), color = "black", linetype = 2, linewidth = 1) +
scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) +
scale_y_discrete(limits = rev) +
scale_color_manual(values = colorpal) +
labs(title = "Maximum Percent Match by Municipality & Interviewer",
y = "Municipality", x = "Maximum Percent Match",
color = "Interviewer") +
guides(color = guide_legend(override.aes = list(size = 5))) +
theme_minimal() +
theme(panel.grid = element_blank(),
plot.caption = element_markdown())
sarajevo_intmunimatch_jitter
In the above chart, each observation is colored differently according to each interviewer. This gives us a rough sense of how maximum proportion match values are distributed across interviewers within each municipality. Admittedly, it is a little confusing given the number of different colors, but we use this chart to assess whether there is a good mixture of interviewers within each municipality and, if so, to assess whether an interviewer’s observations are (1) similarly or differently distributed across municipalities and (2) similarly or differently distributed relative to other interviewers within the same municipality.
On one hand, if certain interviewers tend to have larger maximum percentage match values within each municipality, then this might suggest “interviewer effects” - though, as noted earlier, such patterns might be generated via benign (e.g., neighborhood clustering) processes as well. On the other hand, if interviewer-specific observations are well mixed within each municipality - i.e., if it looks as if each interviewer’s distribution was randomly generated from the same underlying distribution - then this might minimize concerns about strong individual interviewer effects (though it does not rule out collective interviewer, organizational, or researcher fraud).
For example, take Muncipality #4. While maximum proportion match values are generally high in this municipality, they are similarly high for each of the three interviewers who surveyed respondents in that municipality.
In contrast, look at Municipality #3. Here, there is generally a good mixture among 3 of the 4 interviewers who surveyed subjects within that municipality. However, maximum proportion match observations from Interviewer #64 - one of the “potentially problematic” interviewers identified earlier - is concentrated at the high end of this municipality’s overall distribution. This could be indicative of “likely fraud” on this interviewer’s part - or, it might have occurred for benign reasons, such as a greater degree of respondent homogeneity in the neighborhood clusters they interviewed. Either way, this clear pattern is informative, as it tells us that Interviewer 64’s observations were more similar than other interviewers’ observations within the same municipality. Though this information falls short of being dispositive evidence of fraud, it also fails to absolve our fears of fraud as well. As such, it points us to the need for additional investigation of the data collected by this interviewer to see if we can uncover how and why their maximum proportion maximum distribution is atypically high, and it signals to us that it might be a good idea to conduct sensitivity analyses (i.e., with and without this interviewer’s data).
What about the other two interviewers - #45 and #47 - that were flagged as “potentially problematic” by KR’s threshold? First, let’s look at Interviewer #47. This interviewer conducted the most interviews in the study, and their within-municipality distributions do not appear to be all that atypical relative to other interviewers in this figure. Their high level of maximum proportion match values could simply have been due to the fact that they collected more interviews than everyone else - so they have more observations across the range of values on this metric.
Meanwhile, all of Interviewer #45’s interviews occurred in Municipality #4. In this within-municipality context, their interviews actually look a little better as their maximum proportion match values appear to be more similar to those of other interviewers within the same municipality. Again, you might also expect a similar pattern if these interviews were in fact generated via partial duplication. So, none of this is dispositive evidence for or against fraud - we do not think one can justifiably identify cases of “likely fraud” using this metric as KR claim. Yet, the patterns do suggest that maximum proportion match values vary systematically across municipalities (just as Simmons and colleagues’ 2016 paper anticipated), and accounting for such data generating processes operating at or above the interviewer level when examining this metric might uncover more useful information and lead researchers to different conclusions than they might infer from simply running a naive percent match test and applying simple thresholds using the entire dataset. In this case, we do not see anything particularly alarming about either Interviewer #47 or #45, though we still have questions about Interviewer #64.
We think another thing you would want to see if interviewers were not falsifying data through partial duplication, particularly across population strata like municipalities, would be for their maximum proportion match distributions to vary systematically in ways that correspond to other interviewer and municipality variations. Of course, this requires an interviewer to have performed interviews in multiple municipalities (about 5-6 interviewers completed interviews across multiple municipalities in the BiH data). In other words, you would want to see that an interviewer’s maximum proportion match distributions are more reflective of the aggregate than of the interviewer. You can get some sense of this in the chart above, but it is difficult to decipher with all of the different colors. For example, Interviewer #46 appears to have completed interviews across three different municipalities: #1, #2, and #8. If you look at Interviewer #46’s maximum proportion match values across those different municipalities, it does appear that they tend to reflect patterns within the municipality more so than the interviewer. In order to improve our ability to visualize this tendency, we created the slides below. The slides will automatically scroll through each individual interviewer’s figure, but you can also click through yourself at your own pace.
We might usefully extend our approach above by analyzing other stratifying features of the data, like neighborhood clusters. Recall that interviewers were sent to random starting points in 83 neighborhood clusters across the eight municipalities. Below, we present a table that summarizes the frequency of near duplicates by each municipality and for each neighborhood cluster.
<- sarajevo_data %>%
sarajevo_data_muni select(ID, Municipality) %>%
left_join(sarajevo_data %>% group_by(Municipality) %>% summarize(muni_count = n())) %>% #joining grouped data
mutate(muni_label = paste0('Muni. ', Municipality, ' (n = ', muni_count, ')'))
<- inner_join(x = sarajevo_data_intmatch, y = sarajevo_data_muni, by = c("ID", "Municipality"))
sarajevo_data_munimatch
#Data with distinct Municipality information
<- sarajevo_data_munimatch %>%
muni_data select(Municipality, muni_count, muni_label) %>%
distinct() %>%
arrange(Municipality)
#Data with distinct Municipality, Cluster, and Interviewer Information
<- sarajevo_data_munimatch %>%
intmuni_data select(Municipality, muni_count, muni_label, Cluster, Interviewer, int_count, int_label) %>%
distinct()
<- function(data, row, col, gval) {
municlust_tab_func <- data %>%
tab filter(Municipality == gval) %>%
tabyl({{row}}, {{col}}) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns(position = "front") %>%
mutate(Total = parse_number(Total),
Municipality = paste0(gval))
}
<- pull(distinct(sarajevo_data_munimatch, Municipality)) %>%
unique_muni zap_labels() %>%
zap_label() %>%
zap_formats() %>%
zap_widths() %>%
sort()
<- map_df(.x = unique_muni, ~municlust_tab_func(data = sarajevo_data_munimatch,
municlust_tab80 row = Cluster,
col = pct_match_gte80,
gval = .x)) %>%
rename(match_lt80 = "0",
match_gte80 = "1") %>%
mutate(Municipality = as.numeric(Municipality))
<- map_df(.x = unique_muni, ~municlust_tab_func(data = sarajevo_data_munimatch,
municlust_tab85 row = Cluster,
col = pct_match_gte85,
gval = .x)) %>%
rename(match_lt85 = "0",
match_gte85 = "1") %>%
mutate(Municipality = as.numeric(Municipality))
<- left_join(municlust_tab80, municlust_tab85) %>%
municlust_tab select(-match_lt85) %>%
left_join(muni_data) %>%
left_join(intmuni_data) %>%
relocate(c(Total, Municipality, muni_label), .after = match_gte85)
# GT Version
<- municlust_tab %>%
municlust_tabgt group_by(muni_label) %>%
arrange(desc(muni_count)) %>%
gt() %>%
cols_hide(columns = c(Municipality, muni_count, Total, Interviewer, int_label, int_count)) %>%
cols_label(
match_lt80 = "% Match < 80%",
match_gte80 = "% Match \u2265 80%",
match_gte85 = "% Match \u2265 85%",
muni_label = "Municipality") %>%
tab_header(
title = md("**Number (and %) of Respondents by Municipality, Neighborhood Cluster, & Maximum Percent Match**")) %>%
tab_options(container.height = px(500))
municlust_tabgt
Number (and %) of Respondents by Municipality, Neighborhood Cluster, & Maximum Percent Match | |||
Cluster | % Match < 80% | % Match ≥ 80% | % Match ≥ 85% |
---|---|---|---|
Muni. 1 (n = 240) | |||
11 | 10 (100%) | 0 (0%) | 0 (0%) |
12 | 10 (100%) | 0 (0%) | 0 (0%) |
13 | 9 (90%) | 1 (10%) | 0 (0%) |
14 | 10 (100%) | 0 (0%) | 0 (0%) |
15 | 10 (100%) | 0 (0%) | 0 (0%) |
16 | 9 (90%) | 1 (10%) | 0 (0%) |
17 | 10 (100%) | 0 (0%) | 0 (0%) |
18 | 8 (80%) | 2 (20%) | 0 (0%) |
19 | 7 (70%) | 3 (30%) | 0 (0%) |
110 | 3 (30%) | 7 (70%) | 2 (20%) |
111 | 10 (100%) | 0 (0%) | 0 (0%) |
112 | 8 (80%) | 2 (20%) | 0 (0%) |
113 | 10 (100%) | 0 (0%) | 0 (0%) |
114 | 10 (100%) | 0 (0%) | 0 (0%) |
115 | 10 (100%) | 0 (0%) | 0 (0%) |
116 | 10 (100%) | 0 (0%) | 0 (0%) |
116 | 10 (100%) | 0 (0%) | 0 (0%) |
117 | 10 (100%) | 0 (0%) | 0 (0%) |
118 | 10 (100%) | 0 (0%) | 0 (0%) |
119 | 8 (80%) | 2 (20%) | 0 (0%) |
120 | 10 (100%) | 0 (0%) | 0 (0%) |
121 | 10 (100%) | 0 (0%) | 0 (0%) |
122 | 10 (100%) | 0 (0%) | 0 (0%) |
123 | 7 (70%) | 3 (30%) | 0 (0%) |
124 | 9 (90%) | 1 (10%) | 0 (0%) |
Muni. 2 (n = 130) | |||
21 | 5 (56%) | 4 (44%) | 0 (0%) |
22 | 9 (100%) | 0 (0%) | 0 (0%) |
23 | 8 (80%) | 2 (20%) | 0 (0%) |
24 | 7 (70%) | 3 (30%) | 1 (10%) |
25 | 10 (100%) | 0 (0%) | 0 (0%) |
26 | 8 (80%) | 2 (20%) | 0 (0%) |
27 | 4 (40%) | 6 (60%) | 2 (20%) |
29 | 10 (100%) | 0 (0%) | 0 (0%) |
210 | 10 (100%) | 0 (0%) | 0 (0%) |
211 | 9 (90%) | 1 (10%) | 0 (0%) |
212 | 8 (80%) | 2 (20%) | 0 (0%) |
213 | 6 (60%) | 4 (40%) | 0 (0%) |
214 | 9 (90%) | 1 (10%) | 0 (0%) |
Muni. 3 (n = 130) | |||
31 | 7 (70%) | 3 (30%) | 0 (0%) |
32 | 5 (50%) | 5 (50%) | 3 (30%) |
33 | 4 (40%) | 6 (60%) | 2 (20%) |
34 | 3 (30%) | 7 (70%) | 1 (10%) |
35 | 9 (90%) | 1 (10%) | 0 (0%) |
36 | 8 (100%) | 0 (0%) | 0 (0%) |
37 | 10 (100%) | 0 (0%) | 0 (0%) |
38 | 10 (100%) | 0 (0%) | 0 (0%) |
39 | 8 (80%) | 2 (20%) | 0 (0%) |
310 | 10 (100%) | 0 (0%) | 0 (0%) |
311 | 10 (100%) | 0 (0%) | 0 (0%) |
312 | 10 (100%) | 0 (0%) | 0 (0%) |
313 | 10 (100%) | 0 (0%) | 0 (0%) |
Muni. 4 (n = 110) | |||
41 | 6 (60%) | 4 (40%) | 1 (10%) |
42 | 5 (50%) | 5 (50%) | 3 (30%) |
43 | 7 (70%) | 3 (30%) | 1 (10%) |
44 | 6 (60%) | 4 (40%) | 1 (10%) |
45 | 7 (70%) | 3 (30%) | 0 (0%) |
46 | 10 (100%) | 0 (0%) | 0 (0%) |
47 | 7 (70%) | 3 (30%) | 1 (10%) |
48 | 3 (30%) | 7 (70%) | 4 (40%) |
49 | 4 (40%) | 6 (60%) | 3 (30%) |
410 | 6 (60%) | 4 (40%) | 0 (0%) |
411 | 9 (90%) | 1 (10%) | 0 (0%) |
Muni. 5 (n = 80) | |||
51 | 10 (100%) | 0 (0%) | 0 (0%) |
52 | 10 (100%) | 0 (0%) | 0 (0%) |
53 | 10 (100%) | 0 (0%) | 0 (0%) |
54 | 10 (100%) | 0 (0%) | 0 (0%) |
55 | 8 (80%) | 2 (20%) | 0 (0%) |
56 | 10 (100%) | 0 (0%) | 0 (0%) |
57 | 10 (100%) | 0 (0%) | 0 (0%) |
58 | 10 (100%) | 0 (0%) | 0 (0%) |
Muni. 6 (n = 50) | |||
61 | 10 (100%) | 0 (0%) | 0 (0%) |
62 | 10 (100%) | 0 (0%) | 0 (0%) |
63 | 10 (100%) | 0 (0%) | 0 (0%) |
64 | 10 (100%) | 0 (0%) | 0 (0%) |
65 | 10 (100%) | 0 (0%) | 0 (0%) |
Muni. 7 (n = 50) | |||
71 | 10 (100%) | 0 (0%) | 0 (0%) |
72 | 10 (100%) | 0 (0%) | 0 (0%) |
73 | 10 (100%) | 0 (0%) | 0 (0%) |
74 | 10 (100%) | 0 (0%) | 0 (0%) |
75 | 10 (100%) | 0 (0%) | 0 (0%) |
Muni. 8 (n = 36) | |||
81 | 6 (60%) | 4 (40%) | 0 (0%) |
82 | 7 (70%) | 3 (30%) | 0 (0%) |
83 | 2 (20%) | 8 (80%) | 0 (0%) |
84 | 6 (100%) | 0 (0%) | 0 (0%) |
84 | 6 (100%) | 0 (0%) | 0 (0%) |
84 | 6 (100%) | 0 (0%) | 0 (0%) |
The above table, with 83 rows of data, is fairly cumbersome to interpret. Of course, if you scroll through it, you may notice some patterns. Not surprisingly, municipalities with relatively large maximum percent match values have more clusters and more interviews within clusters with potentially problematic matches. However, there also appears to be variation in this pattern, even within municipality. These patterns may be easier to identify if we separately visualize the distributions by cycling through distinct jitter plots as we did above.
First, we can show each municipality and their corresponding clusters separately (similar to how we showed each interviewer separately above).
These plots, particularly for the smaller municipalities, show that distribution of maximum proportion match across neighborhood clusters generally reflect patterns within the municipality. Of course, there is also more variation in the larger municipalities. An issue with looking at the neighborhood clusters is that only one interviewer was assigned to each neighborhood. Thus, unlike with some of municipalities, we cannot look at how the maximum percent match metric differs by interviewer within the same cluster. However, we can look at (1) how different interviewers’ distributions vary within the same municipality across clusters and (2) how specific interviewers’ distributions vary across municipalities and clusters. We are not sure how helpful these detailed breakdowns are for detecting fraud, but we tend to think that having more information is better than having less information. As with municipalities, we think that seeing interviewer-level distributions varying systematically across neighborhoods in ways that reflect the larger maximum proportion match patterns would typically be expected in non-fraudulent data, though seeing such systematic variations is not necessarily strong evidence of the lack of fraud and failure to see such systematic variations is not necessarily indicative of the presence of fraud. In any case, here are the plots presented separately by municipality, again with interviewers in different colors and this time also separately depicted across each of the different neighborhood clusters.
Again, one difficulty with these plots is that when you see an interviewer who has relatively large maximum proportion match values, you cannot be sure whether this is due to something peculiar with the interviewers (e.g., fraud via near duplication) or to something about the neighborhood clusters to which they were assigned (e.g., cluster-level homogeneity), or perhaps to something else entirely.
By now, you have probably gathered that the theory underlying the maximum proportion match metric as indicative of fraud is not well developed. There are multiple plausible processes generating these observed maximum proportion maximum distributions in any given dataset; trying to tease out whether fraud via near duplication is one of those data generating processes using maximum proportion match distributions is like trying to find specific needles that may or may not exist in a stack of other needles.
Still, we are learning more about near duplication and our data through the process, so let’s continue on. The next set of slides contain plots showing each interviewer’s maximum proportion match values across the different municipalities and clusters separately. These plots overlay each municipality’s full range of data (minus the focal interviewer’s) with each interviewer’s cluster-specific maximum proportion match distributions. We find these plots to be particularly useful for visualizing the “within-interviewer” distributions of maximum proportion match values across design stratification features like municipalities and neighborhood cluster (given, of course, the interviewer completed interviews across multiple municipalities and neighborhood clusters).
With these plots, it is interesting to return to those three interviewers who were identified as having potentially suspicious data above: Interviewers #47, #45, and #64. For interviewers #47 and #45, we see more distributional variation across neighborhood clusters both across and within different municipalities (#45 only interviewed in one municipality). This is the case for these interviewers even though their overall pattern of maximum proportion match values is skewed towards larger values. Now, compare this with Interviewer #64 who, like Interviewer #45, only completed interviews in a single municipality. For this interviewer, we see that their relatively large maximum proportion match values exist across the four neighborhood clusters in which they completed interviews, and all four of their clusters are clearly on the high end of the overall range for the municipality. We still cannot determine with any degree of reasonable certainty whether this interviewer did anything nefarious, as such patterns could occur for benign reasons as well (e.g., they could have been sent to four relatively homogeneous clusters). Still, this information has not absolved us from our belief that this particular interviewer’s data should be subjected to additional scrutiny and perhaps flagged for sensitivity tests in later analyses using these data.
We wish to return to a point raised earlier about how predictable features of criminological data might systematically impact forensic analyses of near duplicates using maximum proportion or percent match metrics. Specifically, we noted earlier that the heavily skewed and often zero-inflated nature of crime and crime-relevant data might be expected to increase central tendencies for maximum percent match distributions and, if so, would then make reliance on arbitrary thresholds (e.g., >85% match) particularly unreliable for forensic detection of “likely fraud” in criminological data.
Put plainly, if we routinely ask about relatively rare events - or, in the pubic opinion context when we ask about things with high degrees of consensus, or things susceptible to strong social desirability effects, or anything exhibiting clustered survey response distributions - then we would expect higher levels of response similarity among respondents and, therefore, higher maximum percent match values. To use a concrete example, if almost everyone answers “Never” to a survey item asking how often they “Physically harmed someone else on purpose in the past two years,” then nearly everyone in the sample would match on that particular variable. If we have a large number of items like this in our survey data, then we can probably expect higher overall maximum percent match values in our data as well. We illustrate this issue using our BiH data below.
Based on our prior subject domain experiences with analyzing criminological data, we expect each of the eight self-reported deviant/criminal behaviors to exhibit skewed distributions clustering near the low end of the response scales, with the modal category being “zero” self-reported crime involvement.11 Similarly, for items asking respondents “How morally acceptable is it to you to…” engage in the same eight delinquent acts, we expect clustering at or near “Never Acceptable” (the lowest response value) on these items. There are likely other variables in our data that are also highly skewed, but these are the items that we expect a priori to be particularly skewed. In fact, they are the only questions on the survey that were entirely self-reported (i.e., respondents read the questions and reported them on an answer sheet that was then sealed in an envelope).12 This means we are dealing here with questions that involve rare events (crime/deviance) and have clear social desirability valence (crime/deviance and moral acceptability of crime/deviance).
In total, the past and projected crime/deviance items and the moral acceptability items alone comprise 24 survey items in the BiH data that we expect to exhibit highly skewed distributions and, thus, to inflate near duplication metrics. This amounts to about 10.1 percent of the variables we used to calculate maximum percent match, which is potentially a non-negligible number of items given the stakes associated with using this metric (i.e., determining presence of “likely fraud”).
Instead of plotting each of these zero-inflated questions separately, below we simply plot their modes. These reflect the percent of respondents reporting “Never” engaging in any of the behavior in the past two years (i.e. “Past Crime”), “No chance” of engaging in the behavior in the future (i.e., “Projected Crime”), and that the behavior is “Never acceptable” morally (i.e., “Personal Morality”).
#Function for calculating and reporting percentage of modal response
<- function(data, variables = everything()) {
mode_percentage <- data %>%
results as_tibble() %>%
pivot_longer(cols = all_of(variables), names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarize(mode = names(table(value))[which.max(table(value))],
percentage_mode = sum(table(value)[mode]) / sum(n()) * 100,
modes = list(names(table(value))[table(value) == max(table(value))])) #identifies each value that is the mode in case 2 categories are equally the mode
return(results)
}
#Use function to calculate percentage of modal response and plot past and projected crime (crm_plots) and moral beliefs about crime (mor_plots)
<- sarajevo_data_pctmatchR %>%
crm_vars ::select(starts_with(c("Q25_", "Q26_"))) %>%
dplyrnames()
<- sarajevo_data_pctmatchR %>%
mor_vars ::select(starts_with("Q50_")) %>%
dplyrnames()
<- mode_percentage(data = haven::zap_labels(sarajevo_data_pctmatchR), variables = crm_vars)
crm_modes # use `zap_labels` to remove value labels from data before running function (was leading to an error).
<- mode_percentage(data = haven::zap_labels(sarajevo_data_pctmatchR), variables = mor_vars)
moral_modes
<- bind_rows(crm_modes, moral_modes) %>%
zi_modes select(-modes) %>%
mutate(label = ifelse(variable %in% c(crm_vars[1:8]), "Past Crime", ""),
label = ifelse(variable %in% c(crm_vars[9:16]), "Projected Crime", label),
label = ifelse(variable %in% c(mor_vars), "Personal Morality", label),
label = factor(label, levels = c("Past Crime", "Projected Crime", "Personal Morality")))
<- ggplot(data = zi_modes, aes(x = rev(variable), y = percentage_mode, fill = rev(label))) +
zimode_plot geom_bar(stat = "identity") +
#geom_text(aes(label = paste(round(percentage_mode, digits = 0), "%")), vjust = -0.25) +
facet_wrap(~rev(label), strip.position = "bottom", scales = "free_x") +
scale_y_continuous(breaks = seq(0, 100, by = 10)) +
scale_fill_manual(values = c("#440154FF", "#21908CFF", "#FDE725FF")) +
labs(y = "Percentage Mode", x = "Question", fill = "Question Category:",
title = "Percentage Reporting Mode by Question") +
theme_minimal() +
theme(panel.spacing = unit(0, "lines"),
strip.background = element_blank(),
strip.placement = "outside",
panel.grid = element_blank(),
legend.position = "none",
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
zimode_plot
The distributions of the percent in the zero category are very similar across these three sets of expected zero-inflated variables. The range of respondents answering the “zero” category across all 24 items is 78.47% - 97.08% (past crime: 84.67% - 97.08%; projected crime: 78.47% - 95.5%; morality: 81.39% - 95.01%), with an overall mean of 89.54% (past crime: 90.89%; projected crime: 88.44%; morality: 89.29%).
Given these zero-inflated distributions, a quick and easy sensitivity check is to remove these variables from our data and then calculate the maximum percent match separately for this subset of variables and for the larger subset of data without these variables. This should give us a sense of whether and to what degree the crime and morality items are impacting the maximum percent match values.
To do this, we filter our data into two distinct subset (no crime/morality items; only crime/morality items), and then recalculate maximum percent match in each subset.
<- sarajevo_data_noadtrim %>%
sarajevo_data_nocrmtrim select(!(c(starts_with("Q25"), starts_with("Q26"), starts_with("Q50"))))
# names(sarajevo_data_nocrmtrim) #Confirms these variables were removed
# write_dta(sarajevo_data_nocrmtrim, here("Data", "sarajevo_data_nocrmtrim.dta"))
<- sarajevo_data_noadtrim %>%
sarajevo_data_crmtrim select((c("ID", starts_with("Q25"), starts_with("Q26"), starts_with("Q50"))))
# names(sarajevo_data_crmtrim) #Confirms these variables are only variables in data
# write_dta(sarajevo_data_crmtrim, here("Data", "sarajevo_data_crmtrim.dta"))
#Calculate proportion match
<- percentmatch(sarajevo_data_nocrmtrim, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
sarajevo_nocrm_pctmatch rename(ID = id,
pct_match_nocrm = pm, # appended "nocrm" to indicate data source
match_id_nocrm = id_m) %>%
left_join(sarajevo_data_nocrmtrim) %>%
relocate(pct_match_nocrm, match_id_nocrm, .after = last_col())
<- percentmatch(sarajevo_data_crmtrim, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
sarajevo_crm_pctmatch rename(ID = id,
pct_match_crm = pm, # appended "crm" to indicate data source
match_id_crm = id_m) %>%
left_join(sarajevo_data_crmtrim) %>%
relocate(pct_match_crm, match_id_crm, .after = last_col())
Now, let’s examine the descriptive statistics for the maximum percent match variable in each subset and compare them to those generated using our full data set containing all items.
<- sarajevo_nocrm_pctmatch %>%
pctmatch_nocrm_tab select(pct_match_nocrm) %>%
report_table() %>%
mutate(sample = paste0("No Crime/Morality Items", " (", ncol(sarajevo_data_nocrmtrim) - 1, ")")) %>%
relocate(sample)
<- sarajevo_crm_pctmatch %>%
pctmatch_crm_tab select(pct_match_crm) %>%
report_table() %>%
mutate(sample = paste0("Only Crime/Morality Items", " (", ncol(sarajevo_data_crmtrim) - 1, ")")) %>%
relocate(sample)
<- pctmatch_tab %>%
pctmatch_tab mutate(sample = paste0("All Items", " (", ncol(sarajevo_data_noadtrim) - 1, ")")) %>%
relocate(sample)
<-
pctmatch_crmcomp bind_rows(pctmatch_tab, pctmatch_nocrm_tab, pctmatch_crm_tab)
gt(data = pctmatch_crmcomp) %>%
cols_hide(columns = c(n_Obs, Variable, MAD, percentage_Missing)) %>%
fmt_number(
columns = c(Mean, SD, Median, Min, Max, Skewness, Kurtosis),
decimals = 2,
use_seps = FALSE) %>%
cols_label(
sample = "Sample (# of variables)")
Sample (# of variables) | Mean | SD | Median | Min | Max | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|
All Items (237) | 0.69 | 0.12 | 0.72 | 0.26 | 0.88 | −1.05 | 1.11 |
No Crime/Morality Items (213) | 0.66 | 0.12 | 0.69 | 0.25 | 0.88 | −0.87 | 0.56 |
Only Crime/Morality Items (24) | 0.95 | 0.11 | 1.00 | 0.25 | 1.00 | −3.25 | 11.95 |
From this table, it is clear that the crime and morality items impact the summary statistics describing the maximum percent match distribution. Specifically, removing these items lowered the mean maximum percent match from 69% to 66%. Meanwhile, the comparable statistic from the subset containing only crime and morality items is 95%, which is an extremely high degree of row-wise similarity - and is exactly what we expected given the zero-inflated nature of these data.
To get a better sense of these distributions, let’s plot them like we (and KR) did before.
<- sarajevo_data_pctmatchR %>%
pctmatch_comp_data select(ID, pct_match) %>%
left_join(sarajevo_nocrm_pctmatch) %>%
select(ID, pct_match, pct_match_nocrm) %>%
left_join(sarajevo_crm_pctmatch) %>%
select(ID, pct_match, pct_match_nocrm, pct_match_crm) %>%
mutate(pctmatch_nocrmdif = pct_match - pct_match_nocrm,
pctmatch_crmdif = pct_match - pct_match_crm)
<- ggplot(data = sarajevo_data_pctmatchR, aes(x = pct_match)) +
pctmatch_dist geom_histogram(aes(y = ..density..), bins = 30,
color = "black", fill = "#440154FF") +
scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) +
scale_y_continuous(limits = c(0, 5)) +
labs(x = "All Items") +
theme_minimal() +
theme(panel.grid = element_blank())
<- ggplot(data = sarajevo_nocrm_pctmatch, aes(x = pct_match_nocrm)) +
pctmatch_nocrm_dist geom_histogram(aes(y = ..density..), bins = 30,
color = "black", fill = "#21908CFF") +
scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) +
scale_y_continuous(limits = c(0, 5)) +
labs(x = "No Crime/Morality") +
theme_minimal() +
theme(panel.grid = element_blank())
<- ggplot(data = sarajevo_crm_pctmatch, aes(x = pct_match_crm)) +
pctmatch_crm_dist geom_histogram(aes(y = ..density..), bins = 30,
color = "black", fill = "#FDE725FF") +
scale_x_continuous(limits = c(0.2, 1.1), breaks = seq(.2, 1, .1)) +
labs(x = "Only Crime/Morality") +
theme_minimal() +
theme(panel.grid = element_blank())
+ pctmatch_nocrm_dist + pctmatch_crm_dist +
pctmatch_dist plot_annotation(
title = 'Distribution of Maximum Percent Match by Variables Included')
A careful examination of the first two plots reveals that removal of the crime and morality items causes the the distribution of maximum percent match values to shift somewhat toward lower values. The third plot helps us to visualize why this happens - the distribution in the data subset containing only crime and morality items is shifted dramatically towards higher values of maximum percent match. Note that this plot has a substantially different y-axis scale: the distribution is literally “off the charts” compared to the other two distributions!
We can also identify how many observations in each data (sub)set are above KR’s potential problematic or “likely fraud” (>80% and >85%) thresholds.
<- pctmatch_comp_data %>% filter(round(pct_match, digits = 2) >= .85) %>% nrow()
gt85full <- pctmatch_comp_data %>% filter(round(pct_match, digits = 2) >= .80) %>% nrow()
gt80full
<- pctmatch_comp_data %>% filter(round(pct_match_nocrm, digits = 2) >= .85) %>% nrow()
gt85nocrm <- pctmatch_comp_data %>% filter(round(pct_match_nocrm, digits = 2) >= .80) %>% nrow()
gt80nocrm <- pctmatch_comp_data %>% nrow()
totcrm
<- pctmatch_comp_data %>% filter(round(pct_match_crm, digits = 2) >= .85) %>% nrow()
gt85crm <- pctmatch_comp_data %>% filter(round(pct_match_crm, digits = 2) >= .80) %>% nrow()
gt80crm
<- pctmatch_comp_data %>% nrow
ntot
<- tribble(
neardup_obs ~Data, ~gt85, ~flag85, ~gt80, ~flag80,
"All items (237)", gt85full, (gt85full/ntot)*100, gt80full, (gt80full/ntot)*100,
"No crime/morality items (213)", gt85nocrm, (gt85nocrm/ntot)*100, gt80nocrm, (gt80nocrm/ntot)*100,
"Only crime/morality items (24)", gt85crm, (gt85crm/ntot)*100, gt80crm, (gt80crm/ntot)*100
)
%>%
neardup_obs relocate(c(gt80, flag80), .before = gt85) %>%
gt() %>%
fmt_number(
columns = c(flag85, flag80),
decimals = 1
%>%
) cols_label(
Data = "Sample (# of variables)",
gt85 = "n",
gt80 = "n",
flag85 = "%",
flag80 = "%",
.fn = md) %>%
tab_spanner(label = ">85% match", columns = c(gt85,flag85)) %>%
tab_spanner(label = ">80% match", columns = c(gt80,flag80)) %>%
tab_header(title = md("**Number of observations flagged as near duplicates (N=822)**"))
Number of observations flagged as near duplicates (N=822) | ||||
Sample (# of variables) | >80% match | >85% match | ||
---|---|---|---|---|
n | % | n | % | |
All items (237) | 128 | 15.6 | 25 | 3.0 |
No crime/morality items (213) | 68 | 8.3 | 10 | 1.2 |
Only crime/morality items (24) | 755 | 91.8 | 731 | 88.9 |
As the table above shows, the full data had 25 observations above the 85% cutoff and 128 above the 80% cutoff. The data without the crime and morality items has 10 and 68 observations above the 85% and 80% cutoffs, respectively. The data that includes only the crime and morality items includes 731 observations above the 85% cutoff and 755 above the 80% cutoff. Moreover, though the crime and morality items account for only 24 of the 237 items (about 10%) in the percent match analysis, their inclusion in the data approximately doubles the number of observations flagged as potentially problematic at the 80% and 85% maximum match thresholds!
Overall, this clearly illustrates how items we know to be highly clustered or skewed (e.g., zero-inflated) affect maximum percent match calculations and detection of near duplicates. In fact, if our survey simply included just the crime and morality items, then virtually all of our observations would have been identified as near-duplicates using KR’s criteria!
Given these findings, it might makes sense to approach highly skewed (e.g., zero-inflated) items in a similar way that KR approaches missing data by removing them in the pre-processing stage prior to near duplicate analysis. That is, recall KR removed any variables with more than 10% missing (e.g., items missing due to skip patterns), since clustering on missing would inflate percent match values and near duplicate detection. Likewise, if a large proportion of respondents provide similar response values on an item for potentially legitimate reasons, then that item is a potentially invalid and biased indicator of near duplication. We can see where this takes us by expanding our analysis to other items in the data that exhibit highly skewed or clustered response distributions.
Let’s start by identifying other items in the data that also exhibit high concentration of responses in a particular response value. We will start by identifying those items in which 75% or more of respondents answer in the modal category.13
= function(numer, denom, digits = 0) {
pct paste0(round(((numer/denom) *100), digits = digits), "%")
}
<- mode_percentage(data = haven::zap_labels(sarajevo_data_pctmatchR))
all_modes # use `zap_labels` to remove value labels from data before running function (was leading to an error).
<- sarajevo_data_pctmatchR %>%
sub_vars select(-c(ID, pct_match, match_id)) %>%
ncol()
<- all_modes %>%
m75 filter(round(percentage_mode, digits = 0) >= 75) %>%
nrow()
<- all_modes %>%
m80 filter(round(percentage_mode, digits = 0) >= 80) %>%
nrow()
<- all_modes %>%
m85 filter(round(percentage_mode, digits = 0) >= 85) %>%
nrow()
<- all_modes %>%
m90 filter(round(percentage_mode, digits = 0) >= 90) %>%
nrow()
Using our arbitrary 75% criterion, 96 items exhibit substantial clustering, which means that for these items, 75% or more of the 822 respondents all responded with the same response value. This equates to about 41% of the 234 substantive items in the analysis.
As we would expect, increasing our criterion results in identification of fewer substantially clustered items. For example, using 80%, 85%, and 90% criteria results in 32%, 23%, and 11% of our substantive items, respectively, being identified as substantially clustered in a single response value.
It is likely that most of these items relate to crime and/or morality. We can check by simply filtering the data by the 75% criterion and creating a data summary for just those variables from the main data set.
<- all_modes %>%
m75_modes filter(round(percentage_mode, digits = 0) >= 75)
<- all_modes %>%
m75_vars filter(round(percentage_mode, digits = 0) >= 75) %>%
pull(variable)
<- sarajevo_data_pctmatchR %>%
sarajevo_data_m75 select(all_of(m75_vars)) %>%
::get_label() %>%
sjlabelledas_tibble_row() %>%
pivot_longer(everything(), names_to = "variable", values_to = "label") %>%
left_join(m75_modes) %>%
select(-modes)
gt(sarajevo_data_m75) %>%
tab_options(container.height = px(500))
variable | label | mode | percentage_mode |
---|---|---|---|
Q10_10 | A person of a different ethnicity being nice to your face, but saying bad things behind back. | 1 | 74.57421 |
Q10_11 | Engaging in a dispute with someone of a different ethnicity | 1 | 78.22384 |
Q10_12 | Being victimized by someone of a different ethnicity | 1 | 78.22384 |
Q13_21 | I would feel like I’d be losing an important part of myself if I lost a very close friend | 1 | 77.61557 |
Q13_38 | There is a great deal of corruption in Bosnia | 1 | 82.96837 |
Q14_11 | Engage in risky or dangerous sexual behavior | 1 | 81.38686 |
Q14_14 | Cheat on your spouse or significant other | 1 | 88.68613 |
Q14_17 | Accidentally leave a store with something and did not go back to pay for it | 1 | 91.84915 |
Q14_19 | Engaged in a peaceful demonstration or protest | 1 | 76.39903 |
Q14_20 | Participated in a riot or violent protest | 1 | 91.97080 |
Q15_4 | Because that is the type of person I am | 5 | 76.27737 |
Q15_5 | Because I feel like it is the right thing to do | 5 | 80.65693 |
Q16_5 | Because I feel like it is the right thing to do | 5 | 76.27737 |
Q17_4 | Because that is the type of person I am | 5 | 74.69586 |
Q17_5 | Because I feel like it is the right thing to do | 5 | 77.98054 |
Q18_4 | Because that is the type of person I am | 5 | 82.96837 |
Q18_5 | Because I feel like it is the right thing to do | 5 | 85.64477 |
Q19_1 | Q19. I would feel very guilty [a sense of remorse, personal regret, or emotional pain] if I… Broke a promise or commitment to a friend or family member | 1 | 77.61557 |
Q19_4 | Lied about something important to a person I care about | 1 | 74.69586 |
Q19_5 | Cheated on my spouse or significant other | 1 | 84.54988 |
Q19_8 | Accidentally left a store with something and did not go back to pay for it | 1 | 82.23844 |
Q19_9 | Neglected an important duty, such as picking up children from school | 1 | 79.31873 |
Q20_1 | Q20. I would feel very ashamed if my family and close friends knew that I… Broke a promise or commitment to a friend or family member | 1 | 77.00730 |
Q20_4 | Lied about something important to them | 1 | 75.79075 |
Q20_5 | Cheated on my spouse or significant other | 1 | 84.06326 |
Q20_8 | Accidentally left a store with something and did not go back to pay for it | 1 | 81.26521 |
Q20_9 | Neglected an important duty, such as picking up children from school | 1 | 79.80535 |
Q22_6 | Borrow or steal from a family member without their knowledge | 6 | 85.88808 |
Q22_7 | Steal or embezzle from your work | 6 | 87.34793 |
Q22_8 | Shoplift, steal, or defraud a stranger | 6 | 87.46959 |
Q23_5 | Threaten to fight him | 6 | 81.63017 |
Q23_6 | Approach him and then hit or shove him | 6 | 85.03650 |
Q23_7 | Threaten to use a weapon on him | 6 | 87.83455 |
Q23_8 | Get a weapon and use it on him | 6 | 88.56448 |
Q24_1 | Q24. How guilty would you feel if you contemplated or thought about… Stealing from a family member | 5 | 88.32117 |
Q24_2 | Stealing or embezzling from your work | 5 | 89.65937 |
Q24_3 | Stealing or shoplifting from a stranger | 5 | 90.02433 |
Q24_4 | Threatening someone who angers you | 5 | 80.53528 |
Q24_5 | Hitting someone who angers you | 5 | 82.48175 |
Q24_6 | Using a weapon on someone that angers you | 5 | 89.90268 |
Q24_7 | Using marijuana or other illegal drugs. | 5 | 86.86131 |
Q24_8 | Selling marijuana or other illegal drugs. | 5 | 89.05109 |
Q25_1 | Q25. How often did you do the following in the past two years? Took money or property that didn’t belong to you worth less than 5 BAM. | 1 | 84.67153 |
Q25_2 | Took money or property that didn’t belong to you worth 5 BAM or more. | 1 | 90.38929 |
Q25_3 | Offered a bribe to police or other official (money or other incentive so they would do something they were not supposed to do). | 1 | 85.27981 |
Q25_4 | Threatened to use violence on someone else. | 1 | 86.73966 |
Q25_5 | Physically harmed someone else on purpose. | 1 | 93.43066 |
Q25_6 | Used marijuana or other illegal drug. | 1 | 93.55231 |
Q25_7 | Sold marijuana or other illegal drug. | 1 | 97.08029 |
Q25_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission | 1 | 95.98540 |
Q26_1 | Q26. What are the chances that you might do the following in the future? Take money or property that didn’t belong to you worth less than 5 BAM. | 1 | 82.60341 |
Q26_2 | Take money or property that didn’t belong to you worth 5 BAM or more. | 1 | 86.00973 |
Q26_3 | Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). | 1 | 78.46715 |
Q26_4 | Threaten to use violence on someone else. | 1 | 85.88808 |
Q26_5 | Physically harm someone else on purpose. | 1 | 92.09246 |
Q26_6 | Use marijuana or other illegal drug. | 1 | 92.70073 |
Q26_7 | Sell marijuana or other illegal drug. | 1 | 95.49878 |
Q26_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission | 1 | 94.28224 |
Q27_1 | Q27. How rewarding (beneficial, gratifying, useful) do you think it is to… Take money or property that didn’t belong to you worth less than 5 BAM. | 1 | 83.69830 |
Q27_2 | Take money or property that didn’t belong to you worth 5 BAM or more. | 1 | 85.27981 |
Q27_3 | Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). | 1 | 80.29197 |
Q27_4 | Threaten to use violence on another person. | 1 | 85.64477 |
Q27_5 | Physically harm another person on purpose. | 1 | 89.53771 |
Q27_6 | Use marijuana or other illegal drug. | 1 | 88.56448 |
Q27_7 | Sell marijuana or other illegal drug. | 1 | 90.26764 |
Q27_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her permission. | 1 | 91.60584 |
Q28_2 | Take money or property that didn’t belong to you worth 5 BAM or more. | 5 | 76.39903 |
Q28_4 | Threaten to use violence on someone else. | 5 | 75.54745 |
Q28_5 | Physically harm someone else on purpose. | 5 | 85.88808 |
Q28_6 | Use marijuana or other illegal drug. | 5 | 83.57664 |
Q28_7 | Sell marijuana or other illegal drug. | 5 | 88.32117 |
Q28_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her permission. | 5 | 86.61800 |
Q29_1 | Q29. Would most of your close family members consider it to be morally wrong if you were to… Take money or property that didn’t belong to you worth less than 5 BAM. | 5 | 74.69586 |
Q29_2 | Take money or property that didn’t belong to you worth 5 BAM or more. | 5 | 83.45499 |
Q29_3 | Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). | 5 | 80.29197 |
Q29_4 | Threaten to use violence on someone else. | 5 | 85.03650 |
Q29_5 | Physically harm someone else on purpose. | 5 | 90.14599 |
Q29_6 | Use marijuana or other illegal drug. | 5 | 89.90268 |
Q29_7 | Sell marijuana or other illegal drug. | 5 | 90.99757 |
Q29_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission | 5 | 89.05109 |
Q30_2 | Take money or property that didn’t belong to you worth 5 BAM or more. | 5 | 75.42579 |
Q30_4 | Threaten to use violence on someone else. | 5 | 76.15572 |
Q30_5 | Physically harm someone else on purpose. | 5 | 82.60341 |
Q30_6 | Use marijuana or other illegal drug. | 5 | 82.84672 |
Q30_7 | Sell marijuana or other illegal drug. | 5 | 85.27981 |
Q30_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission | 5 | 85.40146 |
Q4 | Q4. What best describes the family that raised you? | 1 | 90.87591 |
Q50_1 | Q50. How morally acceptable is it to you to… Take money or property that didn’t belong to you worth less than 5 BAM. | 1 | 81.38686 |
Q50_2 | Take money or property that didn’t belong to you worth 5 BAM or more. | 1 | 87.10462 |
Q50_3 | Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). | 1 | 82.60341 |
Q50_4 | Threaten to use violence on someone else. | 1 | 88.19951 |
Q50_5 | Physically harm someone else on purpose. | 1 | 92.94404 |
Q50_6 | Use marijuana or other illegal drug. | 1 | 92.21411 |
Q50_7 | Sell marijuana or other illegal drug. | 1 | 95.01217 |
Q50_8 | Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission | 1 | 94.89051 |
Q6 | Q6. Have you ever cohabitated with a romantic partner (that is, lived with while not married to him/her)? | 2 | 86.49635 |
As expected, most of these questions are in reference to crime or morality topics - but not all of them. For example, the “Q10” items ask respondents how often they “stress or worry about” different things. Within this set, the items that exhibit substantial clustering (i.e., >75% in a single response value) are Q10_10 through Q10_12, which specifically ask about stress or worry due to interactions with someone of a different ethnicity.14
Using this broader set of items identified as exhibiting substantial response clustering, we can again compare the maximum percentage match patterns with those found in the entire data set and the various subsets examined above.
#Create data set with only variables where 75% of responses are the same answer (and ID variable)
<- sarajevo_data_noadtrim %>%
sarajevo_data_m75vars select(all_of(c("ID", m75_vars)))
#names(sarajevo_data_m75vars) #Confirms these variables were removed
#Create data set with only variables where less than 75% of responses are the same answer
<- sarajevo_data_noadtrim %>%
sarajevo_data_nom75vars select(-all_of(m75_vars))
#names(sarajevo_data_nom75vars)
<- percentmatch(sarajevo_data_m75vars, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
sarajevo_m75_pctmatch rename(ID = id,
pct_match_m75 = pm,
match_id_m75 = id_m) %>%
left_join(sarajevo_data_noadtrim) %>%
relocate(pct_match_m75, match_id_m75, .after = last_col())
<- percentmatch(sarajevo_data_nom75vars, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
sarajevo_nom75_pctmatch rename(ID = id,
pct_match_nom75 = pm,
match_id_nom75 = id_m) %>%
left_join(sarajevo_data_noadtrim) %>%
relocate(pct_match_nom75, match_id_nom75, .after = last_col())
As before let’s first examine the descriptive statistics for the maximum percent match variable.
<- sarajevo_m75_pctmatch %>%
pctmatch_m75_tab select(pct_match_m75) %>%
report_table() %>%
mutate(sample = paste0("75% or more concentration", " (", ncol(sarajevo_data_m75vars) - 1, ")")) %>%
relocate(sample)
<- sarajevo_nom75_pctmatch %>%
pctmatch_nom75_tab select(pct_match_nom75) %>%
report_table() %>%
mutate(sample = paste0("Less than 75% concentration", " (", ncol(sarajevo_data_nom75vars) - 1, ")")) %>%
relocate(sample)
<-
pctmatch_m75comp bind_rows(pctmatch_tab, pctmatch_nocrm_tab, pctmatch_crm_tab, pctmatch_nom75_tab, pctmatch_m75_tab)
gt(data = pctmatch_m75comp) %>%
cols_hide(columns = c(n_Obs, Variable, MAD, percentage_Missing)) %>%
fmt_number(
columns = c(Mean, SD, Median, Min, Max, Skewness, Kurtosis),
decimals = 2,
use_seps = FALSE) %>%
cols_label(
sample = "Sample (# of variables)")
Sample (# of variables) | Mean | SD | Median | Min | Max | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|
All Items (237) | 0.69 | 0.12 | 0.72 | 0.26 | 0.88 | −1.05 | 1.11 |
No Crime/Morality Items (213) | 0.66 | 0.12 | 0.69 | 0.25 | 0.88 | −0.87 | 0.56 |
Only Crime/Morality Items (24) | 0.95 | 0.11 | 1.00 | 0.25 | 1.00 | −3.25 | 11.95 |
Less than 75% concentration (141) | 0.57 | 0.11 | 0.58 | 0.27 | 0.82 | −0.27 | −0.43 |
75% or more concentration (96) | 0.90 | 0.13 | 0.95 | 0.31 | 1.00 | −1.99 | 4.24 |
As we saw with the “no crime and morality items” subset, the items with less than 75% concentration in one response value have much lower maximum percent match values than the full sample and than the clustered item subsets. Now, let’s plot the distributions.
<- pctmatch_comp_data %>%
pctmatch_comp_data left_join(sarajevo_m75_pctmatch) %>%
select(ID, starts_with("pct")) %>%
left_join(sarajevo_nom75_pctmatch) %>%
select(ID, starts_with("pct")) %>%
mutate(pctmatch_m75dif = pct_match - pct_match_m75,
pctmatch_nom75dif = pct_match - pct_match_nom75) %>%
relocate(c(pct_match_m75, pct_match_nom75), .before = pctmatch_nocrmdif)
<- ggplot(data = sarajevo_m75_pctmatch, aes(x = pct_match_m75)) +
pctmatch_m75_dist geom_histogram(aes(y = ..density..), bins = 30,
color = "black", fill = "#3B528BFF") +
scale_x_continuous(limits = c(.2, .9), breaks = seq(.2, .9, .1)) +
labs(x = "75% or more concentration") +
theme_minimal() +
theme(panel.grid = element_blank())
<- ggplot(data = sarajevo_nom75_pctmatch, aes(x = pct_match_nom75)) +
pctmatch_nom75_dist geom_histogram(aes(y = ..density..), bins = 30,
color = "black", fill = "#5DC863FF") +
scale_x_continuous(limits = c(.2, .9), breaks = seq(.2, .9, .1)) +
labs(x = "Less than 75% concentration") +
theme_minimal() +
theme(panel.grid = element_blank())
+ pctmatch_nocrm_dist + pctmatch_crm_dist + pctmatch_nom75_dist + pctmatch_m75_dist +
pctmatch_dist plot_annotation(
title = 'Distribution of Maximum Percent Match by Variables Included')
The plot above shows the distributions of the maximum percent match score for different subsets of items. The distributions for “No Crime/Morality” and “Less than 75% concentration” subsets have shifted towards lower percent maximum values (i.e., to the left) compared to the full data with all items, which is what we had expected. Interestingly, removal of crime/morality items and other items with less than 75% response concentration - that is, removal of items with high degrees of item response clustering - results in a distribution that more closely resembles the Gumbel extreme value distribution that KR suggested we should expect in the absence of fraudulent near duplication.
<- sarajevo_nom75_pctmatch %>% filter(round(pct_match_nom75, digits = 2) >= .85) %>% nrow()
gt85nom75 <- sarajevo_nom75_pctmatch %>% filter(round(pct_match_nom75, digits = 2) >= .80) %>% nrow()
gt80nom75
<- sarajevo_m75_pctmatch %>% filter(round(pct_match_m75, digits = 2) >= .85) %>% nrow()
gt85m75 <- sarajevo_m75_pctmatch %>% filter(round(pct_match_m75, digits = 2) >= .80) %>% nrow()
gt80m75
<- neardup_obs %>%
neardup_obs2 add_row(
Data = c("Less than 75% concentration (141)", "75% or more concentration (96)"),
gt85 = c(gt85nom75, gt85m75),
flag85 = c((gt85nom75/ntot)*100, (gt85m75/ntot)*100),
gt80 = c(gt80nom75, gt80m75),
flag80 = c((gt80nom75/ntot)*100, (gt80m75/ntot)*100)
)
%>%
neardup_obs2 relocate(c(gt80, flag80), .before = gt85) %>%
gt() %>%
fmt_number(
columns = c(flag85, flag80),
decimals = 1
%>%
) cols_label(
Data = "Sample (# of variables)",
gt85 = "n",
gt80 = "n",
flag85 = "%",
flag80 = "%",
.fn = md) %>%
tab_spanner(label = ">85% match", columns = c(gt85,flag85)) %>%
tab_spanner(label = ">80% match", columns = c(gt80,flag80)) %>%
tab_header(title = md("**Number of observations flagged as near duplicates (N=822)**"))
Number of observations flagged as near duplicates (N=822) | ||||
Sample (# of variables) | >80% match | >85% match | ||
---|---|---|---|---|
n | % | n | % | |
All items (237) | 128 | 15.6 | 25 | 3.0 |
No crime/morality items (213) | 68 | 8.3 | 10 | 1.2 |
Only crime/morality items (24) | 755 | 91.8 | 731 | 88.9 |
Less than 75% concentration (141) | 6 | 0.7 | 0 | 0.0 |
75% or more concentration (96) | 677 | 82.4 | 615 | 74.8 |
<- sarajevo_data_stratmatch %>%
sarajevo_data_stratmatch full_join(sarajevo_nocrm_pctmatch) %>%
full_join(sarajevo_crm_pctmatch) %>%
full_join(sarajevo_nom75_pctmatch) %>%
full_join(sarajevo_m75_pctmatch)
save(sarajevo_data_stratmatch, file = here("Data", "sarajevo_data_stratmatch.RData"))
#save.image(file = here("Data", "duplicates.RData"))
As we saw previously with the crime and morality analysis, highly skewed or clustered response distributions pose serious problems for forensic detection of fraud via near duplication. Let’s start with those 96 highly skewed items that exhibited insufficient variation due to substantial response clustering (i.e., “75% or more concentration”). Using the maximum percent match metric and KR’s 85% match criterion for identifying “likely fraud” cases, we would flag a whopping 615 observations, or 74.8 of the sample, as above the 85% KR threshold. Using a more liberal criterion uncovers many more near duplicates, with 728 (82.36%) observations identified above the 80% threshold.
Meanwhile, removing these highly skewed items with 75% or more concentration in a single response value results in a drastic reduction of the number of observations identified as “likely fraud” due to near duplication. In fact, among the 141 items with sufficient variability (i.e., “less than 75% concentration”), not a single observation (0) was identified as potentially problematic using KR’s 85% “likely fraud” threshold, and only 6 were flagged using the more liberal 80% threshold.
Now that we know how common characteristics of criminological data inflate maximum percent match metrics, let’s revisit the elusive question of whether Interview #64’s data are “likely falsified” as potentially indicated by their interviewer-specific percent match distribution. Below we have reproduced two of the figures presented above using only the items identified as having sufficient variability for meaningful near duplication analysis, which we defined here as the 141 variables exhibiting less than 75% case-level clustering in one response value.
Recall, when limited to these 141 items, none of our cases had percent match values exceeding KR’s 85% threshold, which is apparent again in these figures. Additionally, these figures show that Interviewer #64 has higher overall percent match values than the other interviewers in their respective municipality. However, in this analysis, their interviewer-specific percent match distribution is more in line with those of other interviewers. It seems plausible, then, that Interviewer #64 simply was assigned to relatively homogeneous areas and, as such, had a greater number of respondents reporting in those clustered response categories. Nonetheless, we are still likely to flag this interviewer’s data for sensitivity analyses in future research because, well, we told you that we have trust issues.
Well, that was a long and somewhat repetitive - or dare we say duplicative - journey. At the end of it, we now have the skills to identify exact and near duplicates in our data. We also have developed a sense of what types of additional analyses might be useful when initial analyses identify some near duplicate cases (e.g., maximum percent match by interviewer and across research design strata), and we have identified some potential sources of bias to watch out and account for (e.g., inflated estimates of near duplicates due to items with skewed or clustered response distributions). We intend to add these forensic tasks early in our workflow when collecting and analyzing new data sources, and we hope you will consider doing so too.
Initially, those near duplicate cases we identified as above KR’s 85% threshold gave us some pause. If you find some in your data, they should give you pause too. However, we hope you now have a better understanding of the maximum percent match metric and its uses as well as its limitations. Additionally, we urge you to remember our warnings about using KR’s arbitrary 85% threshold to make strong determinations about the presence of “likely fraud” in data. Like any heuristic, it is bound to fail in certain situations. For additional examples of such situations, we urge you to check out Simmons’ and colleagues’ response paper. We pulled the following two quotes from their paper to reinforce our warning here:
“[KR’s 85%] threshold based on simulations with synthetic data is not relevant for what we should see in real-world data.” (p.332)
“…it is possible to obtain any value of the maximum percent match statistic in non-falsified data, depending on the survey parameter. Thus, setting a threshold for the statistic and applying it uniformly across surveys is a flawed approach for detecting falsification. In fact, eliminating respondents from a dataset based on this measure may introduce selection bias into survey data and serve to reduce data quality, rather than improve it.” (p.335)
In criminology, we often collect survey data containing a substantial number of items with highly skewed and clustered response distributions (e.g., zero-inflated crime and morality items). This research context appears to be a situation in which such heuristics might frequently fail to generate reliable inferences, perhaps even more often than they do not. In our view, this should not discourage criminologists from using these tools to better understand our data. Rather, we encourage learning and using these tools and their corresponding decision criteria as they were likely intended - as methods and heuristics that help us summarize and sometimes raise questions worth answering about our data.
Generally, when we detect a near duplicate - or any “significant” statistical value - we see this as the starting point rather than end of a journey towards asking and attempting to answer important questions about our data. Then, as we learn more about our methods and data, we might find ourselves modifying heuristics, or even ditching them altogether, if they are not particularly helpful in our specific case. Remember, binary heuristics like statistical thresholds help us make important decisions. Yet those decisions, like the data upon which they are based, often are laden with non-negligible degrees of uncertainty, and the transparent communication and visualization of uncertainty also are essential for good decision-making.
For instance, after searching for near duplicates in our BiH data, we feel much more confident moving forward with analyses under the assumption that fraud via near duplication is not likely to be much, if any, of a concern in these data.15 Despite our and others’ criticisms, KR’s thresholds were helpful to us in this process. They motivated us to ask important questions about our data, then to try answering them with our data, and ultimately to learn a great deal more about our data. Yet, despite our increased confidence - which stems from knowledge and a corresponding reduction in uncertainty about our data - we still do not have a definitive answer about the presence or absence of fraud via near (or other) duplication in our data, and we never will. Much of the remaining uncertainty stems from research design features that we can no longer do anything about, such as the fact that others collected our data for us and we were unable to directly supervise it. Which leads us to our final concluding point: There is no simple heuristic substitute for the careful design, collection, description, modeling, and dissemination of one’s data.
Funding for data collection came from three sources: (1) Collaborative Research Award, Institute for Advanced Study, Indiana University; (2) Faculty Research Grant, Missouri State University; (3) Faculty Research Initiative, College of Arts and Sciences, University of North Carolina-Wilmington.↩︎
Interestingly, this type of exact duplication apparently was involved in criminology’s “Stewart Retractions” and, in fact, was not caught (at least not publicly) for a long time by Stewart’s co-authors - nor by anyone else in the field of criminology, for that matter. Of course, without access to the data, even exact duplication would be difficult to identify. Speaking of which, maybe social scientists should be required to sign a pledge saying that they have seen the data and checked for obvious indications of fraud such as exact duplication before publishing future papers. After all, scientific research shows that signing honesty pledges before filing things like insurance claims has been shown to… Oh, wait, well damn. Never mind. Let’s just keep doing what we’ve been doing.↩︎
The only time I use Stata these days is when working with Master’s students. I may be the only faculty member in my department who regularly uses R, so students often are taught data analysis with Stata or SPSS.↩︎
Long story short: my function generally did produce similar results with a couple exceptions. First, there appeared to be some small rounding discrepancy to the seventh decimal place in the percent match metric in my R function compared to KR’s Stata function. Second, there were some differences in the observation identified as the closest match for the focal observation. I assume this difference results from how the different functions handle cases where more than one observation shares the maximum percent match score with the focal observation.↩︎
KR note that demographic variables are used in sampling designs and should approximate population parameters, and thus are less apt to be erroneously duplicated - see footnote 9 on p.286 in KR’s article↩︎
I actually asked ChatGPT to change the code above to calculate row means instead of column means. Although it initially gave me code to calculate column means, with simple prodding (I had to tell it to use tidyverse map and to give me row means instead of column means) it produced the following code: sarajevo_rowmiss <- map_dbl(seq_len(nrow(sarajevo_data_noad)), ~mean(is.na(sarajevo_data_noad[.x,])))
↩︎
On pg. 286 of their article they mention the 10% cutoff for questions was to avoid variables that are part of skip patterns and the 25% cutoff for observations was to “minimize the risk of having break offs skew the analysis and overstate the level of similarity between observations (pg. 286).”↩︎
The Gumbel distribution appears to be a common distribution used in sciences where predicting maximum values is important. To quote directly from Wikipedia: “This distribution might be used to represent the distribution of the maximum level of a river in a particular year if there was a list of maximum values for the past ten years. It is useful in predicting the chance that an extreme earthquake, flood or other natural disaster will occur.”↩︎
We also relied on a [tutorial] on extreme value distributions by Naresh Devineni↩︎
Within each neighborhood cluster, interviewers were sent to a randomly selected address as a starting point. From there, households were selected with a random walk technique and right hand principle where contact was attempted at every third household. Within each household contacted, the last birthday method was used to select one respondent from among all permanent members of the household older than 18.↩︎
We might even expect this zero-inflation to be more pronounced in our surveys given the cultural context. For instance, Bosnia and Herzegovina and particularly Sarajevo populations are predominantly Muslim, and Islamic contexts tend to have lower crime rates. Additionally, our sample is comprised of adult respondents, and adults tend to self-report substantially lower levels of engagement in criminal behavior compared to teenagers and young adults.↩︎
For respondents who could not read, the protocols called for the interviewers to read the questions but have respondents answer them privately and seal their answers in an envelope.↩︎
Why 75%? Well, why not? We chose it because 75% is a round number that is close to the 78% minimum modal concentration identified in the crime and morality items, which were items that we expected a priori to be zero-inflated and potentially biasing to near duplicate detection. Hey, we are making this up as we go! If it feels arbitrary, that is because it is. We will return to the issue of using arbitrary thresholds and heuristics to make decisions later in the post.↩︎
The lack of variation on these sensitive questions could reflect a variety of causes, such as potentially high degrees of intraethnic homophily in interactions, low levels of perceived overt ethnic bias, or strong social desirability reporting biases on these items in the BiH sample.↩︎
Of course, this does not rule out other sophisticated forms of fraud, such as simulation-based data fabrication by a researcher or survey research organization. We told you that we had trust issues…↩︎