Trust Issues: Examining Near Duplicates in Survey Data

general
rstats
survey research
duplication
fraud
Do you know how to detect exact or near duplicate rows in your data? Read on to learn more!
Authors

Jake Day

Jon Brauer

Maja Kotlaja

Published

October 10, 2023

Near duplicates in survey data: Like “Multiplicity” but without the humor

Have you been duped by data duplicates?

Back in 2016, Andrew Gelman blogged about a study by Kuriakose and Robbins (2016) that investigated “near duplication” of survey responses across 1,000+ public opinion surveys as a way to discover potential data fraud. Around that time, we (Jake, Jon, and Maja Kotlaja) were working with a professional international research organization to collect household interview data in Belgrade, Serbia. Upon reading the post and study, it immediately struck fear in us. How would we know if the data we collected were legitimate? Jon had contracted with professional research organizations to collect survey data in the past and, as he had done before, we did our due diligence in identifying potential contractors and generally making sure they are above board. For instance, we made sure to get multiple quotes, checked each organization’s references and body of work, and had detailed exchanges about sampling and interview methods before finally selecting and entering into contractual agreement with an organization (MASMI-Belgrade in Serbia). Meanwhile, the survey organization itself had also instilled confidence in us by doing their own data quality checks, and they had even thrown out one interviewer’s responses due to data quality concerns.

At some point, we had to assume that we were doing all that one reasonably could be expected to do - that is, aside from traveling to Serbia to directly oversee the survey organization and each interviewer ourselves. Like so many other researchers in similar situations domestically and internationally, eventually we had to trust that we were hiring an experienced professional organization that knows what they are doing to carry out the work professionally. Yet, here’s the thing: I (Jake) generally am not a very trusting person. Maybe it is more accurate to say that I am a “trust, but verify” kind of person. As this was my first time collecting international survey data, I had also heard Jon’s horror stories of the extra (and perhaps ethnocentric) scrutiny that reviewers often applied when submitting manuscripts containing analyses of international survey data. Also, though fraud can happen anywhere, the Kuriakose and Robbins (2016) study we mentioned above actually found near duplicate survey observations were more common in data from non-OECD country samples compared to the OECD samples they examined (Serbia is a non-OECD country). So, for our own peace of mind and to satisfy potentially skeptical reviewers, we knew that someday we should learn how to detect instances of near duplication in our data.

Since completing data collection in Serbia, a global pandemic occurred and, concurrently with the pandemic, we also contracted with another international organization to collect similar survey data in Bosnia and Herzegovina. So, this problem has been at the back of our minds for awhile, and we are finally ready to buckle down and investigate it with our own international survey data. In this post, we tested the near duplication waters by focusing first on our most recently collected dataset from Sarajevo, Bosnia and Herzegovina.1 Afterwards, we plan to conduct similar investigations of data we collected in other international locations (Dhaka, Bangladesh; Belgrade, Serbia). As you will see, this is a lot of work for ultimately what will be a footnote in future papers generated using these data. Still, it is important work, and it is exactly the type of thing that we envisioned sharing on our blog when we launched it. Who knows - maybe our departure into duplication will help some of you dig a little deeper into your data and construct your own esoteric footnotes someday. You’re welcome?

Code
library(haven)
library(here)
library(janitor)
library(Statamarkdown)
library(easystats)
library(gt)
library(ggtext)
library(patchwork)
library(pals)
library(extRemes)
library(qqplotr)
library(tidyverse)

Exact duplicates: When working with data, no one likes a copycat (well, except maybe fraudsters).

Exact Duplications

The first thing most survey organizations likely do (and most researchers should do) is check for exact duplicate entries in the raw data file. The janitor package allows one to quickly identify exact duplicate entries.

Code
load(here("Data", "sarajevo_data.RData"))

#look for exact duplication after removing adminstrative variables:
sarajevo_data %>%
  get_dupes(-c(ID, starttime, Interviewer, Region, Municipality, Cluster, Settlement, weight))
No duplicate combinations found of: Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8a, Q8b, ... and 244 other variables

Using the janitor package, we found no exact duplicates across the whole range of survey questions (minus the administrative variables) in the Sarajevo, Bosnia and Herzegovina (hereafter BiH) data. This is unsurprising given the thoroughness and professionalism we saw from our contracted survey organization, Custom Concepts. So we’re off the hook, right? Footnote complete!? Maybe, but duplicating entries exactly would be a pretty brazen attempt at falsification, one that would likely be caught by a professional survey organization before the data even got to us.2

Nuanced or Near Duplication

Kuriakose and Robbins (2016) (“KR” from here) developed a more nuanced method for detecting likely duplicates by focusing on “near duplicates.” The basic logic is that a dishonest interviewer or survey firm could save time and money by duplicating responses from valid interviews, and that a sophisticated interviewer or firm engaging in fraud is unlikely to simply duplicate the results exactly because exact duplication is relatively easy to check and identify (see above). Rather, duplicating a valid response and then changing the identifier as well as the values of one or two (or more) data columns would avoid the crude duplicate detection methods employed above.

KR’s process involves pairwise comparisons of every observation in the data set on the substantive variables (i.e. non-administrative variables like unique ID) to detect the “maximum percentage of variables that an observation shares with any other observation in the data” (p.284). Based on their simulations, KR (2016) determined that a maximum percent match of 85% or more is worthy of further scrutiny.

Near duplicates: Sometimes cases are more similar than they appear.

As we will discuss in more detail later, KR’s identification of an 85% maximum percent match threshold for detecting fraud is problematic because it rests on a lot of untenable assumptions about the underlying data generating processes, including strong assumptions about study characteristics (e.g., survey item response values, distributions, and inter-item correlations) and about the methods or decisions of fraudsters when duplicating data. Below, we will use 80% and 85% maximum percent thresholds as heuristics to aid in decision-making about potential fraud concerns in our data. However, we feel compelled to strongly caution readers NOT to use these thresholds like “significance levels” to forensically test for the presence or absence of fraud via near duplication. Doing so inevitably will result in undesirably high false positive and false negative rates. With that said, we still think KR’s process of examining percent match distributions is useful as one among many forensic tools that researchers should consider using to better understand their data and to check for obvious or apparent anomalies that beg for additional exploration.

Detecing Near Duplicates: Real World Application in R

I (Jake) initially started by using KR’s function directly in Stata. This meant I had to fire up my Stata version 14 that I probably haven’t opened in at least a couple of years.3 To conduct these analyses in R, I iterated and revised an existing R function meant to mimic KR’s Stata function, which we use below. Given my lack of real programming skills, it was quite the challenge for me. I will try to document its development in a future companion blog to this one tentatively titled: “Stumbling in the Dark: Building/Iterating an R Function to Match Stata’s percentmatch.” In that companion blog, I will walk through the development of the function and also examine whether it produces results consistent with KR’s Stata function applied to the same data.4

Regardless of whether one opts for Stata or R, some basic data pre-processing step are required before examining near duplication. As usual, we will happily do this within the friendly confines of R and the tidyverse.

As with any data analysis, detecting potentially problematic near duplicates requires some data pre-processing or cleaning.

Data pre-processing

The first thing we need to do is remove all of the administrative variables and demographic variables (e.g., sex, age, ethnicity, and religion) so our data include only the substantive variables.5 As before with the exact duplication check using janitor, we will use a copy of the raw BiH data collected in 2021. Our thinking is that this should be among the first checks in one’s workflow before cleaning and analyzing data.

Code
vars_mis99 <-  c("Q7", "Q8a", "Q8b")

sarajevo_data_noad <- sarajevo_data %>%
  select(-c(starttime, Interviewer, Region, Municipality, Cluster, Settlement, Q1, Q2, Q44, Q45, weight)) %>%
  mutate(
    #recode missing codes to NA:
    across(-any_of(c(vars_mis99, "Q45_oth", "ID")), ~ na_if(.x, 9)),
    across(all_of(vars_mis99) & -any_of(c("Q45_oth", "ID")), ~ na_if(.x, 99)))

We now have the raw BiH data with consistent NA values across all of the variables and none of the administrative variables that will artificially increase (or decrease) duplication. The next step is to remove cases where missing values exceed the thresholds set by KR. Specifically, they removed any variable with 10% or more missing values and any observation with 25% or more missing values.

Code
sarajevo_colmiss = map(sarajevo_data_noad, ~mean(is.na(.))) %>%
  as_tibble() %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "pct_colmiss") %>%
  filter(pct_colmiss >= .07) %>%
  arrange(desc(pct_colmiss))

sarajevo_colmiss %>%
  gt() %>%
   cols_label(
    variable = "Variable",
    pct_colmiss = "% Missing") %>%
   fmt_percent(
    columns = pct_colmiss,
    decimals = 1) %>%
  tab_header(
    title = md("**Variables by % Missing**")) %>%
  tab_options(container.height = px(500))
Variables by % Missing
Variable % Missing
Q45_oth 98.9%
Q8b 45.2%
Q31_2 29.9%
Q31_8 29.8%
Q31_1 29.7%
Q31_3 28.7%
Q31_5 27.6%
Q31_4 26.9%
Q31_7 26.9%
Q31_6 26.4%
Q32_3 11.4%
Q14_18 10.3%
Q13_37 9.0%
Q11_8 8.5%
Q11_4 7.0%

As you can see above, there are some variables with a substantial amount of missing data. The variables with the most missing data, “Q45_oth” and “Q8b,” are related to skip patterns (Q8b asks about the respondent’s number of children and was only asked of those who identified as a parent; Q45_oth are the specific answers given for respondents who identified as “other” ethnicity). After those, the substantive questions with the most missing data are Q31_1 - Q31_8. These were questions about respondents’ perceptions of crime and deviance in their neighborhood: “How many people in your neighborhood do these things at least once a year? (e.g.,”Take money or property that didn’t belong to you worth less than 5 BAM.”).”

There are two other variables that are above the 10% missing threshold. Q32_3 (11.4%) and Q14_18 (10.3%). Q32_3 was part of a set of questions that asked: “Imagine that you did something that most of your family, friends, and other people in your neighborhood would disapprove of. Generally, how likely is it that these people would forgive you?” The specific item Q32_3 asked about their perceived likelihood that “Most of the other people in your neighborhood would forgive you.” Given this item’s missingness is an outlier among the set and considering the missing patterns in the Q31 set, there may be something about the neighborhood questions that specifically caused missing data (e.g., respondents’ greater inability or unwillingness to accurately report on crime-related questions about neighbors). Meanwhile, Q14_18 was part of a set of questions that asked “How often did you do the following in the past two years?” with the specific item asking about how often the respondent “neglected an important duty.”

Given this pattern of missing data, we decided to stick with KR’s 10% criterion, though perhaps leaving in Q32_3 and Q14_18 is defensible as well since they have substantially fewer missing observations than the Q31 set of questions. On the other hand, Q45_oth and Q8b should clearly be removed as they are a result of skip patterns which KR specifically mention as a rationale for their 10% criteria (p.286).

The next step is to identify observations with more than 25% missing data. Before doing this, we removed Q45_oth and Q8b as to not include them in the calculation. This turned out to be more difficult than expected and we actually turned to ChatGPT for help.6 We ended up using a simple map_dbl() function from the “purrr” package in Tidyverse.

Code
#Remove skip pattern questions:
sarajevo_data_noadskip <-  sarajevo_data_noad %>%
  select(-Q45_oth, -Q8b)

sarajevo_rowmiss <- map_dbl(seq_len(nrow(sarajevo_data_noadskip)), ~mean(is.na(sarajevo_data_noadskip[.x,]))) %>% 
  as_tibble() %>%
  mutate(ID = row_number()) %>% #This works here, but is not a robust solution
  rename(pct_rowmiss = value) %>%
  relocate(ID)

#Print Results
sarajevo_rowmiss %>%
  arrange(desc(pct_rowmiss)) %>%
  filter(pct_rowmiss >= .25) %>%
  gt() %>%
   cols_label(
    pct_rowmiss = "% Missing") %>%
   fmt_percent(
    columns = pct_rowmiss,
    decimals = 1) %>%
  tab_header(
    title = md("**Observations by % Missing**"))
Observations by % Missing
ID % Missing
149 29.4%
31 28.2%
262 27.4%
345 27.0%

There are four cases (IDs) that are missing more than 25% of the observations (ID = 149, 31, 262, 345). Following KR, we will remove these from the data before checking duplicates. To summarize, we will remove the following survey questions: Q45_oth, Q8b, Q31_2, Q31_8, Q31_1, Q31_3, Q31_5, Q31_4, Q31_7, Q31_6, Q32_3, Q14_18 and the following observations: ID = 149, 31, 262, 345.

We are not sure exactly how KR came up with the 10% and 25% criteria for variables and observations respectively. The rationale for these cut-offs is not fully described in their article.7 Generally, we are suspicious of rules of thumb like these and, when encountering arbitrary thresholds like we do here, we prefer to assess how sensitive results are to alternative decision criteria. For now, we will plug our noses and move forward with these cutoffs, as this blog entry already bites off more than most readers are probably willing to chew.

Code
sarajevo_data_noadtrim <- sarajevo_data_noad %>%
  select(-all_of(colmiss)) %>%
  filter(!ID %in% c(rowmiss))

After pre-processing, you need to identify and remove potential near duplicates that are due to missingness (e.g., item skip) patterns.

Calculating maximum percent match

With our pre-processing complete, we can now run our trimmed data through our function to produce the maximum percent (technically, maximum proportion) match metric for each observation (along with the ID of the observation with the closest match) and append it to our trimmed data. From there, we can identify those observations with greater than 85% maximum percent match, or what KR would consider a “near duplicate” of another observation and thus a potential falsification. In fact, they use the term “likely falsified” for those row hits exhibiting an 85% maximum match, though, as we noted earlier and will discuss later, we strongly caution against using this (or any) simple threshold criteria to make forensic determinations about the likelihood of falsification or fraud.

Code
source("Day_percentmatch-function.R")

sarajevo_data_pctmatchR <- percentmatch(sarajevo_data_noadtrim, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
  rename(ID = id,
         pct_match = pm,
         match_id = id_m) %>%
  left_join(sarajevo_data_noadtrim) %>%
  relocate(pct_match, match_id, .after = last_col())

sarajevo_data_pctmatchR85 <- sarajevo_data_pctmatchR %>%
  zap_label() %>%
  arrange(desc(pct_match)) %>%
  filter(round(pct_match, digits = 2) >= .85) %>%
  select(ID, pct_match, match_id)
#gt(sarajevo_data_pctmatchR85)

#80% cutoff
sarajevo_data_pctmatchR80 <- sarajevo_data_pctmatchR %>%
  zap_label() %>%
  arrange(desc(pct_match)) %>%
  filter(round(pct_match, digits = 2) >= .80) %>%
  select(ID, pct_match, match_id)

sarajevo_data_pctmatch_cut <- left_join(sarajevo_data_pctmatchR85, sarajevo_data_pctmatchR80)

gt(sarajevo_data_pctmatch_cut) %>%
 cols_label(
  pct_match = "Prop. Match",
  match_id = "Matching ID") %>%
 fmt_number(
  columns = pct_match,
  decimals = 3) %>%
  tab_header(
  title = md('**Observations with "Near Duplicates"**')) %>%
  tab_options(container.height = px(500))
Observations with “Near Duplicates”
ID Prop. Match Matching ID
336 0.882 681
681 0.882 336
29 0.878 552
169 0.878 464
464 0.878 169
552 0.878 29
141 0.869 421
421 0.869 141
87 0.861 105
105 0.861 87
51 0.852 752
137 0.852 326
182 0.852 800
326 0.852 137
353 0.852 466
406 0.852 552
466 0.852 353
477 0.852 495
495 0.852 477
752 0.852 51
760 0.852 477
800 0.852 182
83 0.848 154
154 0.848 83
493 0.848 552

As you can see above, when rounded to the second decimal, we have 25 observations who have at least 85% of their data for the substantive variables matching another observation in the data. According to their tutorial, KR would have interpreted this as indicating that 3.04% of the observations in our data are “likely falsified.” Again, we think this interpretation is premature (at best) and problematic, for reasons that should become clear later.

For now, note that this observed “near duplicate” percent is below the 5% and 10% benchmarks used by KR to indicate levels of near duplication that would bias results. If we relax the level of concern for near duplication to 80%, we would have 128 observations falling within that range (see below). So, while only about 3% meet KR criteria for near duplication, about 16% fall within a more relaxed 80% match criterion.

Although we have no principled reason to look at 80% as opposed to any other number, the 3% of our sample that falls clearly within KR’s criteria are at least worthy of further investigation.

In addition to looking at specific observations that have large maximum proportion match values, we can also look at the overall distribution of maximum proportion match values within the data.

Code
pctmatch_tab <- sarajevo_data_pctmatchR %>%
  select(pct_match) %>%
  report_table() 

gt(data = pctmatch_tab) %>%
  cols_hide(columns = c(n_Obs, MAD, percentage_Missing)) %>%
   fmt_number(
    columns = c(Mean, SD, Median, Min, Max, Skewness, Kurtosis),
    decimals = 2,
    use_seps = FALSE)  %>%
  tab_header(
  title = md('**Descriptive Statistics for Maximum Proportion Match**'))
Descriptive Statistics for Maximum Proportion Match
Variable Mean SD Median Min Max Skewness Kurtosis
pct_match 0.69 0.12 0.72 0.26 0.88 −1.05 1.11

The table above shows basic descriptive statistics for the maximum proportion match variable. The mean of the sample is 0.69, which is close to the mean of .66 that KR found in their simulation of 100,000 data sets with 1,000 observations, 100 variables, and random assignment of two distinct values in each cell (see p.285).

To get a better sense of how these values are distributed, we can visualize the entire distribution using a histogram that is similar to those found throughout KR’s paper.

Code
pctmatch_hist <- ggplot(data = sarajevo_data_pctmatchR, aes(x = pct_match)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 color = "white", fill = "#440154FF")  +
  scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) + 
  labs(title = "Distribution of Maximum Proportion Match",
       y = "Density", x = "Maximum Proportion Match") + 
  theme_minimal() +
  theme(panel.grid = element_blank())
pctmatch_hist

A couple of interesting things stand out with this distribution. First, it actually looks pretty similar to the distribution KR demonstrated in the appendix of their pre-print for a non-OECD country sample in which fraud was deemed unlikely to have occurred. There are a handful of observations above the 85% criterion, but there are no weird peaks in cases above 80% match like KR identified in the example of expected fraud they report in their published paper (see p.289).

Second, our distribution has a fatter tail (i.e., the range goes as low as 0.26) compared to many of those in KR’s study. KR’s examples, including the simulated examples, rarely had observations exhibiting less than a 40% maximum match. Based on their examples, KR claim that the distributions of non-fraudulent data generally should resemble a Gumbel distribution, which is a known type of extreme value distribution.

At first glance, this might seem to raise a red flag for our data, as our observed maximum proportion match distribution displayed above does not appear to approximate a Gumbel distribution. However, KR do not offer any principled reasons to expect all non-fraudulent maximum proportion match distributions to specifically mirror a Gumbel distribution as opposed to, say, any other known or generalized extreme value distribution (or, for that matter, to any other distributional form).

In fact, if you visit the journal issue containing KR’s published paper, you will also find a response paper from Simmons and colleagues, a team of scholars at Pew Research Center. (They had previously posted a version on Pew’s website as well). In their response, Simmons and colleagues specifically challenged KR’s claim that the maximum percent match statistic should follow a Gumbel distribution, and they describe “several reasons why we should not expect this to be the case in general” (see pp.328-9). For example, they note that the “belief that the statistic should follow any particular smooth distribution depends on the assumption that the values are independent and identically distributed” (p.329) and, as they subsequently explain, these are invalid assumptions for modeling survey response data.

Nonetheless, for the sake of thoroughness, we examined the “fit” of our distribution to the Gumbel and to other extreme value distributions. For those interested in such minutiae, we have included this detour in an expandable section below. Here is TL;DR version: Although the proportion maximum distribution from our BiH data does not follow a theoretical Gumbel distribution, it does fit reasonably well with a Weibull distribution, which is another known extreme value distribution. Substantively, this might mean that our data are not that surprising given the (debatable) assumption that the maximum proportion match metric in our data reflects a data generating process that is congruent with extreme value theory.

Detour

Expand discussion: Extreme value distributions

The Gumbel distribution is an extreme-value distribution often used for modeling the distribution of maximum values (e.g., maximum percent match in our case).8 KR base their argument that the distribution of extreme values should follow a Gumbel distribution on observations of distributions generated by their simulated maximum percent match values for both uncorrelated and correlated data. They also note that their analyses of real-world survey data from reputable survey organizations, including those of relatively homogeneous groups (e.g., survey of small ethnic minority in a single metropolitan area), “closely approximate a Gumbel distribution (p. 289).”

However, as noted above, Simmons and colleagues (2016) criticized KR’s lack of theoretical justification for expecting a Gumbel distribution. Much like the arbitrary 85% “likely fraud” threshold, Simmons and colleagues contend this distributional expectation stems from arbitrary design and parameter selections in KR’s simulations. They state: “As with the 85% cutoff, the probability of observing deviations from the expected distribution proposed by K&R will depend greatly on the characteristics of the specific survey and the target population (p. 329).”

While we agree with this critique, we also would not be surprised if maximum proportion match metrics tend to approximate generalized extreme value distributions in certain situations. Assuming extreme value theory is (sometimes) relevant to the proportion match metric, then one could assess the fit of an observed or empirical distribution to Gumbel and other extreme value distributions using standard statistical tools that permit more systematic assessment than that found in KR’s study. For instance, when someone wants to compare an empirical distribution to a theoretical distribution, they will often examine a quantile-quantile plot. Then, if they want to examine these differences more formally, they may utilize a goodness-of-fit test (although these become less useful in large samples). To do this, we use the “extRemes” package created by Eric Gilleland.9.

First, we’ll fit a Gumbel distribution to the data using the fevd() function in the “extRemes” package. Then we’ll plot the theoretical distribution based on the estimated parameters against the empirical distribution using the ggplot extension package “qqplotr.”

Code
fit_gumbel <- fevd(x = sarajevo_data_pctmatchR$pct_match, type = "Gumbel", time.units = NULL, period.basis = NULL)

summary(fit_gumbel)

fevd(x = sarajevo_data_pctmatchR$pct_match, type = "Gumbel", 
    time.units = NULL, period.basis = NULL)

[1] "Estimation Method used: MLE"


 Negative Log-Likelihood Value:  -421.9886 


 Estimated parameters:
 location     scale 
0.6256592 0.1410741 

 Standard Error Estimates:
   location       scale 
0.005241089 0.003251954 

 Estimated parameter covariance matrix.
             location        scale
location 2.746901e-05 5.869466e-06
scale    5.869466e-06 1.057521e-05

 AIC = -839.9772 

 BIC = -830.5537 
Code
gumbel_param <- as_tibble_row(distill(fit_gumbel))

gumbel_qqplot <- ggplot(data = sarajevo_data_pctmatchR, aes(sample = pct_match)) + 
  stat_qq_band(bandType = "boot", fill = "#218F8B", alpha = 0.25, identity = TRUE, B = 1000, 
               distribution = "evd",
                dparams = list(loc = gumbel_param$location, scale = gumbel_param$scale)) + 
  stat_qq_line(color = "#218F8B", identity = TRUE,
               distribution = "evd", 
               dparams = list(loc = gumbel_param$location, scale = gumbel_param$scale)) + 
    stat_qq_point(color = "#440154" , alpha = 0.25, 
                  distribution = "evd", 
                  dparams = list(loc = gumbel_param$location, scale = gumbel_param$scale)) +
  labs(title = "Quantile-Quantile Plot of Gumbel Distribution",
       x = "Gumbel Distribution Quantiles",
       y = "Maximum Proportion Match Quantiles") + 
  scale_y_continuous(breaks=seq(.2,1,.1)) +
  scale_x_continuous(breaks = seq(.2, 1.8, .2)) + 
  coord_cartesian(ylim = c(.2, 1)) + 
  theme_light() +
  theme(panel.grid = element_blank(),
        panel.border = element_blank())
gumbel_qqplot

The above chart shows the theoretical distribution and 95% confidence interval (based on pointwise parametric bootstrap with 1,000 samples) represented as a line and confidence band with the empirical distribution represented as the individual data points.

Our empirical distribution clearly does not fit with a theoretical Gumbel distribution. However, recall that the Gumbel distribution is only one type of extreme value distribution. Instead of specifically assuming the restrictive Gumbel distribution, we can use the same procedures as above to estimate a generalized extreme value distribution that estimates a shape parameter in addition to the location and scale parameters. (The Gumbel distribution is the reduced form generalized extreme value distribution when the shape parameter is zero.)

Code
fit_gev <- fevd(x = sarajevo_data_pctmatchR$pct_match, type = "GEV", time.units = NULL, period.basis = NULL)

summary(fit_gev)

fevd(x = sarajevo_data_pctmatchR$pct_match, type = "GEV", time.units = NULL, 
    period.basis = NULL)

[1] "Estimation Method used: MLE"


 Negative Log-Likelihood Value:  -693.2064 


 Estimated parameters:
  location      scale      shape 
 0.6638157  0.1241461 -0.5632021 

 AIC = -1380.413 

 BIC = -1366.278 
Code
gev_param <- as_tibble_row(distill(fit_gev))

In the summary above, you’ll notice that the estimated shape parameter is a negative value: -0.56. This means that the distribution better fits with a Weibull or Type III generalized extreme value distribution (see the data analysis clasroom lesson). The nice thing about the generalized extreme value distribution, according to NASA’s Global Modeling and Assimilation Office, is that it “…unites the Gumbel, Fréchet and Weibull distributions into a single family to allow a continuous range of possible shapes.”

Now, we can produce another quantile-quantile plot to compare the fit of this generalized extreme value distribution to our observed proportion maximum distribution.

Code
gev_qqplot <- ggplot(data = sarajevo_data_pctmatchR, aes(sample = pct_match)) + 
  stat_qq_band(bandType = "boot", fill = "#218F8B", alpha = 0.25, identity = TRUE, B = 1000, 
               distribution = "evd",
               dparams = list(loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape)) + 
  stat_qq_line(color = "#218F8B", identity = TRUE,
               distribution = "evd", 
               dparams = list(loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape)) + 
    stat_qq_point(color = "#440154" , alpha = 0.25, 
                  distribution = "evd", 
                  dparams = list(loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape)) + 
  labs(title = "Quantile-Quantile Plot of Gumbel Distribution",
       x = "Generalized Extreme Value Distribution Quantiles",
       y = "Maximum Proportion Match Quantiles") + 
  scale_y_continuous(breaks=seq(.2,1,.1)) +
  #scale_x_continuous(breaks = seq(.2, 1.8, .2)) + 
  coord_cartesian(ylim = c(.2, 1)) + 
  theme_light() +
  theme(panel.grid = element_blank(),
        panel.border = element_blank())
gev_qqplot

As you can see in the above plot, this generalized extreme value distribution fits our empirical distribution quite well and, unsurprisingly, it provides a much better fit to the data than a Gumbel distribution.

Another way to assess the fit of the theoretical distribution to the empirical distribution is to simply simulate data from those theoretical distributions and plot it next to our empirical data. You can see this below, where we simulate 10,000 draws from both the Gumbel and Weibull distributions fitted above and plot their densities next to the empirical distribution found in our data.

Code
# Empirical Distribution Data
pctmatch <- sarajevo_data_pctmatchR %>%
  select(pct_match) %>%
  rename(pctmatch = pct_match) %>%
  mutate(dist = "Empirical")

# Gumbel Distribution Data
set.seed(7734)
gumbel <- revd(10000,loc = gumbel_param$location, scale = gumbel_param$scale) %>%
  as_tibble() %>%
  rename(pctmatch = value) %>%
  mutate(dist = "Gumbel")

# Weibull Distribution 
set.seed(7734)
weibull <- revd(10000,loc = gev_param$location, scale = gev_param$scale, shape = gev_param$shape) %>%
  as_tibble() %>%
  rename(pctmatch = value) %>%
  mutate(dist = "Weibull")

distcomp <- bind_rows(pctmatch, gumbel, weibull)

#Compare theoretical distributions to empirical
distcomp_plot <- ggplot(data = distcomp, aes(color = dist)) +
  geom_density(aes(x = pctmatch), linewidth = 1) + 
  # geom_vline(aes(xintercept = 0.70), color = "grey", linetype = 2, linewidth = .5) +
  # geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = .5) +
  scale_color_manual(values = c("#440154", "#FDE725FF", "#218F8B")) +
    labs(title = "Density Plot of Emperical Distribution and Theoretical Extreme Value Distributions",
       x = "Maximum Proportion Match",
       y = "Density",
       color = "Distribution") + 
  scale_x_continuous(breaks = seq(.2, 1.8, .2)) + 
  theme_light() +
  theme(panel.grid = element_blank(),
        panel.border = element_blank())
distcomp_plot 

You can see from the plot above that our empirical distribution of maximum proportion match aligns well with the simulated data from the Weibull distribution. The main difference is that our actual data has more maximum proportion match values in the 70% - 80% range than would be expected if it followed more closely with the Weibull distribution. You can also see from the plot that our empirical data does not align with the Gumbel distribution. Part of this is due to the fact that, unlike the maximum proportion values we are calculating, the Gumbel distribution is not constrained between zero and one, which is another reason to be skeptical of KR’s claim that this metric should follow a Gumbel distribution. Indeed, one of the strengths of the Weibull distribution in relation to the maximum proportion match metric is that it has a “light tail with a finite upper bound (see Wikipedia’s “Extreme value theory” page).

At risk of unnecessarily repeating ourselves, it is not clear to us how exactly KR determined that their maximum percent match metric should follow a Gumbel distribution other than simply eyeballing the empirical distributions generated in their study. To examine this issue more formally, we would need to replicate their simulations. We are not going to do that here, but we definitely recommend Simmons and colleagues’ 2016 paper if you are interested in reading more about these issues. Rather, our main goal with this detour was to more formally examine the distributional properties of the maximum proportion match metric within the context of our sample. We ended up stumbling onto some extreme value theory and found that one type of extreme value distribution (Weibull) is a reasonable fit to our data.

In sum, given the maximum proportion match metric in our data stems from a data generating process congruent with extreme value theory, our maximum proportion match calculations might not be that atypical or surprising. Again, though, Simmons and colleagues rightfully question the assumption that this metric should always match such distributions. Hence, we will end this detour by echoing Simmons and colleagues’ warning that a lack of fit between an observed percent match distribution in one’s own data and a Gumbel or other theoretical extreme value distribution will occur naturally - that is, for non-fraudulent reasons - in many social survey designs. For instance, Simmons and colleagues state:

Within any national population, we should expect to observe not a single distribution, but a mixture of distributions corresponding to the different subpopulations. Distinct subgroups within a surveyed population will share different numbers of characteristics or opinions that are measured in the survey and that essentially represent different data generating processes. Particularly homogenous subgroups within the population will appear as additional modes or bunching toward the tail of the distribution.

Therefore, an observed departure from theoretical extreme value distributions (e.g., clustering above 85% maximum percent match) should not be taken prima facie as evidence of fraud. Rather, we view such departures as reasons to further investigate one’s data, with the goal of trying to understand what processes (e.g., research design features and/or fraud) might have generated these distributional characteristics.

Click the link above to enter a rabbit hole about fitting our empirical maximum proportion distribution to theoretical extreme value distributions. Or take the blue pill to skip it and keep reading.

And, we’re back! Or perhaps you never left?

Another difference between our distribution and the distributions discussed by KR is there appear to be more data points near the 80% match level. Indeed, the mode for our data is 0.74, which is larger than the modes of 60-66% reported by KR in the examples and simulations from the pre-print appendix. Interestingly, the mode is closest to the 70% found in the “Ethnic Minority Survey in Three US Counties” reported in Figure 8 of the pre-print appendix. This was meant to be an example of a survey of “a unique sub-population in a geographically concentrated area (p. 23),” which is perhaps more similar to our survey of a single city in Bosnia and Herzegovina. Of course, that particular survey did not have nearly the concentration around the 80% level of maximum match. So there may be something else going on with these data that are producing these relatively large values.

A cynic might initially assume that something fishy is going on (e.g., interviewers or the survey organization falsifying data), but that the patterns were not quite detectable at KR’s 85% threshold. A less cynical alternative take is that there might be something about the nature of the research design and data themselves that generate relatively large values of maximum percent match for entirely non-fraudulent reasons.

Our initial hunch was that typical design and data features in criminological research would have a non-negligible impact on the percent match metric, thereby further reducing the utility of KR’s arbitrary threshold for detecting likely fraud in survey data. Specifically, based on our experiences with analyzing criminological survey data, we expected that a large number of respondents would match on identical (e.g., “zero”) response values across large item sets that ask about self-reported past criminal behavior and criminal intent, crime-related attitudes and beliefs, and other crime-relevant items (e.g., perceptions of family, peer, and neighborhood crime attitudes and behaviors). Thus, we expected the highly skewed and zero-inflated nature of these crime and crime-relevant survey response distributions might artificially inflate the central tendency of maximum percent match metrics.

To formally test this possibility, we could simulate data with known properties and vary things like zero-inflation or levels of inter-item agreement to assess the impact on simulated percent match distributions. However, Simmons and colleagues (2016) already varied some of the parameters from KR’s simulations to make similar points, so we will not reinvent that wheel here. Rather, we will first dig into the cynical explanation by examining the distributions of maximum percent match metrics for each individual interviewer in our BiH data. Then, following a detailed dig into interviewer-, municipality-, and cluster-specific proportion match distributions, we will briefly explore how features of criminological data might affect proportion match metrics. Overall, our goal in these investigations is to try to better understand our data and the processes that might have generated our observed maximum proportion match distributions, with the hope that doing so will help us more convincingly identify potential red flags (e.g., evidence of “likely fraud”) or, ideally, assuage our and others’ concerns about fraud.

Do not be alarmed if you find relatively high maximimum percent match values (i.e., “near duplicates”). Sometimes this is expected due to research design and data features.

Beyond KR’s Percent Match: Exploring Interviewer, Design Strata, and Criminological Data Variations

Interviewer variations

When potentially problematic maximum proportion maximum values (KR’s “likely fraud” cases) are identified, we think this should be the start and not the endpoint of one’s investigation into near duplication patterns in their data. An obvious next step for us is to examine the distribution of maximum proportion match by interviewer. We happen to have interviewer ID variables in the raw data that make this possible. For one data set that was suspicious to KR, they produced a table showing the number of maximum percent match that was above and below 80% for all interviewers who had at least one observation that exceeded 80% match. In our data, we only have 14 interviewers, so we should be able to generate and visualize each interviewer’s distribution of maximum percent match values. Before we do, let’s first see how many interviews each interviewer performed in collecting our BiH data.

Code
int_count <- sarajevo_data %>%
  group_by(Interviewer) %>%
  count(Interviewer)

sarajevo_data %>%
  tabyl(Interviewer) %>% 
  adorn_pct_formatting() %>%
  gt() %>%
  cols_label(
    n = "freq.",
    percent = "%") %>%
  tab_header(
    title = md("**Interview Frequency**")) %>%
  tab_options(container.height = px(500))
Interview Frequency
Interviewer freq. %
45 40 4.8%
46 102 12.3%
47 130 15.7%
48 40 4.8%
49 52 6.3%
50 50 6.1%
62 30 3.6%
63 82 9.9%
64 40 4.8%
65 49 5.9%
66 80 9.7%
67 70 8.5%
68 51 6.2%
69 10 1.2%

The median number of interviews was 50.5 with a range of 10 - 130. If the near duplicates are spread out among all of the 14 interviewers, it may suggest that there was no systematic fraud present - at least, not at the level of the interviewer (e.g., a survey organization committing sophisticated fraud could near duplicate cases randomly across interviewers and design strata). Alternatively, if the near duplicates are concentrated among one or a few interviewers, then it might suggest the presence of strong “interviewer effects” on the maximum proportion match metric (e.g., interviewer fraud via near duplication). However, interviewer-level clustering might also occur for completely benign reasons as well.

Before going any further, this is a good place to pause and once again urge caution against jumping to any conclusions regarding fraud from near duplicate analyses - even after conducting more detailed interviewer-specific proportion match examinations. After all, interviewer-level clustering also might be expected to occur for entirely non-fraudulent reasons. For instance, interviewers tend to collect data around specific starting points and, as such, we might expect survey responses to be more homogeneous within versus across interviewers. Likewise, interviewers collecting data in areas with more homogeneous populations (e.g., areas with very low levels of self-reported crime) would be expected to have higher proportion match values, while interviewers collecting data from more heterogeneous populations and/or from multiple heterogeneous areas might be expected to have lower overall proportion match values.

Moreover, the belief that a high proportion match value is indicative of “likely fraud” itself is based on strong assumptions about how (interviewer, organization, or researcher) fraudsters engage in fraud. One might observe clustering of high proportion match values if a fraudster duplicated an interview or row of data and then lazily changed one, two, or a few variables. Yet, a more sophisticated fraudster that changed substantially more data - or an organization or researcher that used (quasi-)random or simulation techniques to generate partially duplicated or otherwise fraudulent data - would not necessarily generate clusters of high proportion match values. Rather, in these instances, we might expect to observe clusters of low proportion match values - or, in the simulation fraud example, no clusters at all! We will return to these points later. For now, we felt it was important to remind readers that these types of “forensic” data tools rarely, if ever, provide dispositive evidence of fraud. At best, they can provide us a deeper understanding of our data and potentially increase or reduce uncertainty about the probability of fraud.

Detecting near duplicates is easy. Detecting fraud? Not so easy.

Still, we think it is worthwhile to further investigate these near duplication patterns. So, to get a quick sense of interviewer-level variation in maximum proportion match distributions, we can plot a simple histogram for each interviewer.

Code
sarajevo_data_int<- sarajevo_data %>%
  select(ID, starttime, Interviewer, Region, Municipality, Cluster, Settlement, weight) %>%
  left_join(sarajevo_data %>% group_by(Interviewer) %>% summarize(int_count = n())) %>% #joining grouped data
  mutate(int_label = paste0('Int. ', Interviewer, ' (n = ', int_count, ')'))

sarajevo_data_intmatch <- inner_join(x = sarajevo_data_pctmatchR, y = sarajevo_data_int, by = "ID") %>%
  mutate(pct_match_gte80 = factor(ifelse(round(pct_match, digits = 2) >= .8, 1, 0)),
         pct_match_lt80 = factor(ifelse(round(pct_match, digits = 2) < .8, 1, 0)),
         pct_match_gte85 = factor(ifelse(round(pct_match, digits = 2) >= .85, 1, 0)),
         pct_match_lt85 = factor(ifelse(round(pct_match, digits = 2) < .85, 1, 0))) 

sarajevo_intmatch_plot <- ggplot(data = sarajevo_data_intmatch) + 
  geom_histogram(aes(x = round(pct_match, digits = 2), y = ..density..), bins = 20, color = "white", fill = "#440154FF") + 
  geom_density(aes(x = round(pct_match, digits = 2)), color = "#21918c", linewidth = 1) + 
  geom_vline(aes(xintercept = 0.80), color = "red", linetype = 2) +
  facet_wrap(~int_label, ncol = 3) +
  labs(title = "Distribution of Maximum Percent Match by Interviewer ID",
       y = "Density", x = "Maximum Percent Match",
       caption = "Note: The histogram is created with 20 bins. The smoothed density curve is overlaid to avoid idisoyncracies \nwith binning in small samples. The red line is drawn at 80% maximum percent match.") + 
  theme_minimal() + 
  theme(plot.caption = element_text(hjust = 0)) 
print(sarajevo_intmatch_plot)

When examining the distributions of specific data sets, KR looked for peaks in the maximum percent match values above 80%. Not surprisingly, given the overall distribution, you see many interviewers with relatively large numbers of observations around the 80% match level. For a few of these interviewers (#46, #63, and #66), these seem to reflect distributions that are shifted towards the high end of the overall distribution in general. However, a few interviewers do appear to have somewhat atypical peaks after the 80% match level (#45, #47, and #64). We will investigate these (and all) interviewers further to see if we can determine just how atypical their distributional peaks are and whether these interviewers should be red-flagged as potentially invalid.

As noted, KR created a table of interviewers who had at least one interview with 80% match or higher to see what percentage of the interviews they did that were above 80% match with another observation in the data. Since we only have 14 interviewers in our data, we can simply create this table for all of our interviewers:

Code
int_pctmatch <- sarajevo_data_intmatch %>%
  select(Interviewer, pct_match) %>%
  group_by(Interviewer) %>%
  report_table()

int_tab80 <- sarajevo_data_intmatch  %>%
  tabyl(Interviewer, pct_match_gte80) %>%
  adorn_totals("col") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns(position = "front") %>%
  rename(match_lt80 = "0",
         match_gte80 = "1") %>%
  mutate(Total = parse_number(Total)) 

int_tab85 <- sarajevo_data_intmatch  %>%
  tabyl(Interviewer, pct_match_gte85) %>%
  adorn_totals("col") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns(position = "front") %>%
  rename(match_lt85 = "0",
         match_gte85 = "1") %>%
  mutate(Total = parse_number(Total)) 

int_tab <- left_join(int_tab80, int_tab85) %>%
  select(-match_lt85) %>%
  relocate(Total, .after = match_gte85)

int_tabgt <- gt(int_tab) %>%
  cols_label(
    match_lt80 = "% Match < 80%",
    match_gte80 = "% Match \u2265 80%",
    match_gte85 = "% Match \u2265 85%",
    Total = "Total Interviews") %>%
  tab_header(
    title = md("**Number (and %) of Respondents by Interviewer & Maximum Percent Match**"))
int_tabgt %>%
  tab_options(container.height = px(500))
Number (and %) of Respondents by Interviewer & Maximum Percent Match
Interviewer code % Match < 80% % Match ≥ 80% % Match ≥ 85% Total Interviews
45 24 (60%) 16 (40%) 8 (20%) 40
46 80 (78%) 22 (22%) 2 (2%) 102
47 96 (74%) 34 (26%) 6 (5%) 130
48 40 (100%) 0 (0%) 0 (0%) 40
49 49 (94%) 3 (6%) 0 (0%) 52
50 50 (100%) 0 (0%) 0 (0%) 50
62 28 (93%) 2 (7%) 0 (0%) 30
63 66 (80%) 16 (20%) 3 (4%) 82
64 19 (48%) 21 (53%) 6 (15%) 40
65 49 (100%) 0 (0%) 0 (0%) 49
66 66 (85%) 12 (15%) 0 (0%) 78
67 66 (97%) 2 (3%) 0 (0%) 68
68 51 (100%) 0 (0%) 0 (0%) 51
69 10 (100%) 0 (0%) 0 (0%) 10

A couple things jump out to us in this table. First, notice that the three interviewers who had potentially suspicious spikes (#45, #47, and #64) account for 20 of the of the 25 total observations with 85% match or higher. Also, note that two of the three have a substantially larger share of their interviews (40% for #45 and 52.5% for #64) that are at or above an 80% match with another observation. The next closest is Interviewer #47, who did the most interviews (130) and had 26.2% (130) of their interviews at or above 80%.

Visualizing interviewer variations

Soon, we will be diving deeper into this rabbit hole to examine interviewer-level maximum proportion match values across different research design strata, so it will help to leave the summary tables and histograms behind and, instead, transition to alternative ways to visualize each interviewer’s entire distribution of proportion maximum values. Below, we produce a jitter plot of maximum proportion match by each interviewer. It is sorted in descending order from the interviewer that completed the most interviews (#47) to the least (#69). Note that the points are somewhat transparent, so darker colors indicates multiple points stacking upon those values. The points also include random variation along the vertical axis within each interviewer (i.e., points are randomly “jittered” or displaced up or down). This improves our ability to see how each interviewer’s observations are distributed in terms of maximum proportion match. We placed lines at the (arbitrary) 80% and 85% threshold levels so one can clearly see which interviewers have the most similar observations.

Code
sarajevo_intmatch_jitter <- ggplot(data = sarajevo_data_intmatch) + 
  geom_jitter(aes(x = pct_match, y = reorder(int_label, int_count)), color = "#440154", alpha = 0.25,
              position = position_jitter(width = 0, height = .25, seed = 7734)) + 
  geom_hline(yintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5), color = "grey", linewidth = 0.5) +
  geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = 1) +
  geom_vline(aes(xintercept = 0.85), color = "black", linetype = 2, linewidth = 1) +
  scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) + 
  labs(title = "Maximum Percent Match by Interviewer ID",
       y = "Interviewer", x = "Maximum Percent Match",
       caption = ("***Note***: grey dashed line indicates 80% maximum percent match; black dashed line indicates 85% maximum percent match.")) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        plot.caption = element_markdown())
sarajevo_intmatch_jitter

It is interesting to see the distributional similarities between the two interviewers with the highest and - at least according to KR, potentially most problematic - matches (#45 and #64). Substantively, the plot largely reproduces the histograms above. Where this jitterplot may be especially helpful is in examining how the maximum proportion match metric is distributed across interviewers while simultaneously considering other stratifying features of the data as well.

Municipality variations

When administering the BiH survey, a two-stage “proportional stratified random sample” was collected by, first, stratifying by the eight different municipalities in Sarajevo, then proportionally and randomly selecting 83 neighborhood cluster across these municipalities and, finally, interviewing no more than 10 people from each cluster.10 Below, we will summarize how the maximum percent match metric is distributed across and within these stratifying features of the data. We will pay particular attention to distributions across the eight municipalities, as a number of them are large enough to allow us to compare the distribution of the maximum percent match metric across multiple interviewers and within the same municipality. First, let’s summarize and visualize maximum proportion match distributions by municipality:

Code
#Municipality
muni_pctmatch <- sarajevo_data_intmatch %>%
  select(Municipality, pct_match) %>%
  group_by(Municipality) %>%
  report_table()

muni_count <- sarajevo_data %>%
  group_by(Municipality) %>%
  count(Municipality)

muni_tab80 <- sarajevo_data_intmatch  %>%
  tabyl(Municipality, pct_match_gte80) %>%
  adorn_totals("col") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns(position = "front") %>%
  rename(match_lt80 = "0",
         match_gte80 = "1") %>%
  mutate(Total = parse_number(Total)) 

muni_tab85 <- sarajevo_data_intmatch  %>%
  tabyl(Municipality, pct_match_gte85) %>%
  adorn_totals("col") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns(position = "front") %>%
  rename(match_lt85 = "0",
         match_gte85 = "1") %>%
  mutate(Total = parse_number(Total)) 

muni_tab <- left_join(muni_tab80, muni_tab85) %>%
  select(-match_lt85) %>%
  relocate(Total, .after = match_gte85)

# Neighborhood Cluster:
clust_pctmatch <- sarajevo_data_intmatch %>%
  select(Cluster, pct_match) %>%
  group_by(Cluster) %>%
  report_table()

clust_count <- sarajevo_data %>%
  group_by(Cluster) %>%
  count(Cluster)

clust_tab80 <- sarajevo_data_intmatch  %>%
  tabyl(Cluster, pct_match_gte80) %>%
  adorn_totals("col") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns(position = "front") %>%
  rename(match_lt80 = "0",
         match_gte80 = "1") %>%
  mutate(Total = parse_number(Total)) 

clust_tab85 <- sarajevo_data_intmatch  %>%
  tabyl(Cluster, pct_match_gte85) %>%
  adorn_totals("col") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns(position = "front") %>%
  rename(match_lt85 = "0",
         match_gte85 = "1") %>%
  mutate(Total = parse_number(Total)) 

clust_tab <- left_join(clust_tab80, clust_tab85) %>%
  select(-match_lt85) %>%
  relocate(Total, .after = match_gte85)

#Merge with pctmatch & Interviewer data
sarajevo_data_stratmatch <- sarajevo_data_intmatch %>%
  left_join(sarajevo_data %>% group_by(Municipality) %>% summarize(muni_count = n())) %>%
  left_join(sarajevo_data %>% group_by(Cluster) %>% summarize(clust_count = n())) %>%
  mutate(muni_label = paste0('Muni. ', Municipality, ' (n = ', muni_count, ')'),
         muni_longlab = paste0('Municipality ', Municipality, ' (n = ', muni_count, ')'),
         clust_label = paste0('Clust. ', Cluster, ' (n = ', clust_count, ')'),
         int_label = fct_reorder(as_factor(int_label), desc(int_count)),
         muni_label = fct_reorder2(as_factor(muni_label), desc(muni_count), Municipality, .desc = FALSE),
         muni_longlab = fct_reorder2(as_factor(muni_longlab), desc(muni_count), Municipality, .desc = FALSE),
         clust_label = fct_reorder(as_factor(clust_label), Cluster))

save(sarajevo_data_stratmatch, file = here("Data", "sarajevo_data_stratmatch.RData"))

# Combinde Tables (GT Versions)
muni_tabgt <- muni_tab %>%
  arrange(desc(Total)) %>%
  gt() %>%
  cols_label(
    match_lt80 = "% Match < 80%",
    match_gte80 = "% Match \u2265 80%",
    match_gte85 = "% Match \u2265 85%",
    Total = "Total Observations") %>%
  tab_header(
    title = md("**Number (and %) of Respondents by Municipality & Maximum Percent Match**"))
muni_tabgt
Number (and %) of Respondents by Municipality & Maximum Percent Match
Municipality % Match < 80% % Match ≥ 80% % Match ≥ 85% Total Observations
1 218 (91%) 22 (9%) 2 (1%) 240
2 103 (80%) 25 (20%) 3 (2%) 128
3 104 (81%) 24 (19%) 6 (5%) 128
4 70 (64%) 40 (36%) 14 (13%) 110
5 78 (98%) 2 (3%) 0 (0%) 80
6 50 (100%) 0 (0%) 0 (0%) 50
7 50 (100%) 0 (0%) 0 (0%) 50
8 21 (58%) 15 (42%) 0 (0%) 36
Code
sarajevo_munimatch_jitter <- ggplot(data = sarajevo_data_stratmatch) + 
  geom_jitter(aes(x = pct_match, y = muni_label), color = "#35b779", alpha = 0.25, 
              position = position_jitter(width = 0, height = .35, seed = 7734)) + 
  geom_hline(yintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5), color = "grey", linewidth = 0.5) +
  geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = 1) +
  geom_vline(aes(xintercept = 0.85), color = "black", linetype = 2, linewidth = 1) +
  scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) + 
  scale_y_discrete(limits = rev) + 
  labs(title = "Maximum Percent Match by Municipality",
       y = "Municipality", x = "Maximum Percent Match",
       caption = ("***Note***: grey dashed line indicates 80% maximum percent match; black dashed line indicates 85% maximum percent match.")) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        plot.caption = element_markdown())
sarajevo_munimatch_jitter

The above chart is sorted by the number of interviews completed in each municipality. You will notice that relatively large maximum percent match values are concentrated in the municipalities with larger numbers of completed interviews (which are presumably larger population centers as well). Of course, you will also notice more variation in general in these municipalities. Now, let’s distinguish between different interviewers within each municipality.

A central goal of ours here is to determine whether the distributions of those “potentially problematic” interviewers with high proportion match values that we identified earlier stand out as atypical relative to the distributions of other interviewers operating in the same municipality. An atypical within-municipality interviewer distribution might be indicative of fraud in this more fine-grained relative investigation. With that said, municipalities are quite large in a city like Sarajevo, so even an atypical within-municipality interviewer distribution might simply be due to an interviewer working from a starting point that was more homogeneous relative to other interviewers in the same larger population strata. Still, let’s see what we find.

Code
colorpal <- glasbey(16)[-c(14,15)] #creates 14 color palette from glasbey 16 color palette and removes the 14th and 15th colors

sarajevo_intmunimatch_jitter <- ggplot(data = sarajevo_data_stratmatch) + 
  geom_jitter(aes(x = pct_match, y = muni_label, 
                  color = reorder(int_label, int_count, decreasing = TRUE)), alpha = .5,
                  position = position_jitter(width = 0, height = .35, seed = 7734)) + 
  geom_hline(yintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5), color = "grey", linewidth = 0.5) +
  geom_vline(aes(xintercept = 0.80), color = "grey", linetype = 2, linewidth = 1) +
  geom_vline(aes(xintercept = 0.85), color = "black", linetype = 2, linewidth = 1) +
  scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) + 
  scale_y_discrete(limits = rev) + 
  scale_color_manual(values = colorpal) + 
  labs(title = "Maximum Percent Match by Municipality & Interviewer",
       y = "Municipality", x = "Maximum Percent Match",
       color = "Interviewer") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        plot.caption = element_markdown())
sarajevo_intmunimatch_jitter

In the above chart, each observation is colored differently according to each interviewer. This gives us a rough sense of how maximum proportion match values are distributed across interviewers within each municipality. Admittedly, it is a little confusing given the number of different colors, but we use this chart to assess whether there is a good mixture of interviewers within each municipality and, if so, to assess whether an interviewer’s observations are (1) similarly or differently distributed across municipalities and (2) similarly or differently distributed relative to other interviewers within the same municipality.

On one hand, if certain interviewers tend to have larger maximum percentage match values within each municipality, then this might suggest “interviewer effects” - though, as noted earlier, such patterns might be generated via benign (e.g., neighborhood clustering) processes as well. On the other hand, if interviewer-specific observations are well mixed within each municipality - i.e., if it looks as if each interviewer’s distribution was randomly generated from the same underlying distribution - then this might minimize concerns about strong individual interviewer effects (though it does not rule out collective interviewer, organizational, or researcher fraud).

For example, take Muncipality #4. While maximum proportion match values are generally high in this municipality, they are similarly high for each of the three interviewers who surveyed respondents in that municipality.

In contrast, look at Municipality #3. Here, there is generally a good mixture among 3 of the 4 interviewers who surveyed subjects within that municipality. However, maximum proportion match observations from Interviewer #64 - one of the “potentially problematic” interviewers identified earlier - is concentrated at the high end of this municipality’s overall distribution. This could be indicative of “likely fraud” on this interviewer’s part - or, it might have occurred for benign reasons, such as a greater degree of respondent homogeneity in the neighborhood clusters they interviewed. Either way, this clear pattern is informative, as it tells us that Interviewer 64’s observations were more similar than other interviewers’ observations within the same municipality. Though this information falls short of being dispositive evidence of fraud, it also fails to absolve our fears of fraud as well. As such, it points us to the need for additional investigation of the data collected by this interviewer to see if we can uncover how and why their maximum proportion maximum distribution is atypically high, and it signals to us that it might be a good idea to conduct sensitivity analyses (i.e., with and without this interviewer’s data).

What about the other two interviewers - #45 and #47 - that were flagged as “potentially problematic” by KR’s threshold? First, let’s look at Interviewer #47. This interviewer conducted the most interviews in the study, and their within-municipality distributions do not appear to be all that atypical relative to other interviewers in this figure. Their high level of maximum proportion match values could simply have been due to the fact that they collected more interviews than everyone else - so they have more observations across the range of values on this metric.

Meanwhile, all of Interviewer #45’s interviews occurred in Municipality #4. In this within-municipality context, their interviews actually look a little better as their maximum proportion match values appear to be more similar to those of other interviewers within the same municipality. Again, you might also expect a similar pattern if these interviews were in fact generated via partial duplication. So, none of this is dispositive evidence for or against fraud - we do not think one can justifiably identify cases of “likely fraud” using this metric as KR claim. Yet, the patterns do suggest that maximum proportion match values vary systematically across municipalities (just as Simmons and colleagues’ 2016 paper anticipated), and accounting for such data generating processes operating at or above the interviewer level when examining this metric might uncover more useful information and lead researchers to different conclusions than they might infer from simply running a naive percent match test and applying simple thresholds using the entire dataset. In this case, we do not see anything particularly alarming about either Interviewer #47 or #45, though we still have questions about Interviewer #64.

Forensic data analysis rarely uncovers a smoking gun. Rather, as is common with data analysis in general, we are usually sifting through noise to seek a signal that may or may not exist.

We think another thing you would want to see if interviewers were not falsifying data through partial duplication, particularly across population strata like municipalities, would be for their maximum proportion match distributions to vary systematically in ways that correspond to other interviewer and municipality variations. Of course, this requires an interviewer to have performed interviews in multiple municipalities (about 5-6 interviewers completed interviews across multiple municipalities in the BiH data). In other words, you would want to see that an interviewer’s maximum proportion match distributions are more reflective of the aggregate than of the interviewer. You can get some sense of this in the chart above, but it is difficult to decipher with all of the different colors. For example, Interviewer #46 appears to have completed interviews across three different municipalities: #1, #2, and #8. If you look at Interviewer #46’s maximum proportion match values across those different municipalities, it does appear that they tend to reflect patterns within the municipality more so than the interviewer. In order to improve our ability to visualize this tendency, we created the slides below. The slides will automatically scroll through each individual interviewer’s figure, but you can also click through yourself at your own pace.

Neighborhood cluster variations

We might usefully extend our approach above by analyzing other stratifying features of the data, like neighborhood clusters. Recall that interviewers were sent to random starting points in 83 neighborhood clusters across the eight municipalities. Below, we present a table that summarizes the frequency of near duplicates by each municipality and for each neighborhood cluster.

Code
sarajevo_data_muni <- sarajevo_data %>%
  select(ID, Municipality) %>%
  left_join(sarajevo_data %>% group_by(Municipality) %>% summarize(muni_count = n())) %>% #joining grouped data
  mutate(muni_label = paste0('Muni. ', Municipality, ' (n = ', muni_count, ')')) 

sarajevo_data_munimatch <- inner_join(x = sarajevo_data_intmatch, y = sarajevo_data_muni, by = c("ID", "Municipality")) 

#Data with distinct Municipality information
muni_data <- sarajevo_data_munimatch %>%
  select(Municipality, muni_count, muni_label) %>%
  distinct() %>%
  arrange(Municipality)

#Data with distinct Municipality, Cluster, and Interviewer Information
intmuni_data <- sarajevo_data_munimatch %>%
  select(Municipality, muni_count, muni_label, Cluster, Interviewer, int_count, int_label) %>%
  distinct()

municlust_tab_func <- function(data, row, col, gval) {
  tab <- data  %>%
    filter(Municipality == gval) %>%
    tabyl({{row}}, {{col}}) %>%
    adorn_totals("col") %>%
    adorn_percentages("row") %>%
    adorn_pct_formatting(rounding = "half up", digits = 0) %>%
    adorn_ns(position = "front") %>%
    mutate(Total = parse_number(Total),
           Municipality = paste0(gval))
}

unique_muni <- pull(distinct(sarajevo_data_munimatch, Municipality)) %>%
  zap_labels() %>%
  zap_label() %>%
  zap_formats() %>%
  zap_widths() %>%
  sort() 

municlust_tab80 <- map_df(.x = unique_muni, ~municlust_tab_func(data = sarajevo_data_munimatch,
                                                      row = Cluster,
                                                      col = pct_match_gte80,
                                                      gval = .x)) %>%
  rename(match_lt80 = "0",
         match_gte80 = "1") %>%
  mutate(Municipality = as.numeric(Municipality))


municlust_tab85 <- map_df(.x = unique_muni, ~municlust_tab_func(data = sarajevo_data_munimatch,
                                                           row = Cluster,
                                                           col = pct_match_gte85,
                                                           gval = .x)) %>%
    rename(match_lt85 = "0",
           match_gte85 = "1") %>%
  mutate(Municipality = as.numeric(Municipality))

municlust_tab <- left_join(municlust_tab80, municlust_tab85) %>%
  select(-match_lt85) %>%
  left_join(muni_data) %>%
  left_join(intmuni_data) %>%
  relocate(c(Total, Municipality, muni_label), .after = match_gte85) 

# GT Version
municlust_tabgt <- municlust_tab %>%
  group_by(muni_label) %>%
  arrange(desc(muni_count)) %>%
  gt() %>%
  cols_hide(columns = c(Municipality, muni_count, Total, Interviewer, int_label, int_count)) %>%
  cols_label(
    match_lt80 = "% Match < 80%",
    match_gte80 = "% Match \u2265 80%",
    match_gte85 = "% Match \u2265 85%",
    muni_label = "Municipality") %>%
  tab_header(
    title = md("**Number (and %) of Respondents by Municipality, Neighborhood Cluster, & Maximum Percent Match**")) %>%
  tab_options(container.height = px(500))
municlust_tabgt
Number (and %) of Respondents by Municipality, Neighborhood Cluster, & Maximum Percent Match
Cluster % Match < 80% % Match ≥ 80% % Match ≥ 85%
Muni. 1 (n = 240)
11 10 (100%) 0 (0%) 0 (0%)
12 10 (100%) 0 (0%) 0 (0%)
13 9 (90%) 1 (10%) 0 (0%)
14 10 (100%) 0 (0%) 0 (0%)
15 10 (100%) 0 (0%) 0 (0%)
16 9 (90%) 1 (10%) 0 (0%)
17 10 (100%) 0 (0%) 0 (0%)
18 8 (80%) 2 (20%) 0 (0%)
19 7 (70%) 3 (30%) 0 (0%)
110 3 (30%) 7 (70%) 2 (20%)
111 10 (100%) 0 (0%) 0 (0%)
112 8 (80%) 2 (20%) 0 (0%)
113 10 (100%) 0 (0%) 0 (0%)
114 10 (100%) 0 (0%) 0 (0%)
115 10 (100%) 0 (0%) 0 (0%)
116 10 (100%) 0 (0%) 0 (0%)
116 10 (100%) 0 (0%) 0 (0%)
117 10 (100%) 0 (0%) 0 (0%)
118 10 (100%) 0 (0%) 0 (0%)
119 8 (80%) 2 (20%) 0 (0%)
120 10 (100%) 0 (0%) 0 (0%)
121 10 (100%) 0 (0%) 0 (0%)
122 10 (100%) 0 (0%) 0 (0%)
123 7 (70%) 3 (30%) 0 (0%)
124 9 (90%) 1 (10%) 0 (0%)
Muni. 2 (n = 130)
21 5 (56%) 4 (44%) 0 (0%)
22 9 (100%) 0 (0%) 0 (0%)
23 8 (80%) 2 (20%) 0 (0%)
24 7 (70%) 3 (30%) 1 (10%)
25 10 (100%) 0 (0%) 0 (0%)
26 8 (80%) 2 (20%) 0 (0%)
27 4 (40%) 6 (60%) 2 (20%)
29 10 (100%) 0 (0%) 0 (0%)
210 10 (100%) 0 (0%) 0 (0%)
211 9 (90%) 1 (10%) 0 (0%)
212 8 (80%) 2 (20%) 0 (0%)
213 6 (60%) 4 (40%) 0 (0%)
214 9 (90%) 1 (10%) 0 (0%)
Muni. 3 (n = 130)
31 7 (70%) 3 (30%) 0 (0%)
32 5 (50%) 5 (50%) 3 (30%)
33 4 (40%) 6 (60%) 2 (20%)
34 3 (30%) 7 (70%) 1 (10%)
35 9 (90%) 1 (10%) 0 (0%)
36 8 (100%) 0 (0%) 0 (0%)
37 10 (100%) 0 (0%) 0 (0%)
38 10 (100%) 0 (0%) 0 (0%)
39 8 (80%) 2 (20%) 0 (0%)
310 10 (100%) 0 (0%) 0 (0%)
311 10 (100%) 0 (0%) 0 (0%)
312 10 (100%) 0 (0%) 0 (0%)
313 10 (100%) 0 (0%) 0 (0%)
Muni. 4 (n = 110)
41 6 (60%) 4 (40%) 1 (10%)
42 5 (50%) 5 (50%) 3 (30%)
43 7 (70%) 3 (30%) 1 (10%)
44 6 (60%) 4 (40%) 1 (10%)
45 7 (70%) 3 (30%) 0 (0%)
46 10 (100%) 0 (0%) 0 (0%)
47 7 (70%) 3 (30%) 1 (10%)
48 3 (30%) 7 (70%) 4 (40%)
49 4 (40%) 6 (60%) 3 (30%)
410 6 (60%) 4 (40%) 0 (0%)
411 9 (90%) 1 (10%) 0 (0%)
Muni. 5 (n = 80)
51 10 (100%) 0 (0%) 0 (0%)
52 10 (100%) 0 (0%) 0 (0%)
53 10 (100%) 0 (0%) 0 (0%)
54 10 (100%) 0 (0%) 0 (0%)
55 8 (80%) 2 (20%) 0 (0%)
56 10 (100%) 0 (0%) 0 (0%)
57 10 (100%) 0 (0%) 0 (0%)
58 10 (100%) 0 (0%) 0 (0%)
Muni. 6 (n = 50)
61 10 (100%) 0 (0%) 0 (0%)
62 10 (100%) 0 (0%) 0 (0%)
63 10 (100%) 0 (0%) 0 (0%)
64 10 (100%) 0 (0%) 0 (0%)
65 10 (100%) 0 (0%) 0 (0%)
Muni. 7 (n = 50)
71 10 (100%) 0 (0%) 0 (0%)
72 10 (100%) 0 (0%) 0 (0%)
73 10 (100%) 0 (0%) 0 (0%)
74 10 (100%) 0 (0%) 0 (0%)
75 10 (100%) 0 (0%) 0 (0%)
Muni. 8 (n = 36)
81 6 (60%) 4 (40%) 0 (0%)
82 7 (70%) 3 (30%) 0 (0%)
83 2 (20%) 8 (80%) 0 (0%)
84 6 (100%) 0 (0%) 0 (0%)
84 6 (100%) 0 (0%) 0 (0%)
84 6 (100%) 0 (0%) 0 (0%)

The above table, with 83 rows of data, is fairly cumbersome to interpret. Of course, if you scroll through it, you may notice some patterns. Not surprisingly, municipalities with relatively large maximum percent match values have more clusters and more interviews within clusters with potentially problematic matches. However, there also appears to be variation in this pattern, even within municipality. These patterns may be easier to identify if we separately visualize the distributions by cycling through distinct jitter plots as we did above.

First, we can show each municipality and their corresponding clusters separately (similar to how we showed each interviewer separately above).

These plots, particularly for the smaller municipalities, show that distribution of maximum proportion match across neighborhood clusters generally reflect patterns within the municipality. Of course, there is also more variation in the larger municipalities. An issue with looking at the neighborhood clusters is that only one interviewer was assigned to each neighborhood. Thus, unlike with some of municipalities, we cannot look at how the maximum percent match metric differs by interviewer within the same cluster. However, we can look at (1) how different interviewers’ distributions vary within the same municipality across clusters and (2) how specific interviewers’ distributions vary across municipalities and clusters. We are not sure how helpful these detailed breakdowns are for detecting fraud, but we tend to think that having more information is better than having less information. As with municipalities, we think that seeing interviewer-level distributions varying systematically across neighborhoods in ways that reflect the larger maximum proportion match patterns would typically be expected in non-fraudulent data, though seeing such systematic variations is not necessarily strong evidence of the lack of fraud and failure to see such systematic variations is not necessarily indicative of the presence of fraud. In any case, here are the plots presented separately by municipality, again with interviewers in different colors and this time also separately depicted across each of the different neighborhood clusters.

Again, one difficulty with these plots is that when you see an interviewer who has relatively large maximum proportion match values, you cannot be sure whether this is due to something peculiar with the interviewers (e.g., fraud via near duplication) or to something about the neighborhood clusters to which they were assigned (e.g., cluster-level homogeneity), or perhaps to something else entirely.

By now, you have probably gathered that the theory underlying the maximum proportion match metric as indicative of fraud is not well developed. There are multiple plausible processes generating these observed maximum proportion maximum distributions in any given dataset; trying to tease out whether fraud via near duplication is one of those data generating processes using maximum proportion match distributions is like trying to find specific needles that may or may not exist in a stack of other needles.

Detecting fraud using maximum proportion match distributions is like trying to find specific needles that may or may not exist in a stack of other needles.

Still, we are learning more about near duplication and our data through the process, so let’s continue on. The next set of slides contain plots showing each interviewer’s maximum proportion match values across the different municipalities and clusters separately. These plots overlay each municipality’s full range of data (minus the focal interviewer’s) with each interviewer’s cluster-specific maximum proportion match distributions. We find these plots to be particularly useful for visualizing the “within-interviewer” distributions of maximum proportion match values across design stratification features like municipalities and neighborhood cluster (given, of course, the interviewer completed interviews across multiple municipalities and neighborhood clusters).

With these plots, it is interesting to return to those three interviewers who were identified as having potentially suspicious data above: Interviewers #47, #45, and #64. For interviewers #47 and #45, we see more distributional variation across neighborhood clusters both across and within different municipalities (#45 only interviewed in one municipality). This is the case for these interviewers even though their overall pattern of maximum proportion match values is skewed towards larger values. Now, compare this with Interviewer #64 who, like Interviewer #45, only completed interviews in a single municipality. For this interviewer, we see that their relatively large maximum proportion match values exist across the four neighborhood clusters in which they completed interviews, and all four of their clusters are clearly on the high end of the overall range for the municipality. We still cannot determine with any degree of reasonable certainty whether this interviewer did anything nefarious, as such patterns could occur for benign reasons as well (e.g., they could have been sent to four relatively homogeneous clusters). Still, this information has not absolved us from our belief that this particular interviewer’s data should be subjected to additional scrutiny and perhaps flagged for sensitivity tests in later analyses using these data.

Criminological data variations

We wish to return to a point raised earlier about how predictable features of criminological data might systematically impact forensic analyses of near duplicates using maximum proportion or percent match metrics. Specifically, we noted earlier that the heavily skewed and often zero-inflated nature of crime and crime-relevant data might be expected to increase central tendencies for maximum percent match distributions and, if so, would then make reliance on arbitrary thresholds (e.g., >85% match) particularly unreliable for forensic detection of “likely fraud” in criminological data.

Put plainly, if we routinely ask about relatively rare events - or, in the pubic opinion context when we ask about things with high degrees of consensus, or things susceptible to strong social desirability effects, or anything exhibiting clustered survey response distributions - then we would expect higher levels of response similarity among respondents and, therefore, higher maximum percent match values. To use a concrete example, if almost everyone answers “Never” to a survey item asking how often they “Physically harmed someone else on purpose in the past two years,” then nearly everyone in the sample would match on that particular variable. If we have a large number of items like this in our survey data, then we can probably expect higher overall maximum percent match values in our data as well. We illustrate this issue using our BiH data below.

Rare event, zero-inflated, and other heavily skewed item response patterns legitimately increase the number of near duplicates in data. So, scantron data from dolphins - even honest ones - would be flagged as likely fraudulent.

Based on our prior subject domain experiences with analyzing criminological data, we expect each of the eight self-reported deviant/criminal behaviors to exhibit skewed distributions clustering near the low end of the response scales, with the modal category being “zero” self-reported crime involvement.11 Similarly, for items asking respondents “How morally acceptable is it to you to…” engage in the same eight delinquent acts, we expect clustering at or near “Never Acceptable” (the lowest response value) on these items. There are likely other variables in our data that are also highly skewed, but these are the items that we expect a priori to be particularly skewed. In fact, they are the only questions on the survey that were entirely self-reported (i.e., respondents read the questions and reported them on an answer sheet that was then sealed in an envelope).12 This means we are dealing here with questions that involve rare events (crime/deviance) and have clear social desirability valence (crime/deviance and moral acceptability of crime/deviance).

In total, the past and projected crime/deviance items and the moral acceptability items alone comprise 24 survey items in the BiH data that we expect to exhibit highly skewed distributions and, thus, to inflate near duplication metrics. This amounts to about 10.1 percent of the variables we used to calculate maximum percent match, which is potentially a non-negligible number of items given the stakes associated with using this metric (i.e., determining presence of “likely fraud”).

Zero-inflated items (crime; morality) bias near duplicate detection

Instead of plotting each of these zero-inflated questions separately, below we simply plot their modes. These reflect the percent of respondents reporting “Never” engaging in any of the behavior in the past two years (i.e. “Past Crime”), “No chance” of engaging in the behavior in the future (i.e., “Projected Crime”), and that the behavior is “Never acceptable” morally (i.e., “Personal Morality”).

Code
#Function for calculating and reporting percentage of modal response

mode_percentage <- function(data, variables = everything()) {
  results <- data %>%
    as_tibble() %>%
    pivot_longer(cols = all_of(variables), names_to = "variable", values_to = "value") %>%
    group_by(variable) %>%
    summarize(mode = names(table(value))[which.max(table(value))],
              percentage_mode = sum(table(value)[mode]) / sum(n()) * 100,
              modes = list(names(table(value))[table(value) == max(table(value))])) #identifies each value that is the mode in case 2 categories are equally the mode
  
  return(results)
}
Code
#Use function to calculate percentage of modal response and plot past and projected crime (crm_plots) and moral beliefs about crime (mor_plots)
crm_vars <- sarajevo_data_pctmatchR %>%
  dplyr::select(starts_with(c("Q25_", "Q26_"))) %>%
  names()

mor_vars <- sarajevo_data_pctmatchR %>%
  dplyr::select(starts_with("Q50_")) %>%
  names()

crm_modes <- mode_percentage(data = haven::zap_labels(sarajevo_data_pctmatchR), variables = crm_vars)
# use `zap_labels` to remove value labels from data before running function (was leading to an error).

moral_modes <- mode_percentage(data = haven::zap_labels(sarajevo_data_pctmatchR), variables = mor_vars)

zi_modes <- bind_rows(crm_modes, moral_modes) %>%
  select(-modes) %>%
  mutate(label = ifelse(variable %in% c(crm_vars[1:8]), "Past Crime", ""),
         label = ifelse(variable %in% c(crm_vars[9:16]), "Projected Crime", label), 
         label = ifelse(variable %in% c(mor_vars), "Personal Morality", label),
         label = factor(label, levels = c("Past Crime", "Projected Crime", "Personal Morality")))

zimode_plot <- ggplot(data = zi_modes, aes(x = rev(variable), y = percentage_mode, fill = rev(label))) +
  geom_bar(stat = "identity") +
  #geom_text(aes(label = paste(round(percentage_mode, digits = 0), "%")), vjust = -0.25) +
  facet_wrap(~rev(label), strip.position = "bottom", scales = "free_x") +
  scale_y_continuous(breaks = seq(0, 100, by = 10)) +
  scale_fill_manual(values = c("#440154FF", "#21908CFF", "#FDE725FF")) + 
  labs(y = "Percentage Mode", x = "Question", fill = "Question Category:",
       title = "Percentage Reporting Mode by Question") +
  theme_minimal() +
  theme(panel.spacing = unit(0, "lines"), 
        strip.background = element_blank(),
        strip.placement = "outside",
        panel.grid = element_blank(),
        legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
zimode_plot

The distributions of the percent in the zero category are very similar across these three sets of expected zero-inflated variables. The range of respondents answering the “zero” category across all 24 items is 78.47% - 97.08% (past crime: 84.67% - 97.08%; projected crime: 78.47% - 95.5%; morality: 81.39% - 95.01%), with an overall mean of 89.54% (past crime: 90.89%; projected crime: 88.44%; morality: 89.29%).

Given these zero-inflated distributions, a quick and easy sensitivity check is to remove these variables from our data and then calculate the maximum percent match separately for this subset of variables and for the larger subset of data without these variables. This should give us a sense of whether and to what degree the crime and morality items are impacting the maximum percent match values.

To do this, we filter our data into two distinct subset (no crime/morality items; only crime/morality items), and then recalculate maximum percent match in each subset.

Code
sarajevo_data_nocrmtrim <- sarajevo_data_noadtrim %>%
  select(!(c(starts_with("Q25"), starts_with("Q26"), starts_with("Q50"))))
# names(sarajevo_data_nocrmtrim) #Confirms these variables were removed
# write_dta(sarajevo_data_nocrmtrim, here("Data", "sarajevo_data_nocrmtrim.dta"))

      
sarajevo_data_crmtrim <- sarajevo_data_noadtrim %>%
    select((c("ID", starts_with("Q25"), starts_with("Q26"), starts_with("Q50"))))
# names(sarajevo_data_crmtrim) #Confirms these variables are only variables in data
# write_dta(sarajevo_data_crmtrim, here("Data", "sarajevo_data_crmtrim.dta"))

#Calculate proportion match
sarajevo_nocrm_pctmatch <- percentmatch(sarajevo_data_nocrmtrim, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
  rename(ID = id,
         pct_match_nocrm = pm, # appended "nocrm" to indicate data source
         match_id_nocrm = id_m) %>%
  left_join(sarajevo_data_nocrmtrim) %>%
  relocate(pct_match_nocrm, match_id_nocrm, .after = last_col())

sarajevo_crm_pctmatch <- percentmatch(sarajevo_data_crmtrim, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
  rename(ID = id,
         pct_match_crm = pm, # appended "crm" to indicate data source
         match_id_crm = id_m) %>%
  left_join(sarajevo_data_crmtrim) %>%
  relocate(pct_match_crm, match_id_crm, .after = last_col())

Now, let’s examine the descriptive statistics for the maximum percent match variable in each subset and compare them to those generated using our full data set containing all items.

Code
pctmatch_nocrm_tab <- sarajevo_nocrm_pctmatch %>%
  select(pct_match_nocrm) %>%
  report_table() %>%
  mutate(sample = paste0("No Crime/Morality Items", " (", ncol(sarajevo_data_nocrmtrim) - 1, ")")) %>%
  relocate(sample)

pctmatch_crm_tab <- sarajevo_crm_pctmatch %>%
  select(pct_match_crm) %>%
  report_table() %>%
  mutate(sample = paste0("Only Crime/Morality Items", " (", ncol(sarajevo_data_crmtrim) - 1, ")")) %>%
  relocate(sample)

pctmatch_tab <- pctmatch_tab %>%
  mutate(sample = paste0("All Items", " (", ncol(sarajevo_data_noadtrim) - 1, ")")) %>%
  relocate(sample)

pctmatch_crmcomp <- 
  bind_rows(pctmatch_tab, pctmatch_nocrm_tab, pctmatch_crm_tab)

gt(data = pctmatch_crmcomp) %>%
  cols_hide(columns = c(n_Obs, Variable, MAD, percentage_Missing)) %>%
   fmt_number(
    columns = c(Mean, SD, Median, Min, Max, Skewness, Kurtosis),
    decimals = 2,
    use_seps = FALSE) %>%
    cols_label(
    sample = "Sample (# of variables)")
Sample (# of variables) Mean SD Median Min Max Skewness Kurtosis
All Items (237) 0.69 0.12 0.72 0.26 0.88 −1.05 1.11
No Crime/Morality Items (213) 0.66 0.12 0.69 0.25 0.88 −0.87 0.56
Only Crime/Morality Items (24) 0.95 0.11 1.00 0.25 1.00 −3.25 11.95

From this table, it is clear that the crime and morality items impact the summary statistics describing the maximum percent match distribution. Specifically, removing these items lowered the mean maximum percent match from 69% to 66%. Meanwhile, the comparable statistic from the subset containing only crime and morality items is 95%, which is an extremely high degree of row-wise similarity - and is exactly what we expected given the zero-inflated nature of these data.

To get a better sense of these distributions, let’s plot them like we (and KR) did before.

Code
pctmatch_comp_data <- sarajevo_data_pctmatchR %>%
  select(ID, pct_match) %>%
  left_join(sarajevo_nocrm_pctmatch) %>%
  select(ID, pct_match, pct_match_nocrm) %>%
  left_join(sarajevo_crm_pctmatch) %>%
  select(ID, pct_match, pct_match_nocrm, pct_match_crm) %>%
  mutate(pctmatch_nocrmdif = pct_match - pct_match_nocrm,
         pctmatch_crmdif = pct_match - pct_match_crm)

pctmatch_dist <- ggplot(data = sarajevo_data_pctmatchR, aes(x = pct_match)) + 
  geom_histogram(aes(y = ..density..), bins = 30,
                 color = "black", fill = "#440154FF") +
  scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) + 
  scale_y_continuous(limits = c(0, 5)) + 
  labs(x = "All Items") +
  theme_minimal() +
  theme(panel.grid = element_blank())

pctmatch_nocrm_dist <- ggplot(data = sarajevo_nocrm_pctmatch, aes(x = pct_match_nocrm)) + 
  geom_histogram(aes(y = ..density..), bins = 30,
                 color = "black", fill = "#21908CFF") +
  scale_x_continuous(limits = c(0.2, 0.9), breaks = seq(.2, .9, .1)) + 
  scale_y_continuous(limits = c(0, 5)) + 
  labs(x = "No Crime/Morality") +
  theme_minimal() +
  theme(panel.grid = element_blank())

pctmatch_crm_dist <- ggplot(data = sarajevo_crm_pctmatch, aes(x = pct_match_crm)) + 
  geom_histogram(aes(y = ..density..), bins = 30,
                 color = "black", fill = "#FDE725FF") +
  scale_x_continuous(limits = c(0.2, 1.1), breaks = seq(.2, 1, .1)) + 
  labs(x = "Only Crime/Morality") +
  theme_minimal() +
  theme(panel.grid = element_blank())

pctmatch_dist + pctmatch_nocrm_dist + pctmatch_crm_dist +
  plot_annotation(
  title = 'Distribution of Maximum Percent Match by Variables Included')

A careful examination of the first two plots reveals that removal of the crime and morality items causes the the distribution of maximum percent match values to shift somewhat toward lower values. The third plot helps us to visualize why this happens - the distribution in the data subset containing only crime and morality items is shifted dramatically towards higher values of maximum percent match. Note that this plot has a substantially different y-axis scale: the distribution is literally “off the charts” compared to the other two distributions!

We can also identify how many observations in each data (sub)set are above KR’s potential problematic or “likely fraud” (>80% and >85%) thresholds.

Code
gt85full <- pctmatch_comp_data %>% filter(round(pct_match, digits = 2) >= .85) %>% nrow()
gt80full <- pctmatch_comp_data %>% filter(round(pct_match, digits = 2) >= .80) %>% nrow()


gt85nocrm <- pctmatch_comp_data %>% filter(round(pct_match_nocrm, digits = 2) >= .85) %>% nrow()
gt80nocrm <- pctmatch_comp_data %>% filter(round(pct_match_nocrm, digits = 2) >= .80) %>% nrow()
totcrm <- pctmatch_comp_data %>% nrow()

 
gt85crm <- pctmatch_comp_data %>% filter(round(pct_match_crm, digits = 2) >= .85) %>% nrow()
gt80crm <- pctmatch_comp_data %>% filter(round(pct_match_crm, digits = 2) >= .80) %>% nrow()

ntot <- pctmatch_comp_data %>% nrow
  
neardup_obs <- tribble(
  ~Data, ~gt85, ~flag85, ~gt80, ~flag80, 
  "All items (237)", gt85full, (gt85full/ntot)*100, gt80full, (gt80full/ntot)*100, 
  "No crime/morality items (213)", gt85nocrm, (gt85nocrm/ntot)*100, gt80nocrm, (gt80nocrm/ntot)*100, 
  "Only crime/morality items (24)", gt85crm, (gt85crm/ntot)*100, gt80crm, (gt80crm/ntot)*100
)

neardup_obs %>%
  relocate(c(gt80, flag80), .before = gt85) %>%
gt() %>%
    fmt_number(
      columns = c(flag85, flag80), 
      decimals = 1
    ) %>%
    cols_label(
      Data = "Sample (# of variables)", 
      gt85 = "n",
      gt80 = "n", 
      flag85 = "%",
      flag80 = "%",
      .fn = md) %>% 
  tab_spanner(label = ">85% match", columns = c(gt85,flag85)) %>% 
  tab_spanner(label = ">80% match", columns = c(gt80,flag80)) %>% 
  tab_header(title = md("**Number of observations flagged as near duplicates (N=822)**"))
Number of observations flagged as near duplicates (N=822)
Sample (# of variables) >80% match >85% match
n % n %
All items (237) 128 15.6 25 3.0
No crime/morality items (213) 68 8.3 10 1.2
Only crime/morality items (24) 755 91.8 731 88.9

As the table above shows, the full data had 25 observations above the 85% cutoff and 128 above the 80% cutoff. The data without the crime and morality items has 10 and 68 observations above the 85% and 80% cutoffs, respectively. The data that includes only the crime and morality items includes 731 observations above the 85% cutoff and 755 above the 80% cutoff. Moreover, though the crime and morality items account for only 24 of the 237 items (about 10%) in the percent match analysis, their inclusion in the data approximately doubles the number of observations flagged as potentially problematic at the 80% and 85% maximum match thresholds!

Overall, this clearly illustrates how items we know to be highly clustered or skewed (e.g., zero-inflated) affect maximum percent match calculations and detection of near duplicates. In fact, if our survey simply included just the crime and morality items, then virtually all of our observations would have been identified as near-duplicates using KR’s criteria!

Given these findings, it might makes sense to approach highly skewed (e.g., zero-inflated) items in a similar way that KR approaches missing data by removing them in the pre-processing stage prior to near duplicate analysis. That is, recall KR removed any variables with more than 10% missing (e.g., items missing due to skip patterns), since clustering on missing would inflate percent match values and near duplicate detection. Likewise, if a large proportion of respondents provide similar response values on an item for potentially legitimate reasons, then that item is a potentially invalid and biased indicator of near duplication. We can see where this takes us by expanding our analysis to other items in the data that exhibit highly skewed or clustered response distributions.

Skewed or clustered response distributions in general bias near duplicate detection

Let’s start by identifying other items in the data that also exhibit high concentration of responses in a particular response value. We will start by identifying those items in which 75% or more of respondents answer in the modal category.13

Code
pct = function(numer, denom, digits = 0) {
  paste0(round(((numer/denom) *100), digits = digits), "%")
}

all_modes <- mode_percentage(data = haven::zap_labels(sarajevo_data_pctmatchR)) 
# use `zap_labels` to remove value labels from data before running function (was leading to an error).

sub_vars <- sarajevo_data_pctmatchR %>%
  select(-c(ID, pct_match, match_id)) %>%
  ncol()

m75 <- all_modes %>%
  filter(round(percentage_mode, digits = 0) >= 75) %>%
  nrow()

m80 <- all_modes %>%
  filter(round(percentage_mode, digits = 0) >= 80) %>%
  nrow()

m85 <- all_modes %>%
  filter(round(percentage_mode, digits = 0) >= 85) %>%
  nrow()

m90 <- all_modes %>%
  filter(round(percentage_mode, digits = 0) >= 90) %>%
  nrow()

Using our arbitrary 75% criterion, 96 items exhibit substantial clustering, which means that for these items, 75% or more of the 822 respondents all responded with the same response value. This equates to about 41% of the 234 substantive items in the analysis.

As we would expect, increasing our criterion results in identification of fewer substantially clustered items. For example, using 80%, 85%, and 90% criteria results in 32%, 23%, and 11% of our substantive items, respectively, being identified as substantially clustered in a single response value.

It is likely that most of these items relate to crime and/or morality. We can check by simply filtering the data by the 75% criterion and creating a data summary for just those variables from the main data set.

Code
m75_modes <- all_modes %>%
  filter(round(percentage_mode, digits = 0) >= 75)

m75_vars <- all_modes %>%
  filter(round(percentage_mode, digits = 0) >= 75) %>%
  pull(variable) 

sarajevo_data_m75 <- sarajevo_data_pctmatchR %>%
  select(all_of(m75_vars)) %>%
  sjlabelled::get_label() %>%
  as_tibble_row() %>%
  pivot_longer(everything(), names_to = "variable", values_to = "label") %>%
  left_join(m75_modes) %>%
  select(-modes) 

gt(sarajevo_data_m75) %>%
  tab_options(container.height = px(500))
variable label mode percentage_mode
Q10_10 A person of a different ethnicity being nice to your face, but saying bad things behind back. 1 74.57421
Q10_11 Engaging in a dispute with someone of a different ethnicity 1 78.22384
Q10_12 Being victimized by someone of a different ethnicity 1 78.22384
Q13_21 I would feel like I’d be losing an important part of myself if I lost a very close friend 1 77.61557
Q13_38 There is a great deal of corruption in Bosnia 1 82.96837
Q14_11 Engage in risky or dangerous sexual behavior 1 81.38686
Q14_14 Cheat on your spouse or significant other 1 88.68613
Q14_17 Accidentally leave a store with something and did not go back to pay for it 1 91.84915
Q14_19 Engaged in a peaceful demonstration or protest 1 76.39903
Q14_20 Participated in a riot or violent protest 1 91.97080
Q15_4 Because that is the type of person I am 5 76.27737
Q15_5 Because I feel like it is the right thing to do 5 80.65693
Q16_5 Because I feel like it is the right thing to do 5 76.27737
Q17_4 Because that is the type of person I am 5 74.69586
Q17_5 Because I feel like it is the right thing to do 5 77.98054
Q18_4 Because that is the type of person I am 5 82.96837
Q18_5 Because I feel like it is the right thing to do 5 85.64477
Q19_1 Q19. I would feel very guilty [a sense of remorse, personal regret, or emotional pain] if I… Broke a promise or commitment to a friend or family member 1 77.61557
Q19_4 Lied about something important to a person I care about 1 74.69586
Q19_5 Cheated on my spouse or significant other 1 84.54988
Q19_8 Accidentally left a store with something and did not go back to pay for it 1 82.23844
Q19_9 Neglected an important duty, such as picking up children from school 1 79.31873
Q20_1 Q20. I would feel very ashamed if my family and close friends knew that I… Broke a promise or commitment to a friend or family member 1 77.00730
Q20_4 Lied about something important to them 1 75.79075
Q20_5 Cheated on my spouse or significant other 1 84.06326
Q20_8 Accidentally left a store with something and did not go back to pay for it 1 81.26521
Q20_9 Neglected an important duty, such as picking up children from school 1 79.80535
Q22_6 Borrow or steal from a family member without their knowledge 6 85.88808
Q22_7 Steal or embezzle from your work 6 87.34793
Q22_8 Shoplift, steal, or defraud a stranger 6 87.46959
Q23_5 Threaten to fight him 6 81.63017
Q23_6 Approach him and then hit or shove him 6 85.03650
Q23_7 Threaten to use a weapon on him 6 87.83455
Q23_8 Get a weapon and use it on him 6 88.56448
Q24_1 Q24. How guilty would you feel if you contemplated or thought about… Stealing from a family member 5 88.32117
Q24_2 Stealing or embezzling from your work 5 89.65937
Q24_3 Stealing or shoplifting from a stranger 5 90.02433
Q24_4 Threatening someone who angers you 5 80.53528
Q24_5 Hitting someone who angers you 5 82.48175
Q24_6 Using a weapon on someone that angers you 5 89.90268
Q24_7 Using marijuana or other illegal drugs. 5 86.86131
Q24_8 Selling marijuana or other illegal drugs. 5 89.05109
Q25_1 Q25. How often did you do the following in the past two years? Took money or property that didn’t belong to you worth less than 5 BAM. 1 84.67153
Q25_2 Took money or property that didn’t belong to you worth 5 BAM or more. 1 90.38929
Q25_3 Offered a bribe to police or other official (money or other incentive so they would do something they were not supposed to do). 1 85.27981
Q25_4 Threatened to use violence on someone else. 1 86.73966
Q25_5 Physically harmed someone else on purpose. 1 93.43066
Q25_6 Used marijuana or other illegal drug. 1 93.55231
Q25_7 Sold marijuana or other illegal drug. 1 97.08029
Q25_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission 1 95.98540
Q26_1 Q26. What are the chances that you might do the following in the future? Take money or property that didn’t belong to you worth less than 5 BAM. 1 82.60341
Q26_2 Take money or property that didn’t belong to you worth 5 BAM or more. 1 86.00973
Q26_3 Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). 1 78.46715
Q26_4 Threaten to use violence on someone else. 1 85.88808
Q26_5 Physically harm someone else on purpose. 1 92.09246
Q26_6 Use marijuana or other illegal drug. 1 92.70073
Q26_7 Sell marijuana or other illegal drug. 1 95.49878
Q26_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission 1 94.28224
Q27_1 Q27. How rewarding (beneficial, gratifying, useful) do you think it is to… Take money or property that didn’t belong to you worth less than 5 BAM. 1 83.69830
Q27_2 Take money or property that didn’t belong to you worth 5 BAM or more. 1 85.27981
Q27_3 Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). 1 80.29197
Q27_4 Threaten to use violence on another person. 1 85.64477
Q27_5 Physically harm another person on purpose. 1 89.53771
Q27_6 Use marijuana or other illegal drug. 1 88.56448
Q27_7 Sell marijuana or other illegal drug. 1 90.26764
Q27_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her permission. 1 91.60584
Q28_2 Take money or property that didn’t belong to you worth 5 BAM or more. 5 76.39903
Q28_4 Threaten to use violence on someone else. 5 75.54745
Q28_5 Physically harm someone else on purpose. 5 85.88808
Q28_6 Use marijuana or other illegal drug. 5 83.57664
Q28_7 Sell marijuana or other illegal drug. 5 88.32117
Q28_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her permission. 5 86.61800
Q29_1 Q29. Would most of your close family members consider it to be morally wrong if you were to… Take money or property that didn’t belong to you worth less than 5 BAM. 5 74.69586
Q29_2 Take money or property that didn’t belong to you worth 5 BAM or more. 5 83.45499
Q29_3 Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). 5 80.29197
Q29_4 Threaten to use violence on someone else. 5 85.03650
Q29_5 Physically harm someone else on purpose. 5 90.14599
Q29_6 Use marijuana or other illegal drug. 5 89.90268
Q29_7 Sell marijuana or other illegal drug. 5 90.99757
Q29_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission 5 89.05109
Q30_2 Take money or property that didn’t belong to you worth 5 BAM or more. 5 75.42579
Q30_4 Threaten to use violence on someone else. 5 76.15572
Q30_5 Physically harm someone else on purpose. 5 82.60341
Q30_6 Use marijuana or other illegal drug. 5 82.84672
Q30_7 Sell marijuana or other illegal drug. 5 85.27981
Q30_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission 5 85.40146
Q4 Q4. What best describes the family that raised you? 1 90.87591
Q50_1 Q50. How morally acceptable is it to you to… Take money or property that didn’t belong to you worth less than 5 BAM. 1 81.38686
Q50_2 Take money or property that didn’t belong to you worth 5 BAM or more. 1 87.10462
Q50_3 Offer a bribe to police or other official (money or other incentive so they will do something they are not supposed to do). 1 82.60341
Q50_4 Threaten to use violence on someone else. 1 88.19951
Q50_5 Physically harm someone else on purpose. 1 92.94404
Q50_6 Use marijuana or other illegal drug. 1 92.21411
Q50_7 Sell marijuana or other illegal drug. 1 95.01217
Q50_8 Attempt to access another person’s private information, such as a bank account or computer files, without his or her knowledge or permission 1 94.89051
Q6 Q6. Have you ever cohabitated with a romantic partner (that is, lived with while not married to him/her)? 2 86.49635

As expected, most of these questions are in reference to crime or morality topics - but not all of them. For example, the “Q10” items ask respondents how often they “stress or worry about” different things. Within this set, the items that exhibit substantial clustering (i.e., >75% in a single response value) are Q10_10 through Q10_12, which specifically ask about stress or worry due to interactions with someone of a different ethnicity.14

Using this broader set of items identified as exhibiting substantial response clustering, we can again compare the maximum percentage match patterns with those found in the entire data set and the various subsets examined above.

Code
#Create data set with only variables where 75% of responses are the same answer (and ID variable)
sarajevo_data_m75vars <- sarajevo_data_noadtrim %>%
  select(all_of(c("ID", m75_vars)))

  #names(sarajevo_data_m75vars) #Confirms these variables were removed

#Create data set with only variables where less than 75% of responses are the same answer
sarajevo_data_nom75vars <- sarajevo_data_noadtrim %>%
  select(-all_of(m75_vars))

  #names(sarajevo_data_nom75vars)

sarajevo_m75_pctmatch <- percentmatch(sarajevo_data_m75vars, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
  rename(ID = id,
         pct_match_m75 = pm,
         match_id_m75 = id_m) %>%
  left_join(sarajevo_data_noadtrim) %>%
  relocate(pct_match_m75, match_id_m75, .after = last_col())

sarajevo_nom75_pctmatch <- percentmatch(sarajevo_data_nom75vars, idvar = "ID", include_na = TRUE, timing = FALSE) %>%
  rename(ID = id,
         pct_match_nom75 = pm,
         match_id_nom75 = id_m) %>%
  left_join(sarajevo_data_noadtrim) %>%
  relocate(pct_match_nom75, match_id_nom75, .after = last_col())

As before let’s first examine the descriptive statistics for the maximum percent match variable.

Code
pctmatch_m75_tab <- sarajevo_m75_pctmatch %>%
  select(pct_match_m75) %>%
  report_table() %>%
  mutate(sample = paste0("75% or more concentration", " (", ncol(sarajevo_data_m75vars) - 1, ")")) %>%
  relocate(sample)

pctmatch_nom75_tab <- sarajevo_nom75_pctmatch %>%
  select(pct_match_nom75) %>%
  report_table() %>%
  mutate(sample = paste0("Less than 75% concentration", " (", ncol(sarajevo_data_nom75vars) - 1, ")")) %>%
  relocate(sample)

pctmatch_m75comp <- 
  bind_rows(pctmatch_tab, pctmatch_nocrm_tab, pctmatch_crm_tab, pctmatch_nom75_tab, pctmatch_m75_tab)

gt(data = pctmatch_m75comp) %>%
  cols_hide(columns = c(n_Obs, Variable, MAD, percentage_Missing)) %>%
   fmt_number(
    columns = c(Mean, SD, Median, Min, Max, Skewness, Kurtosis),
    decimals = 2,
    use_seps = FALSE) %>%
    cols_label(
    sample = "Sample (# of variables)")
Sample (# of variables) Mean SD Median Min Max Skewness Kurtosis
All Items (237) 0.69 0.12 0.72 0.26 0.88 −1.05 1.11
No Crime/Morality Items (213) 0.66 0.12 0.69 0.25 0.88 −0.87 0.56
Only Crime/Morality Items (24) 0.95 0.11 1.00 0.25 1.00 −3.25 11.95
Less than 75% concentration (141) 0.57 0.11 0.58 0.27 0.82 −0.27 −0.43
75% or more concentration (96) 0.90 0.13 0.95 0.31 1.00 −1.99 4.24

As we saw with the “no crime and morality items” subset, the items with less than 75% concentration in one response value have much lower maximum percent match values than the full sample and than the clustered item subsets. Now, let’s plot the distributions.

Code
pctmatch_comp_data <- pctmatch_comp_data %>%
  left_join(sarajevo_m75_pctmatch) %>%
  select(ID, starts_with("pct")) %>%
  left_join(sarajevo_nom75_pctmatch) %>%
  select(ID, starts_with("pct")) %>%
  mutate(pctmatch_m75dif = pct_match - pct_match_m75,
         pctmatch_nom75dif = pct_match - pct_match_nom75) %>%
  relocate(c(pct_match_m75, pct_match_nom75), .before = pctmatch_nocrmdif)

pctmatch_m75_dist <- ggplot(data = sarajevo_m75_pctmatch, aes(x = pct_match_m75)) + 
  geom_histogram(aes(y = ..density..), bins = 30,
                 color = "black", fill = "#3B528BFF") +
  scale_x_continuous(limits = c(.2, .9), breaks = seq(.2, .9, .1)) + 
  labs(x = "75% or more concentration") +
  theme_minimal() +
  theme(panel.grid = element_blank())

pctmatch_nom75_dist <- ggplot(data = sarajevo_nom75_pctmatch, aes(x = pct_match_nom75)) + 
  geom_histogram(aes(y = ..density..), bins = 30,
                 color = "black", fill = "#5DC863FF") +
  scale_x_continuous(limits = c(.2, .9), breaks = seq(.2, .9, .1)) + 
  labs(x = "Less than 75% concentration") +
  theme_minimal() +
  theme(panel.grid = element_blank())

pctmatch_dist + pctmatch_nocrm_dist + pctmatch_crm_dist + pctmatch_nom75_dist + pctmatch_m75_dist +
  plot_annotation(
  title = 'Distribution of Maximum Percent Match by Variables Included')

The plot above shows the distributions of the maximum percent match score for different subsets of items. The distributions for “No Crime/Morality” and “Less than 75% concentration” subsets have shifted towards lower percent maximum values (i.e., to the left) compared to the full data with all items, which is what we had expected. Interestingly, removal of crime/morality items and other items with less than 75% response concentration - that is, removal of items with high degrees of item response clustering - results in a distribution that more closely resembles the Gumbel extreme value distribution that KR suggested we should expect in the absence of fraudulent near duplication.

Code
gt85nom75 <- sarajevo_nom75_pctmatch %>% filter(round(pct_match_nom75, digits = 2) >= .85) %>% nrow()
gt80nom75 <- sarajevo_nom75_pctmatch %>% filter(round(pct_match_nom75, digits = 2) >= .80) %>% nrow()

gt85m75 <- sarajevo_m75_pctmatch %>% filter(round(pct_match_m75, digits = 2) >= .85) %>% nrow()
gt80m75 <- sarajevo_m75_pctmatch %>% filter(round(pct_match_m75, digits = 2) >= .80) %>% nrow()

neardup_obs2 <- neardup_obs %>% 
  add_row(
    Data = c("Less than 75% concentration (141)", "75% or more concentration (96)"),
    gt85 = c(gt85nom75, gt85m75),
    flag85 = c((gt85nom75/ntot)*100, (gt85m75/ntot)*100),
    gt80 = c(gt80nom75, gt80m75), 
    flag80 = c((gt80nom75/ntot)*100, (gt80m75/ntot)*100)
  )

neardup_obs2 %>%
  relocate(c(gt80, flag80), .before = gt85) %>%
gt() %>%
    fmt_number(
      columns = c(flag85, flag80), 
      decimals = 1
    ) %>%
    cols_label(
      Data = "Sample (# of variables)", 
      gt85 = "n",
      gt80 = "n", 
      flag85 = "%",
      flag80 = "%",
      .fn = md) %>% 
  tab_spanner(label = ">85% match", columns = c(gt85,flag85)) %>% 
  tab_spanner(label = ">80% match", columns = c(gt80,flag80)) %>% 
  tab_header(title = md("**Number of observations flagged as near duplicates (N=822)**"))
Number of observations flagged as near duplicates (N=822)
Sample (# of variables) >80% match >85% match
n % n %
All items (237) 128 15.6 25 3.0
No crime/morality items (213) 68 8.3 10 1.2
Only crime/morality items (24) 755 91.8 731 88.9
Less than 75% concentration (141) 6 0.7 0 0.0
75% or more concentration (96) 677 82.4 615 74.8
Code
sarajevo_data_stratmatch <- sarajevo_data_stratmatch %>%
  full_join(sarajevo_nocrm_pctmatch) %>%
  full_join(sarajevo_crm_pctmatch) %>%
  full_join(sarajevo_nom75_pctmatch) %>%
  full_join(sarajevo_m75_pctmatch)

save(sarajevo_data_stratmatch, file = here("Data", "sarajevo_data_stratmatch.RData"))
#save.image(file = here("Data", "duplicates.RData"))

As we saw previously with the crime and morality analysis, highly skewed or clustered response distributions pose serious problems for forensic detection of fraud via near duplication. Let’s start with those 96 highly skewed items that exhibited insufficient variation due to substantial response clustering (i.e., “75% or more concentration”). Using the maximum percent match metric and KR’s 85% match criterion for identifying “likely fraud” cases, we would flag a whopping 615 observations, or 74.8 of the sample, as above the 85% KR threshold. Using a more liberal criterion uncovers many more near duplicates, with 728 (82.36%) observations identified above the 80% threshold.

Meanwhile, removing these highly skewed items with 75% or more concentration in a single response value results in a drastic reduction of the number of observations identified as “likely fraud” due to near duplication. In fact, among the 141 items with sufficient variability (i.e., “less than 75% concentration”), not a single observation (0) was identified as potentially problematic using KR’s 85% “likely fraud” threshold, and only 6 were flagged using the more liberal 80% threshold.

Mystery solved? Closing the case on Interviewer 64.

Now that we know how common characteristics of criminological data inflate maximum percent match metrics, let’s revisit the elusive question of whether Interview #64’s data are “likely falsified” as potentially indicated by their interviewer-specific percent match distribution. Below we have reproduced two of the figures presented above using only the items identified as having sufficient variability for meaningful near duplication analysis, which we defined here as the 141 variables exhibiting less than 75% case-level clustering in one response value.

No cases exceed 85% maximum percent match and Interviewer 64 is less of an outlier when analysis is limited to items with sufficient variation (n=141 variables).

Recall, when limited to these 141 items, none of our cases had percent match values exceeding KR’s 85% threshold, which is apparent again in these figures. Additionally, these figures show that Interviewer #64 has higher overall percent match values than the other interviewers in their respective municipality. However, in this analysis, their interviewer-specific percent match distribution is more in line with those of other interviewers. It seems plausible, then, that Interviewer #64 simply was assigned to relatively homogeneous areas and, as such, had a greater number of respondents reporting in those clustered response categories. Nonetheless, we are still likely to flag this interviewer’s data for sensitivity analyses in future research because, well, we told you that we have trust issues.

Conclusion: Know Thy Data and, When Using Heuristics, Choose Wisely

Well, that was a long and somewhat repetitive - or dare we say duplicative - journey. At the end of it, we now have the skills to identify exact and near duplicates in our data. We also have developed a sense of what types of additional analyses might be useful when initial analyses identify some near duplicate cases (e.g., maximum percent match by interviewer and across research design strata), and we have identified some potential sources of bias to watch out and account for (e.g., inflated estimates of near duplicates due to items with skewed or clustered response distributions). We intend to add these forensic tasks early in our workflow when collecting and analyzing new data sources, and we hope you will consider doing so too.

Initially, those near duplicate cases we identified as above KR’s 85% threshold gave us some pause. If you find some in your data, they should give you pause too. However, we hope you now have a better understanding of the maximum percent match metric and its uses as well as its limitations. Additionally, we urge you to remember our warnings about using KR’s arbitrary 85% threshold to make strong determinations about the presence of “likely fraud” in data. Like any heuristic, it is bound to fail in certain situations. For additional examples of such situations, we urge you to check out Simmons’ and colleagues’ response paper. We pulled the following two quotes from their paper to reinforce our warning here:

“[KR’s 85%] threshold based on simulations with synthetic data is not relevant for what we should see in real-world data.” (p.332)

“…it is possible to obtain any value of the maximum percent match statistic in non-falsified data, depending on the survey parameter. Thus, setting a threshold for the statistic and applying it uniformly across surveys is a flawed approach for detecting falsification. In fact, eliminating respondents from a dataset based on this measure may introduce selection bias into survey data and serve to reduce data quality, rather than improve it.” (p.335)

In criminology, we often collect survey data containing a substantial number of items with highly skewed and clustered response distributions (e.g., zero-inflated crime and morality items). This research context appears to be a situation in which such heuristics might frequently fail to generate reliable inferences, perhaps even more often than they do not. In our view, this should not discourage criminologists from using these tools to better understand our data. Rather, we encourage learning and using these tools and their corresponding decision criteria as they were likely intended - as methods and heuristics that help us summarize and sometimes raise questions worth answering about our data.

Generally, when we detect a near duplicate - or any “significant” statistical value - we see this as the starting point rather than end of a journey towards asking and attempting to answer important questions about our data. Then, as we learn more about our methods and data, we might find ourselves modifying heuristics, or even ditching them altogether, if they are not particularly helpful in our specific case. Remember, binary heuristics like statistical thresholds help us make important decisions. Yet those decisions, like the data upon which they are based, often are laden with non-negligible degrees of uncertainty, and the transparent communication and visualization of uncertainty also are essential for good decision-making.

For instance, after searching for near duplicates in our BiH data, we feel much more confident moving forward with analyses under the assumption that fraud via near duplication is not likely to be much, if any, of a concern in these data.15 Despite our and others’ criticisms, KR’s thresholds were helpful to us in this process. They motivated us to ask important questions about our data, then to try answering them with our data, and ultimately to learn a great deal more about our data. Yet, despite our increased confidence - which stems from knowledge and a corresponding reduction in uncertainty about our data - we still do not have a definitive answer about the presence or absence of fraud via near (or other) duplication in our data, and we never will. Much of the remaining uncertainty stems from research design features that we can no longer do anything about, such as the fact that others collected our data for us and we were unable to directly supervise it. Which leads us to our final concluding point: There is no simple heuristic substitute for the careful design, collection, description, modeling, and dissemination of one’s data.

Make good decisions by learning as much as you can about your data. When making decisions, if you use thresholds or heuristics to choose, then choose wisely.

Footnotes

  1. Funding for data collection came from three sources: (1) Collaborative Research Award, Institute for Advanced Study, Indiana University; (2) Faculty Research Grant, Missouri State University; (3) Faculty Research Initiative, College of Arts and Sciences, University of North Carolina-Wilmington.↩︎

  2. Interestingly, this type of exact duplication apparently was involved in criminology’s “Stewart Retractions” and, in fact, was not caught (at least not publicly) for a long time by Stewart’s co-authors - nor by anyone else in the field of criminology, for that matter. Of course, without access to the data, even exact duplication would be difficult to identify. Speaking of which, maybe social scientists should be required to sign a pledge saying that they have seen the data and checked for obvious indications of fraud such as exact duplication before publishing future papers. After all, scientific research shows that signing honesty pledges before filing things like insurance claims has been shown to… Oh, wait, well damn. Never mind. Let’s just keep doing what we’ve been doing.↩︎

  3. The only time I use Stata these days is when working with Master’s students. I may be the only faculty member in my department who regularly uses R, so students often are taught data analysis with Stata or SPSS.↩︎

  4. Long story short: my function generally did produce similar results with a couple exceptions. First, there appeared to be some small rounding discrepancy to the seventh decimal place in the percent match metric in my R function compared to KR’s Stata function. Second, there were some differences in the observation identified as the closest match for the focal observation. I assume this difference results from how the different functions handle cases where more than one observation shares the maximum percent match score with the focal observation.↩︎

  5. KR note that demographic variables are used in sampling designs and should approximate population parameters, and thus are less apt to be erroneously duplicated - see footnote 9 on p.286 in KR’s article↩︎

  6. I actually asked ChatGPT to change the code above to calculate row means instead of column means. Although it initially gave me code to calculate column means, with simple prodding (I had to tell it to use tidyverse map and to give me row means instead of column means) it produced the following code: sarajevo_rowmiss <- map_dbl(seq_len(nrow(sarajevo_data_noad)), ~mean(is.na(sarajevo_data_noad[.x,])))↩︎

  7. On pg. 286 of their article they mention the 10% cutoff for questions was to avoid variables that are part of skip patterns and the 25% cutoff for observations was to “minimize the risk of having break offs skew the analysis and overstate the level of similarity between observations (pg. 286).”↩︎

  8. The Gumbel distribution appears to be a common distribution used in sciences where predicting maximum values is important. To quote directly from Wikipedia: “This distribution might be used to represent the distribution of the maximum level of a river in a particular year if there was a list of maximum values for the past ten years. It is useful in predicting the chance that an extreme earthquake, flood or other natural disaster will occur.”↩︎

  9. We also relied on a [tutorial] on extreme value distributions by Naresh Devineni↩︎

  10. Within each neighborhood cluster, interviewers were sent to a randomly selected address as a starting point. From there, households were selected with a random walk technique and right hand principle where contact was attempted at every third household. Within each household contacted, the last birthday method was used to select one respondent from among all permanent members of the household older than 18.↩︎

  11. We might even expect this zero-inflation to be more pronounced in our surveys given the cultural context. For instance, Bosnia and Herzegovina and particularly Sarajevo populations are predominantly Muslim, and Islamic contexts tend to have lower crime rates. Additionally, our sample is comprised of adult respondents, and adults tend to self-report substantially lower levels of engagement in criminal behavior compared to teenagers and young adults.↩︎

  12. For respondents who could not read, the protocols called for the interviewers to read the questions but have respondents answer them privately and seal their answers in an envelope.↩︎

  13. Why 75%? Well, why not? We chose it because 75% is a round number that is close to the 78% minimum modal concentration identified in the crime and morality items, which were items that we expected a priori to be zero-inflated and potentially biasing to near duplicate detection. Hey, we are making this up as we go! If it feels arbitrary, that is because it is. We will return to the issue of using arbitrary thresholds and heuristics to make decisions later in the post.↩︎

  14. The lack of variation on these sensitive questions could reflect a variety of causes, such as potentially high degrees of intraethnic homophily in interactions, low levels of perceived overt ethnic bias, or strong social desirability reporting biases on these items in the BiH sample.↩︎

  15. Of course, this does not rule out other sophisticated forms of fraud, such as simulation-based data fabrication by a researcher or survey research organization. We told you that we had trust issues…↩︎