Trust Issues: Examining Near Duplicates in Survey Data

survey research
Do you know how to detect exact or near duplicate rows in your data? Read on to learn more!

Jake Day

Jon Brauer

Maja Kotlaja


October 10, 2023

Near duplicates in survey data: Like “Multiplicity” but without the humor

Have you been duped by data duplicates?

Back in 2016, Andrew Gelman blogged about a study by Kuriakose and Robbins (2016) that investigated “near duplication” of survey responses across 1,000+ public opinion surveys as a way to discover potential data fraud. Around that time, we (Jake, Jon, and Maja Kotlaja) were working with a professional international research organization to collect household interview data in Belgrade, Serbia. Upon reading the post and study, it immediately struck fear in us. How would we know if the data we collected were legitimate? Jon had contracted with professional research organizations to collect survey data in the past and, as he had done before, we did our due diligence in identifying potential contractors and generally making sure they are above board. For instance, we made sure to get multiple quotes, checked each organization’s references and body of work, and had detailed exchanges about sampling and interview methods before finally selecting and entering into contractual agreement with an organization (MASMI-Belgrade in Serbia). Meanwhile, the survey organization itself had also instilled confidence in us by doing their own data quality checks, and they had even thrown out one interviewer’s responses due to data quality concerns.

At some point, we had to assume that we were doing all that one reasonably could be expected to do - that is, aside from traveling to Serbia to directly oversee the survey organization and each interviewer ourselves. Like so many other researchers in similar situations domestically and internationally, eventually we had to trust that we were hiring an experienced professional organization that knows what they are doing to carry out the work professionally. Yet, here’s the thing: I (Jake) generally am not a very trusting person. Maybe it is more accurate to say that I am a “trust, but verify” kind of person. As this was my first time collecting international survey data, I had also heard Jon’s horror stories of the extra (and perhaps ethnocentric) scrutiny that reviewers often applied when submitting manuscripts containing analyses of international survey data. Also, though fraud can happen anywhere, the Kuriakose and Robbins (2016) study we mentioned above actually found near duplicate survey observations were more common in data from non-OECD country samples compared to the OECD samples they examined (Serbia is a non-OECD country). So, for our own peace of mind and to satisfy potentially skeptical reviewers, we knew that someday we should learn how to detect instances of near duplication in our data.

Since completing data collection in Serbia, a global pandemic occurred and, concurrently with the pandemic, we also contracted with another international organization to collect similar survey data in Bosnia and Herzegovina. So, this problem has been at the back of our minds for awhile, and we are finally ready to buckle down and investigate it with our own international survey data. In this post, we tested the near duplication waters by focusing first on our most recently collected dataset from Sarajevo, Bosnia and Herzegovina.1 Afterwards, we plan to conduct similar investigations of data we collected in other international locations (Dhaka, Bangladesh; Belgrade, Serbia). As you will see, this is a lot of work for ultimately what will be a footnote in future papers generated using these data. Still, it is important work, and it is exactly the type of thing that we envisioned sharing on our blog when we launched it. Who knows - maybe our departure into duplication will help some of you dig a little deeper into your data and construct your own esoteric footnotes someday. You’re welcome?


Exact duplicates: When working with data, no one likes a copycat (well, except maybe fraudsters).

Exact Duplications

The first thing most survey organizations likely do (and most researchers should do) is check for exact duplicate entries in the raw data file. The janitor package allows one to quickly identify exact duplicate entries.

load(here("Data", "sarajevo_data.RData"))

#look for exact duplication after removing adminstrative variables:
sarajevo_data %>%
  get_dupes(-c(ID, starttime, Interviewer, Region, Municipality, Cluster, Settlement, weight))
No duplicate combinations found of: Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8a, Q8b, ... and 244 other variables

Using the janitor package, we found no exact duplicates across the whole range of survey questions (minus the administrative variables) in the Sarajevo, Bosnia and Herzegovina (hereafter BiH) data. This is unsurprising given the thoroughness and professionalism we saw from our contracted survey organization, Custom Concepts. So we’re off the hook, right? Footnote complete!? Maybe, but duplicating entries exactly would be a pretty brazen attempt at falsification, one that would likely be caught by a professional survey organization before the data even got to us.2

Nuanced or Near Duplication

Kuriakose and Robbins (2016) (“KR” from here) developed a more nuanced method for detecting likely duplicates by focusing on “near duplicates.” The basic logic is that a dishonest interviewer or survey firm could save time and money by duplicating responses from valid interviews, and that a sophisticated interviewer or firm engaging in fraud is unlikely to simply duplicate the results exactly because exact duplication is relatively easy to check and identify (see above). Rather, duplicating a valid response and then changing the identifier as well as the values of one or two (or more) data columns would avoid the crude duplicate detection methods employed above.

KR’s process involves pairwise comparisons of every observation in the data set on the substantive variables (i.e. non-administrative variables like unique ID) to detect the “maximum percentage of variables that an observation shares with any other observation in the data” (p.284). Based on their simulations, KR (2016) determined that a maximum percent match of 85% or more is worthy of further scrutiny.