Stumbling in the Dark: Building/Iterating an R Function to Match Stata’s percentmatch

general
rstats
duplication
fraud
If you are looking for more information about the modified R function we used to detect near duplicates, then you have come to the right place.
Author

Jake Day

Published

October 9, 2023

For this non-programmer, iterating on a percentmatch function in R was not entirely unlike stumbling in the dark.

Motivation

In order for our near duplication workflow to be fully (and freely) reproducible, we wanted to be able to estimate the maximum percent/proportion match value entirely within R.1 This meant we had to find or create an R function that mimics Kuriakosse and Robbins’ (2016) Stata function.2 On the surface, this does not seem like it should be a difficult task. Basically, for each observation in the data, you need to instruct R to compare it to every other observation and count how how many columns are the same, then divide this matching count by the total number of columns to generate casewise proportion match values and, finally, to report the maximum proportion match value observed.

Honestly, this would probably be a pretty trivial exercise for real programmers. I (Jake) am not a real programmer. So, upon starting this journey, I was hoping (and expecting) that someone had already figured this out in R; that way, I could just use someone else’s existing function without needing to create my own. As you will see when I get around to writing this up, there were some challenges with the existing user-created functions, particularly in terms of computational efficiency, that posed some application problems. Eventually, I hope to walk you through an abridged overview of my journey to iterating on Florent Bédécarrats’percentmatch” function. In the meantime, thanks for your patience as I muster up the courage to “code in public” and draft a description of my stumbling around in the dark to build an R version of the percentmatch function that 1) works as intended and 2) does not take all day to run. For now, the current version of the function, which we used in the main blog post, will have to suffice and is included below.

Code for percentmatch R Function

# percentmatch function to align with:

# Noble L. Kuriakose, 2015.
# "PERCENTMATCH: Stata module to calculate the highest percentage match (near duplicates) between observations," Statistical Software Components
# S457984, Boston College Department of Economics.
# <https://ideas.repec.org/c/boc/bocode/s457984.html>

# and expand and correct (where necessary):

#fBedecarrats percentmatch function (see: https://github.com/fBedecarrats/percentmatch)

library(tidyverse)
# First part of function converts the data to strings and converts NA values to string depending on include_na option
percentmatch <- function (data, idvar, include_na = TRUE, progress = FALSE, timing = FALSE) {
  if(timing) { #starts time clock if timing option is TRUE
    start_time <- Sys.time()
  }
  if (include_na) {
    data <- relocate(data, idvar) %>% #Make specified idvar first column in data
      mutate(across(everything(), as.character),
             across(everything(), ~ ifelse(is.na(.x), "NA", .x)))
  } else {
    data <- relocate(data, idvar) %>% #Make specified idvar first column in data
      mutate(across(everything(), as.character))
  }

  #Specifies parameters that will be used in for loop 
  t <- t(data)
  t[t == ""] <- NA
  id <- data[[1]]
  nr <- nrow(t)
  nc <- ncol(t)
  m <- vector(mode = "numeric", length = nc-1)
  pm <- vector(mode = "numeric", length = nc)
  id_m <- vector(mode = "character", length = nc)
  
  #Calculates maximum proportion match for each observation in the data and creates data with three variables (id, pm, id_m)
  for (i in 1:nc) {
    n <- t[,i]
    tt <- t[,-i]
    for (j in 1:(nc-1)) {
      # pmin takes minimum sum of non-missing values between columns being compared; subtract 1 so it doesn't include ID variable
      m[j] <- sum(n == tt[,j], na.rm = TRUE) / (pmin(sum(!is.na(n)), sum(!is.na(tt[, j]))) - 1) 
    }
    pm[i] <- max(m)
    id_m[i] <- tt[1,which.max(m)]
    if (progress) { #option to show progress is FALSE by default
      if (i %% 10 == 0) { 
        print(i) 
      }
    }
  }
  out <- tibble(id = as.numeric(id), #converting back to numeric
                pm, 
                id_m = as.numeric(id_m))
  if(timing) {
    end_time <- Sys.time() #stops time clock if timing option is TRUE
    print(end_time - start_time)
  }
  return(out)
}

Footnotes

  1. Achieving this important goal will require us to make our data publicly available as well. We plan on doing this…eventually. We hope to write a future blog post about the process of sharing this kind of sensitive data (e.g., asking about criminal and immoral behaviors and attitudes) and, perhaps, about the potential for sharing synthetic data as an alternative to consider when otherwise restricted in criminology. In this particular situation, we have not yet published in an academic journal using these data. When we do, we expect at least to provide replication data upon publication. We still have a lot of work to do before we get there. Meanwhile, until we reach the point where we can openly and ethically share the data, you should take everything we tell you regarding results from these data with a grain of salt - as you should also do with any study in and beyond criminology for which data are not made available. Beware, though, that such a justifiably skeptical attitude might leave you extremely salty, as you will likely find that the data for a lot of studies are simply unavailable. You may even encounter situations where authors claim their data do not even exist anymore, perhaps due to the death of a computer, a researcher, or both. And, yes, we even have footnotes for our preview pages.↩︎

  2. Having an efficient R version of the function would also allow us to replicate and extend KR’s and Simmons et al.’s (2016) simulation studies.↩︎