We are going to use this appendix to document our process for cleaning and wrangling the neighborhood-level data constructed from the the individual-level community survey data (as well as official crime data). These data were used in “Chapter 6: Mapping Collective Efficacy” in order to examine the neighborhood-level variation in collective efficacy and crime.
11.0.2 Load the Data
First, we need to load the crime rate data that was in the “Data” folder of the “ArcGIS Pro Project” folder shared with us via dropbox. We will specifically use the CrimeRate Selected Neighborhood.xls data as it includes an indicator variable for the neighborhoods that were sampled. It includes SHAPE_AREA and SHAPE_LEN variables that indicate geographic shape/location features, but not the full multipolygon which may be necessary to construct a map (we’ll find out below). So we downloaded that data (in September of 2024 from DataKC. On first glance, it appeared to have the same neighborhood information as included in the CrimeRate data, but we can check that to make sure. And, of course, we’ll also load the full recoded and analytical data “Chapter 5: Measuring Collective Efficacy” that includes our collective efficacy and crime-related items, subscales, and scales.
As you can see above, the kcmo_crimerate_raw data has 30 variables, many of which have the same name. The most common case of this is with three “OBJECTID…#” items (there is also an “OBJECTID1” variable) and what looks like two “NBHID…#” and “NBHNAME…#” variables. Fixing the names is generally an easy fix (we can simply remove the duplicate columns). But before we do that, we want to make sure columns with the same name do indeed have the same data.
From eyeballing the first six rows of the data above, it appears that the variables with the same name generally have the same data. But there is one obvious exception. OBJECTID...# variables (and OBJECTID1 variable) do not always match. This is most obvious with OBJECTID...2 and OBJECTID...3 which are located beside each other in the data. While OBJECTID...8 appears to match OBJECTID...3, OBJECTID1 does not appear to match the other OBJECTID... columns. The different NBHNAME...# and NBHID...# items do appear to match.
You may also notice multiple sets of geographic variables–SHAPE_AREA and SHAPE_LEN, SHAPE_Area and SHAPE_Length, and Shape_Area and Shape_Length. While the first two sets appear to match, the third set appears to be completely different. We can ultimately cross-reference this with the data pulled directly from DataKC. For now, I want to more formally test that these various “duplicate” items are indeed the same (or different).
As you can see above, OBJECTID...3 and OBJECTID...8 are identical to each other but almost completely different from OBJECTID...2 (except for 3 values). OBJECTID...2 is also mostly different from OBJECTID1 (except for 2 values) and completely different from OBJECTID...3 and OBJECTID...8. I will need to keep three different versions of the OBJECTID variables for now and compare it to the DataKC data to confirm which aligns with the public data.
Next we can look at the NBHID and NBHNAME variables.
# A tibble: 1 × 2
NBHNAME...5 NBHNAME...10
<chr> <chr>
1 Noble And Gregory Ridge Noble and Gregory Ridge
As you can see above, the one difference is the result of “And” being capitalized in NBHNAME...5 and not in NBHNAME...10. This gives us confidence that the NBHID variable is capturing distinct neighborhoods. We will cross-reference is with the data from DataKC to settle on which version we will use.
As you can see in the results of the duplication test, all of the shape variables are different. This is confusing as, for example, when eyeballing the SHAPE_AREA and SHAPE_Area variables they look the same. Apparently, this is a result of the values (floating point numbers) being rounded to different decimal places. While the decimals don’t show up in the printed data frame output, they’re there beneath the surface. You can see this when we convert the data to be character values and print the first six rows of the data frame.
We’re not too concerned about this and, if we had to choose one, we’d choose the more precise number. But what all this does is reinforce the necessity of using the DataKC geography data directly. In fact, we can check the DataKC geography data to see at what level of precision it’s geography variables are stored.
It’s pretty clear that the all caps version of the SHAPE_AREA and SHAPE_LEN variables in our crime rate data are very likely the same as those in the data from the city. All are stored as 13 characters (12 numbers plus the decimal place). We will confirm this by merging the city data with the crime rate data. But first, we will want to trim the crime rate data to just the unique variables and perhaps rename some of them so they do not duplicate the variable names in the DataKC data.
One thing to note about this merged data is that it has 3 more observations than the original data. This difference is explained in part by the 6 observations with 0 for a NHBID and NA for the NBHNAME variable in the city’s data and 3 observations that are missing from the crime rate data. You can see this clearly in the simple table below. We believe these were the 3 neighborhoods that were intentionally selected out from the sampling frame for methodological reasons.
Now we should be able to use our dupevar_test() function to confirm which variables in the crime rate data are the same as the city’s data. Of course, now that we have merged it with the city data, this is all kind of moot.2
Now, we can trim the crime rate data to just those variables we need to identify neighborhood boundaries, map crime, and ultimately merge with our survey data. I will also create a factor variable that reflects the specific sample (Not Sampled vs. Random Sample vs. High Crime Sample) from which the neighborhood was drawn (or not).
11.0.4 Prepare the Geography & Crime Rate Data for Mapping
The Kansas City neighborhood data from dataKC that we joined/merged with the crime rate data included the relevant “features” or geographic information we needed to plot maps of the data. Specifically, the_geom variable in that data includes the relevant geographic information to draw our maps. In drawing the maps, we will work with the “sf” package which is one of the go-to packages for working with geographic shape/feature data (the sf stands for “simple features”). The “sf” package has a website with multiple vignettes for working with geographic data.
The sf package includes multiple operation for working with geographic features. In fact, we could have used it to directly read the shape (.shp) files that Dr. Kotlaja provided us. However, given our more simple goals–to map neighborhood variation in collective efficacy and crime–the geographic features of the neighborhoods is all we really need. So we will use the st_as_sf() function form the sf package to point it to the geographic information already included in our merged data.
Show code
crimerate_geom <-st_as_sfc(kcmo_crimerate$the_geom) # Create the sf data framekcmo_crimerate_sf <-st_sf(kcmo_crimerate, geometry = crimerate_geom, crs =4326)names(kcmo_crimerate_sf)
The survey data does not have a corresponding NBHID variable that directly maps onto the same variable in the geo-coded data files we were just working with. Of course, given we know which neighborhoods were included in our sample, and there is a neighborhood number identifier (coded as NBHD) we can link the survey data to the geographic data with a little bit of work.
The first thing we’ll do is identify what neighborhoods are included in our recoded survey data
As you can see in the above table, the NBHD variable is simply the neighborhoods sequentially numbered. Dr. Kotlaja shared the corresponding NBHID values from the crime rate data. We can simply create that variable within our survey data and merge it with the crime rate data.
Kuriakose, Noble, and Michael Robbins. 2015. “Don’t GetDuped: Fraud Through Duplication in PublicOpinionSurveys.” {SSRN} {Scholarly} {Paper}. Rochester, NY: Social Science Research Network. https://papers.ssrn.com/abstract=2580502.
Revelle, William. 2024. “The Seductive Beauty of Latent Variable Models: Or Why I Don’t Believe in the EasterBunny.”Personality and Individual Differences 221 (April): 112552. https://doi.org/10.1016/j.paid.2024.112552.
Sampson, Robert J., Stephen W. Raudenbush, and Felton Earls. 1997. “Neighborhoods and ViolentCrime: AMultilevelStudy of CollectiveEfficacy.”Science 277 (5328): 918–24. https://doi.org/10.1126/science.277.5328.918.
When R encounters this, it simply appends “…#” where the # represents the column number where the variable is located in the data.↩︎
We also already made some decisions in this regard. For example, recall that the two NBHDNAME...# variables in the crime rate data had one value that was different - capital “And” vs. lowercase “and”. We simply went to that value in the city’s data to determine which was correct.↩︎