Ensuring Confidentiality of Geocoded Health Data: Assessing Geographic Masking Strategies for Individual-Level Data

Paul A Zandbergen, Paul A Zandbergen

Abstract

Public health datasets increasingly use geographic identifiers such as an individual's address. Geocoding these addresses often provides new insights since it becomes possible to examine spatial patterns and associations. Address information is typically considered confidential and is therefore not released or shared with others. Publishing maps with the locations of individuals, however, may also breach confidentiality since addresses and associated identities can be discovered through reverse geocoding. One commonly used technique to protect confidentiality when releasing individual-level geocoded data is geographic masking. This typically consists of applying a certain amount of random perturbation in a systematic manner to reduce the risk of reidentification. A number of geographic masking techniques have been developed as well as methods to quantity the risk of reidentification associated with a particular masking method. This paper presents a review of the current state-of-the-art in geographic masking, summarizing the various methods and their strengths and weaknesses. Despite recent progress, no universally accepted or endorsed geographic masking technique has emerged. Researchers on the other hand are publishing maps using geographic masking of confidential locations. Any researcher publishing such maps is advised to become familiar with the different masking techniques available and their associated reidentification risks.

Figures

Figure 1
Figure 1
Disclosure of confidential information by publishing coordinates. Figure 1(a) shows an example of a hypothetical set of coordinates. Plotting these on a small scale map (b) provides an approximate location (i.e., Rio Rancho). Zooming in using a large scale map (c) provides a very exact location, which can be used to identify the street address associated with the set of coordinates (e.g., 1364 Peppoli Loop SE). Aerial imagery (d) can be used to confirm the specific residence.
Figure 2
Figure 2
Geocoding and reverse geocoding. Geocoding (a) is the process of assigning locations (i.e., coordinates) to address information. A tabular dataset of addresses becomes a map. Reverse geocoding (b) literally puts this in reverse and converts mapped locations to addresses. Errors in the geocoding and reverse geocoding process may result in mismatched address information; that is, the addresses obtained using reverse geocoding may not be identical to those used in the original geocoding.
Figure 3
Figure 3
Spatial aggregation of individual cases using census enumeration units. Individual geocoded locations (left) are aggregated using census tracts (right). The count of the number of cases per census tract is used to determine relevant population-weighted indices, such as the number of cases per 10,000 residents. Determining incidence or disease rates, as opposed to raw counts, is one of the primary reasons for aggregation. As a secondary benefit, spatial aggregation greatly reduced the reidentification risk.
Figure 4
Figure 4
Conceptual illustration of geographic masking. A set of original locations (a) is created using address geocoding or field data collection using GPS. These locations correspond very closely to the residences of interest, although a certain amount of error might be present. For each location, a masked representation is created (b) by displacing the original location using one of several algorithms. Most algorithms include a certain degree of randomness in the displacement. The original locations are removed from the dataset, resulting in a set of masked locations (c) for publication and distribution purposes. The set of masked locations has the same number of observations as the set of original locations.
Figure 5
Figure 5
Graphical representation of common geographic masking techniques. The red dot indicates the original location and the blue dot one of the many possible masked locations.
Figure 6
Figure 6
Example of geographic masking technique (i.e., random placement within a circle) using an additional spatial filter to constrain displacement. The red dot represents the original location; the yellow area represents all possible locations for the masked location; and the blue dot represents one possible masked location selected randomly. This filter can be used to avoid placement in areas where logically no population resides (such as water bodies or parks) or to limit displacement to a particular enumeration unit (such as the same census tract or postal code).
Figure 7
Figure 7
Illustration of the k-anonymity concept using record linkage. Medical records contain a number of different fields which are removed to protect confidentiality, including name and address. When combined with voting records, however, it becomes possible to uniquely identify individuals in the medical records by combining fields for ZIP code, birthday, and sex. The k-anonymity provided by the released data is unacceptably low. By removing the field for birthdate (or replacing it with birth year), the k-anonymity is substantially increased and may reach acceptable levels. The concept of k-anonymity provides a quantitative measure of confidentiality protection. More specifically, it is a number that can be calculated for each subset of the data. For the example of medical record and voting records, values for k-anonymity can be calculated prior to release for all combination of ZIP code and sex or any other field of interest. Adapted from [66].

References

    1. Zandbergen P. A. Geocoding quality and implications for spatial analysis. Geography Compass. 2009;3(2):647–680. doi: 10.1111/j.1749-8198.2008.00205.x.
    1. Rushton G., Armstrong M. P., Gittler J., et al. Geocoding Health Data: The Use of Geographic Codes in Cancer Prevention and Control, Research and Practice. CRC Press; 2010.
    1. Krieger N., Chen J. T., Waterman P. D., Soobader M.-J., Subramanian S. V., Carson R. Geocoding and monitoring of US socioeconomic inequalities in mortality and cancer incidence: does the choice of area-based measure and geographic level matter? The public health disparities geocoding project. American Journal of Epidemiology. 2002;156(5):471–482. doi: 10.1093/aje/kwf068.
    1. Abe T., Stinchcomb D. Geocoding Health Data. CRC Press; 2010. Geocoding practices in cancer registries; pp. 111–125.
    1. Krieger N. Place, space, and health: GIS and epidemiology. Epidemiology. 2003;14(4):384–385. doi: 10.1097/01.ede.0000071473.69307.8a.
    1. Bissette J. M., Stover J. A., Newman L. M., Delcher P. C., Bernstein K. T., Matthews L. Assessment of geographic information systems and data confidentiality guidelines in STD programs. Public Health Reports. 2009;124(supplement 2):p. 58.
    1. Richardson D. B., Volkow N. D., Kwan M.-P., Kaplan R. M., Goodchild M. F., Croyle R. T. Medicine. Spatial turn in health research. Science. 2013;339(6126):1390–1392. doi: 10.1126/science.1232257.
    1. Gemmill A., Gunier R. B., Bradman A., Eskenazi B., Harley K. G. Residential proximity to methyl bromide use and birth outcomes in an agricultural population in california. Environmental Health Perspectives. 2013;121(6):737–743. doi: 10.1289/ehp.1205682.
    1. Fefferman N. H., O’Neil E. A., Naumova E. N. Confidentiality and confidence: is data aggregation a means to achieve both? Journal of Public Health Policy. 2005;26(4):430–449. doi: 10.1057/palgrave.jphp.3200029.
    1. Rushton G., Armstrong M. P., Gittler J., et al. Geocoding in cancer research: a review. The American Journal of Preventive Medicine. 2006;30(2):S16–S24. doi: 10.1016/j.amepre.2005.09.011.
    1. Reiter J. P., Kinney S. K. Sharing confidential data for research purposes: a primer. Epidemiology. 2011;22(5):632–635. doi: 10.1097/EDE.0b013e318225c44b.
    1. Goldberg D. W., Wilson J. P., Knoblock C. A. From text to geographic coordinates: the current state of geocoding. URISA Journal. 2007;19(1):33–46.
    1. Zandbergen P. A. Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMC Public Health. 2007;7, article 37 doi: 10.1186/1471-2458-7-37.
    1. Zandbergen P. A., Green J. W. Error and bias in determining exposure potential of children at school locations using proximity-based GIS techniques. Environmental Health Perspectives. 2007;115(9):1363–1370. doi: 10.1289/ehp.9668.
    1. Zandbergen P. A., Hart T. C., Lenzer K. E., Camponovo M. E. Error propagation models to examine the effects of geocoding quality on spatial analysis of individual-level datasets. Spatial and Spatio-Temporal Epidemiology. 2012;3(1):69–82. doi: 10.1016/j.sste.2012.02.007.
    1. Cayo M. R., Talbot T. O. Positional error in automated geocoding of residential addresses. International Journal of Health Geographics. 2003;2, article 10 doi: 10.1186/1476-072X-2-10.
    1. Jacquemin B., Lepeule J., Boudier A., et al. Impact of geocoding methods on associations between long-term exposure to urban air pollution and lung function. Environmental Health Perspectives. 2013;121(9):1054–1060.
    1. Jacquez G. M. A research agenda: does geocoding positional error matter in health GIS studies? Spatial and Spatio-Temporal Epidemiology. 2012;3(1):7–16. doi: 10.1016/j.sste.2012.02.002.
    1. Zimmerman D. L., Fang X. Estimating spatial variation in disease risk from locations coarsened by incomplete geocoding. Statistical Methodology. 2012;9(1-2):239–250. doi: 10.1016/j.stamet.2011.01.008.
    1. Duncan D. T., Castro M. C., Blossom J. C., Bennett G. G., Steven L. G. G. G. Evaluation of the positional difference between two common geocoding methods. Geospatial Health. 2011;5(2):265–273.
    1. Zandbergen P. A. A comparison of address point, parcel and street geocoding techniques. Computers, Environment and Urban Systems. 2008;32(3):214–232. doi: 10.1016/j.compenvurbsys.2007.11.006.
    1. Goldberg D. W., Cockburn M. G. Improving geocode accuracy with candidate selection criteria. Transactions in GIS. 2010;14(1):149–176. doi: 10.1111/j.1467-9671.2010.01211.x.
    1. Zandbergen P. A., Chakraborty J. Improving environmental exposure analysis using cumulative distribution functions and individual geocoding. International Journal of Health Geographics. 2006;5, article 23 doi: 10.1186/1476-072X-5-23.
    1. Miranda M. L., Anthopolos R., Hastings D. A geospatial analysis of the effects of aviation gasoline on childhood blood lead levels. Environmental Health Perspectives. 2011;119(10):1513–1516. doi: 10.1289/ehp.1003231.
    1. Xue J., McCurdy T., Burke J., et al. Analyses of school commuting data for exposure modeling purposes. Journal of Exposure Science and Environmental Epidemiology. 2010;20(1):69–78. doi: 10.1038/jes.2009.3.
    1. Armstrong M. P., Rushton G., Zimmerman D. L. Geographically masking health data to preserve confidentiality. Statistics in Medicine. 1999;18(5):497–525.
    1. Sueda K., Miyaki T., Rekimoto J. Mobile and Ubiquitous Systems: Computing, Networking, and Services. Springer; 2012. Social geoscape: visualizing an image of the city for mobile UI using user generated geo-tagged objects; pp. 1–12.
    1. Kounadi O., Lampoltshammer T. J., Leitner M., Heistracher T. Accuracy and privacy aspects in free online reverse geocoding services. Cartography and Geographic Information Science. 2013;40(2):140–153. doi: 10.1080/15230406.2013.777138.
    1. Brownstein J. S., Cassa C. A., Mandl K. D. No place to hide—reverse identification of patients from published maps. New England Journal of Medicine. 2006;355(16):1741–1742. doi: 10.1056/NEJMc061891.
    1. Krumm J. A survey of computational location privacy. Personal and Ubiquitous Computing. 2009;13(6):391–399. doi: 10.1007/s00779-008-0212-5.
    1. Rekimoto J., Miyaki T., Ishizawa T. LifeTag: WiFi-based continuous location logging for life pattern analysis. (Lecture Notes in Computer Science).Location- and Context-Awareness. 2007;4718:35–49.
    1. Searight K. R., Logan D. J., Bourland II Freddie J., Loher C. J., Charlton B. R. Reverse geocoding system using combined street segment and point datasets. Google Patents, 2010.
    1. Marshall R., Polk J., George R. A protocol for location transformations. TCS, 2011.
    1. Chen L.-C., Lai Y.-C., Yeh Y.-H., Lin J.-W., Lai C.-N., Weng H.-C. Enhanced mechanisms for navigation and tracking services in smart phones. Journal of Applied Research and Technology. 2013;11:272–282.
    1. Brownstein J. S., Cassa C. A., Kohane I. S., Mandl K. D. An unsupervised classification method for inferring original case locations from low-resolution disease maps. International Journal of Health Geographics. 2006;5, article 56 doi: 10.1186/1476-072X-5-56.
    1. Curtis A. J., Mills J. W., Leitner M. Spatial confidentiality and GIS: re-engineering mortality locations from published maps about Hurricane Katrina. International Journal of Health Geographics. 2006;5, article 44 doi: 10.1186/1476-072X-5-44.
    1. Council N. R. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data. National Academies Press; 2007.
    1. Olson K. L., Grannis S. J., Mandl K. D. Privacy protection versus cluster detection in spatial epidemiology. American Journal of Public Health. 2006;96(11):2002–2008. doi: 10.2105/AJPH.2005.069526.
    1. Boulos M. N. K., Curtis A. J., Abdelmalik P. Musings on privacy issues in health research involving disaggregate geographic data about individuals. International Journal of Health Geographics. 2009;8:p. 46.
    1. Tenopir C., Allard S., Douglass K., et al. Data sharing by scientists: practices and perceptions. PLoS ONE. 2011;6(6) doi: 10.1371/journal.pone.0021101.e21101
    1. Schofield P. N., Bubela T., Weaver T., et al. Post-publication sharing of data and tools. Nature. 2009;461(7261):171–173. doi: 10.1038/461171a.
    1. Duncan G. T., Pearson R. W. Enhancing access to microdata while protecting confidentiality: prospects for the future. Statistical Science. 1991;6(3):219–232.
    1. Cox L. Matrix masking methods for disclosure limitation in microdata. Survey Methodology. 1994;20(2):165–169.
    1. Allshouse W. B., Fitch M. K., Hampton K. H., et al. Geomasking sensitive health data and privacy protection: an evaluation using an E911 database. Geocarto International. 2010;25(6):443–452. doi: 10.1080/10106049.2010.496496.
    1. Fitch M. Geomasking algorithms to protect confidentiality of sexually transmitted infections in spatial epidemiology. Proceedings of the American Public Health Association Annual Meeting and Exposition; 2007.
    1. Hampton K. H., Fitch M. K., Allshouse W. B., et al. Mapping health data: improved privacy protection with donut method geomasking. American Journal of Epidemiology. 2010;172(9):1062–1069. doi: 10.1093/aje/kwq248.
    1. Lu Y., Yorke C., Zhan F. B. Considering risk locations when defining perturbation zones for geomasking. Cartographica. 2012;47(3):168–178.
    1. French J. L., Wand M. P. Generalized additive models for cancer mapping with incomplete covariates. Biostatistics. 2004;5(2):177–191. doi: 10.1093/biostatistics/5.2.177.
    1. Bell B. S. Biostatistical Applications in Cancer Research. Springer; 2002. Spatial analysis of disease-applications; pp. 151–182.
    1. Shi X., Alford-Teaster J., Onega T. Kernel density estimation with geographically masked points. Proceedings of the 17th International Conference on Geoinformatics (Geoinformatics '09); August 2009;
    1. Francis S. S., Selvin S., Yang W., Buffler P. A., Wiemels J. L. Unusual space-time patterning of the Fallon, Nevada leukemia cluster: evidence of an infectious etiology. Chemico-Biological Interactions. 2012;196(3):102–109. doi: 10.1016/j.cbi.2011.02.019.
    1. Claridge J., Diggle P., McCann C. M., et al. Fasciola hepatica is associated with the failure to detect bovine tuberculosis in dairy cattle. Nature Communications. 2012;3, article 853 doi: 10.1038/ncomms1840.
    1. Liang S., Banerjee S., Carlin B. P. Bayesian wombling for spatial point processes. Biometrics. 2009;65(4):1243–1253. doi: 10.1111/j.1541-0420.2009.01203.x.
    1. Choi A. L., Levy J. I., Dockery D. W., et al. Does living near a Superfund site contribute to higher polychlorinated biphenyl (PCB) exposure? Environmental Health Perspectives. 2006;114(7):1092–1098. doi: 10.1289/ehp.8827.
    1. Pereira G., De Vos A. J. B. M., Cook A., D’Arcy J. Holman C. Vector fields of risk: a new approach to the geographical representation of childhood asthma. Health and Place. 2010;16(1):140–146. doi: 10.1016/j.healthplace.2009.09.006.
    1. Kwan M.-P., Casas I., Schmitz B. C. Protection of geoprivacy and accuracy of spatial information: how effective are geographical masks? Cartographica. 2004;39(2):15–28.
    1. Zimmerman D. L., Pavlik C. Quantifying the effects of mask metadata disclosure and multiple releases on the confidentiality of geographically masked health data. Geographical Analysis. 2008;40(1):52–76. doi: 10.1111/j.0016-7363.2007.00713.x.
    1. Cassa C. A., Wieland S. C., Mandl K. D. Re-identification of home addresses from spatial locations anonymized by Gaussian skew. International Journal of Health Geographics. 2008;7, article 45 doi: 10.1186/1476-072X-7-45.
    1. Stinchcomb D. Procedures for geomasking to protect patient confidentiality. Proceedings of the ESRI International Health GIS Conference; 2004; Washington, DC, USA.
    1. Clifton K. J., Gehrke S. R. Application of geographic perturbation methods to residential locations in the oregon household activity survey: proof of concept. Proceedings of the Transportation Research Board 92nd Annual Meeting; 2013.
    1. Cassa C. A., Grannis S. J., Overhage J. M., Mandl K. D. A context-sensitive approach to anonymizing spatial surveillance data: impact on outbreak detection. Journal of the American Medical Informatics Association. 2006;13(2):160–165. doi: 10.1197/jamia.M1920.
    1. Leitner M., Curtis A. Cartographic guidelines for geographically masking the locations of confidential point data. Cartographic Perspectives. 2004;(49):22–39.
    1. Zandbergen P. Validation of masking techniques for location privacy protection of individual-level health data. Proceedings of the American Public Health Association Annual Meeting; 2011; Washington, DC, USA.
    1. Leitner M., Curtis A. A first step towards a framework for presenting the location of confidential point data on maps-results of an empirical perceptual study. International Journal of Geographical Information Science. 2006;20(7):813–822. doi: 10.1080/13658810600711261.
    1. Wieland S. C., Cassa C. A., Mandl K. D., Berger B. Revealing the spatial distribution of a disease while preserving privacy. Proceedings of the National Academy of Sciences of the United States of America. 2008;105(46):17608–17613. doi: 10.1073/pnas.0801021105.
    1. Sweeney L. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems. 2002;10(5):557–570. doi: 10.1142/S0218488502001648.
    1. El Emam K., Dankar F. K. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association. 2008;15(5):627–637. doi: 10.1197/jamia.M2716.
    1. Sweeney L. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems. 2002;10(5):571–588. doi: 10.1142/S021848850200165X.
    1. Aggarwal G., Feder T., Kenthapadi K., et al. Approximation algorithms for k-anonymity. Journal of Privacy Technology (JOPT) 2005
    1. El Emam K., Dankar F. K., Issa R., et al. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association. 2009;16(5):670–682. doi: 10.1197/jamia.M3144.
    1. Kalnis P., Ghinita G., Mouratidis K., Papadias D. Preventing location-based identity inference in anonymous spatial queries. IEEE Transactions on Knowledge and Data Engineering. 2007;19(12):1719–1733. doi: 10.1109/TKDE.2007.190662.
    1. Khoshgozaran A., Shahabi C., Shirani-Mehr H. Location privacy: going beyond K-anonymity, cloaking and anonymizers. Knowledge and Information Systems. 2011;26(3):435–465. doi: 10.1007/s10115-010-0286-z.
    1. Gedik B., Liu L. Protecting location privacy with personalized k-anonymity: architecture and algorithms. IEEE Transactions on Mobile Computing. 2008;7(1):1–18. doi: 10.1109/TMC.2007.1062.
    1. Ghinita G., Zhao K., Papadias D., Kalnis P. A reciprocal framework for spatial K-anonymity. Information Systems. 2010;35(3):299–314. doi: 10.1016/j.is.2009.10.001.
    1. Xue M., Kalnis P., Pung H. K. Location and Context Awareness. Springer; 2009. Location diversity: enhanced privacy protection in location based services; pp. 70–87.
    1. Zandbergen P. A. Influence of street reference data on geocoding quality. Geocarto International. 2011;26(1):35–47. doi: 10.1080/10106049.2010.537374.
    1. Zandbergen P. A., Hart T. C. Geocoding accuracy considerations in determining residency restrictions for sex offenders. Criminal Justice Policy Review. 2009;20(1):62–90. doi: 10.1177/0887403408323690.
    1. Zinszer K., Jauvin C., Verma A., et al. Residential address errors in public health surveillance data: a description and analysis of the impact on geocoding. Spatial and Spatio-Temporal Epidemiology. 2010;1(2-3):163–168. doi: 10.1016/j.sste.2010.03.002.
    1. Mazumdar S., Rushton G., Smith B. J., Zimmerman D. L., Donham K. J. Geocoding accuracy and the recovery of relationships between environmental exposures and health. International Journal of Health Geographics. 2008;7, article 13 doi: 10.1186/1476-072X-7-13.
    1. Jacquemin B., Lepeule J., Boudier A., et al. Impact of geocoding methods on associations between long-term exposure to urban air pollution and lung function. Environmental Health Perspectives. 2013 doi: 10.1289/ehp.1206016.
    1. Healy M. A., Gilliland J. A. Quantifying the magnitude of environmental exposure misclassification when using imprecise address proxies in public health research. Spatial and Spatio-Temporal Epidemiology. 2012;3(1):55–67. doi: 10.1016/j.sste.2012.02.006.
    1. Roongpiboonsopit D., Karimi H. A. Quality assessment of online street and rooftop geocoding services. Cartography and Geographic Information Science. 2010;37(4):301–318. doi: 10.1559/152304010793454318.
    1. Zandbergen P. A. Positional accuracy of spatial data: non-Normal distributions and a critique of the national standard for spatial data accuracy. Transactions in GIS. 2008;12(1):103–130. doi: 10.1111/j.1467-9671.2008.01088.x.
    1. Krieger N., Waterman P., Lemieux K., Zierler S., Hogan J. W. On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. American Journal of Public Health. 2001;91(7):1114–1116.
    1. Zhou Y., Dominici F., Louis T. A. A smoothing approach for masking spatial data. The Annals of Applied Statistics. 2010;4(3):1451–1475.
    1. Wang H., Reiter J. P. Multiple imputation for sharing precise geographies in public use data. The Annals of Applied Statistics. 2012;6(1):229–252.
    1. Huckett J. C. Synthetic Data Methods for Disclosure Limitation. ProQuest; 2008.
    1. Kamel Boulos M. N., Cai Q., Padget J. A., Rushton G. Using software agents to preserve individual health data confidentiality in micro-scale geographical analyses. Journal of Biomedical Informatics. 2006;39(2):160–170. doi: 10.1016/j.jbi.2005.06.003.
    1. Young C., Martin D., Skinner C. Geographically intelligent disclosure control for flexible aggregation of census data. International Journal of Geographical Information Science. 2009;23(4):457–482. doi: 10.1080/13658810801949835.

Source: PubMed

3
Prenumerera