The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them

Mehmet Kayaalp, Allen C Browne, Fiona M Callaghan, Zeyno A Dodd, Guy Divita, Selcuk Ozturk, Clement J McDonald, Mehmet Kayaalp, Allen C Browne, Fiona M Callaghan, Zeyno A Dodd, Guy Divita, Selcuk Ozturk, Clement J McDonald

Abstract

Objective: To understand the factors that influence success in scrubbing personal names from narrative text.

Materials and methods: We developed a scrubber, the NLM Name Scrubber (NLM-NS), to redact personal names from narrative clinical reports, hand tagged words in a set of gold standard narrative reports as personal names or not, and measured the scrubbing success of NLM-NS and that of four other scrubbing/name recognition tools (MIST, MITdeid, LingPipe, and ANNIE/GATE) against the gold standard reports. We ran three comparisons which used increasingly larger name lists.

Results: The test reports contained more than 1 million words, of which 2388 were patient and 20,160 were provider name tokens. NLM-NS failed to scrub only 2 of the 2388 instances of patient name tokens. Its sensitivity was 0.999 on both patient and provider name tokens and missed fewer instances of patient name tokens in all comparisons with other scrubbers. MIST produced the best all token specificity and F-measure for name instances in our most relevant study (study 2), with values of 0.997 and 0.938, respectively. In that same comparison, NLM-NS was second best, with values of 0.986 and 0.748, respectively, and MITdeid was a close third, with values of 0.985 and 0.796 respectively. With the addition of the Clinical Center name list to their native name lists, Ling Pipe, MITdeid, MIST, and ANNIE/GATE all improved substantially. MITdeid and Ling Pipe gained the most--reaching patient name sensitivity of 0.995 (F-measure=0.705) and 0.989 (F-measure=0.386), respectively.

Discussion: The privacy risk due to two name tokens missed by NLM-NS was statistically negligible, since neither individual could be distinguished among more than 150,000 people listed in the US Social Security Registry.

Conclusions: The nature and size of name lists have substantial influences on scrubbing success. The use of very large name lists with frequency statistics accounts for much of NLM-NS scrubbing success.

Keywords: Chart Research; De-Identification; Electronic Medical Records; PHI.

Figures

Figure 1
Figure 1
Portion of a patient report after VTT tagging. Red signifies patient name, pink a numeric identifier, yellow a date, and green an age. This report includes only bogus PHI for demonstration purposes.

References

    1. U.S. Department of Health and Human Services. Public Welfare; Administrative Data Standards and Related Requirements; Security and Privacy; Privacy of Individually Identifiable Health Information; Other Requirements Relating to Uses and Disclosures of Protected Health Information. 45 C.F.R. Sect. 164.514. 2002. (accessed 18 Apr 2013).
    1. Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING'96). 5–9 Aug 1996, Copenhagen, Denmark, 1996:466–71
    1. Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 2011;12:417–28
    1. Deleger L, Molnar K, Savova G, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc 2013;20:80–4
    1. Meystre S, Friedlin F, South B, et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010;10:70.
    1. Mehnert RB. A world of knowledge for the nation's health: the U.S. National Library of Medicine. Am J Hosp Pharm 1986;43:2991–7
    1. McCray AT, Sponsler JL, Brylawski B, et al. The role of lexical knowledge in biomedical text understanding. In: Stead W. ed Proceedings of the Eleventh Annual SCAMC; 1987. IEEE Computer Society Press, 1987:103–7
    1. Schoolman HM, Lindberg DA. The information age in concept and practice at the National Library of Medicine. Ann Am Acad Pol Soc Sci 1988:117–26
    1. U.S. Department of Health and Human Services, Office of Civil Rights. Guidance on De-identification of Protected Health Information, 2012. (accessed 18 Apr 2013).
    1. Health Level Seven International. V2 Messages. (accessed 18 Apr 2013).
    1. Henderson M. HL7 Messaging. 2nd edn Aubry, Texas: Otech, 2007
    1. Brady K, Sriram R, Lide B, et al. Testing the Nation's Healthcare Information Infrastructure: NIST perspective. IEEE NIST. 2012. Nov; 0018-9162/12:50-7.
    1. The H.I.S. Desk Reference: A CIO Survey. Baltimore, MD: CHIME and HCIA, Inc. 1998:26–9 ISBN 1-57372-033-X
    1. Neamatullah I, Douglass M, Lehman LH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008;8:32.
    1. Wellner B, Huyck M, Mardis S, et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007;14:564–73
    1. Carpenter B. LingPipe for 99.99% recall of gene mentions. In: Proceedings of the 2nd BioCreative workshop; 23–25 April 2007, Madrid, Spain
    1. Cunningham H, Maynard D, Bontcheva K, et al. GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). 7–12 July 2002, Stroudsburg, PA, 2002
    1. Hopcroft JE, Ullman JD. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 1979
    1. National Technical Information Service, U.S. Department of Commerce. Social Security Administration's Death Master File. (accessed 18 Apr 2013).
    1. HHS Office of Civil Rights. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. 2012.
    1. Minkov E, Wang RC, Tomasic A, et al. NER Systems that suit user's preferences: adjusting the recall-precision trade-off for entity extraction. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (ACL); NY. Jun 2006. 93–6. (accessed 6 Aug 2013).
    1. Ye N, Chai KMA, Lee WS, et al. Optimizing F-measures: a tale of two approaches. Proceedings of the 29th Internat Conf on Machine Learning (ICML), Edinburgh, UK, 2012.
    1. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 6 1900;50:157–75
    1. The R Project for Statistical Computing. (accessed 18 Apr 2013).
    1. Beckwith B, Mahaadevan R, Balis U, et al. Development and evaluation of an open source software tool for de-identification of pathology reports. BMC Med Inform Decis Mak 2006;6:12.
    1. Friedlin J, McDonald C. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc 2008;15:601–10
    1. Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007;14:550–63 Paper and data supplement available at:

Source: PubMed

3
Abonnere