NLM Scrubber: NLM s Software Application to De-identify Clinical Text Documents

April 3, 2026 updated by: National Library of Medicine (NLM)

NLM Scrubber: NLM's Software Application to De-identify Clinical Text Documents

Background: Electronic health records contain a vast amount of data about diseases and treatments. Researchers could use this data to test their ideas, but they would need to use records from more than just their own group of patients. But access to those records is restricted to ensure patient privacy.

U.S. National Library of Medicine (NLM) has created a computer tool called NLM Scrubber. This program recognizes and deletes personal information from health records. The researchers who developed this program now need access to the original records. This will allow them to see how well the program removes personal information from patient records and how they can make it more accurate.

Objectives:

To find ways to improve clinical text de-identification.

Eligibility:

No new participants. Researchers will review data that have already been collected.

Design:

Researchers will collect a random sample of reports. These will be from different doctors in different fields.

Researchers will manually remove personal information from the records.

Researchers will also automatically remove personal information from original records using NLM-Scrubber.

Researchers will compare the results of the computer program versus the manual changes. They will note when the program has not been removing personal information correctly. They will also note when the program has been deleting nonpersonal health information incorrectly.

Researchers will use the results to revise the program. They will keep testing it until the de-identification process is complete.

Study Overview

Status

Enrolling by invitation

Conditions

Personally Identifiable Information

Detailed Description

This study is about the quality assessment, improvement, and monitoring of an automatic clinical text de-identification software application called NLM Scrubber, which has been developed at the National Library of Medicine (NLM). The application has been developed so that clinical reports can be used in secondary scientific studies (i.e., for secondary use) without breaching patient privacy. Research on methods for protecting patient privacy and on the development of NLM Scrubber have been conducted by following the guidelines of and in compliance with HIPAA and the Privacy Act.

In order to further develop and improve NLM Scrubber and assess its de-identification performance effectively, the investigators require the original / unredacted samples from all potential clinical report types and sources. To this end, NLM investigators have been

collaborating with entities within NIH, namely, NIH Clinical Center, BTRIS, and NCI as well as outside entities, Kentucky State Registry administered by University of Kentucky and researchers from the University of Pittsburgh, who stated their interest in integrating NLM

Scrubber to their application called Text Information Extraction System. These entities collect samples of various types of clinical reports for assessing and improving NLM Scrubber performance. However we also need access to the original data in order to assess

potential problems and improve the accuracy of NLM Scrubber.

Study Type

Observational

Enrollment (Estimated)

50000

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Locations

United States
- Maryland
  - Bethesda, Maryland, United States
    - National Library of Medicine

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

1 day and older (Child, Adult, Older Adult)

Accepts Healthy Volunteers

Sampling Method

Probability Sample

Study Population

Everybody for whom a clinical narrative report is created.

Description

No new participant enrollment. Researchers will review data that have already been collected.

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Observational Models: Other
Time Perspectives: Retrospective

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort
1 Everybody for whom a clinical narrative report is created.

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
The rate of de-identification of PII Time Frame: 01/01/2017-01/31/2027	HIPAA Privacy Rule defines 18 types of personally identifying information, that need to be de-identified, which include personal names, addresses, significant dates, numeric identifiers (such as social security number). Our annotators label those words and numbers creating a gold standard and NLM-Scrubber tries to recognize and eliminate all of them. The rate of de-identification of PII refers to success of this outcome measure.	01/01/2017-01/31/2027

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
The rate of erroneously redacted clinical information Time Frame: 01/01/2017-01/31/2027	While NLM-Scrubber tries to eliminate only PII elements while preserving non-identifying study data, it inadvertently deletes some of the non-identifying study data elements (non-protected health information) as well. The rate of erroneously redacted clinical information refers to the failure of NLM-Scrubber in preserving non-identifying health information.	01/01/2017-01/31/2027

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

National Library of Medicine (NLM)

Collaborators

National Institutes of Health Clinical Center (CC)

National Cancer Institute (NCI)

Investigators

Principal Investigator: Mehmet M Kayaalp, Ph.D., National Library of Medicine (NLM)

Publications and helpful links

The person responsible for entering information about the study voluntarily provides these publications. These may be about anything related to the study.

General Publications

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start

May 25, 2016

Primary Completion (Estimated)

January 31, 2027

Study Completion (Estimated)

January 31, 2027

Study Registration Dates

First Submitted

June 9, 2016

First Submitted That Met QC Criteria

June 9, 2016

First Posted (Estimated)

June 10, 2016

Study Record Updates

Last Update Posted (Actual)

April 6, 2026

Last Update Submitted That Met QC Criteria

April 3, 2026

Last Verified

December 16, 2025

More Information

Terms related to this study

Keywords

Other Study ID Numbers

999916122
16-LM-N122

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

IPD Plan Description

We receive patient data, protected health information (PHI), from our collaborating data sources with the promise that we would protect PHI to the full extent and not share it with third parties.

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Personally Identifiable Information

University of California, Davis

Completed

Menu Label Noticeability

Noticing Nutrition Information | Recalling Nutrition Information

United States
Elizabeth Glaser Pediatric AIDS Foundation
Johnson & Johnson

Completed

Paper Versus Cartoon Video-based Administration of Study Information Notice for Participant Consent (CONSENT)

Information Disclosure

Cameroon
Christiana Care Health Services
Thomas Jefferson University

Completed

Telephone Follow-up From an Intensive Care Nursery

Information Retention

United States
Weill Medical College of Cornell University
New York State Department of Health

Completed

Assessing the Impact of Health Information Exchange (HIE) on Healthcare Utilization (VHR)

Health Care Utilization | Health Information Technology | Health Information Exchange | Virtual Health Record

United States
London School of Hygiene and Tropical Medicine

Completed

Dissemination of Findings Fast Using Online-videos Trial (DIFFUSION)

Information Seeking Behavior

United Kingdom
Washington University School of Medicine
National Center for Advancing Translational Sciences (NCATS)

Recruiting

Implementation of a Caregiver-Report Suicide-Risk Screener in Children Under Age 8 in a Behavioral Health Center

Information Dissemination

United States
University Hospital, Bordeaux
University of Bordeaux

Completed

Evaluation of the Health Digital Territories Program in the Five Pilot Territories (EvaTSN Project ) (EvaTSN)

Evaluation of a National Health Information Technology-based Program to Improve Healthcare Coordination and Access to Information

France
Brigham and Women's Hospital

Not yet recruiting

Digital Tools to Engage and Activate Patients During Hospitalization

Hospital Information Systems
State University of New York at Buffalo
National Cancer Institute (NCI)

Active, not recruiting

Interventions to Decrease Health Information Avoidance

Health Information Avoidance

United States
Washington University School of Medicine
Agency for Healthcare Research and Quality (AHRQ)

Completed

Anesthesiology Control Tower (ACTFAST)

Health Information Technology

United States

NLM Scrubber: NLM s Software Application to De-identify Clinical Text Documents