Using Digital Data to Predict CHD

October 8, 2025 updated by: University of Pennsylvania

Using Digital Data to Predict Cardiovascular Health and Health Care Utilization

This project seeks to identify and characterize features derived from digital data (e.g. social media, online search, mobile media) which are associated with coronary heart disease (CHD) and related risk factors, and develop models that use digital data and conventional predictive models to predict CHD risk and health care utilization.

Study Overview

Status

Completed

Intervention / Treatment

Detailed Description

Cardiovascular disease is the leading cause of death in the US. While secondary prevention approaches have improved longevity of patients, risk factors and adverse health behaviors (e.g., physical inactivity, smoking) are highly prevalent, and in most contemporary series, less than 1% of adults meet all factors of ideal CV health. The logistics and practicalities of meeting the goal of ideal CV health have not been clearly elucidated. Practice guidelines recommend using the Framingham risk score (FRS) or other risk prediction tools to classify patients' risk of CV disease. These models however are imprecise and there is increasing focus on identifying markers that provide better measures of risk. As digital platforms are increasingly used to document lifestyle and health behaviors, data from digital sources may provide a window into manifestations of novel risk factors and potentially a better characterization of existing risk factors. While it seems like a cliche to mention the profound impact of digital data on everyday lives, there is indeed great substance in the opportunities these new media provide for understanding behavioral, social, and environmental determinants of health. This project seeks to identify and characterize features derived from digital data (e.g. social media, online search, mobile media) which are associated with coronary heart disease (CHD) and related risk factors, and develop models that use digital data and conventional predictive models to predict CHD risk and health care utilization.

Study Type

Observational

Enrollment (Actual)

781

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Locations

    • Pennsylvania
      • Philadelphia, Pennsylvania, United States, 19101
        • University of Pennsylvania Health System

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

30 years to 74 years (Adult, Older Adult)

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

We will identify patients ages 30-74 with and without CHD (ICD 9:414.0, ICD 10: I63, I20-I25)

Description

Inclusion Criteria:

  • 30 - 74 years of age
  • Willing to sign informed consent
  • Primarily English speaking (for language analysis)
  • Has an account on any of the following digital data platforms (Facebook, Instagram, Twitter Reddit, Google (gmail), or smartphone or wearable device such as Apple Health, Fitbit, Samsung Health, MapMyFitness or Garmin) and willing to share data
  • If has social media account, Instagram or Facebook, willing to share historical and prospective data (60 days) If has Google (gmail) account, willing to download and share google takeout zip file
  • If has smartphone or wearable device, willing to share step data
  • Willing to share access to medical health records
  • Willing to share healthcare insurance information

Exclusion Criteria:

  • Patient does not meet age inclusion criteria above
  • Does not use and post on digital data sources we are studying or unwilling to donate data
  • Patient is in severe distress, e.g. respiratory, physical, or emotional distress
  • Patient is intoxicated, unconscious, or unable to appropriately respond to questions

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

  • Observational Models: Case-Control
  • Time Perspectives: Cross-Sectional

Cohorts and Interventions

Group / Cohort
Intervention / Treatment
Case
Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years.
Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Control
Patients aged 30-74 who have non-cardiovascular-related history
Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.

What is the study measuring?

Primary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
Time Frame: Through study completion, an average of 3 years

The primary outcome is topics and features (derived using the LDA method for clustering language data).

For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.

Through study completion, an average of 3 years

Other Outcome Measures

Outcome Measure
Measure Description
Time Frame
CHD Event
Time Frame: Through study completion, an average of 3 years

Reliability in predicting CHD related event in patient as measured by Framingham Risk Score.

The Framingham Risk Score (FRS) is a validated means of predicting cardiovascular disease (CVD) risk. Input variables include age, cigarette smoking, total cholesterol, HDL cholesterol, systolic blood pressure measurement and treatment for hypertension. Point values are calculated based on each of these risks. A 10-year risk score can be derived as a percentage. Risk scores range from 0-20%.

Low Risk: Less than 10% risk that you will develop a heart attack or die from coronary disease in the next 10 years.

Intermediate risk: A 10 to 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years.

High Risk: A greater than 20% risk that you will develop a heart attack or die from coronary disease in the next 10 years.

Through study completion, an average of 3 years
Health Care Utilization
Time Frame: Through study completion, an average of 3 years
Prediction of cost for health care utilization between heart disease and non- heart disease subjects measured by insurance claims data
Through study completion, an average of 3 years

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

September 25, 2020

Primary Completion (Actual)

May 30, 2025

Study Completion (Actual)

June 1, 2025

Study Registration Dates

First Submitted

September 28, 2020

First Submitted That Met QC Criteria

September 28, 2020

First Posted (Actual)

October 5, 2020

Study Record Updates

Last Update Posted (Estimated)

November 4, 2025

Last Update Submitted That Met QC Criteria

October 8, 2025

Last Verified

October 1, 2025

More Information

Terms related to this study

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

UNDECIDED

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

No

Studies a U.S. FDA-regulated device product

No

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Cardiovascular Diseases

Clinical Trials on Survey

Subscribe