OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

May 19, 2026 updated by: Hannah Galvin, Cambridge Health Alliance

A Comparative Performance Evaluation of Four Publicly Available Large Language Models Against Gold Standard Medical References

OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance.

In this study, investigators have two goals:

To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.
To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.

To accomplish study goal #1, investigators have enlisted residents in the above specialties to use the OpenEvidence tool in the course of clinical practice. In order to mitigate any safety risks, the residents will also use a typical reference tool for their question, which is referred to as the "Gold Standard" tool. These tools include PubMed and UpToDate. The residents will:

State their clinical question.
Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.
State their clinical conclusion based on the OpenEvidence data.
Query the Gold Standard Resource.
State their final clinical conclusion.
Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.
Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.

Attending physician Subject Matter Experts (SMEs) matched by specialty with at least 5 years of post-training clinical experience will then evaluate the residents' responses. 5 years was chosen based the book "Outliers" by Malcolm Gladwell, in which he asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.

SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a scale of 1-10. For questions where the SME's rate the clinical appropriateness of the residents' conclusions poorly (< 5/10), they will be asked to review the OpenEvidence output and answer an additional question as to whether the output was incorrect or the resident misinterpreted the output from the tool.

To accomplish goal #2, the initial prompt entered by the residents into OpenEvidence will be copied by the research team into ChatGPT, Gemini, and Claude. The outputs from each tool (including OpenEvidence) will be surfaced to SMEs, who will be asked to rate each output based on accuracy, completeness, and bias. Likert scales will be used for these ratings. SMEs will also be asked an open-ended question to identify any patient safety issues from any of the outputs.

Study Overview

Status

Enrolling by invitation

Conditions

Intervention / Treatment

Other: AI clinical reference tool

Detailed Description

OpenEvidence is an online tool built out of the May Clinic Platform Accelerate [OpenEvidence] that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care.

OpenEvidence is an online tool designed to aggregate and synthesize data from peer-reviewed clinical studies, subsequently generating responses to user inquiries through the application of generative AI. Although increasingly utilized by both seasoned clinicians and trainees, there is a notable absence of published data regarding the accuracy of the tool's outputs, their safety and efficacy in appropriately informing clinical decision-making. Concurrently, a growing number of clinicians are leveraging other publicly-available large language models (LLMs) to support decision-making in clinical care. While a number of studies have examined the accuracy of LLM responses to medical board questions or clinical vignettes, there is limited research on their performance in real-world clinical settings, and even fewer studies offer comparative analyses of this performance.

In a review of the literature, one article shows LLM's may be better at detecting anxiety than practitioners, but this was based on clinical vignettes. [Levkovich et al.] Another looked at diagnostic sensitivity of LLM's using patient-reported outcome measures in a structured questionnaire. [Pagano et al.] An additional study comparing LLM's for oncology also uses fictional vignettes. [Benary et al.] A randomized control trial using clinical vignettes did not show any clinical improvement for providers who had access to LLM's. [Goh et al.] One case study explored integration of ChatGPT 3.5 into daily rounds and evaluated its use qualitatively, but did not compare it with other LLM's or gold standard reference tools. [Skryd et al.] Another compared ChatGPT's responses to American College of Radiology appropriateness criteria for breast pain and breast cancer screening, but again did not compare it with other LLM's. [Rao et al.] In our review, only one study evaluated LLM's in a real world clinical setting. This was a series of papers that looked at their use for complex decision making in breast-cancer care, using a small number of actual cases and a standardized prompt template. [Griewing, Knitza et al.; Griewing, Gremke et al.] That study found issues with consistency and deterioration of accuracy (particularly with GPT 3.5), leading the authors to conclude that the clinical use of LLM's for that purpose was not yet feasible at the time of publication. Still, health systems leaders see the use of these tools rapidly accelerating in clinical practice. For this reason, investigators believe it is imperative to study their safety and the clinical appropriateness of the decisions clinicians are making as a result of their use.

Cambridge Health Alliance (CHA) is a public, academic safety-net health system in the Boston area, serving a diverse population of patients. CHA has a robust primary care and outpatient psychiatry footprint, and supports a large graduate medical education program through both Harvard Medical School and Tufts University School of Medicine. Investigators chose residents as our primary study participants as many trainees are already using OpenEvidence, and found them more incentivized to participate in the study if given access to the tool at CHA (where it is otherwise blacklisted from network services and prohibited by policy until results of this study can be determined).

Study outcomes are as follows:

Determine whether the use of OpenEvidence leads to clinically appropriate decisions by residents in the course of clinical practice in a community health setting.

Determine how the output of OpenEvidence compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in accuracy, completeness, and bias when addressing clinical questions residents have in the course of clinical practice in a community health setting.

Methods:

Data collection is planned to take place over a 6-month period in order to minimize vendor version upgrades during the study period. Residents are grouped by specialty into "medicine" (internal medicine/family medicine) and "psychiatry" (adult/child psychiatry). In order to simplify matching to appropriate specialty subject matter experts, medicine residents are asked to use OpenEvidence only for adult primary care cases (excluding OB/GYN-related issues). Psychiatry residents are asked to use OpenEvidence only for adult psychiatry cases.

Before being accepted as participants, trainees were all asked to agree to the following:

Cross check any OpenEvidence query against a Gold Standard tool, defined in the study protocol to include PubMed, UpToDate, Dynamed, a clinical specialty society guideline, or other similar clinical reference source that must be documented in the study form.
Not enter any personal health information (PHI) into OpenEvidence (as defined below).
Attempt to use OpenEvidence at least 3 times per week, if appropriate to the clinical care of the patient.
Document 100% of their OpenEvidence queries into the study research form to avoid selection bias.

All residents will be given brief training in prompt engineering for healthcare before data collection begins. Standardized prompts will not be used, as one of the subgoals of the study is to understand what types of queries residents submit to OpenEvidence in a real world setting.

All residents will be educated on the definition of PHI, as follows:

Queries should not include any of PHI, as defined by the Safe Harbor identifiers [HHS]; queries can include patient age in years (days/weeks/months for pediatrics), and legal sex; for patients age 89 or older, the user must instead use the term "over age 89" to comply with Safe Harbor standards.

Queries should not include patients suspected of having extremely rare conditions as defined by the National Organization for Rare Disorders, as these are also prone to reidentification [NORD]. If a rare condition is not initially suspected but becomes suspected through the research process of using the AI tool, the user will be asked to stop their query at that point.

Data collection will involve the use of a HIPAA-compliant Google Form within CHA's enterprise Google Workspace for Health cloud infrastructure. The data collection form will ask trainees do the following:

Enter their initial clinical question.
Paste their OpenEvidence prompt (numbered sequentially for iterative prompts) and the full OpenEvidence output(s) generated.
Enter their clinical conclusion based on the OpenEvidence output.
Enter the Gold Standard reference tool used.
State their final clinical conclusion based on information from both OpenEvidence and the Gold Standard Reference tool.
Answer a question on the extent to which their initial clinical conclusion was modified by the Gold Standard reference.
Answer an open-ended question on whether they noticed any clinical safety issues, inaccuracies, or bias in the output from OpenEvidence.

Queries will be sorted by specialty (medicine vs. psychiatry), and each query will receive a sequential study number.

Attending physician Subject Matter Experts (SMEs) Board Certified in Internal Medicine, Family Medicine, or Psychiatry with at least 5 years of post-training clinical experience were recruited. Five years of post-training clinical experience was chosen based on the fact that Malcolm Gladwell, in his book, Outliers, asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.

SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a 10-point Likert scale. SME's will also be provided with the OpenEvidence output for each query, and where the SME rates the clinical appropriateness of the residents' conclusions poorly (< 5/10), the SME will additionally be asked a follow-up question to assess whether the tool's output itself provided a clinically inappropriate response, in order to ascertain whether the trainee may have misinterpreted the tool's output. SME review will include a 2.5-5% overlap between reviewers to calculate a kappa score for interrater reliability.

In part two of the study, the research team will sort the OpenEvidence queries into themes, and choose a random sampling of queries from each specialty and theme for comparison between LLM's. The research team confirm that prompts do not include any PHI according to study protocol. They will then copy the OpenEvidence prompts entered by residents for the selected queries and paste them exactly into ChatGPT, Gemini, CoPilot and Claude.

The outputs of each of the five tools (OpenEvidence, ChatGPT, Gemini, Copilot, and Claude) will be surfaced in a Google webform. SMEs will be asked to rate each output on a Likert scale for accuracy, completeness, and bias, as well as to answer a qualitative question identifying any patient safety issues in the output.

Results:

Primary outcome results will be reported as follows:

Clinical appropriateness of decision made by residents using OpenEvidence (mean with SD, median), by specialty

If, in cases of low clinical appropriateness, SME's identified that this was due not to the tool's output but instead due to the resident's interpretation of the tool, metrics will also be provided with these cases excluded
Interrater reliability (kappa value)

Secondary outcome results will be reported as follows:

For each specialty and each variable (accuracy, completeness, and bias), investigators will report:

The "win" rate for each LLM and average margin from the second place score.
The effect size (using each LLM as a separate reference) using Cohen's d test.
Interrater reliability (kappa value)

Study Type

Observational

Enrollment (Estimated)

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Locations

United States
- Massachusetts
  - Cambridge, Massachusetts, United States, 02193
    - Cambridge Health Alliance

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Child
Adult
Older Adult

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

Post-graduate trainees in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry

Description

Inclusion Criteria:

Active trainees PGY-1 through PGY-6 in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry at Cambridge Health Alliance
Must agree to the study protocol requirements outlined in the study description.

Exclusion Criteria:

Anyone who does not meet inclusion criteria
Residents who plan to leave CHA prior to the end of the study collection period.

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort	Intervention / Treatment
Medicine Residents Trainees in internal medicine or family medicine	Other: AI clinical reference tool Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.
Psychiatry residents Trainees in adult and child psychiatry	Other: AI clinical reference tool Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Clinical Appropriateness: Mean with SD Time Frame: 6 months	Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. (For Likert scales described in this and all of our outcome metrics, higher scores are better outcomes.) Mean score with standard deviation will be used for primary outcome.	6 months
Clinical Appropriateness: Median Time Frame: 6 months	Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. Median clinical appropriateness scores will also be reported.	6 months
Clinical Appropriateness: Interrater Reliability Time Frame: 6 months	SME's will evaluate Clinical Appropriateness scores of resident decisions based on OpenEvidence output on a 10-point Likert scale. Interrater reliability of SME Clinical Appropriateness scores will be calculated using kappa value.	6 months

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Comparative Accuracy of LLM's: Win Rate Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. Whichever model wins will be given a score of "1". If there is a tie, each gets 0.5 or 0.33 points, depending on the division of the tie. For each specialty (medicine and psychiatry), and each LLM, we will report the "win rate" with average margin from the second place LLM. For example, for all Medicine queries, we will report the percentage of times OpenEvidence "won" over the other LLM's on the accuracy Likert scale (where there is SME overlap to determine the kappa value), we will average the SME's scores). As it pertains to ACCURACY, we will report the WIN RATE as a percentage.	6 months
Comparative Accuracy: Margin of Win Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. For each specialty (medicine and psychiatry), we will calculate the win rate for each LLM on accuracy and then also report: Average margin from 2nd place LLM for ACCURACY. If 1st and 2nd place tie, average margin will reported as 0.	6 months
Comparative Accuracy: Effect Size Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. For each specialty (medicine and psychiatry), we will use Cohen's d test to calculate the effect size of each LLM's average ACCURACY score in comparison to each of the other three LLM's. An effect size of 0.2 is small, 0.5 is medium, and 0.8 is considered large.	6 months
Comparative Accuracy of LLM's: Interrater Reliability Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. Interrater reliability of Accuracy scores will be calculated using kappa value.	6 months
Comparative Completeness of LLM's: Win Rate Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for COMPLETNESS on a 10-point Likert scale. Whichever model wins will be given a score of "1". If there is a tie, each gets 0.5 or 0.33 points, depending on the division of the tie. For each specialty (medicine and psychiatry), and each LLM, we will report the "win rate" with average margin from the second place LLM. For example, for all Medicine queries, we will report the percentage of times OpenEvidence "won" over the other LLM's on the accuracy Likert scale (where there is SME overlap to determine the kappa value), we will average the SME's scores). As it pertains to COMPLETENESS, we will report the WIN RATE as a percentage.	6 months
Comparative Completeness: Margin of Win Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. For each specialty (medicine and psychiatry), we will calculate the win rate for each LLM on completeness and then also report: Average margin from 2nd place LLM for COMPLETENESS. If 1st and 2nd place tie, average margin will reported as 0.	6 months
Comparative Completeness: Effect Size Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. For each specialty (medicine and psychiatry), we will use Cohen's d test to calculate the effect size of each LLM's average COMPLETENESS score in comparison to each of the other three LLM's. An effect size of 0.2 is small, 0.5 is medium, and 0.8 is considered large.	6 months
Comparative Completeness of LLM's: Interrater Reliability Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. Interrater reliability of Completeness scores will be calculated using kappa value.	6 months
Comparative Bias of LLM's: Win Rate Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for signs of bias on a 10-point Likert scale. Whichever model wins will be given a score of "1". If there is a tie, each gets 0.5 or 0.33 points, depending on the division of the tie. For each specialty (medicine and psychiatry), and each LLM, we will report the "win rate" with average margin from the second place LLM. For example, for all Medicine queries, we will report the percentage of times OpenEvidence "won" over the other LLM's on the accuracy Likert scale (where there is SME overlap to determine the kappa value), we will average the SME's scores). As it pertains to BIAS, we will report the WIN RATE as a percentage.	6 months
Comparative Bias: Margin of Win Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for bias on a 10-point Likert scale. For each specialty (medicine and psychiatry), we will calculate the win rate for each LLM on bias and then also report: Average margin from 2nd place LLM for BIAS. If 1st and 2nd place tie, average margin will reported as 0.	6 months
Comparative Bias: Effect Size Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. For each specialty (medicine and psychiatry), we will use Cohen's d test to calculate the effect size of each LLM's average BIAS score in comparison to each of the other three LLM's. An effect size of 0.2 is small, 0.5 is medium, and 0.8 is considered large.	6 months
Comparative Bias of LLM's: Interrater Reliability Time Frame: 6 months	Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale. Interrater reliability of Bias scores will be calculated using kappa value.	6 months

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

Cambridge Health Alliance

Investigators

Principal Investigator: Hannah K Galvin, MD, Cambridge Health Alliance

Publications and helpful links

The person responsible for entering information about the study voluntarily provides these publications. These may be about anything related to the study.

General Publications

Helpful Links

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

October 1, 2025

Primary Completion (Estimated)

May 30, 2026

Study Completion (Estimated)

September 30, 2026

Study Registration Dates

First Submitted

September 15, 2025

First Submitted That Met QC Criteria

September 26, 2025

First Posted (Actual)

September 30, 2025

Study Record Updates

Last Update Posted (Actual)

May 22, 2026

Last Update Submitted That Met QC Criteria

May 19, 2026

Last Verified

May 1, 2026

More Information

Terms related to this study

Other Study ID Numbers

CHA-IRB-25-26-444

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

UNDECIDED

IPD Plan Description

We don't anticipate making individual participant data or data dictionaries available, but this is a novel area of research, so it is unclear what we might uncover that could assist other researchers in the future. If researches approached us, and examining the specific queries and responses could assist in designing further studies, we would consider this along with our IRB, legal and compliance teams.

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on AI (Artificial Intelligence)

University of Pavia

Recruiting

Comparison of Digital Analysis and Artificial Intelligence for Cephalometric Tracing

Cephalometric Analysis | Cephalometry | Artificial Intelligence (AI) | Artificial Intelligence (AI) in Diagnosis

Italy
Uşak University

Completed

Physiotherapists and Artificial Intelligence

Digital Competences | Artificial Intelligence (AI) | Physiotherapist Students | Acceptance of Artificial Intelligence | Artificial Intelligence Attitude

Turkey
Assiut University

Not yet recruiting

AI Ethical Assessment in Scientific Resreach (AI)

Artificial Intelligence (AI)
First Hospital of China Medical University

Recruiting

A Multi-center Study on Artificial Intelligence-Based Quantitative Evaluation of Echocardiography (MAIQUEE)

Echocardiography | Cardiovascular Diseases (CVD) | Artificial Intelligence (AI) | Artificial Intelligence (AI) in Diagnosis

China
John J Chen

Completed

Enhancing Interdisciplinary Understanding of Ophthalmology Notes Through a Local Large Language Model

Communication | Interdisciplinary Communication | Artificial Intelligence (AI) | Artificial Intelligence Technology

United States
Tanta University

Completed

AI Models vs Non-Invasive Fibrosis Scores in MAFLD Diagnosis (MAFLD-AI)

AI (Artificial Intelligence) | MAFLD

Egypt
Guangdong Provincial People's Hospital

Recruiting

Scientific Validity Assessment and Optimization of AI-Generated A3/A4 Type Questions for the Chinese Medical Licensing Examination

AI (Artificial Intelligence)

China
Tsinghua University

Not yet recruiting

Evaluation of AI-Generated Clinical Advice by Physicians

Artificial Intelligence (AI)

China
Radboud University Medical Center
Prime Dental Alliance Eindhoven

Not yet recruiting

The Impact of Artificial Intelligence on Dentists' Decision-Making Process During Caries Detection (DECIDE-AI)

Artificial Intelligence Supported Image Reviewing | Artificial Intelligence (AI) in Diagnosis

Netherlands
TC Erciyes University

Not yet recruiting

AI-Supported Case Analysis Among Nursing Students

Nursing Students | Artificial Intelligence (AI)

Turkey (Türkiye)

Clinical Trials on AI clinical reference tool

University of North Carolina, Chapel Hill

Completed

AI Tool to Reduce Clinician Documentation Burden (Evidently)

Electronic Health Records | Health Information Technology | Burnout, Healthcare Workers | Clinical Workflow Optimization

United States
Sarah Nabia
Liver Foundation, West Bengal; Endless Health

Recruiting

Assessing the Effectiveness of Large Language Model (LLM)-Enabled Nurse Treatment Planning in 2 Indian Districts

Hypertension | Fever | Diabete Mellitus | Breathlessness

India
Tampere University Hospital
University of Turku; Kuopio University Hospital; Tampere University; University... and other collaborators

Not yet recruiting

Study on Female Patients' Mammographic Texture Features (COMPRESS)

Breast Cancer | Artificial Intelligence | Mammography
UNICANCER

Recruiting

Arm Swelling Occurence in Breast Cancer Patients With Nodal Radiotherapy: Impact of Informing Them of AI-predicted Risk (PRE-ACT-01)

Unilateral Breast Neoplasms

France, Netherlands
Royal Cornwall Hospitals Trust
University of Birmingham; University of Exeter

Not yet recruiting

AI-assisted Diagnosis, Triage and Assessment of Hearing Loss and Tinnitus

Tinnitus | Hearing Loss, Adult-Onset

United Kingdom
Union Hospital, Tongji Medical College, Huazhong...
The First Affiliated Hospital of Zhengzhou University; The First Affiliated...

Completed

Clinical Application of Automated Interpretation System for Chest X-Ray Images Based on Multimodal Large Models

Radiology | AI (Artificial Intelligence) | X-Ray

China
University of California, Los Angeles

Completed

A Randomized Controlled Trial of Ambient Artificial Intelligence Scribe Technologies (AIScribe RCT)

Physician Workflow | Artificial Intelligence (AI)

United States
Dana-Farber Cancer Institute
National Cancer Institute (NCI)

Not yet recruiting

ACTIVATE: AI-driven Clinical-trial Trial-Information and Viability Assessment Tool for EHRs (ACTIVATE)

Cancer

United States
Wuhan Union Hospital, China

Completed

Chest X-Ray Image Diagnosis and Report Generation Dedicated Model Based on Deepseek

Radiology | Artificial Intellegence | Chest X-ray for Clinical Evaluation | Large Language Model

China
Shanghai 6th People's Hospital

Not yet recruiting

AI-assisted Quality Control Study of Multimodal Data in the Epidemiological Survey of Shanghai Nicheng Cohort Study

Cohort Studies | Quality Control

OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

A Comparative Performance Evaluation of Four Publicly Available Large Language Models Against Gold Standard Medical References

Study Overview

Status

Conditions

Intervention / Treatment

Detailed Description

Study Type

Enrollment (Estimated)

Contacts and Locations

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Sampling Method

Study Population

Description

Study Plan

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort

Intervention / Treatment

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Investigators

Publications and helpful links

General Publications

Helpful Links

Study record dates

Study Major Dates

Study Start (Actual)

Primary Completion (Estimated)

Study Completion (Estimated)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Other Study ID Numbers

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

IPD Plan Description

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on AI (Artificial Intelligence)

Clinical Trials on AI clinical reference tool

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in Turkmenistan

Conditions

Rare Diseases

Drug Interventions

Dietary Supplements

Sponsor/Collaborators

Locations