此页面是自动翻译的，不保证翻译的准确性。请参阅英文版对于源文本。

Preliminary Evaluation of a Large Language Model-Based Tool for Complex Surgical Decision Support in Lung Cancer

2026年6月13日更新者：XiuYuan Chen、Peking University People's Hospital

This study is an exploratory effect-size estimation study, with the following specific objectives: ① to estimate the point estimate and 95% confidence interval of the Win Ratio for the experimental group (GAPS-Agent) versus the control group (large language model) in blinded pairwise preference judgments by thoracic surgery expert adjudicators, to serve as a sample size planning parameter for subsequent multicenter confirmatory clinical trials; ② to preliminarily evaluate the value of GAPS-Agent within clinical workflows.The hypothesis of this study is as follows: compared with a general-purpose large language model without medical enhancement (control group), a structured agentic workflow optimized on the basis of the GAPS evaluation framework (GAPS-Agent, experimental group) can help junior resident physicians generate clinical decision plans for complex lung cancer cases that are more strongly preferred by senior thoracic surgery expert adjudicators.

研究概览

地位

邀请报名

条件

干预/治疗

研究类型

介入性

注册 (估计的)

阶段

不适用

联系人和位置

本节提供了进行研究的人员的详细联系信息，以及有关进行该研究的地点的信息。

学习地点

中国
- Beijing Municipality
  - Beijing、Beijing Municipality、中国、100044
    - Peking University People's Hospital

参与标准

研究人员寻找符合特定描述的人，称为资格标准。这些标准的一些例子是一个人的一般健康状况或先前的治疗。

资格标准

适合学习的年龄

成人
年长者

接受健康志愿者

不

描述

Inclusion Criteria:

Resident Physician Subjects:
1. Holds a valid and legally effective Physician Practice License of the People's Republic of China;
2. Currently holds the rank of resident physician in a thoracic surgery department at a tertiary Class A (3A) hospital;
3. Agrees to complete all assessment tasks of the main study phase in accordance with the study protocol;
4. Can guarantee the time and effort required to complete all assessment tasks of the main study.
Study Cases:
1. The case was discussed at the Thoracic Oncology Multidisciplinary Team (MDT) conference of Peking University People's Hospital between January 2025 and May 2026;
2. The current version of the NCCN guidelines does not provide an explicit recommendation covering the management of the case;
3. Does not overlap with the GAPS evaluation set;
4. The case is presented in pure text in a structured format, with all direct and indirect identifiers removed and complete de-identification performed prior to inclusion;
5. From the pool of eligible cases, 12 cases will be randomly drawn using Python (numpy.random, with a fixed and archived seed) to serve as the main study cases. The cases will cover 6 themes (chest mass of undetermined diagnosis, early-stage lung cancer, locally advanced lung cancer, oligometastatic/oligoprogressive disease, special intraoperative situations, and tumor recurrence), with 2 cases per theme.
Adjudication Expert Panel:
1. Holds a valid and legally effective Physician Practice License of the People's Republic of China;
2. Currently holds the rank of attending physician or above in a thoracic surgery department at a tertiary Class A hospital;
3. Chairs or regularly participates in lung cancer multidisciplinary team (MDT) work in their department.

Exclusion Criteria:

Resident Physician Subjects:
1. Has previously participated in the construction of the GAPS evaluation set or the development of GAPS-Agent;
2. Unable to complete the tasks of the study phase.
Study Cases:
1. Key case information is missing, such as text-form data on pathology (including IHC/NGS), imaging, laboratory tests, prior medical history, comorbidities, or PS score;
2. Decision-making for the case is strictly dependent on non-text information.
Adjudication Expert Panel:
1. Participated in the construction of the GAPS evaluation set, the content validity verification, or the development of GAPS-Agent for this study;
2. Has a direct conflict of interest with any specific product among the two-arm tools of this study.

学习计划

本节提供研究计划的详细信息，包括研究的设计方式和研究的衡量标准。

研究是如何设计的？

设计细节

主要用途：其他
分配：随机化
介入模型：并行分配
屏蔽：单身的

手臂数量

武器和干预

参与者组/臂	干预/治疗
实验性的：test arm GAPS-Agent	其他：GAPS-Agent The research group has previously developed the GAPS evaluation framework for complex clinical decision-making in lung cancer. In this framework, G (Grounding) characterizes the cognitive depth of decision-making (ranging from knowledge retrieval to decisions that go beyond clinical guidelines), A (Authority) corresponds to the grading of evidence strength, P (Perturbation) describes the identification and management of real-world clinical confounding factors, and S (Strength) corresponds to the calibration of recommendation strength. Within this framework, the research group has completed the construction of a 100-item complex lung cancer decision-making evaluation set along with its corresponding rubrics, and has invited multiple thoracic oncology experts to complete content validity validation. Based on this, the research group developed GAPS-Agent, which uses an open-source large language model as its foundation and integrates functional modules such as guideline and evidence retri
有源比较器：control arm LLM	其他：LLM Open source large language model that is not specifically enhanced in medical field.

参与者组/臂

干预/治疗

实验性的：test arm

GAPS-Agent

其他：GAPS-Agent

The research group has previously developed the GAPS evaluation framework for complex clinical decision-making in lung cancer. In this framework, G (Grounding) characterizes the cognitive depth of decision-making (ranging from knowledge retrieval to decisions that go beyond clinical guidelines), A (Authority) corresponds to the grading of evidence strength, P (Perturbation) describes the identification and management of real-world clinical confounding factors, and S (Strength) corresponds to the calibration of recommendation strength. Within this framework, the research group has completed the construction of a 100-item complex lung cancer decision-making evaluation set along with its corresponding rubrics, and has invited multiple thoracic oncology experts to complete content validity validation. Based on this, the research group developed GAPS-Agent, which uses an open-source large language model as its foundation and integrates functional modules such as guideline and evidence retri

有源比较器：control arm

LLM

其他：LLM

Open source large language model that is not specifically enhanced in medical field.

研究衡量的是什么？

主要结果指标

结果测量	措施说明	大体时间
Overall plan Win Ratio 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	A total of 10 blinded expert judges made Win/Tie/Loss ternary preference judgments on 192 paired scheme comparisons in terms of overall scheme quality. The win ratio was calculated as Wins ÷ Losses, and the 95% confidence interval was estimated using a two-level (physician × case) cluster bootstrap resampling method (B = 10,000, quantile method on the log scale).	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.

次要结果测量

结果测量	措施说明	大体时间
Inter-rater agreement 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	For the ternary preference judgment results of 10 expert judges across 192 paired comparisons and 6 evaluation domains, Fleiss' kappa was used to assess inter-rater agreement. The kappa value and its 95% confidence interval are reported for each evaluation domain.	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.
Redundancy Win Ratio 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	A total of 10 blinded expert judges made Win/Tie/Loss ternary preference judgments on 192 paired scheme comparisons in terms of overall scheme quality. The win ratio was calculated as Wins ÷ Losses, and the 95% confidence interval was estimated using a two-level (physician × case) cluster bootstrap resampling method (B = 10,000, quantile method on the log scale).	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.
Evidence-based medicine adherence Win Ratio 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	A total of 10 blinded expert judges made Win/Tie/Loss ternary preference judgments on 192 paired scheme comparisons in terms of overall scheme quality. The win ratio was calculated as Wins ÷ Losses, and the 95% confidence interval was estimated using a two-level (physician × case) cluster bootstrap resampling method (B = 10,000, quantile method on the log scale).	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.
Actionability Win Ratio 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	A total of 10 blinded expert judges made Win/Tie/Loss ternary preference judgments on 192 paired scheme comparisons in terms of overall scheme quality. The win ratio was calculated as Wins ÷ Losses, and the 95% confidence interval was estimated using a two-level (physician × case) cluster bootstrap resampling method (B = 10,000, quantile method on the log scale).	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.
Completeness Win Ratio 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	A total of 10 blinded expert judges made Win/Tie/Loss ternary preference judgments on 192 paired scheme comparisons in terms of overall scheme quality. The win ratio was calculated as Wins ÷ Losses, and the 95% confidence interval was estimated using a two-level (physician × case) cluster bootstrap resampling method (B = 10,000, quantile method on the log scale).	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.
Safety Win Ratio 大体时间：Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.	A total of 10 blinded expert judges made Win/Tie/Loss ternary preference judgments on 192 paired scheme comparisons in terms of overall scheme quality. The win ratio was calculated as Wins ÷ Losses, and the 95% confidence interval was estimated using a two-level (physician × case) cluster bootstrap resampling method (B = 10,000, quantile method on the log scale).	Measured at the time when experts completed their preference judgements. Calculated up to 3 weeks after the preference judgements.
GAPS automated rubric score 大体时间：Generated up to 3 weeks after residents finished their plan generation.	A third-party large language model, independent of the two study arms' base models, served as the judge model and automatically scored all 96 plans according to the GAPS rubric.	Generated up to 3 weeks after residents finished their plan generation.
Subject physician's self-confidence score 大体时间：Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.	After submitting each case plan, the participating physicians self-rated their confidence in their own plan using a 1-5 point Likert scale.	Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.
Tool satisfaction score 大体时间：Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.	After submitting each case plan, the participating physicians rated their satisfaction with the tool using a 1-5 point Likert scale.	Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.
Tool trustworthiness score 大体时间：Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.	After submitting each case plan, the participating physicians rated the tool's credibility using a 1-5 point Likert scale.	Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.
Decision-making time 大体时间：Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.	The time taken (in minutes) by each participating physician to complete the production of each case plan was automatically recorded by the evaluation platform. Differences between groups were analyzed using a linear mixed-effects model.	Completed at the time when residents submitted their plans. Calculated up to 3 weeks after the submission.

合作者和调查者

在这里您可以找到参与这项研究的人员和组织。

赞助

Peking University People's Hospital

研究记录日期

这些日期跟踪向 ClinicalTrials.gov 提交研究记录和摘要结果的进度。研究记录和报告的结果由国家医学图书馆 (NLM) 审查，以确保它们在发布到公共网站之前符合特定的质量控制标准。

研究主要日期

学习开始 (实际的)

2026年6月10日

初级完成 (估计的)

2026年6月21日

研究完成 (估计的)

2026年6月21日

研究注册日期

首次提交

2026年6月10日

首先提交符合 QC 标准的

2026年6月13日

首次发布 (实际的)

2026年6月17日

研究记录更新

最后更新发布 (实际的)

2026年6月17日

上次提交的符合 QC 标准的更新

2026年6月13日

最后验证

2026年6月1日

肺癌 (NSCLC)的临床试验

Asklepios proresearch
Medical University of Vienna; Q1.6 B.V.

尚未招聘

癌症患者远程监测及时沟通研究 (CONNECT)

NSCLC IV期
Fondazione Policlinico Universitario Agostino Gemelli...

完全的

艾乐替尼引起的内分泌毒性 (TOSS-ALK)

NSCLC IV期

意大利
Spanish Lung Cancer Group

完全的

T790M Mutation Testing in Blood by Different Methodologies

NSCLC IV期

西班牙
University of Malaya

招聘中

化疗剂量降低对无可操作突变的老年非小细胞肺癌患者生存结局的影响

非小细胞肺癌 NSCLC

香港, 马来西亚
Guangzhou University of Traditional Chinese Medicine
Guang'anmen Hospital of China Academy of Chinese Medical Sciences; Beijing Chest Hospital; Sichuan... 和其他合作者

尚未招聘

一项关于中医证型与肺癌患者EGFR-TKI疗效关联的前瞻性多中心研究

非小细胞肺癌 NSCLC

中国
IRCCS Azienda Ospedaliero-Universitaria di Bologna

招聘中

DNA和RNA在NGS分析中的作用

非小细胞肺癌 NSCLC

意大利
AstraZeneca

招聘中

针对EGFR突变阳性非小细胞肺癌的[111In]-FPI-2107 I期研究

EGFR突变阳性NSCLC

中国
Ono Pharmaceutical Co., Ltd.
Bristol-Myers Squibb

招聘中

Nivolumab与化学疗法结合作为可切除NSCLC患者的新辅助治疗的回顾性观察研究：台湾的现实世界经验（Neoreal）

非小细胞肺癌 NSCLC

台湾
Jiangsu Aosaikang Pharmaceutical Co., Ltd.

完全的

Beteewn ASK120067 和利福平或伊曲康唑的药物相互作用研究

局部晚期或转移性 NSCLC

中国
Niguarda Hospital
University of Turin, Italy; Fondazione del Piemonte per l'Oncologia

完全的

捐献死后肿瘤组织 (DONUM)

NSCLC IV期 | 钢筋混凝土 | 杯子

意大利

GAPS-Agent的临床试验

Fox Chase Cancer Center

终止

VM110 在显微肿瘤检测中的应用：I 期研究

胰腺癌 | 卵巢癌

美国
ImmunityBio, Inc.

撤销

诺加彭德金α因巴克塞普与iNKT细胞在重症社区获得性肺炎危重成人患者中的研究

败血症 | 淋巴细胞减少症 | 急性呼吸窘迫综合征 (ARDS) | 社区获得性肺炎 (CAP) | 免疫麻痹
Darren Sigal, MD
Scripps Health

尚未招聘

BAL/BOT/agenT-797 用于治疗伴有肝转移的pMMR结直肠癌

结直肠癌转移

美国
Orchestra BioMed, Inc

招聘中

[未经美国FDA批准或清除的设备试验]

冠状动脉疾病

美国
ImmunityBio, Inc.

撤销

Nogapendekin Alfa-Inbakicept 与 iNKT 细胞治疗危重成人严重社区获得性肺炎（伴或不伴脓毒症/ARDS）

败血症 | 急性呼吸窘迫综合征 | 严重社区获得性肺炎 | 危重症成人患者的淋巴细胞减少症/免疫麻痹
Hemoteq AG

完全的

Agent™ 和 SeQuent® Please 紫杉醇涂层球囊导管在冠状动脉支架内再狭窄 (AGENT-ISR) 中的比较 (AGENT-ISR)

冠状动脉疾病 | 冠状动脉硬化 | 冠状动脉粥样硬化 | 冠状动脉再狭窄

德国, 法国

Preliminary Evaluation of a Large Language Model-Based Tool for Complex Surgical Decision Support in Lung Cancer

研究概览

地位

条件

干预/治疗

研究类型

注册 (估计的)

阶段

联系人和位置

学习地点

参与标准

资格标准

适合学习的年龄

接受健康志愿者

描述

学习计划

研究是如何设计的？

设计细节

手臂数量

武器和干预

参与者组/臂

干预/治疗

研究衡量的是什么？

主要结果指标

结果测量

措施说明

大体时间

次要结果测量

结果测量

措施说明

大体时间

合作者和调查者

赞助

研究记录日期

研究主要日期

学习开始 (实际的)

初级完成 (估计的)

研究完成 (估计的)

研究注册日期

首次提交

首先提交符合 QC 标准的

首次发布 (实际的)

研究记录更新

最后更新发布 (实际的)

上次提交的符合 QC 标准的更新

最后验证

更多信息

与本研究相关的术语

关键字

其他相关的 MeSH 术语

其他研究编号

计划个人参与者数据 (IPD)

计划共享个人参与者数据 (IPD)？

药物和器械信息、研究文件

研究美国 FDA 监管的药品

研究美国 FDA 监管的设备产品

肺癌 (NSCLC)的临床试验

GAPS-Agent的临床试验

搜索类似试验

赞助者和合作者

医疗条件

药物干预

CROs by country

CROs in Denmark

条件

罕见疾病

药物干预

膳食补充剂

赞助者/合作者

地点