Large Language Models Versus Human Examiners for Grading Physiotherapy Clinical Cases (PACE-AI)

June 24, 2026 updated by: Alfredo Lerín Calvo, Neuron, Spain

Agreement Between Large Language Models and Faculty Assessment in the Evaluation of Clinical Reasoning Case Examinations in Undergraduate Physiotherapy Education: A Comparative Reliability Study

This study evaluates whether large language models (LLMs) can reliably assess written clinical-reasoning case examinations completed by undergraduate physiotherapy students, compared with faculty assessment. In the course "Specific Methods in Physiotherapy" (third year of the Physiotherapy Degree), students solve complex clinical cases that require clinical reasoning, technical knowledge, and therapeutic decision-making. These cases are traditionally graded by faculty, a time-consuming process that may show inter-rater variability.

A set of de-identified student case examinations will be assessed using the rubric currently applied in the course, which covers clarity and structure of clinical reasoning, integration of the biopsychosocial model (ICF and APTA frameworks), accuracy in identifying pain mechanisms, coherence between diagnosis, hypotheses, and treatment, originality and depth of analysis, and professional writing. Each examination will be scored independently by three LLMs (for example, Claude, ChatGPT, and Gemini), each receiving an identical standardized prompt that embeds the same rubric, and by faculty serving as the reference standard.

To avoid overloading faculty, full double human grading may not be feasible; the human reference will therefore consist of expert faculty grading by one independent rater or, when resources allow, two independent raters. In contrast, paired assessment is fully implemented across the AI models: each examination is scored by several LLMs, and each model is queried in duplicate, allowing the study to estimate agreement between models and the test-retest stability of each model.

The primary aim is to quantify agreement between LLM-generated scores and the faculty reference score. Secondary aims include agreement among the LLMs, test-retest reliability of each model, criterion-level agreement, the quality and usefulness of the qualitative feedback generated, the time and cost associated with each approach, and students' perceptions of the usefulness of human versus AI feedback.

The findings will clarify the strengths and limitations of LLMs as supportive tools for formative assessment in health-professions education and will inform criteria for their responsible and effective use. No LLM output will affect students' official grades, which remain the sole responsibility of faculty.

Study Overview

Detailed Description

BACKGROUND AND RATIONALE The assessment of clinical case examinations in physiotherapy requires the appraisal of multiple dimensions, including clinical reasoning, selection of appropriate techniques, treatment dosage, ethical considerations, and patient communication. This grading process demands substantial faculty time and may be affected by evaluator fatigue and inter-rater variability. Large language models (LLMs) have shown notable capabilities in text comprehension, reasoning, and the generation of structured feedback, and preliminary evidence suggests they may provide consistent evaluations in medical education contexts. However, questions remain regarding their reliability, potential bias, and their ability to capture the complexity of clinical reasoning. There is currently limited empirical evidence on whether LLMs can complement human grading while maintaining quality standards and offering immediate formative feedback. This study addresses that gap through a systematic comparison between human assessment (reference standard) and LLM-assisted assessment using identical materials and criteria.

OBJECTIVES Primary objective: To quantify the agreement between the scores generated by LLMs and the faculty reference score in the assessment of physiotherapy clinical-reasoning case examinations.

Secondary objectives:

  1. To estimate the agreement among different LLMs (inter-model reliability).
  2. To estimate the test-retest reliability of each LLM (intra-model reliability) when the same examination is scored on repeated, independent administrations.
  3. To evaluate agreement at the level of individual rubric criteria.
  4. To compare the quality, specificity, and formative usefulness of the qualitative feedback produced by LLMs and by faculty.
  5. To compare the time and cost associated with human and LLM assessment.
  6. To assess students' perceptions of the usefulness, fairness, and transparency of human versus AI feedback.

STUDY DESIGN Cross-sectional inter-rater agreement and reliability study with repeated measures, in which the same set of de-identified student clinical case examinations is assessed independently by human and artificial-intelligence raters using a shared, predefined rubric. The study is observational and educational in nature; it does not modify the teaching or the official assessment received by students.

SETTING, PARTICIPANTS, AND MATERIALS The study is conducted within the course "Specific Methods in Physiotherapy" (third year, Physiotherapy Degree), in which clinical cases are used as a central learning and assessment tool. The units of analysis are the written case examinations produced by enrolled students (approximately 60 to 80 examinations). All examinations are anonymized prior to assessment so that no rater can identify the author. The grading instrument is the rubric already used in the course, comprising the following criteria: (1) clarity and structure of clinical reasoning; (2) integration of the biopsychosocial model (ICF and APTA frameworks); (3) accuracy in identifying pain mechanisms; (4) coherence between diagnosis, hypotheses, and treatment; (5) originality and depth of analysis; and (6) appropriate professional writing. Each criterion yields a partial score, and the criteria sum to a global score.

RATERS AND ASSESSMENT PROCEDURE Human assessment (reference standard): Each examination is scored by faculty with expertise in the course, applying the established rubric. The protocol is designed to accommodate two scenarios depending on faculty workload. In the preferred scenario, two faculty members score each examination independently (paired human correction), enabling estimation of human-human reliability and the use of the mean or consensus score as the reference. In the contingency scenario, to avoid overloading faculty, a single expert faculty rating (or, alternatively, the official course grade already assigned) is used as the reference standard; in that case, human-human reliability is not estimated within this study and is acknowledged as a limitation.

Artificial-intelligence assessment (paired AI correction): The same anonymized examinations are scored independently by three LLMs (for example, Claude, ChatGPT, and Gemini, in the versions available during the data-collection period). Each model receives an identical standardized prompt that embeds the same rubric and requests, for every examination, a partial score per criterion, a global score, and structured qualitative feedback. To assess intra-model (test-retest) reliability, each model is queried in duplicate in independent sessions under fixed generation parameters. This design implements "peer" correction across models: outputs are cross-compared between models (inter-model agreement) and against repeated runs of the same model (intra-model stability), mirroring the logic of paired review while remaining feasible without additional faculty burden. Grading order is randomized, and raters (human and AI) are blinded to one another's scores. The time required for each evaluation and the operating cost of each LLM are recorded.

OUTCOME MEASURES Primary outcome: agreement between each LLM's global score and the faculty reference global score.

Secondary outcomes: inter-model agreement among LLMs; intra-model test-retest reliability; criterion-level agreement; feedback quality (number of specific and actionable comments, coverage of case dimensions, and rated formative usefulness); efficiency (mean evaluation time and cost per evaluation); and student-perceived usefulness of human versus AI feedback (Likert scales).

STATISTICAL ANALYSIS General approach: Analyses will be performed in R. Continuous variables will be summarized as mean and standard deviation or median and interquartile range, according to distribution (assessed with the Shapiro-Wilk test and graphical inspection); categorical variables as absolute and relative frequencies. All tests will be two-sided with an alpha of 0.05, and 95% confidence intervals (CI) will be reported for all reliability and agreement estimates.

Primary analysis (LLM vs faculty agreement): For the global score (continuous), agreement between each LLM and the faculty reference will be quantified with the intraclass correlation coefficient (ICC), two-way random-effects model, absolute-agreement definition, single-rater and average-rater forms [ICC(2,1) and ICC(2,k)], following the conventions of McGraw and Wong and the reporting guidance of Koo and Li. ICC values will be interpreted as poor (<0.50), moderate (0.50-0.75), good (0.75-0.90), and excellent (>0.90). Systematic bias will be examined with Bland-Altman analysis, reporting the mean difference and 95% limits of agreement, with inspection for proportional bias. For criterion-level (ordinal) scores, agreement will be quantified with Cohen's weighted kappa using quadratic weights. Because kappa is sensitive to prevalence and marginal imbalance (the "kappa paradox"), Gwet's AC1/AC2 and the prevalence-adjusted bias-adjusted kappa (PABAK) will be reported as robust complements. Categorical agreement coefficients will be interpreted using the Landis and Koch benchmarks. Pre-specified targets, consistent with the project objectives, are ICC >= 0.75 for the global score and weighted kappa >= 0.60 at the criterion level.

Inter-model reliability: Agreement among the three LLMs on the global score will be estimated with a two-way random-effects ICC for multiple raters; for criterion-level categorical scores, Fleiss' kappa and Gwet's AC will be used. Pairwise model comparisons will also be reported.

Intra-model (test-retest) reliability: For each model, agreement between duplicate runs will be quantified with the ICC and the percentage of exact agreement for categorical criteria, complemented by Gwet's AC. The standard error of measurement (SEM) and the minimal detectable change (MDC95) will be derived from the ICC to express measurement precision in score units.

Feedback quality (qualitative and quantitative): The qualitative feedback will be analyzed through structured content analysis with predefined categories aligned to the rubric dimensions. The number of specific, actionable comments and the coverage of case dimensions (reasoning, technique, communication, ethics) will be counted for each rater type and compared using chi-square or Fisher's exact tests for proportions and Kruskal-Wallis tests for counts, with appropriate post-hoc comparisons. Inter-coder reliability for the content-analysis coding will itself be reported (Cohen's or Gwet's coefficient) to ensure the trustworthiness of the categorization.

Efficiency: Mean evaluation time will be compared between humans and each LLM using paired t-tests or Wilcoxon signed-rank tests depending on distribution. Cost per evaluation (faculty time valued at standard institutional rates and LLM usage fees) will be summarized descriptively.

Student perception: Likert responses will be summarized as medians and IQRs and as the proportion of favorable responses. Differences in perceived usefulness between human and AI feedback will be tested with the Wilcoxon signed-rank test, and the Friedman test will be used when more than two feedback sources are compared, with appropriate post-hoc analysis.

Handling of the reference-standard contingency: If two independent faculty ratings are obtained, human-human reliability will be reported and the mean or consensus score used as the reference for all LLM comparisons; if only a single faculty rating (or the official course grade) is available, that single value serves as the reference, the absence of an in-study human-human reliability estimate is reported as a limitation, and, where available, historical course data on faculty agreement are cited as external context. The AI-centered analyses (inter-model and intra-model reliability) are unaffected by this contingency and are performed in all cases.

Sensitivity analyses and missing data: Sensitivity analyses will explore the influence of score distribution and of any examinations with extreme scores. The pattern of any missing or non-evaluable outputs (for example, an LLM failing to return a parsable score) will be described, and complete-case analysis will be the primary approach, with the proportion of missing data reported.

SAMPLE SIZE JUSTIFICATION This is a reliability/agreement study, so the sample-size rationale is based on the precision of the reliability estimates rather than on hypothesis testing of a between-group difference. For an expected ICC of approximately 0.75 estimated with several raters and a target 95% CI half-width of about 0.10-0.12, established approximations for ICC precision (Bonett) indicate that on the order of 40-60 examinations are required; for kappa-based criterion agreement, the approximations of Donner and Rotnitzky yield comparable figures. The approximately 60-80 available examinations therefore provide adequate precision for the planned estimates. These figures are indicative and will be refined once the final number of evaluable examinations and raters is confirmed.

ETHICS AND DATA PROTECTION The study will be submitted to the Research Ethics Committee of Universidad Rey Juan Carlos for review and approval. Informed consent will be requested from participating students. All examinations are anonymized before assessment, data are processed in accordance with the EU General Data Protection Regulation (GDPR) and applicable national law, and participation does not affect students' final grades, which are determined exclusively by faculty. Students are informed that their de-identified case examinations will be assessed by both faculty and LLMs for educational-research purposes.

Study Type

Observational

Enrollment (Estimated)

65

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Study Locations

    • Madrid
      • Madrid, Madrid, Spain, 28023
        • Centro Superior de Estudios Universitarios La Salle
        • Contact:

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

  • Adult
  • Older Adult

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

The study population comprises undergraduate students enrolled in the course "Specific Methods in Physiotherapy" (third year of the Physiotherapy Degree) during the study period, approximately 60 to 80 students. As part of the course, each student produces a written clinical-reasoning case examination. The de-identified examinations from consenting students constitute the units of analysis and are assessed independently, using the same predefined rubric, by faculty (reference standard) and by three large language models. No clinical intervention is applied and participation does not affect students' official grades, which are determined exclusively by faculty.

Description

Inclusion Criteria:

  • Students officially enrolled in the course "Specific Methods in Physiotherapy" (third year of the Physiotherapy Degree) during the study period.
  • Submission of a completed written clinical-reasoning case examination as part of the course.
  • Provision of informed consent for the anonymized examination to be used for educational-research purposes.

Exclusion Criteria:

  • Refusal to provide, or withdrawal of, informed consent.
  • Blank, incomplete, or non-evaluable examinations (e.g., no developed written response).
  • Examinations that cannot be reliably de-identified prior to assessment.

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Cohorts and Interventions

Group / Cohort
Intervention / Treatment
Anonymized physiotherapy clinical case examinations
Single cohort consisting of de-identified written clinical-reasoning case examinations produced by undergraduate physiotherapy students in the course "Specific Methods in Physiotherapy." Each examination is assessed independently, using the same predefined rubric, by faculty (reference standard) and by three large language models (LLMs), with each model queried in duplicate to assess test-retest reliability. The examination is the unit of analysis; no participant follow-up is performed.
Assessment of each anonymized examination by three large language models (for example, Claude, ChatGPT, and Gemini, in the versions available during data collection). Each model receives an identical standardized prompt embedding the study rubric and returns a score per criterion, a global score, and structured qualitative feedback. Each model is queried in duplicate in independent sessions under fixed generation parameters to estimate intra-model (test-retest) reliability, and outputs are compared across models to estimate inter-model agreement.
Assessment of the same anonymized examinations by faculty with expertise in the course, applying the identical rubric, serving as the reference standard. In the preferred scenario, two faculty members score each examination independently (paired human correction); if faculty workload precludes this, a single expert faculty rating, or the official course grade already assigned, is used as the reference. Faculty and LLM raters are blinded to one another's scores.

What is the study measuring?

Primary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Agreement between LLM global scores and the faculty reference global score
Time Frame: Single cross-sectional assessment during the data-collection period (approximately 2 months)
Agreement between the global examination score generated by each large language model (LLM) and the faculty reference global score, computed for the same anonymized examinations. Agreement is quantified with the intraclass correlation coefficient (ICC), two-way random-effects model, absolute-agreement definition, single- and average-measures forms [ICC(2,1) and ICC(2,k)], with 95% confidence intervals. Systematic bias is examined with Bland-Altman analysis (mean difference and 95% limits of agreement). ICC is interpreted as poor (<0.50), moderate (0.50-0.75), good (0.75-0.90), or excellent (>0.90). Pre-specified target: ICC >= 0.75.
Single cross-sectional assessment during the data-collection period (approximately 2 months)

Secondary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Criterion-level agreement between LLM and faculty scores
Time Frame: Single cross-sectional assessment during the data-collection period (approximately 2 months)
Agreement between LLM and faculty scores at the level of each individual rubric criterion (ordinal scores). Quantified with Cohen's weighted kappa (quadratic weights), with 95% confidence intervals. Because kappa is sensitive to prevalence and marginal imbalance, Gwet's AC1/AC2 and the prevalence-adjusted bias-adjusted kappa (PABAK) are reported as robust complements. Coefficients are interpreted using the Landis and Koch benchmarks. Pre-specified target: weighted kappa >= 0.60.
Single cross-sectional assessment during the data-collection period (approximately 2 months)
Intra-model test-retest reliability of each large language model
Time Frame: Single cross-sectional assessment during the data-collection period (approximately 2 months)
Stability of each LLM's scoring across two independent duplicate runs of the same examination under fixed generation parameters. Quantified with the ICC for the global score and percentage of exact agreement (complemented by Gwet's AC) for criterion-level scores, with 95% confidence intervals. The standard error of measurement (SEM) and the minimal detectable change (MDC95) are derived from the ICC to express measurement precision in score units.
Single cross-sectional assessment during the data-collection period (approximately 2 months)
Quality and coverage of the qualitative feedback
Time Frame: Assessed after completion of all evaluations, during the analysis period (approximately 3 months)
Quality of the qualitative feedback produced by each rater type (LLMs and faculty), assessed through structured content analysis with predefined categories aligned to the rubric dimensions. Outcomes include the number of specific, actionable comments per evaluation and the coverage of case dimensions (clinical reasoning, technique, communication, ethics). Counts and proportions are compared across rater types (chi-square or Fisher's exact tests for proportions; Kruskal-Wallis for counts, with post-hoc comparisons). Inter-coder reliability of the content-analysis coding is reported (Cohen's or Gwet's coefficient).
Assessed after completion of all evaluations, during the analysis period (approximately 3 months)
Mean evaluation time per examination: faculty versus LLM
Time Frame: Single cross-sectional assessment during the data-collection period (approximately 2 months)
Mean time required to evaluate one examination, recorded separately for faculty and for each LLM. Compared between humans and LLMs using paired t-tests or Wilcoxon signed-rank tests according to distribution, with 95% confidence intervals for the mean difference. Reported in minutes per examination.
Single cross-sectional assessment during the data-collection period (approximately 2 months)
Cost per evaluation: faculty versus LLM
Time Frame: Single cross-sectional assessment during the data-collection period (approximately 2 months)
Operating cost of evaluating one examination, comparing faculty time valued at standard institutional rates against LLM usage fees. Summarized descriptively (mean and standard deviation or median and interquartile range) per rater type and reported in euros per examination.
Single cross-sectional assessment during the data-collection period (approximately 2 months)

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Estimated)

August 1, 2026

Primary Completion (Estimated)

August 10, 2026

Study Completion (Estimated)

August 10, 2026

Study Registration Dates

First Submitted

June 24, 2026

First Submitted That Met QC Criteria

June 24, 2026

First Posted (Actual)

June 30, 2026

Study Record Updates

Last Update Posted (Actual)

June 30, 2026

Last Update Submitted That Met QC Criteria

June 24, 2026

Last Verified

June 1, 2026

More Information

Terms related to this study

Other Study ID Numbers

  • ALCNR005

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

YES

IPD Plan Description

De-identified individual participant data (anonymized examination scores per rubric criterion and global score from all human and LLM raters, including duplicate LLM runs) and the corresponding data dictionary will be shared. The standardized LLM prompt, the scoring rubric, and the statistical analysis code will also be made available. Data will be deposited in the Zenodo open-access repository and assigned a permanent DOI. No directly identifying information will be shared; all examinations are anonymized prior to assessment.

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

No

Studies a U.S. FDA-regulated device product

No

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Educational Assessment

Clinical Trials on LLM-based assessment

3
Subscribe