Cross-sectional Functional Stratification Based on Psychometric Profiling and Machine Learning in Patients With Substance Use Disorders (SUD) (SISAP-TUS)

May 14, 2026 updated by: Lauro Gutiérrez Castro

Unsupervised Deep Representation Learning for Clinical Stratification in Substance Use Disorders

Substance use disorders (SUDs) show considerable clinical heterogeneity that limits the usefulness of traditional categorical diagnoses. This observational, cross-sectional study aims to apply an unsupervised deep learning method - an autoencoder - to learn continuous latent representations from standardised psychometric data and to explore whether those representations can help stratify clinical subpopulations. The investigators will recruit 155 adults undergoing residential treatment for SUD. Participants will complete six validated instruments assessing impulsivity (BIS-11), anger regulation (STAXI-2), behavioural activation/avoidance (BADS), borderline symptomatology (BSL-23), generalised anxiety (GAD-7), and environmental reward (EROS). Demographic and clinical variables (age, sex, primary substance, years of use, prior treatments) will also be recorded.

After data cleaning and standardisation (z-scores), a symmetric autoencoder with a 12-dimensional bottleneck (architecture 21-32-24-12-24-32-21) will be trained using mean squared error loss. Regularisation includes L2 weight decay and dropout. The model will be trained 30 times with different random seeds to assess stability; the five best models (by validation pseudo-R²) will be combined into a weighted ensemble. Five-fold cross-validation will evaluate generalisation. For comparison, principal component analysis (PCA) will be applied to the same data. Gaussian mixture models (GMM) will be fitted on the latent space to explore potential clinical subgroups.

The primary outcome is the stability of the latent representation (coefficient of variation of validation MSE across runs). Secondary outcomes include reconstruction performance (pseudo-R²) of the ensemble, comparison with PCA, and the interpretability of latent dimensions via correlations with original variables. GMM results will be described using BIC, silhouette width, bootstrap stability, and clinical characterisation of clusters.

This study does not involve any intervention. Results will be hypothesis-generating and require external validation. No automated clinical decisions will be made.

Study Overview

Detailed Description

Substance use disorders (SUDs) are characterised by substantial heterogeneity in clinical presentation, behavioural patterns, emotional regulation difficulties, impulsivity, and treatment response. Individuals with the same categorical diagnosis may differ considerably in symptom severity, comorbid psychopathology, and psychosocial functioning. This variability limits the explanatory value of traditional diagnostic classifications and supports the development of dimensional and data-driven approaches for patient characterisation.

Recent advances in machine learning provide methods capable of identifying latent structures within complex clinical datasets. Autoencoders, a form of unsupervised deep learning, can learn compact nonlinear representations of multidimensional data while preserving relevant information from the original variables. Compared with traditional linear dimensionality reduction methods such as principal component analysis (PCA), autoencoders may better capture complex interactions among psychological and behavioural variables. When combined with probabilistic clustering approaches such as Gaussian mixture models (GMM), these latent representations may facilitate the identification of clinically meaningful patient subgroups.

The purpose of this observational study is to apply an autoencoder model to psychometric and clinical data obtained from adults receiving residential treatment for substance use disorders. The study aims to explore latent dimensions underlying symptom and behavioural variability and to evaluate whether these dimensions support stable subgroup identification.

Primary Objective:

To learn a 12-dimensional latent representation from standardised psychometric and clinical variables using an autoencoder model and evaluate the stability of this representation across repeated training procedures.

Secondary Objectives:

To compare the reconstruction performance of the autoencoder with principal component analysis (PCA).

To characterise the clinical meaning of the latent dimensions through correlations with the original variables.

To explore potential patient subgroups using Gaussian mixture models (GMM) applied to the latent space.

To assess the stability and interpretability of the identified subgroups.

Study Design:

This is a single-centre, observational, cross-sectional, non-interventional study conducted in a residential addiction treatment facility. Recruitment is planned from February 2024 through December 2025. The study is registered prior to dissemination of results.

Study Population:

Approximately 155 adults diagnosed with substance use disorder according to DSM-5 criteria will be included. Eligible participants must be 18 years of age or older, currently receiving residential treatment, capable of completing study questionnaires, and willing to provide written informed consent.

Participants with active psychotic disorders, severe cognitive impairment, significant language or literacy barriers, or imminent discharge from treatment will be excluded.

Measures and Data Collection:

Participants will complete a battery of validated self-report instruments assessing impulsivity, anger regulation, behavioural activation and avoidance, borderline symptomatology, anxiety, and environmental reward. Additional demographic and clinical variables will include age, sex, primary substance of use, years of substance use, and prior treatment history.

Questionnaires include:

Barratt Impulsiveness Scale (BIS-11) State-Trait Anger Expression Inventory-2 (STAXI-2) Behavioral Activation for Depression Scale (BADS) Borderline Symptom List-23 (BSL-23) Generalized Anxiety Disorder-7 (GAD-7) Environmental Reward Observation Scale (EROS)

Data Analysis:

Clinical variables will be standardised prior to analysis. Missing values are expected to be minimal and will be handled using median imputation procedures. Redundant variables with excessive multicollinearity may be removed before modelling.

An autoencoder neural network will be trained to generate a reduced latent representation of the clinical data. Model performance and stability will be evaluated across repeated training runs and cross-validation procedures. Reconstruction accuracy will be compared with PCA using equivalent dimensionality.

The resulting latent space will subsequently be analysed using Gaussian mixture models to explore potential patient subgroups. Model selection will consider statistical fit, cluster stability, and clinical interpretability. Correlations between latent dimensions and original clinical variables will be examined to facilitate interpretation of the learned representations.

Ethical Considerations:

The study protocol has been approved by the corresponding Institutional Ethics Committee. All participants will provide written informed consent prior to participation. Data will be anonymised after collection, and no direct identifiers will be retained.

This study is observational and will not modify routine clinical treatment. No automated clinical decisions will be made based on model outputs. Participants may experience mild emotional discomfort or fatigue while completing questionnaires; psychological support will be available if needed.

The study will be conducted in accordance with the Declaration of Helsinki and applicable local ethical regulations.

Dissemination:

Results will be submitted for publication in peer-reviewed scientific journals and presented at academic conferences. De-identified data and analysis code may be shared publicly after publication to support transparency and reproducibility.

Study Type

Observational

Enrollment (Actual)

155

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Locations

    • Jalisco
      • Ajijic, Jalisco, Mexico, 45920
        • Under The Tree

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

  • Adult

Accepts Healthy Volunteers

No

Sampling Method

Non-Probability Sample

Study Population

Adult patients (≥18 years) with a diagnosis of Substance Use Disorder (SUD) admitted to a residential detoxification and rehabilitation center. Consecutive recruitment between February 2024 and March 2026. Estimated final sample size is 155 participants. No healthy volunteers are included.

Description

Inclusion Criteria:

  • DSM-5 diagnosis of Substance Use Disorder (SUD), confirmed by a psychiatrist or clinical psychologist.
  • Age ≥ 18 years.
  • Currently admitted to a residential addiction treatment center at the time of assessment.
  • Ability to complete the psychometric questionnaires independently.
  • Written informed consent.

Exclusion Criteria:

  • Active psychotic disorder (e.g., schizophrenia, delusional disorder) not stabilized pharmacologically.
  • Severe cognitive impairment (dementia, severe brain injury) that prevents understanding the questionnaire items.
  • Language barriers or illiteracy that prevent self-administration of the scales.
  • Scheduled discharge from the center within 7 days of the assessment date.

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Cohorts and Interventions

Group / Cohort
Intervention / Treatment
Total sample (residential treatment)
Adult patients (N=155) with DSM-5 TR substance use disorder receiving residential treatment. All participants completed six psychometric scales (BIS-11, STAXI-2, BADS, BSL-23, GAD-7, EROS) and provided demographic/clinical data in a single cross-sectional session. No intervention was administered.
This is a purely observational study. No drug, device, behavioral therapy, or other intervention was assigned. The study only involved standardized psychometric measurements.

What is the study measuring?

Primary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Latent dimension scores
Time Frame: Baseline (single assessment, cross-sectional)

Twelve continuous latent dimensions derived from the bottleneck layer of a symmetric autoencoder trained on 21 standardized clinical variables. Each dimension represents a compressed, nonlinear combination of the original psychometric indicators (impulsivity, emotion regulation, behavioral activation, borderline symptoms, anxiety, and environmental reward). The dimensions are extracted for each participant after averaging the predictions of an ensemble of the five best autoencoder runs.

Unit of Measure: Standardized z-score (mean = 0, SD = 1 in the training sample)

Baseline (single assessment, cross-sectional)

Secondary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Gaussian mixture model cluster membership
Time Frame: Baseline

Categorical assignment of each participant to one of the clusters obtained by fitting a Gaussian mixture model with full covariance matrices to the 12-dimensional latent space. The number of clusters is determined by the Bayesian Information Criterion (BIC) and clinical interpretability. This outcome is exploratory and does not imply discrete subtypes.

Unit of Measure: Nominal (cluster number: 1, 2, …)

Baseline
Autoencoder reconstruction pseudo-R²
Time Frame: Baseline (computed on the validation split and on the full sample after training)

Proportion of variance in the original 21 clinical variables that is explained by the autoencoder's reconstructions, defined as 1 - (MSE_model / MSE_null), where MSE_null is the mean squared error of a model predicting only the mean. This metric is calculated for the ensemble of the five best models and for each of the 30 independent runs separately.

Unit of Measure: Proportion (range 0 to 1)

Baseline (computed on the validation split and on the full sample after training)
Autoencoder reconstruction mean squared error
Time Frame: Baseline

Average squared difference between the original 21 standardized input variables and the reconstructed outputs produced by the autoencoder. Lower values indicate better reconstruction. Reported for the ensemble model and for each independent run.

Unit of Measure: Mean squared error (dimensionless, as data are z-standardized)

Baseline
Coefficient of variation of reconstruction MSE
Time Frame: Baseline (after all runs are completed)

Coefficient of variation (CV = standard deviation / mean) of the reconstruction MSE computed over 30 independent autoencoder training runs with different random seeds. This metric assesses the stability and reproducibility of the model.

Unit of Measure: Percentage (%)

Baseline (after all runs are completed)
Cross-validated reconstruction R²
Time Frame: Baseline

Mean R² (and standard deviation) obtained from 5-fold cross-validation repeated 3 times, using the same autoencoder architecture and hyperparameters. This evaluates how well the model generalises to unseen patients.

Unit of Measure: Proportion (range 0 to 1)

Baseline
Explained variance by 12 principal components
Time Frame: Baseline

Total proportion of variance explained by the first 12 principal components obtained from PCA applied to the same 21 standardized variables. This serves as a comparator for the autoencoder's reconstruction performance.

Unit of Measure: Proportion (range 0 to 1)

Baseline

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Investigators

  • Principal Investigator: Lauro Gutiérrez Castro, Under The Tree

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

March 25, 2024

Primary Completion (Actual)

February 18, 2026

Study Completion (Actual)

April 22, 2026

Study Registration Dates

First Submitted

May 9, 2026

First Submitted That Met QC Criteria

May 9, 2026

First Posted (Actual)

May 15, 2026

Study Record Updates

Last Update Posted (Actual)

May 18, 2026

Last Update Submitted That Met QC Criteria

May 14, 2026

Last Verified

May 1, 2026

More Information

Terms related to this study

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

YES

IPD Plan Description

Individual participant data (IPD) that underlie the results reported in the manuscript will be shared after de-identification (anonymization). The data will include the 21 standardized clinical variables and the 12-dimensional latent representations for all 155 participants. Study protocol, statistical analysis plan, and R code will also be made available.

IPD Sharing Time Frame

Beginning 9 months and ending 36 months after article publication

IPD Sharing Access Criteria

Data will be available to researchers who provide a methodologically sound proposal for purposes of replicating the results or conducting secondary analyses. Proposals should be directed to the corresponding author. Requestors will need to sign a data access agreement.

IPD Sharing Supporting Information Type

  • STUDY_PROTOCOL
  • SAP
  • ICF
  • CSR

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

No

Studies a U.S. FDA-regulated device product

No

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Addiction

Clinical Trials on No intervention (observational only)

Subscribe