mzLearn, a data-driven LC/MS signal detection algorithm, enables pre-trained generative models for untargeted metabolomics
This repository contains the codebase for mzEmbed, a framework for developing pre-trained generative models and fine-tuning them for specific tasks for untargeted metabolomics datasets.
Author:
mzLearn is a data-driven algorithm designed to autonomously detect metabolite signals from raw LC/MS data without requiring input parameters from the user. The algorithm processes raw LC/MS data files in the open-source mzML
format, iteratively learning signal characteristics to ensure high-quality signal detection.
- Zero-parameter design: No prior knowledge or QC samples are required.
- Iterative learning: mzLearn autonomously refines signal detection, correcting for retention time (rt) and intensity drifts caused by batch effects and run order.
- Output: A two-dimensional table of detected features defined by median rt and m/z values, with normalized intensities across samples.
- Scalability: Capable of handling large-scale datasets (e.g., 2,075 files in a single run).
- Accessibility: mzLearn’s website for accessing the tool is available at http://mzlearn.com/.
mzEmbed extends mzLearn’s capabilities by combining outputs from multiple datasets to develop pre-trained generative models and applying them to a range of metabolomics applications.
-
Pre-trained Model Development:
- Combines metabolomics data from multiple studies to create robust pre-trained generative models.
- Supports Variational Autoencoders (VAEs) for unsupervised learning of metabolite representations.
- Enables parameter optimization using grid search and Optuna for hyperparameter tuning.
- Outputs embeddings that capture biological and demographic variability, such as age, disease state.
-
Fine-Tuning Pre-Trained Models:
- Allows fine-tuning of pre-trained models on independent datasets for improved task-specific performance.
- Supports fine-tuning for binary classification, multi-class classification, and survival analysis.
-
Task-Specific Model Refinement:
- Retrains the last layer of fine-tuned models for specific tasks, such as clinical classifcation and surivival analysis.
-
Advanced Architectures:
- Supports the development of joint learning models for treatment-independent, prognostic stratification of patient.
- Implements adversarial learning to isolate treatment-specific predictive biomarkers, or predictive stratification of patient.
- Python 3.9 or higher
- Clone the repository:
git clone [email protected]:ReviveMed/mzEmbed.git cd mzEmbed
- Install dependencies:
pip install -r requirements.txt
- building the package:
cd mz_embed python -m build pip install -e .
The repository supports six main use cases, including pretraining, fine-tuning, and advanced learning architectures. pretrain and finetune directories includes examples of the Python commands for each use case.
This project is licensed under the Academic and Non-Profit Use License. See the LICENSE.txt file for details.
If you use mzLearn or mzEmbed in your research, please cite:
mzLearn, a data-driven LC/MS signal detection algorithm, enables pre-trained generative models for untargeted metabolomics Leila Pirhaji, Jonah Eaton, Adarsh K. Jeewajee, Min Zhang, Matthew Morris, Maria Karasarides