mzLearn, a data-driven LC/MS signal detection algorithm, enables pre-trained generative models for untargeted metabolomics

This repository contains the codebase for mzEmbed, a framework for developing pre-trained generative models and fine-tuning them for specific tasks for untargeted metabolomics datasets.

Author:

Leila Pirhaji

Overview of mzLearn

mzLearn is a data-driven algorithm designed to autonomously detect metabolite signals from raw LC/MS data without requiring input parameters from the user. The algorithm processes raw LC/MS data files in the open-source mzML format, iteratively learning signal characteristics to ensure high-quality signal detection.

Key Features of mzLearn:

Zero-parameter design: No prior knowledge or QC samples are required.
Iterative learning: mzLearn autonomously refines signal detection, correcting for retention time (rt) and intensity drifts caused by batch effects and run order.
Output: A two-dimensional table of detected features defined by median rt and m/z values, with normalized intensities across samples.
Scalability: Capable of handling large-scale datasets (e.g., 2,075 files in a single run).
Accessibility: mzLearn’s website for accessing the tool is available at http://mzlearn.com/.

Overview of mzEmbed Codebase

mzEmbed extends mzLearn’s capabilities by combining outputs from multiple datasets to develop pre-trained generative models and applying them to a range of metabolomics applications.

Key Components of mzEmbed:

Pre-trained Model Development:
- Combines metabolomics data from multiple studies to create robust pre-trained generative models.
- Supports Variational Autoencoders (VAEs) for unsupervised learning of metabolite representations.
- Enables parameter optimization using grid search and Optuna for hyperparameter tuning.
- Outputs embeddings that capture biological and demographic variability, such as age, disease state.
Fine-Tuning Pre-Trained Models:
- Allows fine-tuning of pre-trained models on independent datasets for improved task-specific performance.
- Supports fine-tuning for binary classification, multi-class classification, and survival analysis.
Task-Specific Model Refinement:
- Retrains the last layer of fine-tuned models for specific tasks, such as clinical classifcation and surivival analysis.
Advanced Architectures:
- Supports the development of joint learning models for treatment-independent, prognostic stratification of patient.
- Implements adversarial learning to isolate treatment-specific predictive biomarkers, or predictive stratification of patient.

Getting Started

Requirements

Python 3.9 or higher

Installation

Clone the repository:

git clone [email protected]:ReviveMed/mzEmbed.git
cd mzEmbed

Install dependencies:
```
pip install -r requirements.txt
```

building the package:

cd mz_embed
python -m build
pip install -e .

Usage:

The repository supports six main use cases, including pretraining, fine-tuning, and advanced learning architectures. pretrain and finetune directories includes examples of the Python commands for each use case.

License

This project is licensed under the Academic and Non-Profit Use License. See the LICENSE.txt file for details.

Citation

If you use mzLearn or mzEmbed in your research, please cite:

mzLearn, a data-driven LC/MS signal detection algorithm, enables pre-trained generative models for untargeted metabolomics Leila Pirhaji, Jonah Eaton, Adarsh K. Jeewajee, Min Zhang, Matthew Morris, Maria Karasarides

Name		Name	Last commit message	Last commit date
Latest commit History 765 Commits
.github		.github
data		data
mz_embed		mz_embed
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
__init__.py		__init__.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mzLearn, a data-driven LC/MS signal detection algorithm, enables pre-trained generative models for untargeted metabolomics

Overview of mzLearn

Key Features of mzLearn:

Overview of mzEmbed Codebase

Key Components of mzEmbed:

Getting Started

Requirements

Installation

Usage:

License

Citation

About

Releases

Packages

Languages

License

ReviveMed/mzEmbed

Folders and files

Latest commit

History

Repository files navigation

mzLearn, a data-driven LC/MS signal detection algorithm, enables pre-trained generative models for untargeted metabolomics

Overview of mzLearn

Key Features of mzLearn:

Overview of mzEmbed Codebase

Key Components of mzEmbed:

Getting Started

Requirements

Installation

Usage:

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages