PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

PathoGFAIR is a collection of Galaxy-based FAIR workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualisations and reports for comprehensive results.

PathoGFAIR implementation

The core of PathoGFAIR project is a series of 5 Galaxy-based workflows designed to process Nanopore sequencing data, detect pathogens, and track their presence across samples:

Preprocessing: where the quality controlling, reads trimming for quality retaining, host sequences removal and other contaminating sequences removal take place
Taxonomy Profiling: where taxonomy profiling takes place identifying and visualizing our samples' community abundances down to the subspecies level
Gene-based Pathogen Identification: where we identify all possible pathogens by identifying all the Virulence factors (VFs) genes and their specific locations, we also identify Antimicrobial resistance genes (AMRs) within the same workflow.
Allele-based Pathogen Identification: where we identify all the SNPs and variants and create the consensus sequences of all samples
Samples Aggregation and Visualisation: where we visualize the outputs of the pathogens drawing a heatmap of all the found pathogenic genes for all samples, phylogenetic trees relating samples together per common pathogenic genes found, and bar charts for important tabular outputs of previous workflows, e.g. number of identified SNPs and variants per sample, number of removed host reads and mapping depth and coverage.

Where to find the workflows?

Workflows are available on 2 workflows registries (Dockstore and WorkflowHub) and several Galaxy servers.

Workflow Name	WorkflowHub	Dockstore	Galaxy Servers
Nanopore Preprocessing (v 0.1)	ID 1061 v 0.1	nanopore-pre-processing/main:v0.1	European Galaxy Server, United States Galaxy Server, Australian Galaxy Server
Taxonomy Profiling and Visualization with Krona (v 0.1)	ID 1059 v 0.1	taxonomy-profiling-and-visualization-with-krona/main:v0.1	European Galaxy Server, United States Galaxy Server, Australian Galaxy Server
Gene-based Pathogen Identification (v 0.1)	ID 1062 v 0.1	gene-based-pathogen-identification/main:v0.1	European Galaxy Server, United States Galaxy Server, Australian Galaxy Server
Allele-based Pathogen Identification (v 0.1)	ID 1063 v 0.1	allele-based-pathogen-identification/main:v0.1	European Galaxy Server, United States Galaxy Server, Australian Galaxy Server
Samples Aggregation and Visualisation (v 0.1)	ID 1060 v 0.1	pathogen-detection-pathogfair-samples-aggregation-and-visualisation/main:v0.1	European Galaxy Server, United States Galaxy Server, Australian Galaxy Server
PathoGFAIR 5in1 (v 0.1)	Soon	Soon	European Galaxy Server, United States Galaxy Server, Australian Galaxy Server

How to learn to use the workflows?

To assist in understanding and using the workflows, we provide extensive tutorial and recording available via the Galaxy Training Network GTN.

Use Cases

To demonstrate PathoGFAIR and its features, 130 samples from 2 studies (without or with prior pathogen isolation) were analysed.

All samples contained pathogens known beforehand and were sequenced using Oxford Nanopore technology.

Samples Without Prior Pathogen Isolation

Pathogens were deliberately spiked into 46 samples to mimic real-world scenarios given a protocol developed in the context of PathoGFAIR.

The full analysis can be found in a dedicated Galaxy history.

Samples With Prior Pathogen Isolation

To further test PathoGFair, 84 public datasets were used. The full analysis can be found in a dedicated Galaxy history.

Benchmarking

To evaluate the effectiveness of PathoGFAIR workflows, a benchmarking analysis was performed comparing PathoGFAIR's pathogen detection capabilities with the systems and pipelines.

This section provides detailed instructions to replicate the PathoGFAIR benchmarking process, as outlined in our dedicated protocol on protocols.io. The focus here is on running the selected systems/pipelines used in our benchmarking.

PathoGFAIR

Setup: Import the Galaxy history, which contains the benchmarking results of PathoGFAIR
Results:
- View the results directly.
- Rerun PathoGFAIR workflows on the same 46 samples, available under dataset number 47:biolytix_classified_samples.

CZID (IDseq)

Setup:
1. Create an Account on CZID: We used credentials from Engy Nasr, University of Freiburg.
2. Create a public or private Project with the following details:
  - Name
  - Description
  - Analysis Type: Metagenomics
  - Sequencing Type: Nanopore
3. Upload Datasets: Upload the 46 samples (total size: 7GB) either locally or from a public repository:
  - Local upload (our recommendation): Download the samples from our published Galaxy history and upload them.
  - Online upload: Use the NCBI repository PRJNA982679.
4. Upload or Enter Metadata: Metadata fields we included are: host [Chicken], Ct value, sampling date, location, and nucleotide type [DNA], which are available in our metadata table in data/benchmark.
  
  For a full list of possible fields to include, see: CZID Metadata Dictionary.
Execution: Sample uploads started on Tuesday, October 15th, at 9 AM. Dataset upload finished at 9:58 AM. Analysis was fully completed for all samples after: 1hour and 30 mins of the datasets finished upload.
Results: View the analysis results: CZID Results, use the drop down menu on the top left to switch between samples.

BugSeq

Setup:
1. Create an Account on BugSeq: We used credentials from Engy Nasr, University of Freiburg.
2. Upload datasets: Upload the 46 samples (total size: 7GB) either locally or via their BaseSpace, that you have to contact them for. We uploaded them from local directory, same as explained for CZID(IDseq)
3. Set up parameters
  - Platform: Nanopore
  - Device & Chemistry: MinION/GridION/Flongle - R9.4.1
  - Metagenomic Database: NCBI nt (BugSeq recommendation for metagenomics samples)
  - Sample Type: Generic (BugSeq recommendation)
  - Sequenced Material: DNA
  - Outbreak Analysis (Genomic Relatedness Visualization): yes
Execution: Sample uploads started on Tuesday, October 15th, at 11:30 AM. Dataset upload finished at 12:30 PM. Analysis was fully completed for all samples after: 1hour and 30 mins of the datasets finished upload.
Results: On Tuesday, October 15th, at 15:21 PM, we received that the analysis had failed, as we have insufficient sample credits. We sent to their support for help directly after receiving their email. They replied back that we can only analyse 10 samples not 46, so BugSeq will be removed from the benchmark.

GitHub Repository

In the GitHub repository, you'll find sources (Jupyter notebooks) to replicate figures in the manuscript.

The notebooks are also designed to run on any Galaxy instance using Jupytool.

Folder structure

data: folder with 2 subfolders (1 per use case: samples_without_prior_pathogen_isolation and samples_with_prior_pathogen_isolation), each containing metadata and Galaxy-generated outputs of the workflows on samples from each use cases
bin: folder with two Jupyter notebooks (1 per use case) for
- importing all tables from the data folder
- generating all figures to the results folder.
results: folder with results after running the Jupyter notebooks on the workflows output datasets.
docs: folder with figures (docs/figures) and tables (docs/tables) mentioned in the paper.

Requirements

To reproduce the figures in results, we need to run the notebooks bin and then needs the following:

jupyter
openpyxl
matplotlib
numpy
pandas
python=3.13.1
requests
seaborn
scikit-learn
scipy
upsetplot
clustergrammer2

This can be installed with a conda environment:

$ conda env create -f environment.yml

Usage

Activate conda environment
```
 $ conda activate pathogfair
```
Launch Jupyter
```
 $ jupyter notebook
```
Go to http://localhost:8888 (a page should open automatically in your browser)
Navigate to bin
Open the notebooks and run them

Contributors

Engy Nasr
Anna Henger
Björn Grüning
Paul Zierep
Bérénice Batut

Contribution

Feel free to contribute, open issues, or provide feedback.

Citation

If you use or refer to this project in your research, please cite the associated paper:

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking Engy Nasr, Anna Henger, Björn Grüning, Paul Zierep, Bérénice Batut bioRxiv 2024.06.26.600753; doi: https://doi.org/10.1101/2024.06.26.600753

Sources

All sources for the figures and tables in the paper can be found in this GitHub repository

Figures

Figure 1 created using Inkscape (SVG in docs/figures/Figure_1.svg)
Figure 2
- Panels (A), (B), (D), (E), (F) generated by bin/samples_without_prior_pathogen_isolation.ipynb
- Panel (C) corresponds to a phylogenetic tree generated using Galaxy and available in this Galaxy history (step 11427)
- Panels arranged and colored using Inkscape (SVG in docs/figures/Figure_2.svg)
Figure 3
- Panels generated by bin/samples_without_prior_pathogen_isolation.ipynb
- Panels arranged and colored by Inkscape (SVG in docs/figures/Figure_3.svg)
Supplementary Figure S1 generated by bin/samples_without_prior_pathogen_isolation.ipynb
Supplementary online Figure S2
Supplementary Figure S3 generated by bin/samples_without_prior_pathogen_isolation.ipynb
Supplementary Figure S4
- Phylogenetic tree generated using Galaxy and available in this Galaxy history (step 11425)
- Colors added using Inkscape (SVG in docs/figures/Supplementary_Figure_S4.svg)
Supplementary Figure S5 generated by bin/samples_with_prior_pathogen_isolation.ipynb
Supplementary online Figure S6
Supplementary Figure S7
- Panels generated by bin/samples_with_prior_pathogen_isolation.ipynb
- Panels arranged using Inkscape (SVG in docs/figures/Supplementary_Figure_S7.svg)
Supplementary Figure S8
- Phylogenetic tree generated using Galaxy and available in this Galaxy history (step 2742)
- Colors added using Inkscape (SVG in docs/figures/Supplementary_Figure_S8.svg)
Supplementary Figure S9
- Heatmap generated by bin/samples_without_prior_pathogen_isolation.ipynb
- Labels added using Inkscape (SVG in docs/figures/Supplementary_Figure_S9.svg)

Tables

Table 1: Comparison of features between PathoGFAIR and other similar pipelines or systems
Supplemantary Table T1: source in docs/tables/Supplementary_Table_T1.tsv
Supplemantary Table T2: source in docs/tables/Supplementary_Table_T2.tsv
Supplemantary Table T3: source in docs/tables/Supplementary_Table_T3.tsv
Supplemantary Table T4: source in docs/tables/Supplementary_Table_T4.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

PathoGFAIR implementation

Where to find the workflows?

How to learn to use the workflows?

Use Cases

Samples Without Prior Pathogen Isolation

Samples With Prior Pathogen Isolation

Benchmarking

PathoGFAIR

CZID (IDseq)

BugSeq

GitHub Repository

Folder structure

Requirements

Usage

Contributors

Contribution

Citation

Sources

Figures

Tables

Files

README.md

Latest commit

History

README.md

File metadata and controls

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

PathoGFAIR implementation

Where to find the workflows?

How to learn to use the workflows?

Use Cases

Samples Without Prior Pathogen Isolation

Samples With Prior Pathogen Isolation

Benchmarking

PathoGFAIR

CZID (IDseq)

BugSeq

GitHub Repository

Folder structure

Requirements

Usage

Contributors

Contribution

Citation

Sources

Figures

Tables