Skip to content

Align raw nanopore signals to a reference using subsequence DTW

License

Notifications You must be signed in to change notification settings

hasindu2008/sigfish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sigfish

sigfish is an experiment toolkit that attempts to directly map nanopore raw signal data to a reference. Supports R9 DNA and RNA, R10 DNA and the latest RNA004 kits from ONT.

This is under construction. Interface and parameters are not stable. Documentation is currently minimal

This repository of sigfish supports CPU only. For the FPGA accelerated version, please refer to the sigfish-haru fork for which the documentation is located HARU and the publication at GIGAScience.

Building

sudo apt-get install zlib1g-dev   #install zlib development libraries
git clone https://github.com/hasindu2008/sigfish
cd sigfish
make

The commands to install zlib development libraries on some popular distributions:

On Debian/Ubuntu : sudo apt-get install zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install zlib-devel
On OS X : brew install zlib

Usage

Currently, there are two subtools: dtw and eval.

dtw

Performs subsequence Dynamic Time Warping (sDTW) of raw signals in S/BLOW5 format to a reference in FASTA format. This is an all-to-all alignment and is not intended for large references.

Usage: sigfish dtw [OPTIONS] genome.fa/transcriptome.fa reads.blow5

Output is in a PAF-like format with the following columns:

Col Type Description
1 string Read identifier name
2 int Raw signal length (number of samples)
3 int Raw signal start index (0-based; BED-like; closed)
4 int Raw signal end index (0-based; BED-like; open)
5 char Relative strand: "+" or "-"
6 string Reference name
7 int Reference sequence length
8 int start on the reference sequence (0-based; BED-like; closed)
9 int end on reference sequence (0-based; BED-like; open)
10 int Approximation of number of matching bases in the alignment
11 int Alignment block length on the reference in terms of bases
12 int Mapping quality (0-255; 255 for missing)

Following optional tags are present:

Tag Type Description
tp A Type of alignment: P/primary
d1 f DTW score (lower the better)
d2 f DTW score of the next best alignment

If you specify --sam, the output will be in SAM format containing ss and si tags similar to the format described here. This output can be used to visualise the alignment using squigualiser.

Options:

basic options:
   -t INT                     number of processing threads [8]
   -K INT                     batch size (max number of reads loaded at once) [512]
   -B FLOAT[K/M/G]            max number of bytes loaded at once [20.0M]
   -h                         help
   -o FILE                    output to file [stdout]
   --verbose INT              verbosity level [4]
   --version                  print version
   --pore STR                 set the pore chemistry (r9, r10 or rna004) [auto]
advanced options:
   --kmer-model FILE          custom nucleotide k-mer model file (format similar to test/r9-models/r9.4_450bps.nucleotide.6mer.template.model)
   --rna                      the dataset is direct RNA
   -q INT                     the number of events in query signal to align [250]
   -p INT                     the number of events to trim at query signal start [50]
   --debug-break INT          break after processing the specified no. of batches
   --profile-cpu=yes|no       process section by section (used for profiling on CPU)
   --dtw-std                  use DTW standard instead of DTW subsequence
   --invert                   reverse the reference events instead of query
   --full-ref                 map to the full reference
   --from-end                 Map the end portion of the query instead of the beginning
   --sam                      Output in SAM format

eval

Evaluates/compare mappings in PAF format (testset) by comparing to a truthset which is also in PAF format. As an example, testset is the output from sigfish dtw, whereas, truthset can be the mapping output from Minimap2. The mapping position (includes contig name, coordinates and strandness) for read ID in the testset is compared agianst those in the truthset. A mapping is considered correct if the contig name and the strandness in the testset exactly matches the truthset and the mapping coodinates match the criteria min(|diff_st|,|diff_end|) < THRESHOLD where,

truth:   -----------------------------
test:                    -----------------------------------
         |<---diff_st--->|            |<-----diff_end----->|

THRESHOLD is 100 by default.

Output includes statistics for mapping accuracy considering the testset as a whole, as well as based on individual mapping qualities scores.

Usage:

sigfish eval truth.paf test.paf

Options:

basic options:
   -h                         help
   --version                  print version
   --secondary STR            consider secondary mappings. yes or no.
   --tid-only                 consider regerence name and strand only

Acknowledgement

  • The output PAF-like format output by sigfish dtw was inspired by UNCALLED. sigfish eval was implemented by learning from UNCALLED pafstats.
  • The methodology in ReadUntil was referred to when implementing alignment component in sigfish dtw
  • The event detection code is from Oxford Nanopore's Scrappie basecaller.
  • The DTW code is from mlpy.
  • The pore-models are from Nanopolish.
  • Some code snippets have been taken from Minimap2, Samtools.

Citation

You may cite the following:

Po Jui Shih, Hassaan Saadat, Sri Parameswaran, Hasindu Gamaarachchi, Efficient real-time selective genome sequencing on resource-constrained devices, GigaScience, Volume 12, 2023, giad046, https://doi.org/10.1093/gigascience/giad046

@article{shih2023efficient,
  title={Efficient real-time selective genome sequencing on resource-constrained devices},
  author={Shih, Po Jui and Saadat, Hassaan and Parameswaran, Sri and Gamaarachchi, Hasindu},
  journal={GigaScience},
  volume={12},
  pages={giad046},
  year={2023},
  publisher={Oxford University Press}
}

About

Align raw nanopore signals to a reference using subsequence DTW

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages