sigtk

A simple toolkit written for performing various operations on nanopore raw signal data. This is still in a very premature development stage and thus anticipate changes. Currently, sigtk is single threaded and has not been optimised for performance. The intended use is to perform operations on relatively smaller datasets for learning purposes and eyeballing.

Building

sudo apt-get install zlib1g-dev   #install zlib development libraries
git clone https://github.com/hasindu2008/sigtk
cd sigtk
make

The commands to install zlib development libraries on some popular distributions:

On Debian/Ubuntu : sudo apt-get install zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install zlib-devel
On OS X : brew install zlib

Usage

synthetic reference (sref)

Prints a synthetic reference signal for a given reference genome using traditional pore models. The 6-mer DNA pore-model used is here and the 5-mer RNA pore-model is here.

Usage: sigtk sref reference.fa

Specify --rna to use the RNA pore-model. Output is a tab-delimited text file with each row being a reference contig (one for + and another for - when DNA; only for + when RNA) and the columns being as described below:

Col	Type	Name	Description
1	string	ref_name	Reference contig name
2	int	ref_len	Length of the reference (no. of bases)
3	char	strand	The reference strand direction (+ or -)
4	int	sig_len	Length of the synthetic signal (no. of k-mers)
5	float*	sig_mean	Command separated mean current values of the synthetic signal

Per-record raw-signal operations

The subtools in this section perform various operations on individual raw-signal records in a BLOW5/SLOW5 file. Those subtools can be used in one of the following forms:

# To perform subtool operation on all reads in a BLOW5 file
sigtk <subtool> reads.blow5
# To perform subtool operation on specified read IDs in a BLOW5 file
sigtk <subtool> reads.blow5 read_id1 read_id2 ..

By default, a tab-delimited text file with the first row being the header is printed. You can suppress the header using -n flag, for easy use with command line tools such as awk. Some subtools can be invoked with -c for compact output that prints data in a custom encoding (explained in each subtool, if relevant). These subtools automatically detect if raw signal data in for DNA or RNA, if applicable.

pa

Prints the raw signal in pico-amperes.

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	int	len_raw_signal	The number of samples in the raw signal
3	float*	pa	Comma separated Raw signal in pico amperes

event

Event segmentation is based on the method in Oxford Nanopore's Scrappie basecaller.

By default, the output will be in the long intuitive form as explained below:

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	int	event_idx	Event index (0-based)
3	int	raw_start	Raw signal start index for the event (0-based; BED-like; closed)
4	int	raw_end	Raw signal end index for the event (0-based; BED-like; open)
5	float	event_mean	Mean level of pico-ampere scaled signal for the event
6	float	event_std	Standard deviations of pico-ampere scaled signal for the event

To obtain a condensed output that consumes less space and one record per row, specify -c option:

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	int	len_raw_signal	The number of samples in the raw signal
3	int	raw_start	Raw signal start index of the first event (0-based; BED-like; closed)
4	int	raw_end	Raw signal end index of the last event (0-based; BED-like; open)
5	int	num_event	Number of events
6	int*	events	Comma separated event lengths (based on no. raw signal samples)

The event 0 starts at raw signal index raw_start (0-based; BED-like; closed) and ends at raw_start+events[0] (0-based; BED-like; open). The event 1 starts at raw signal index raw_start+events[0] (0-based; BED-like; closed) and ends at raw_start+events[0]+events[1] (0-based; BED-like; open). Likewise, the events can be reconstructed by using the cumulative sum of events.

stat

Prints signal statistics.

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	int	len_raw_signal	The number of samples in the raw signal
3	float	raw_mean	Mean of raw signal values
4	float	pa_mean	Mean of pico-amperes scaled signal
5	float	raw_std	Standard deviation of raw signal values
6	float	pa_std	Standard deviation of pico-amperes scaled signal
7	int	raw_median	Median of raw signal values
8	float	pa_median	Mean of pico-amperes scaled signal

subtools under development

Note that these are not much tested and the interface and output may change at anytime.

prefix

Under construction. Will change anytime. Only for direct RNA at the moment. Finds prefix segments in a raw signal such as adaptor and polyA.

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	int	len_raw_signal	The number of samples in the raw signal
3	int	adapt_start	Raw signal start index of the adaptor
4	int	adapt_end	Raw signal end index of the adaptor
5	int	polya_start	Raw signal start index of the polyA tail
6	int	polya_end	Raw signal end index of the polyA tail

If --print-stat is printed, following additional columns will be printed.

Type	Name	Description
float	adapt_mean	Mean of pico-amperes scaled signal of the adaptor
float	adapt_std	Standard deviation of pico-amperes scaled signal of the adaptor
float	adapt_median	Median of pico-amperes scaled signal of the adaptor
float	polya_mean	Mean of pico-amperes scaled signal of the polyA tail
float	polya_std	Standard deviation of pico-amperes scaled signal of the polyA tail
float	polya_median	Median of pico-amperes scaled signal of the polyA

jnn

Under construction. Will change anytime. Print segments found using JNN segmenter.

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	int	len_raw_signal	The number of samples in the raw signal
3	int	num_seg	Number of segments found
4	string	seg	List of segments as explained below

...............|..........|..............|..........|............   <- signal and segments
              100        110            201        212              <- signal index (0-based)

Segments will be noted as: 100,110;201,212;

If -c is specified, output will be in the following short notation by using relative offsets.

...............|..........|..............|..........|............   <- signal and segments
              100        110            201        212              <- signal index (0-based)

               <---10----><-----91------><---11----->

100H10,91H11,

ent

Under construction. Will change anytime. Calculates shannon entropy for reads in a given S/BLOW5 file.

Col	Type	Name	Description
1	string	read_id	Read identifier name
2	float	raw_ent	entropy of raw signal samples
3	float	delta_ent	entropy after zig-zag delta
4	float	byte_ent	entropy after splitting and storing least significant byte and most significant byte of the signal samples separately: ent(LSB)+ent(MSB)

ss

Under construction. Will change anytime. Operations to convert to/from signal alignment string (ss). See https://hasindu2008.github.io/f5c/docs/output#resquiggle-paf-output-format for explanation of ss.

To convert a PAF file with ss tags to TSV, you can use:

sigtk ss paf2tsv in.paf

qts

Under construction. Will change anytime. Quantise the raw signal in a S/BLOW5 files. Takes a S/BLOW5 file as the input and writes the quantised output to a S/BLOW5 file.

Usage:

sigtk qts original.blow5 -o quantised.blow5

Options: -q INT : Number of LSB bits to trucate (set to 0). Default is 1.

Acknowledgement

The event detection code is from Oxford Nanopore's Scrappie basecaller. The pore-models are from Nanopolish. Code snippets have been taken from Minimap2 and Samtools. The name of the tool sigtk in signal-space was inspired by seqtk in base-space. Kseq and ksort from klib are used. Segmentation method (aka jnn) was adapted from SquiggleKit and deeplexicon.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
build		build
scripts		scripts
slow5lib		slow5lib
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sigtk

Building

Usage

synthetic reference (sref)

Per-record raw-signal operations

pa

event

stat

subtools under development

prefix

jnn

ent

ss

qts

Acknowledgement

About

Releases

Packages

Languages

License

uaudith/sigtk

Folders and files

Latest commit

History

Repository files navigation

sigtk

Building

Usage

synthetic reference (sref)

Per-record raw-signal operations

pa

event

stat

subtools under development

prefix

jnn

ent

ss

qts

Acknowledgement

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages