buttery-eel
is a wrapper for guppy
and dorado
. It allows us to read SLOW5
files, and send that data to [guppy
] or dorado
server (https://community.nanoporetech.com/downloads) to basecall. It requires matching versions of guppy
and ont-pyguppy-client-lib
to work.
You can download guppy or dorado server here: https://community.nanoporetech.com/downloads. An ONT login is required to access that page, sorry no easy way around that one without legal headaches.
- Currently, the main branch is the multi-process version (parallel processes to communicate to/from Guppy client) that enables performance scaling for multi-GPU setups, especially for FAST basecalling or shorter reads. A simple single-process version (one process to communicate to/from the Guppy client) that works well for HAC and SUP models is available in the
singleproc
branch for learning purposes. - Before v0.3.3, the main branch was the single-process version (
singleproc
branch) and the multi=process version was under themultiproc
branch.
Using python3, preferably python3.7 to 3.9. Python 3.10 and higher does not yet have any pip wheel builds available for v6.3.8 and lower of guppy
Install a version of guppy
(something higher than 4) where GUPPY_VERSION
is the version, for example, 6.3.8
. Alternatively, you can install a version of dorado server
too.
Download: https://community.nanoporetech.com/downloads
The guppy
and ont-pyguppy-client-lib
versions need to match
git clone https://github.com/Psy-Fer/buttery-eel.git
cd buttery-eel
python3 -m venv venv3
source ./venv3/bin/activate
pip install --upgrade pip
pip install --upgrade setuptools wheel
# if your slow5 file uses zstd compression and you have zstd installed
# see slow5lib for more info
# set this first to ensure pyslow5 installs with zstd:
# export PYSLOW5_ZSTD=1
# if GUPPY_VERSION=6.3.8
# modify requirements.txt to have:
# ont-pyguppy-client-lib==6.3.8
# if using DORADO_SERVER_VERSION=7.1.4
# ont-pyguppy-client-lib==7.1.4
python setup.py install
# Alternatively, the new way of building things is to do the following command
# pip install .
buttery-eel --help
Suppose the name of the virtual environment you created is venv3 and resides directly in the root of the cloned buttery-eel git repository. In that case, you can use the wrapper script available under /path/to/repository/scripts/eel
for conveniently executing buttery-eel. This script will automatically source the virtual environment, find a free port, execute the buttery-eel with the parameters you specified and finally deactivate the virtual environment. If you add the path of /path/to/repository/scripts/
to your PATH environment variable, you can simply use buttery-eel as:
eel -g /path/to/ont-guppy/bin/ --config dna_r10.4.1_e8.2_400bps_hac_prom.cfg --device cuda:all -i reads.blow5 -o reads.reads # and any other parameters
Alternatively, you can manually execute buttery-eel if you have sourced the virtual environment. You must provide --port PORT --use_tcp
parameters manually in this case. Example:
buttery-eel -g /path/to/ont-guppy/bin/ --config dna_r10.4.1_e8.2_400bps_hac_prom.cfg --device cuda:all -i reads.blow5 -o reads.reads.fastq --port 5000 --use_tcp # and any other parameters
usage: buttery-eel [-h] -i INPUT -o OUTPUT -g GUPPY_BIN --config CONFIG [--guppy_batchsize GUPPY_BATCHSIZE] [--call_mods] [-q QSCORE] [--slow5_threads SLOW5_THREADS]
[--procs PROCS] [--slow5_batchsize SLOW5_BATCHSIZE] [--quiet] [--max_read_queue_size MAX_READ_QUEUE_SIZE] [--log LOG] [--moves_out]
[--do_read_splitting] [--min_score_read_splitting MIN_SCORE_READ_SPLITTING] [--detect_adapter] [--min_score_adapter MIN_SCORE_ADAPTER]
[--trim_adapters] [--detect_mid_strand_adapter] [--seq_sum] [--barcode_kits BARCODE_KITS] [--enable_trim_barcodes] [--require_barcodes_both_ends]
[--detect_mid_strand_barcodes] [--min_score_barcode_front MIN_SCORE_BARCODE_FRONT] [--min_score_barcode_rear MIN_SCORE_BARCODE_REAR]
[--min_score_barcode_mid MIN_SCORE_BARCODE_MID] [--profile] [-v]
buttery-eel - wrapping guppy/dorado for SLOW5 basecalling
optional arguments:
-h, --help show this help message and exit
--profile run cProfile on all processes - for debugging benchmarking (default: False)
-v, --version Prints version
Run Options:
-i INPUT, --input INPUT
input blow5 file or directory for basecalling (default: None)
-o OUTPUT, --output OUTPUT
output .fastq or unaligned .sam file to write (default: None)
-g GUPPY_BIN, --guppy_bin GUPPY_BIN
path to ont_guppy/bin or ont-dorado-server/bin folder (default: None)
--config CONFIG basecalling model config (default: dna_r9.4.1_450bps_fast.cfg)
--guppy_batchsize GUPPY_BATCHSIZE
number of reads to send to guppy/dorado at a time. (default: 4000)
--call_mods output MM/ML tags for methylation - will output sam - use with appropriate mod config (default: False)
-q QSCORE, --qscore QSCORE
A mean q-score to split fastq/sam files into pass/fail output (default: None)
--slow5_threads SLOW5_THREADS
Number of threads to use reading slow5 file (default: 4)
--procs PROCS Number of worker processes to use processing reads (default: 4)
--slow5_batchsize SLOW5_BATCHSIZE
Number of reads to process at a time reading slow5 (default: 4000)
--quiet Don't print progress (default: False)
--max_read_queue_size MAX_READ_QUEUE_SIZE
Number of reads to process at a time reading slow5 (default: 20000)
--log LOG guppy/dorado log folder path (default: buttery_basecaller_logs)
--moves_out output move table (sam format only) (default: False)
Sequencing summary Options:
--seq_sum Write out sequencing_summary.txt file (default: False)
Read splitting Options:
--do_read_splitting Perform read splitting based on mid-strand adapter detection (default: False)
--min_score_read_splitting MIN_SCORE_READ_SPLITTING
Minimum mid-strand adapter score for reads to be split (default: 50.0)
Adapter trimming Options:
--detect_adapter Enable detection of adapters at the front and rear of the sequence (default: False)
--min_score_adapter MIN_SCORE_ADAPTER
Minimum score for a front or rear adapter to be classified. Default is 60. (default: 60.0)
--trim_adapters Flag indicating that adapters should be trimmed. Default is False. (default: False)
--detect_mid_strand_adapter
Flag indicating that read will be marked as unclassified if the adapter sequence appears within the strand itself. Default is False. (default:
False)
Barcode demultiplexing Options:
--barcode_kits BARCODE_KITS
Strings naming each barcode kit to use. Default is to not do barcoding. (default: None)
--enable_trim_barcodes
Flag indicating that barcodes should be trimmed. (default: False)
--require_barcodes_both_ends
Flag indicating that barcodes must be at both ends. (default: False)
--detect_mid_strand_barcodes
Flag indicating that read will be marked as unclassified if barcodes appear within the strand itself. (default: False)
--min_score_barcode_front MIN_SCORE_BARCODE_FRONT
Minimum score for a front barcode to be classified (default: 60.0)
--min_score_barcode_rear MIN_SCORE_BARCODE_REAR
Minimum score for a rear barcode to be classified (default: 60.0)
--min_score_barcode_mid MIN_SCORE_BARCODE_MID
Minimum score for mid barcodes to be detected (default: 60.0)
Set up flags needed and run (--use_tcp
is needed but not forced in these early versions):
buttery-eel -g ont-guppy-6.3.8/bin --use_tcp -x "cuda:all" --config dna_r9.4.1_450bps_fast.cfg --port 5558 -i PAF25452_pass_bfdfd1d8_11.blow5 -o test.fastq
To call modifications, provide a modbases
model and the --call_mods
flag. Output will now be unaligned-sam containing the MM/ML
tags. It will also provide the move table
You must use guppy 6.3.0 or higher for mod calling
buttery-eel -g ont-guppy-6.3.8/bin --use_tcp -x "cuda:all" --config dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast.cfg --call_mods --port 5558 -i PAF25452_pass_bfdfd1d8_11.blow5 -o test.mod.sam
the --config
file can be found using this command with guppy guppy_basecaller --print_workflows
and looking up the appropriate kit and flowcell type. Specify the format like this --config dna_r9.4.1_450bps_fast.cfg
ending in .cfg
samtools fastq -TMM,ML test.mod.sam | minimap2 -ax map-ont -y ref.fa - | samtools view -Sb - | samtools sort - > test.aln.mod.bam
If you also wish to keep the quality scores in the unofficial qs tags or if mapping a regular unmapped sam the -T argument can be used in conjunction with minimap2 -y for example: -TMM,ML,qs
or -Tqs
. You can also get all sam tags with -T'*'
but you need samtools of v1.16 or higher.
If everything goes right, the server will be terminated at the end of the basecalling.
However, sometimes things go wrong, and the wrapper will temrinate before it terminates the server.
I have mostly fixed this but sometimes it still happens. Here is how you check for the server and then kill it.
# check for guppy instanaces
ps -ef | grep guppy
# That might give you a result like this
# hasindu 27946 27905 99 19:31 pts/22 01:25:29 /install/ont-guppy-6.3.8/bin/guppy_basecall_server --log_path buttery_guppy_logs --config dna_r9.4.1_450bps_hac_prom.cfg --port 5558 --use_tcp -x cuda:all --max_queued_reads 2000 --chunk_size 2000
# using the --port to see that it is indeed the one you started.
# you can then kill the process with, where in this case, `PID=27946`
kill <PID>
# then you can try again
ONT have 4 basecallers.
Albacore
(archived)Guppy
(current - production)Bonito
(research)Dorado
(preview - future production)
Albacore
(not used anymore) and Guppy
are closed source software, and researchers are required to sign a developer agreement to gain access to the source code. Any changes made by the researchers can't share that modified software, and there are a number of restrictions in place that are too boring to talk about.
Bonito
and Dorado
on the other hand, are MOSTLY open source. I say MOSTLY, because they have closed source dependencies, such as koi
, and their license isn't really "true" open source. Either way, it's close enough. This is great, because any modifications a researcher makes to those software, can be shared with the wider community.
We have had an interest in the basecallers as we are the developers of SLOW5
, a file format that is smaller and faster than FAST5
. While ONT is developing POD5
, there is a gap, where anyone using SLOW5
for downstream analysis and storage, have to convert their SLOW5
files back to FAST5
to re-basecall with newer versions of guppy
, bonito, or dorado. While we have built versions of guppy
that can read SLOW5
, we cannot distribute these due to developer agreement. We have built forks of both bonito and dorado and submitted pull requests on their official repositories, they were rejected by the ONT devs.
So there is a major road block for ease of use with SLOW5
for re-basecalling data, especially with guppy
, as that is the current production basecaller. I think this is something that stands in the way of more widespread adoption. While we can maintain our own forks of bonito and dorado as the changes to read SLOW5
are minimal, there is no work around to share a guppy
version that reads SLOW5
....UNTIL NOW!
Adaptive sampling has really pushed the need for interfaces with basecalling in real-time, and a python interface has been built to help control guppy_basecaller_client
and guppy_basecall_server
. There is an open library from ONT called pyguppyclient
, however I had some issues getting it to work, and support seems patchy. However it uses something called ont-pyguppy-client-lib
which is built and released with every guppy
release, and so is up to date with the latest versions. So, if we match the python library version, with the guppy
release version, they should be compatible.
So using this library and some helpful hints from the readfish/ReadUntil basecalling scripts by Matt Loose and Alexander Payne, buttery-eel
starts a guppy_basecall_server
, connects to it as a client, reads a SLOW5
file to get reads, converts them to a form to be accepted by guppy
, and sends them to the guppy_basecall_server
. It then collects the basecalled reads, and writes the data to a fastq. Once all the reads are processed, it sends a termination command to the guppy_basecall_server
and that's it.
The best thing about this, is all of the libraries and code is open, and so we can share a method that doesn't cause any legal headaches, and also doesn't require the ONT devs to accept any pull requests or code changes. This just wraps around any guppy
release like a "buttery eel", and you can use SLOW5
.
- Firstly, whoever maintains and develops the
ont-pyguppy-client-lib
at ONT, thank you. This library gives the minimum required to make this relatively easy to implement. - Hasindu Gamaarachchi for having the idea to do this, and for the issue posted on the slow5tools repo by SziKayLeung
- My partner Hilary for coming up with the name.
- Matt Loose and Alexander Payne for having the basics of this all along in your readfish code and being awesome in general
- ONT and their open source code of bonito and dorado for handling uSAM writing.
- Lastly, I'd like to say i'm a little surprised this wasn't suggested to us by the devs at ONT when they were rejecting our pull requests on Guppy, Bonito, and Dorado. Oh well.
- slow5lib/pyslow5
- ONT guppy
- ONT ont-pyguppy-client-lib
- basecaller code and flow mostly follows the methods used in readfish by Matt Loose and Alexander Payne