If you are anything like us, reading a README is the last thing you do when running code. PLEASE DON'T DO THAT FOR READFISH. This will effect changes to your sequencing and - if you use it incorrectly - cost you money. We have added a list of GOTCHAs at the end of this README. We have almost certainly missed some... so - if something goes wrong, let us know so we can add you to the GOTCHA hall of fame!
This is a Python3 package that integrates with the Read Until API. However, we use a slightly modified version - Read Until API V2.
The Read Until API provides a mechanism for an application to connect to a MinKNOW server to obtain read data in real-time. The data can be analysed in the way most fit for purpose, and a return call can be made to the server to unblock the read in progress and so direct sequencing capacity towards reads of interest.
This implementation of ReadFish requires Guppy version 3.4.5. and MinKNOW version core 3.6. It will not work on earlier versions and performance is not currently guranteed on later versions.
Update - 21/08/2020 - We currently have a pre-release version of ReadFish available which is compatible with MinKNOW 4.0 and Guppy 4. See the PyPI page for installation instructions.
**Guppy 3.4.5 is available from here https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_3.4.5_linux64.tar.gz **
Currently we only recommend LINUX for running ReadFish. We have not had effective performance on other platforms to date.
The code here has been tested with Guppy in GPU mode using GridION Mk1 and
NVIDIA RTX2080 on live sequencing runs and an NIVIDA GTX1080 using playback
on a simulated run (see below for how to test this).
This code is run at your own risk as it DOES affect sequencing output. You
are strongly advised to test your setup prior to running (see below for
example tests).
If you use this software please cite: 10.1101/2020.02.03.926956
Nanopore adaptive sequencing for mixed samples, whole exome capture and targeted panels
Alexander Payne, Nadine Holmes, Thomas Clarke, Rory Munro, Bisrat Debebe, Matthew Loose
bioRxiv 2020.02.03.926956; doi: https://doi.org/10.1101/2020.02.03.926956
# Make a virtual environment
python3 -m venv readfish
. ./readfish/bin/activate
pip install --upgrade pip
# Install our ReadFish Software
pip install git+https://github.com/LooseLab/read_until_api_v2@master
pip install git+https://github.com/LooseLab/readfish@master
# check install
$ readfish
usage: readfish [-h] [--version]
{targets,align,centrifuge,unblock-all,validate,summary} ...
positional arguments:
{targets,align,centrifuge,unblock-all,validate,summary}
Sub-commands
targets Run targeted sequencing
align ReadFish and Run Until, using minimap2
centrifuge ReadFish and Run Until, using centrifuge
unblock-all Unblock all reads
validate ReadFish TOML Validator
summary Summary stats from FASTQ files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
See '<command> --help' to read about a specific sub-command.
# example run command - change arguments as necessary:
$ readfish targets --experiment-name "Test run" --device MN17073 --toml example.toml --log-file RU_log.log
For information on the TOML files see TOML.md.
To test readfish on your configuration we recommend first running a playback experiment to test unblock speed and then selection.
- Download an open access bulk FAST5 file from here. This file is 21Gb so make sure you have plenty of space.
- To configure a run for playback, you need to find and edit a sequencing TOML
file. These are typically located in
/opt/ont/minknow/conf/package/sequencing
. Edit a file such as sequencing_MIN106_DNA.toml and under the entry[custom_settings]
add a field:simulation = "/full/path/to/your_bulk.FAST5"
- If running GUPPY in GPU mode, set the parameter
break_reads_after_seconds = 1.0
tobreak_reads_after_seconds = 0.4
. - In the MinKNOW GUI, right click on a sequencing position and select
Reload Scripts
. Your version of MinKNOW will now playback the bulkfile rather than live sequencing. - Insert a configuration test flowcell into the sequencing device.
- Start a sequencing run as you would normally, selecting the corresponding flow cell type to the edited script (here FLO-MIN106) as the flowcell type.
- The run should start and immediately begin a mux scan. Let it run for around five minutes after which your read length histogram should look as below:
Now we shall test unblocking by running ru_unblock_all
which will simply eject
every single read on the flow cell.
- To do this run:
readfish unblock-all --device <YOUR_DEVICE_ID> --experiment-name "Testing ReadFish Unblock All"
- Leave the run for a further 5 minutes and observe the read length histogram. If unblocks are happening correctly you will see something like the below: A closeup of the unblock peak shows reads being unblocked quickly:
If you are happy with the unblock response, move onto testing basecalling.
To test selective sequencing you must have access to a guppy basecall server (>=3.4.0) and configure a TOML file. Here we provide an example TOML file.
-
First make a local copy of the example TOML file:
curl -O https://github.com/LooseLab/readfish/blob/master/examples/human_chr_selection.toml
-
Modify the
reference
field in the file to be the full path to a minimap2 index of the human genome. -
Modify the
targets
fields for each condition to reflect the naming convention used in your index. This is the sequence name only, up to but not including any whitespace. e.g.>chr1 human chromosome 1
would becomechr1
. If these names do not match, then target matching will fail. -
We provide a JSON schema and a script for validating configuration files which will let you check if the toml will drive an experiment as you expect:
readfish validate human_chr_selection.toml
Errors with the configuration will be written to the terminal along with a text description of the conditions for the experiment as below.
readfish validate examples/human_chr_selection.toml 😻 Looking good! Generating experiment description - please be patient! This experiment has 1 region on the flowcell Using reference: /path/to/reference.mmi Region 'select_chr_21_22' (control=False) has 2 targets of which 2 are in the reference. Reads will be unblocked when classed as single_off or multi_off; sequenced when classed as single_on or multi_on; and polled for more data when classed as no_map or no_seq.
-
If your toml file validates then run the following command:
readfish targets --device <YOUR_DEVICE_ID> \ --experiment-name "RU Test basecall and map" \ --toml <PATH_TO_TOML> \ --log-file ru_test.log
-
In the terminal window you should see messages reporting the speed of mapping of the form:
2020-02-24 16:45:35,677 ru.ru_gen 7R/0.03526s 2020-02-24 16:45:35,865 ru.ru_gen 3R/0.02302s 2020-02-24 16:45:35,965 ru.ru_gen 4R/0.02249s
Note: if these times are longer than 0.4 seconds you will have performance issues. Contact us via github issues for support.
-
In the MinKNOW messages interface you should see the experiment description as generated by the ru_validate command above.
The only way to test readfish on a playback run is to look at changes in read length for rejected vs accepted reads. To do this:
- Start a fresh simulation run using the bulkfile provided above.
- Restart the readfish command (as above):
readfish targets --device <YOUR_DEVICE_ID> \ --experiment-name "RU Test basecall and map" \ --toml <PATH_TO_TOML> \ --log-file ru_test.log
- Allow the run to proceed for at least 15 minutes (making sure you are writing out read data!).
- After 15 minutes it should look something like this: Zoomed in on the unblocks:
- Run
readfish summary
to check if your run has performed as expected. This file requires the path to your toml file followed by the path to your fastq reads. Typical results are provided below and show longer mean read lengths for the two selected chromosomes (here chr21 and chr22). Note the mean read lengths observed will be dependent on system performance. Optimal guppy configuration for your system is left to the user.contig number sum min max std mean median N50 chr1 1326 4187614 142 224402 14007 3158 795 48026 chr10 804 2843010 275 248168 15930 3536 842 47764 chr11 672 2510741 184 310591 18572 3736 841 73473 chr12 871 2317742 292 116848 9929 2661 825 37159 chr13 391 1090012 227 189103 12690 2788 781 41292 chr14 469 2323329 275 251029 20107 4954 830 68887 chr15 753 2189326 180 154830 12371 2907 812 40686 chr16 522 1673329 218 166941 12741 3206 862 39258 chr17 484 1609208 191 169651 15777 3325 816 73019 chr18 483 1525953 230 252901 14414 3159 813 40090 chr19 664 1898289 249 171742 13181 2859 820 46271 chr2 1474 4279420 234 222310 13090 2903 820 43618 chr20 489 1622910 229 171322 13223 3319 887 33669 chr21 32 1221224 1053 223477 56923 38163 13238 112200 chr22 47 724863 244 184049 28113 15423 6781 33464 chr3 1142 3554814 243 247771 15173 3113 760 62683 chr4 1224 4402210 210 221084 15769 3597 820 66686 chr5 1371 4495150 205 330821 16699 3279 801 65394 chr6 978 2725891 246 146169 10995 2787 791 37791 chr7 1039 3027136 166 263043 14705 2914 798 56567 chr8 848 2581406 238 229150 15618 3044 772 44498 chr9 893 3028224 259 247975 16011 3391 802 54953 chrM 144 216047 215 20731 2562 1500 864 1391 chrX 868 3124552 238 192451 15594 3600 832 49047 chrY 8 47071 510 31654 10743 5884 1382 31654
After completing your tests you should remove the simulation line from the sequencing_MIN106_DNA.toml file. You MUST then reload the scripts. If using Guppy GPU basecalling leave the break_reads_after_seconds parameter as 0.4.
These may or may not (!) be mistakes we have made already...
- If the previous run has not fully completed - i.e is still basecalling or processing raw data,you may connect to the wrong instance and see nothing happening. Always check the previous run has finished completely.
- If you have forgotten to remove your simultation line from your sequencing toml you will forever be trapped in an inception like resequencing of old data... Don't do this!
- If basecalling doesn't seem to be working check:
- Check your basecalling server is running.
- Check the ip of your server is correct.
- Check the port of your server is correct.
- If you are expecting reads to unblock but they do not - check that you have set
control=false
in your ru_generators toml file.control=true
will prevent any unblocks but does otherwise run the full analysis pipeline. - Oh no - every single read is being unblocked - I have nothing on target!
- Double check your reference file is in the correct location.
- Double check your targets exist in that reference file.
- Double check your targets are correctly formatted with contig name matching the record names in your reference (Exclude description - i.e the contig name up to the first whitespace).
- Where has my reference gone? If you are using a _live TOML file - e.g running iter_align or iter_cent, the previous reference MMI file is deleted when a new one is added. This obviosuly saves on disk space use(!) but can lead to unfortunate side effects - i.e you delete yoru MMI file. These can of course be recreated but user beware.
Happy ReadFishing!
We're really grateful to lots of people for help and support. Here's a few of them...
From the lab: Teri Evans, Sam Holt, Lewis Gallagher, Chris Alder, Thomas Clarke
From ONT: Stu Reid, Chris Wright, Rosemary Dokos, Chris Seymour, Clive Brown, George Pimm, Jon Pugh
From the Nanopore World: Nick Loman, Josh Quick, John Tyson, Jared Simpson, Ewan Birney, Alexander Senf, Nick Goldman, Miten Jain
And for our Awesome Logo please checkout out @tim_bassford from @TurbineCreative!
Please note that this repository is participating in a study into sustainability of open source projects. Data will be gathered about this repository for approximately the next 12 months, starting from June 2021.
Data collected will include number of contributors, number of PRs, time taken to close/merge these PRs, and issues closed.
For more information, please visit the informational page or download the participant information sheet.