democratizing genome assembly
demogenas
is a pipeline for denovo genome assembly and genome evaluation. It's written in Snakemake.
In terms of dependencies you'll need:
Snakemake
- best get via conda - note that we have been testing extensively only with Snakemake version 5.9.1 and we expect that some issues will arise with newer Snakemake versions.Singularity
- globally installed - 3.11.4 and newer should work.
Futher you'll need to to clone this repository to get the context for the worklow and a number of scripts that will be used by the workflow.
git clone --recursive https://github.com/chrishah/demogenas.git
cd demogenas
The master wrappers script can be executed as follows. You're going to have to be in the repo to execute the script.
./demogenas
The wrapper script will call snakemake
and trigger certain parts of the pipeline indicated via the -m
option.
To run the pipeline successfully you are going to need at least 2 files:
- data file - a tab delimited file specifying the location and type of data for your samples (example:
data/testdata/test.data.tsv
) - config file - specifying parameters for the pipeline (example:
data/testdata/test.config.yaml
)
The principal input data for now are:
- Illumina paired end reads in fastq or bam format (specified in data file in the columns
f_fastq
andr_fastq
orbam
) - ONT reads in fast5 format (data file column 'fast5_dir')
- long reads in fastq format (data file column 'long')
A given sample (data file column 'sample') can combine multiple datatypes specified in multiple lines with unique library names (data file column 'lib').
The example data file (data/testdata/test.data.tsv
) specifies 5 hypothetical samples comprising different kinds of input data:
- fastq_only - a sample for which multiple illumina libraries were sequenced and all data come in fastq format
- bam_only - a sample for which multiple illumina libraries were sequenced and all data come in bam format
- fastq_bam - a sample for which multiple illumina libraries were sequenced and data come as fastq and bam format
- fast5_only - a sample for which multiple ONT libraries were sequenced and data comes as fast5
- all_types - a sample for which multiple illumina and ONT libraries were sequenced and data come in fastq, bam and fast5 format
Per default, demogenas
will process (trim, errorcorrect, merge) and assemble all samples and datatypes in the datafile automatically, with the particular steps (trimmers, correctors, assemblers) as specified in the config file.
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry
If you want to process only selected samples in the data file you can specify the sample(s) of interest with the --select=
option, e.g.:
# assemble one sample 'fastq_only'
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=fastq_only
# assemble two samples 'fastq_only' and 'bam_only'
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=fastq_only,bam_only
You can check the different behaviours of the pipeline for different sample/data types - note that demogenas
will do different things depending on the data types provided, e.g.:
# only fastq
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=fastq_only
# only bam
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=bam_only
# combination of fastq and bam
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=fastq_bam
# fast5 only
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=fast5_only
# combination of fastq, bam and fast5
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=all_types
Here's the rulegraph of the workflow that would be exectued by the last command above.
Via the config file I control which steps are performed, e.g.:
# process illumina data and assemble with all relevant assemblers
./demogenas -t local -m assemble --configfile=data/testdata/test.config.yaml --dry --select=fastq_only
# process illumina data and assemble only with spades
./demogenas -t local -m assemble --configfile=data/testdata/test.config.spadesonly.yaml --dry --select=fastq_only
If you don't want to go all the way and assemble, there are other modes, such as:
-m trim_illumina
-m correct_illumina
-m merge_illumina
-m eval_illumina
-m eval_kmer_plot
-m kmer_filter
-m call_ont
# trim illumina data
./demogenas -t local -m trim_illumina --configfile=data/testdata/test.config.yaml --dry --select=fastq_only
# trim and correct illumina data
./demogenas -t local -m correct_illumina --configfile=data/testdata/test.config.yaml --dry --select=fastq_bam
# trim, correct and merge illumina dat
./demogenas -t local -m merge_illumina --configfile=data/testdata/test.config.yaml --dry --select=fastq_only
# trim illumina data and evaluate (fastqc, kmer spectrum)
./demogenas -t local -m eval_kmer_plot --configfile=data/testdata/test.config.yaml --dry --select=fastq_bam
The latter command would just trim Illumina reads and calculate and plot kmer frequencies.
Note that for a sample comprising only fast5 data none of the illumina specific steps will be done. You can try.
./demogenas -t local -m trim_illumina --configfile=data/testdata/test.config.yaml --dry --select=fast5_only
./demogenas -t local -m correct_illumina --configfile=data/testdata/test.config.yaml --dry --select=fast5_only
./demogenas -t local -m merge_illumina --configfile=data/testdata/test.config.yaml --dry --select=fast5_only
We have extra modes to evaluate your assemblies - this can be done at any time, even if not all assemblies are finished yet.
First, run prepare_assemblies
mode - this will gather all assemblies that are finished at this stage (potentially restricted only to a single sample id via --select=
option as below).
./demogenas -m prepare_assemblies -t local --configfile=data/testdata/test.config.yaml --select="fastq_bam"
Since this is just a demo and no assembly has actually been run yet the above command will not actually trigger any jobs. Snakemake will tell you that there's nothing to be done. However, if any assemblies for this particular sample had been completed the above command would have gathered them in a particular place now. Check out the content of this target directory:
ls -1 results/fastq_bam/assembly_evaluation/assemblies/
You'll see a list of files. These are just empty files now that ship with the repo for the purpose of this demo. Filenames as given by demogenas should be indicative of the origin of each file. The principal naming scheme is <assembler-trimmer-correction-merger>.min<length>.fasta
. The file platanus-trimgalore-bless-usearch-auto.min1000.fasta
for example was produced via platanus, based on reads trimmed with trimgalore, corrected with bless and merged with usesarch. The assembly has been filtered to retain only scaffolds of a minimum length of 1000bp. If one sticks to the principal naming scheme one can also put in external assemblies for subsequent evaluation, such as the file something.min1000.fasta
.
Now, to evaluate all assemblies finished at this moment with the methods as specified in the config file, run evaluate_assemblies
like e.g. so:
./demogenas -m evaluate_assemblies -t local --configfile=data/testdata/test.config.yaml --select="fastq_bam" --dry