MEGAHIT producing inflated assemblies? #239

soungalo · 2019-09-17T17:02:32Z

Hello,
I am trying to use megahit as part of a larger analysis pipeline, and have encountered a strange behavior. I produced a test set using the following steps:

Extracted a 1M genomic sequence from a reference genome
Aligned reads from the same species to the 1M fragment (using bwa mem)
Extracted only properly-aligned reads (samtools view -f 2 -q 30) and converted back to fastq. I ended up with ~340M bases (so ~x340 coverage)
Assembled the reads using megahit (default params - paired end)

I expected to end up with an assembly size <= 1M, but was surprised to get ~2.43M (that's after discarding contigs < 200).
Any thoughts on why this might happen? perhaps parameters should be calibrated for the very high coverage I used? If so, how? any suggestions?
I can provide the data if needed.

Thanks!

mingleiR · 2019-09-21T16:09:14Z

very interesting test. Hope the author give some idea.
In your test, MegaHit did the assembly of a single genome, rather than assembly of a metagenome.
Before the reply from the author, maybe you can try SPAdes or metaSPAdes

voutcn · 2019-09-24T03:33:56Z

About three years ago I tried to assemble a viral genome using >1000x reads and resulted in very fragmented contigs. Extremely high coverage means there are sequences errors everywhere.

My solution was to use BBNorm (in the BBMap package) with target=70 to normalize the reads and the contigs looked much better.

Recently Brian Bushnell, who developed BBMap told me that using bbcms with bbcms.sh mincount=2 highcountfraction=0.6 might be better for metagenomes. I don't have time and resources to do experiments but you guys may want to try bbmcs and/or bbnorm.

tseemann · 2019-10-22T20:15:32Z

@soungalo I agree with @voutcn . You essentially created an isolate genome but then used a metagenome assembler to attempt to recover it. A metagenome assembler has different assumptions about the data with the aim to recover variable coverage replicons. This makes it more sensitive to read errors unfortunately. For isolate assembly, more than 100x tends to make things worse. This is because more data just adds more noise (new random read errors) but no more signal (the underlying genome). Subsampling is the typical strategy, and what I use in shovill.

One extra comment: Extracted only properly-aligned reads (samtools view -f 2 -q 30)
Tools like bwa mem peform local alignment, so include alignments only using short pieces of some reads, not the whole read. The bwa score threshold is 30 so pieces as small as 30bp can be included n the SAM file. You may want to use bowtie2 --end-to-end to force glocal alignment given the reads came from the same/similar genome.

voutcn mentioned this issue Oct 8, 2019

Error occurs when assembling contigs for k = 27, please refer to log for detail [Exit code -7] #222

Open

voutcn mentioned this issue Feb 23, 2020

The N50 is very short from soil sample #259

Open

voutcn mentioned this issue Jun 12, 2020

megahit exit code -6 #270

Closed

slambrechts mentioned this issue Apr 1, 2021

Discussion: How to assemble complicated metagenome e.g. soil metagenome-atlas/atlas#375

Closed

fpusan mentioned this issue Nov 7, 2024

Stopping in STEP1 -> 01.run_all_assemblies.pl. Program finished abnormally jtamames/SqueezeMeta#904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MEGAHIT producing inflated assemblies? #239

MEGAHIT producing inflated assemblies? #239

soungalo commented Sep 17, 2019

mingleiR commented Sep 21, 2019

voutcn commented Sep 24, 2019

tseemann commented Oct 22, 2019

MEGAHIT producing inflated assemblies? #239

MEGAHIT producing inflated assemblies? #239

Comments

soungalo commented Sep 17, 2019

mingleiR commented Sep 21, 2019

voutcn commented Sep 24, 2019

tseemann commented Oct 22, 2019