Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEGAHIT producing inflated assemblies? #239

Open
soungalo opened this issue Sep 17, 2019 · 3 comments
Open

MEGAHIT producing inflated assemblies? #239

soungalo opened this issue Sep 17, 2019 · 3 comments

Comments

@soungalo
Copy link

Hello,
I am trying to use megahit as part of a larger analysis pipeline, and have encountered a strange behavior. I produced a test set using the following steps:

  1. Extracted a 1M genomic sequence from a reference genome
  2. Aligned reads from the same species to the 1M fragment (using bwa mem)
  3. Extracted only properly-aligned reads (samtools view -f 2 -q 30) and converted back to fastq. I ended up with ~340M bases (so ~x340 coverage)
  4. Assembled the reads using megahit (default params - paired end)

I expected to end up with an assembly size <= 1M, but was surprised to get ~2.43M (that's after discarding contigs < 200).
Any thoughts on why this might happen? perhaps parameters should be calibrated for the very high coverage I used? If so, how? any suggestions?
I can provide the data if needed.

Thanks!

@mingleiR
Copy link

very interesting test. Hope the author give some idea.
In your test, MegaHit did the assembly of a single genome, rather than assembly of a metagenome.
Before the reply from the author, maybe you can try SPAdes or metaSPAdes

@voutcn
Copy link
Owner

voutcn commented Sep 24, 2019

About three years ago I tried to assemble a viral genome using >1000x reads and resulted in very fragmented contigs. Extremely high coverage means there are sequences errors everywhere.

My solution was to use BBNorm (in the BBMap package) with target=70 to normalize the reads and the contigs looked much better.

Recently Brian Bushnell, who developed BBMap told me that using bbcms with bbcms.sh mincount=2 highcountfraction=0.6 might be better for metagenomes. I don't have time and resources to do experiments but you guys may want to try bbmcs and/or bbnorm.

@tseemann
Copy link

@soungalo I agree with @voutcn . You essentially created an isolate genome but then used a metagenome assembler to attempt to recover it. A metagenome assembler has different assumptions about the data with the aim to recover variable coverage replicons. This makes it more sensitive to read errors unfortunately. For isolate assembly, more than 100x tends to make things worse. This is because more data just adds more noise (new random read errors) but no more signal (the underlying genome). Subsampling is the typical strategy, and what I use in shovill.

One extra comment: Extracted only properly-aligned reads (samtools view -f 2 -q 30)
Tools like bwa mem peform local alignment, so include alignments only using short pieces of some reads, not the whole read. The bwa score threshold is 30 so pieces as small as 30bp can be included n the SAM file. You may want to use bowtie2 --end-to-end to force glocal alignment given the reads came from the same/similar genome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants