-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MEGAHIT producing inflated assemblies? #239
Comments
very interesting test. Hope the author give some idea. |
About three years ago I tried to assemble a viral genome using >1000x reads and resulted in very fragmented contigs. Extremely high coverage means there are sequences errors everywhere. My solution was to use BBNorm (in the BBMap package) with Recently Brian Bushnell, who developed BBMap told me that using bbcms with |
@soungalo I agree with @voutcn . You essentially created an isolate genome but then used a metagenome assembler to attempt to recover it. A metagenome assembler has different assumptions about the data with the aim to recover variable coverage replicons. This makes it more sensitive to read errors unfortunately. For isolate assembly, more than 100x tends to make things worse. This is because more data just adds more noise (new random read errors) but no more signal (the underlying genome). Subsampling is the typical strategy, and what I use in One extra comment: |
Hello,
I am trying to use megahit as part of a larger analysis pipeline, and have encountered a strange behavior. I produced a test set using the following steps:
I expected to end up with an assembly size <= 1M, but was surprised to get ~2.43M (that's after discarding contigs < 200).
Any thoughts on why this might happen? perhaps parameters should be calibrated for the very high coverage I used? If so, how? any suggestions?
I can provide the data if needed.
Thanks!
The text was updated successfully, but these errors were encountered: