-
Notifications
You must be signed in to change notification settings - Fork 28
How multiple sequence alignment partitioning works
Since MUSCLE cannot handle very long input sequences, Trycycler msa first partitions the sequence into pieces, then it conducts a multiple sequence alignment on each, and finally it stitches the alignments together:
Input sequences:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
Partitioned sequences:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGC CTAAACTG
GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC CTACTG
GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGC CTAAACTG
Aligned partitions:
GGCAGAG--CGA CGTAAA-TTAC GAGT-AAAGGAGGGGA-GAGCATTAAG-CATGC CTAAACTG
GGCAGAGCGCGA CGTAAA-TTAC GAGTAAAAGGA-GGGAGGAGCATTAAGCCATGC CT--ACTG
GGCAGAGCGCGA C-TAAATTTAC GAGT-AAAGGA-GGGAGGAGCAT--AGCCATGC CTAAACTG
Merged alignments:
GGCAGAG--CGACGTAAA-TTACGAGT-AAAGGAGGGGA-GAGCATTAAG-CATGCCTAAACTG
GGCAGAGCGCGACGTAAA-TTACGAGTAAAAGGA-GGGAGGAGCATTAAGCCATGCCT--ACTG
GGCAGAGCGCGAC-TAAATTTACGAGT-AAAGGA-GGGAGGAGCAT--AGCCATGCCTAAACTG
The sequence partitioning is done by grabbing a k-mer (size determined by --kmer
) in the first sequence and finding it each of the other sequences (looking in a chunk of each sequence with size determined by --lookahead
). If the k-mer is found once and only once in each sequence, then it forms the boundary of a partition. Then the position jumps ahead (determined by --step
) and the process repeats.
This means that the sequence will be partitioned into chunks that are typically --step
bp in size, though sometimes longer in repetitive or low-identity sequence regions.
To work through an example, imagine we had these input sequences:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
and parameters of --kmer 4 --step 10 --lookahead 20
(unrealistically small but appropriate for this toy example).
The first k-mer (GCGA
)is taken 10 bp (--step
) into the first sequence:
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And then that k-mer is searched for in each of the sequences (only looking in the part of each sequence defined by the lookahead margin):
GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------- lookahead
Since that k-mer was found once and only once in each sequence, it is considered a safe place to divide the sequences:
GGCAGAGCGA CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
We then repeat the process, grabbing a k-mer 10 bp further on in the first sequence:
GGCAGAGCGA CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And searching for it in all sequences:
GGCAGAGCGA CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------- lookahead
Once again, it was found once and only once, so it is a safe division point:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
We then again grab a k-mer from the first sequence:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search for it in all sequences:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------- lookahead
But this time the k-mer was not found once and only once in each sequence, so it's not a safe division point. We jump ahead 10 bp and try again with a new k-mer:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search for it in all sequences:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG ------------------------------ lookahead
Once again we failed to find a unique hit, so we do another 10 bp jump to get a new k-mer:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
And search:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG -------------------------------------- lookahead
And now we've finally found a once-in-each-sequence hit, so we have a safe division point:
GGCAGAGCGA CGTAAATTAC GAGTAAAGGAGGGGAGAGCATTAAGCATGC CTAAACTG
GGCAGAGCGCGA CGTAAATTAC GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC CTACTG
GGCAGAGCGCGA CTAAATTTAC GAGTAAAGGAGGGAGGAGCATAGCCATGC CTAAACTG
In this way, Trycycler msa partitions the sequence into separate chunks while avoiding putting chunk boundaries in ambiguous or repetitive parts of the sequence.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth