Skip to content

How multiple sequence alignment partitioning works

Ryan Wick edited this page Nov 17, 2020 · 4 revisions

Since MUSCLE cannot handle very long input sequences, Trycycler msa first partitions the sequence into pieces, then it conducts a multiple sequence alignment on each, and finally it stitches the alignments together:

Input sequences:
  GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
  GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
  GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

Partitioned sequences:
  GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGC    CTAAACTG
  GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC  CTACTG
  GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGC     CTAAACTG

Aligned partitions:
  GGCAGAG--CGA  CGTAAA-TTAC  GAGT-AAAGGAGGGGA-GAGCATTAAG-CATGC  CTAAACTG
  GGCAGAGCGCGA  CGTAAA-TTAC  GAGTAAAAGGA-GGGAGGAGCATTAAGCCATGC  CT--ACTG
  GGCAGAGCGCGA  C-TAAATTTAC  GAGT-AAAGGA-GGGAGGAGCAT--AGCCATGC  CTAAACTG

Merged alignments:
  GGCAGAG--CGACGTAAA-TTACGAGT-AAAGGAGGGGA-GAGCATTAAG-CATGCCTAAACTG
  GGCAGAGCGCGACGTAAA-TTACGAGTAAAAGGA-GGGAGGAGCATTAAGCCATGCCT--ACTG
  GGCAGAGCGCGAC-TAAATTTACGAGT-AAAGGA-GGGAGGAGCAT--AGCCATGCCTAAACTG

The sequence partitioning is done by grabbing a k-mer (size determined by --kmer) in the first sequence and finding it each of the other sequences (looking in a chunk of each sequence with size determined by --lookahead). If the k-mer is found once and only once in each sequence, then it forms the boundary of a partition. Then the position jumps ahead (determined by --step) and the process repeats.

This means that the sequence will be partitioned into chunks that are typically --step bp in size, though sometimes longer in repetitive or low-identity sequence regions.

Example partitioning

To work through an example, imagine we had these input sequences:

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

and parameters of --kmer 4 --step 10 --lookahead 20 (unrealistically small but appropriate for this toy example).

The first k-mer (GCGA)is taken 10 bp (--step) into the first sequence:

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And then that k-mer is searched for in each of the sequences (only looking in the part of each sequence defined by the lookahead margin):

GGCAGAGCGACGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGACGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGACTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
--------------------
     lookahead

Since that k-mer was found once and only once in each sequence, it is considered a safe place to divide the sequences:

GGCAGAGCGA    CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

We then repeat the process, grabbing a k-mer 10 bp further on in the first sequence:

GGCAGAGCGA    CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And searching for it in all sequences:

GGCAGAGCGA    CGTAAATTACGAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTACGAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTACGAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
              --------------------
                   lookahead

Once again, it was found once and only once, so it is a safe division point:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG

We then again grab a k-mer from the first sequence:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search for it in all sequences:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
                          --------------------
                               lookahead

But this time the k-mer was not found once and only once in each sequence, so it's not a safe division point. We jump ahead 10 bp and try again with a new k-mer:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search for it in all sequences:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
                          ------------------------------
                                    lookahead

Once again we failed to find a unique hit, so we do another 10 bp jump to get a new k-mer:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG

And search:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGCCTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGCCTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGCCTAAACTG
                          --------------------------------------
                                        lookahead

And now we've finally found a once-in-each-sequence hit, so we have a safe division point:

GGCAGAGCGA    CGTAAATTAC  GAGTAAAGGAGGGGAGAGCATTAAGCATGC    CTAAACTG
GGCAGAGCGCGA  CGTAAATTAC  GAGTAAAAGGAGGGAGGAGCATTAAGCCATGC  CTACTG
GGCAGAGCGCGA  CTAAATTTAC  GAGTAAAGGAGGGAGGAGCATAGCCATGC     CTAAACTG

In this way, Trycycler msa partitions the sequence into separate chunks while avoiding putting chunk boundaries in ambiguous or repetitive parts of the sequence.

Clone this wiki locally