Skip to content
Sergei Kliver edited this page Jan 17, 2024 · 23 revisions

Synteny pipeline

Cactus-based pipeline for detection and visualization of synteny blocks via whole genome alignment (WGA)

Required software (might be incomplete)

Tools Source Description Comment
Windowmasker conda repeatmasking part of blast package
TRF conda repeatmasking
RepeatMasker conda repeatmasking
bedtools conda applying masking to fasta
BUSCOclade github tree reconstruction https://github.com/tomarovsky/BuscoClade
Progressive Cactus docker/github WGA docker-based installation preferred
HalToolkit cactus container/github/conda? Synteny maybe a bit painful for manual installation, but less than cactus
MAVR conda wrapper scripts available from channel mahajrod
RouToolPa conda dependency of MAVR available from channel mahajrod
MACE conda visualization available from channel mahajrod
Chromodotter conda visualization available from channel mahajrod

Stages:

I. Check input assemblies

To be sure that whole genome alignment (WGA) and hal-related tools will work, I recommend to check the scaffold ids in the input assemblies for special symbols (@, %$, |, and others) and remove them. NCBI and DNAzoo assemblies do not need such a verification. Probably, it is an overcare, but both Cactus and halToolkit may inherit old code (up to ancient Kent binaries) and may have issues with special symbols.

II. Mask the assemblies

Absolute minima is Windowmasker + TRF, but adding RepeatMasker is highly recommended too.

MAVR package has a script which runs all three tools, converts and combines output of all into a single gff and produces masked assembly.

Usage:

annotate_repeats_for_wga.py -i <input_fasta>  -s <taxon> -p <output_prefix> -t <threads>

Options: -s - taxa for RepeatMasker, for example, carnivora

Recommended CPU number : 40

Running time : for usual mammalian genome (2.5 Gbp) - 8-24 hours per genome, without Repeat masker - 2-4 h

Output: masked .fasta

III. Reconstruct phylogenetic tree

You could also use a precalculated tree. It should be a rooted and completely resolved (no three- or multifurcations) tree. Progressive Cactus in theory could work with not unresolved tree, but it will significantly increase running time. Cactus start from aligning adjacent leaves and reconstruct an ancestral genome, corresponding to closest node the internal node. If internal node has more than two descendant, it will do all-vs-all pairwise alignments before reconstruction. Next, procedure is repeated by following the tree till the root node. So for completely resolved tree with N leaves it will require N-1 pairwise alignments, for completely unresolved - N*(N-1)/2, respectively.

Recommended tool : BUSCOclade. It is a snakemake pipeline, so it installs all dependencies automatically and is able to use cluster. Input is just a set of fasta files with the assemblies. For fast tree use mafft alignment mode, for more precise - prank, respectively.

Recommended CPU number : use cluster

Usage:

follow the README in the github repo

Running time : 8-24 h per genome to run BUSCO, 5-12 h to run mafft or up to week to run prank (depends on number of genomes), up to 24 h to reconstruct the tree

IV. Run progressive cactus

I recommend to run cactus on a single machine with 70+ CPUs. Theoretically,you could generate all the commands and then run them on cluster, but I know no successes on it. During run cactus does a lot of IO on disk, and for mammalian-sized genome it require around 1-1.5 Tb space. By default it uses system temp folder. So if your machine doesn't have enough space change it via environment variables. 

Usage:

Docker image is highly recommended. It seems to be much faster than standalone cactus, and you will avoid all the dependency hell. I prefer to open container in interactive mode and then run the cactus command manually. Additional advantage of the container is that also includes halToolkit.


docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.7
cactus jobStore <config_file> <output_hal> --maxCores 100 --defaultMemory 10Gi --defaultDisk 20Gi --workDir ./tmp/ --logFile <logfile> --retryCount 10

Recommended CPU number : 70-100

Running time : 12-24 h per genome of mammalian size

Example of Cactus config file :

((felis_catus:0.04823724,(((((nyctereutes_procyonoides_china:0.00030531,nyctereutes_procyonoides:0.00032296):0.00476368,(vulpes_lagopus:0.0015788,(vulpes_vulpes:0.00145739,(vulpes_ferrilata:0.00000001,vulpes_corsac:0.00000001):0.00133068):0.000308):0.00350066):0.00029636,otocyon_megalotis:0.00545699):0.00017181,(((((canis_lupus_dingo:0.0001507,canis_lupus_dingo_alp:0.00016731):0.00021624,canis_lupus_familiaris:0.00034404):0.00014479,canis_lupus_orion:0.00035417):0.00117847,lycaon_pictus:0.00205181):0.0005135,chrysocyon_brachyurus:0.00244847):0.00267126):0.00050872,urocyon_littoralis:0.00519099):0.04705468):0.02630353,manis_javanica:0.08232512);
canis_lupus_dingo       genomes/canis_lupus_dingo.renamed.1000plus.fasta
canis_lupus_dingo_alp   genomes/canis_lupus_dingo_alp.renamed.1000plus.fasta
canis_lupus_familiaris  genomes/canis_lupus_familiaris.renamed.1000plus.fasta
canis_lupus_orion       genomes/canis_lupus_orion.renamed.1000plus.fasta
chrysocyon_brachyurus   genomes/chrysocyon_brachyurus.renamed.1000plus.fasta
felis_catus     genomes/felis_catus.renamed.1000plus.fasta
lycaon_pictus   genomes/lycaon_pictus.renamed.1000plus.fasta
manis_javanica  genomes/manis_javanica.renamed.1000plus.fasta
nyctereutes_procyonoides        genomes/nyctereutes_procyonoides.renamed.1000plus.fasta
nyctereutes_procyonoides_china  genomes/nyctereutes_procyonoides_china.renamed.1000plus.fasta
otocyon_megalotis       genomes/otocyon_megalotis.renamed.1000plus.fasta
urocyon_littoralis      genomes/urocyon_littoralis.renamed.1000plus.fasta
vulpes_corsac   genomes/vulpes_corsac.renamed.1000plus.fasta
vulpes_ferrilata        genomes/vulpes_ferrilata.renamed.1000plus.fasta
vulpes_lagopus  genomes/vulpes_lagopus.renamed.1000plus.fasta
vulpes_vulpes   genomes/vulpes_vulpes.renamed.1000plus.fasta

First line should contain a rooted tree in newick format. Next lines - genome labels and paths to the fasta files. Fastas must be uncompressed.

Output: .hal file with the alignment

V. Extraction of synteny blocks

halSynteny is a tool designed specially to extract synteny blocks from WGA in .hal format. It can extract synteny blocks for any pair of the genomes including the reconstructed ancestral ones. In pair one genome is query, second - target. It is single threaded and sometimes might be very slow. I know the cases when it was running for two-three weeks on .hal with 28 genomes. Fortunately, it allows parallelization by query genome scaffolds, i.e. you could run it independently for each query scaffold. However, it consumes a lot of memory. It is difficult to predict the RAM usage, but it is dependent on the size of alignment, distance between genomes and size of query chromosomes. For example, for machine with 768 GB RAM, alignment with 28 genomes , and longest chromosome of 350 Mbp it was safe to run no more 35 halSynteny jobs simultaneously. I prefer to generate list of commands for each scaffold of query and then run it using parallel. To reduce the number of commands you can avoid scaffold shorter than 10 kbp, and scaffolds shorter 1 kbp definitely must be filtered out.

Usage:

  1. if you wish to run halSynteny in single threaded mode for all your input genomes one by one:
> halsynteny_canidae2.cmds; 
SPECIES_LIST=(AHD_hap1 AHD_hap2 AHD_dnazoo basenji_UNSW-BAS-2019 basenji_MUID185726 boxer_canFam6 boxer_canFam3.1 great_dane GSD_canFam4 wolf dingo_alpine dingo labrador); 
for TARGET in ${SPECIES_LIST[@]}; 
    do 
    SPECIES_LIST=(${SPECIES_LIST[@]:1}); 
    for QUERY in ${SPECIES_LIST[@]}; 
        do 
        if [ "${TARGET}" = "${QUERY}" ]; 
            then 
            echo "Skipping self alignment"; 
            else 
            echo "halSynteny --maxAnchorDistance 50000 --minBlockSize 50000 --queryGenome ${QUERY} --targetGenome ${TARGET}  canidae2.hal ${QUERY}.to.${TARGET}.anchor50000.block50000.psl" >>  halsynteny_canidae2.cmds; 
            fi;  
         done; 
    done

parallel -j <thread_number> < halsynteny_canidae2.cmds
  1. if you wish to run halSynteny with parallelization for all your input genomes (you will need to generate a .len files in advance for each genome. .len file is a TAB separated file with two columns - scaffold id and scaffold length):
> halsynteny_all_vs_all.cmds ;

for TARGET in cryptoprocta_ferox felis_catus mirounga_angustirostris phoca_largha phoca_vitulina halichoerus_grypus pusa_sibirica erignathus_barbatus odobenus_rosmarus ailurus_fulgens aonyx_cinereus lutra_lutra enhydra_lutris lontra_canadensis pteronura_brasiliensis mustela_nigripes mustela_putorius_furo neogale_vison martes_flavigula martes_martes martes_zibellina bassariscus_astutus meles_meles procyon_lotor nasua_narica potos_flavus homo_sapiens canis_familiaris;
do
        mkdir -p all_vs_all/${TARGET};
        for QUERY in cryptoprocta_ferox felis_catus mirounga_angustirostris phoca_largha phoca_vitulina halichoerus_grypus pusa_sibirica erignathus_barbatus odobenus_rosmarus ailurus_fulgens aonyx_cinereus lutra_lutra enhydra_lutris lontra_canadensis pteronura_brasiliensis mustela_nigripes mustela_putorius_furo neogale_vison martes_flavigula martes_martes martes_zibellina bassariscus_astutus meles_meles procyon_lotor nasua_narica potos_flavus homo_sapiens martes_foina;
        do
                if [ "${TARGET}" = "${QUERY}" ];
                then
                        echo "Skipping self alignment";
                else
                        mkdir all_vs_all/${TARGET}/${QUERY};
                        for SCAF in `awk '{if ($2 >=10000) print $1}' /media/eternus3/skliver/mustelidae/cactus/genomes/len/${QUERY}*`;
                        do

                                echo "halSynteny --maxAnchorDistance 50000 --minBlockSize 50000 --queryGenome ${QUERY} --targetGenome ${TARGET} --queryChromosome ${SCAF}  mustpinpro.v5.hal all_vs_all/${TARGET}/${QUERY}/${SCAF}.${QUERY}.to.${TARGET}.anchor50000.block50000.psl" >>  halsynteny_all_vs_all.cmds;
                        done
                fi;
        done;
done

parallel -j 35 < halsynteny_all_vs_all.cmds

Options :

--maxAnchorDistance       - maximal distance between anchors (~sublocks) to merge
--minBlockSize            - minimal size of a synteny blocks

You can increase this number, but the fine structure will be lost.

Recommended CPU number : 35+, for big alignments it is limited by the RAM

Recommended RAM per job : Highly dependent on number of the genomes, size of chromosomes and phylogenetic distance between query and target

Running time : highly depends on size of genomes, phylogenetic distance and query chromosome size.

Output: .psl file with synteny blocks in case of unparalled run, or set of them if halSynteny was run on individual query scaffolds. Do not forget to merge them. Via cat for example

VI. Pairwise visualization via dotplot (only two species per plot)

VII. Visualization vs reference (many species vs one)

VIII. Visualization in chain (many species)