-
Notifications
You must be signed in to change notification settings - Fork 1
Synteny pipeline
Cactus-based pipeline for detection and visualization of synteny blocks via whole genome alignment (WGA)
Tools | Source | Description | Comment |
---|---|---|---|
Windowmasker | conda | repeatmasking | part of blast package |
TRF | conda | repeatmasking | |
RepeatMasker | conda | repeatmasking | |
bedtools | conda | applying masking to fasta | |
BUSCOclade | github | tree reconstruction | https://github.com/tomarovsky/BuscoClade |
Progressive Cactus | docker/github | WGA | docker-based installation preferred |
HalToolkit | cactus container/github/conda? | Synteny | maybe a bit painful for manual installation, but less than cactus |
MAVR | conda | wrapper scripts | available from channel mahajrod |
RouToolPa | conda | dependency of MAVR | available from channel mahajrod |
MACE | conda | visualization | available from channel mahajrod |
Chromodotter | conda | visualization | available from channel mahajrod |
To be sure that whole genome alignment (WGA) and hal-related tools will work, I recommend to check the scaffold ids in the input assemblies for special symbols (@, %$, |, and others) and remove them. NCBI and DNAzoo assemblies do not need such a verification. Probably, it is an overcare, but both Cactus and halToolkit may inherit old code (up to ancient Kent binaries) and may have issues with special symbols.
Absolute minima is Windowmasker + TRF, but adding RepeatMasker is highly recommended too.
MAVR package has a script which runs all three tools, converts and combines output of all into a single gff and produces masked assembly.
Usage:
annotate_repeats_for_wga.py -i <input_fasta> -s <taxon> -p <output_prefix> -t <threads>
Options: -s - taxa for RepeatMasker, for example, carnivora
Recommended CPU number : 40
Running time : for usual mammalian genome (2.5 Gbp) - 8-24 hours per genome, without Repeat masker - 2-4 h
Output: masked .fasta
You could also use a precalculated tree. It should be a rooted and completely resolved (no three- or multifurcations) tree. Progressive Cactus in theory could work with not unresolved tree, but it will significantly increase running time. Cactus start from aligning adjacent leaves and reconstruct an ancestral genome, corresponding to closest node the internal node. If internal node has more than two descendant, it will do all-vs-all pairwise alignments before reconstruction. Next, procedure is repeated by following the tree till the root node. So for completely resolved tree with N leaves it will require N-1 pairwise alignments, for completely unresolved - N*(N-1)/2, respectively.
Recommended tool : BUSCOclade. It is a snakemake pipeline, so it installs all dependencies automatically and is able to use cluster. Input is just a set of fasta files with the assemblies. For fast tree use mafft alignment mode, for more precise - prank, respectively.
Recommended CPU number : use cluster
Usage:
follow the README in the github repo
Running time : 8-24 h per genome to run BUSCO, 5-12 h to run mafft or up to week to run prank (depends on number of genomes), up to 24 h to reconstruct the tree
I recommend to run cactus on a single machine with 70+ CPUs. Theoretically,you could generate all the commands and then run them on cluster, but I know no successes on it. During run cactus does a lot of IO on disk, and for mammalian-sized genome it require around 1-1.5 Tb space. By default it uses system temp folder. So if your machine doesn't have enough space change it via environment variables.
Usage:
Docker image is highly recommended. It seems to be much faster than standalone cactus, and you will avoid all the dependency hell. I prefer to open container in interactive mode and then run the cactus command manually. Additional advantage of the container is that also includes halToolkit.
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.7
cactus jobStore <config_file> <output_hal> --maxCores 100 --defaultMemory 10Gi --defaultDisk 20Gi --workDir ./tmp/ --logFile <logfile> --retryCount 10
Recommended CPU number : 70-100
Running time : 12-24 h per genome of mammalian size
Example of Cactus config file :
((felis_catus:0.04823724,(((((nyctereutes_procyonoides_china:0.00030531,nyctereutes_procyonoides:0.00032296):0.00476368,(vulpes_lagopus:0.0015788,(vulpes_vulpes:0.00145739,(vulpes_ferrilata:0.00000001,vulpes_corsac:0.00000001):0.00133068):0.000308):0.00350066):0.00029636,otocyon_megalotis:0.00545699):0.00017181,(((((canis_lupus_dingo:0.0001507,canis_lupus_dingo_alp:0.00016731):0.00021624,canis_lupus_familiaris:0.00034404):0.00014479,canis_lupus_orion:0.00035417):0.00117847,lycaon_pictus:0.00205181):0.0005135,chrysocyon_brachyurus:0.00244847):0.00267126):0.00050872,urocyon_littoralis:0.00519099):0.04705468):0.02630353,manis_javanica:0.08232512);
canis_lupus_dingo genomes/canis_lupus_dingo.renamed.1000plus.fasta
canis_lupus_dingo_alp genomes/canis_lupus_dingo_alp.renamed.1000plus.fasta
canis_lupus_familiaris genomes/canis_lupus_familiaris.renamed.1000plus.fasta
canis_lupus_orion genomes/canis_lupus_orion.renamed.1000plus.fasta
chrysocyon_brachyurus genomes/chrysocyon_brachyurus.renamed.1000plus.fasta
felis_catus genomes/felis_catus.renamed.1000plus.fasta
lycaon_pictus genomes/lycaon_pictus.renamed.1000plus.fasta
manis_javanica genomes/manis_javanica.renamed.1000plus.fasta
nyctereutes_procyonoides genomes/nyctereutes_procyonoides.renamed.1000plus.fasta
nyctereutes_procyonoides_china genomes/nyctereutes_procyonoides_china.renamed.1000plus.fasta
otocyon_megalotis genomes/otocyon_megalotis.renamed.1000plus.fasta
urocyon_littoralis genomes/urocyon_littoralis.renamed.1000plus.fasta
vulpes_corsac genomes/vulpes_corsac.renamed.1000plus.fasta
vulpes_ferrilata genomes/vulpes_ferrilata.renamed.1000plus.fasta
vulpes_lagopus genomes/vulpes_lagopus.renamed.1000plus.fasta
vulpes_vulpes genomes/vulpes_vulpes.renamed.1000plus.fasta
First line should contain a rooted tree in newick format. Next lines - genome labels and paths to the fasta files. Fastas must be uncompressed.
Output: .hal file with the alignment
halSynteny is a tool designed specially to extract synteny blocks from WGA in .hal format. It can extract synteny blocks for any pair of the genomes including the reconstructed ancestral ones. In pair one genome is query, second - target. It is single threaded and sometimes might be very slow. I know the cases when it was running for two-three weeks on .hal with 28 genomes. Fortunately, it allows parallelization by query genome scaffolds, i.e. you could run it independently for each query scaffold. However, it consumes a lot of memory. It is difficult to predict the RAM usage, but it is dependent on the size of alignment, distance between genomes and size of query chromosomes. For example, for machine with 768 GB RAM, alignment with 28 genomes , and longest chromosome of 350 Mbp it was safe to run no more 35 halSynteny jobs simultaneously. I prefer to generate list of commands for each scaffold of query and then run it using parallel. To reduce the number of commands you can avoid scaffold shorter than 10 kbp, and scaffolds shorter 1 kbp definitely must be filtered out.
Usage:
- if you wish to run halSynteny in single threaded mode for all your input genomes one by one:
> halsynteny_canidae2.cmds;
SPECIES_LIST=(AHD_hap1 AHD_hap2 AHD_dnazoo basenji_UNSW-BAS-2019 basenji_MUID185726 boxer_canFam6 boxer_canFam3.1 great_dane GSD_canFam4 wolf dingo_alpine dingo labrador);
for TARGET in ${SPECIES_LIST[@]};
do
SPECIES_LIST=(${SPECIES_LIST[@]:1});
for QUERY in ${SPECIES_LIST[@]};
do
if [ "${TARGET}" = "${QUERY}" ];
then
echo "Skipping self alignment";
else
echo "halSynteny --maxAnchorDistance 50000 --minBlockSize 50000 --queryGenome ${QUERY} --targetGenome ${TARGET} canidae2.hal ${QUERY}.to.${TARGET}.anchor50000.block50000.psl" >> halsynteny_canidae2.cmds;
fi;
done;
done
parallel -j <thread_number> < halsynteny_canidae2.cmds
- if you wish to run halSynteny with parallelization for all your input genomes (you will need to generate a .len files in advance for each genome. .len file is a TAB separated file with two columns - scaffold id and scaffold length):
> halsynteny_all_vs_all.cmds ;
for TARGET in cryptoprocta_ferox felis_catus mirounga_angustirostris phoca_largha phoca_vitulina halichoerus_grypus pusa_sibirica erignathus_barbatus odobenus_rosmarus ailurus_fulgens aonyx_cinereus lutra_lutra enhydra_lutris lontra_canadensis pteronura_brasiliensis mustela_nigripes mustela_putorius_furo neogale_vison martes_flavigula martes_martes martes_zibellina bassariscus_astutus meles_meles procyon_lotor nasua_narica potos_flavus homo_sapiens canis_familiaris;
do
mkdir -p all_vs_all/${TARGET};
for QUERY in cryptoprocta_ferox felis_catus mirounga_angustirostris phoca_largha phoca_vitulina halichoerus_grypus pusa_sibirica erignathus_barbatus odobenus_rosmarus ailurus_fulgens aonyx_cinereus lutra_lutra enhydra_lutris lontra_canadensis pteronura_brasiliensis mustela_nigripes mustela_putorius_furo neogale_vison martes_flavigula martes_martes martes_zibellina bassariscus_astutus meles_meles procyon_lotor nasua_narica potos_flavus homo_sapiens martes_foina;
do
if [ "${TARGET}" = "${QUERY}" ];
then
echo "Skipping self alignment";
else
mkdir all_vs_all/${TARGET}/${QUERY};
for SCAF in `awk '{if ($2 >=10000) print $1}' /media/eternus3/skliver/mustelidae/cactus/genomes/len/${QUERY}*`;
do
echo "halSynteny --maxAnchorDistance 50000 --minBlockSize 50000 --queryGenome ${QUERY} --targetGenome ${TARGET} --queryChromosome ${SCAF} mustpinpro.v5.hal all_vs_all/${TARGET}/${QUERY}/${SCAF}.${QUERY}.to.${TARGET}.anchor50000.block50000.psl" >> halsynteny_all_vs_all.cmds;
done
fi;
done;
done
parallel -j 35 < halsynteny_all_vs_all.cmds
Options :
--maxAnchorDistance - maximal distance between anchors (~sublocks) to merge
--minBlockSize - minimal size of a synteny blocks
You can increase this number, but the fine structure will be lost.
Recommended CPU number : 35+, for big alignments it is limited by the RAM
Recommended RAM per job : Highly dependent on number of the genomes, size of chromosomes and phylogenetic distance between query and target
Running time : highly depends on size of genomes, phylogenetic distance and query chromosome size.
Output: .psl file with synteny blocks in case of unparalled run, or set of them if halSynteny was run on individual query scaffolds. Do not forget to merge them. Via cat for example