Skip to content

Collection of scripts to construct and curate phylogenomic datasets

Notifications You must be signed in to change notification settings

fmarletaz/phylogenomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tools for phylogenomic analyses

This repository includes several python scripts aimed at facilitating some steps of phylogenomic analysis pipeline. They particularly have been used for the following work:

Marlétaz F, Peijnenburg KTCA, Goto T, Satoh N, Rokhsar DS. A new spiralian phylogeny refines the position of the enigmatic arrow worms. in preparation

cross-conta.py

The problem of index hoping causes a small fraction of reads to cross-contaminate illumina libraries sequenced on the same lane. This problem is minor for quantitative approaches but for de novo assembly, it can cause the presence of mislabelled assembled transcripts. This tool follows the same line of reasonning as described in Simion et al. (2018). Briefly, for a given transcriptome, it generates read count against all libraries sequenced on the same lane using kallisto, and then filters out each transcript that has a higher count on another library than the ones it belongs to.

usage: cross-conta.py [options] <ctrl>

  ctrl        Control file including names of assembly and paired reads   for each library

optional arguments:
  -p NPROC    Number of threads (default: 8)
  -f FOLD     Fold-enrichment to discard contig (default: 2)
  -m MINCOV   Minimal coverage of contig by corresp. reads (default: 2)

phylostata.py

This utility computes various statistics against a collection of alignments and applies from filters.

Dependencies: ete3 library, numpy and Biopython which can all be installed with conda.

Briefly, it checks the monophyly for each clade mentioning in the taxonomic list, it computes the mutational saturation for each alignment, and excludes taxa with divergence to the root higher than a threshold. The usage is very simply:

Usage: phyloStrata.py <taxlist> <suffix> <fasta files...>

taxlist needs to be formatted as a list of taxa within the alignments with a generic clade name separated by a tab. The monophyly of the taxa in the clade will be checked. The suffix will be used for the output file. The tree files need to be named as taxon.xx.xx and corresponding fasta files taxon.al.hc.tr.fa.

concatenate-red.py

This files build a concatenated alignments from the filtered alignment and the statistics file generated by phylostata.py.

Usage: concatenate-ext.py <taxlist> <suffix> <fasta files...>

About

Collection of scripts to construct and curate phylogenomic datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages