po2group_stats.pl
is a script to categorize orthologs from Proteinortho5 output according to genome groups. In the prot_finder workflow is a script, binary_group_stats.pl
, which does the same thing for column groups in a delimited TEXT binary matrix.
- Synopsis
- Description
- Usage
- Options
- Output
- Dependencies
- Run environment
- Author - contact
- Citation, installation, and license
- Changelog
perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/ -g group_file.tsv -p > overall_stats.tsv
Categorize the genomes in an ortholog/paralog output matrix (option -i) from a Proteinortho5 calculation according to group affiliations. The group affiliations of the genomes are intended to get overall presence/absence statistics for groups of genomes and not simply single genomes (e.g. comparing 'marine', 'earth', 'commensal', 'pathogenic' etc. genome groups). Percentage inclusion (option -cut_i) and exclusion (option -cut_e) cutoffs can be set to define how strict the presence/absence of genome groups within an orthologous group (OG) are defined. Of course groups can also hold only single genomes to get single genome statistics. Group affiliations are defined in a mandatory tab-delimited group input file (option -g) with minimal two and maximal four groups.
Only alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and period (.) characters are allowed for the group names in the group file to avoid downstream problems with the operating/file system. As a consequence, also no whitespaces are allowed in these! Additionally, group names, genome filenames (should be enforced by the file system), and FASTA IDs considering all genome files (mostly locus tags; should be enforced by Proteinortho5) need to be unique.
Proteinortho5 (PO) has to be run with option -singles to include also genes without orthologs, so-called singletons/ORFans, for each genome in the PO matrix (see the PO manual). Additionally, option -selfblast is recommended to enhance paralog detection by PO.
To explain the logic behind the categorization, the following annotation for example groups will be used. A '1' exemplifies a group genome count in a respective OG >= the rounded inclusion cutoff, a '0' a group genome count <= the rounded exclusion cutoff. The presence and absence of OGs for the group affiliations are structured in different categories depending on the number of groups. For two groups (e.g. A and B) there are five categories: 'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1), 'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a genome count for at least one group outside the cutoffs (exclusion cutoff < genome count < inclusion cutoff) and thus cannot be categorized. These 'unspecific' OGs will only be printed to a final annotation result file with option -u. Overall stats for all categories are printed to STDOUT in a final tab-delimited output matrix.
Three groups (A, B, and C) have the following nine categories: 'A specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific' (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0), 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'.
Four groups (A, B, C, and D) are classified in 17 categories: 'A specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific' (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific' (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented' (0:0:0:0), and 'unspecific'.
The resulting group presence/absence (according to the cutoffs) can also be printed to a binary matrix (option -b) in the result directory (option -r), excluding the 'unspecific' category. Since the categories are the logics underlying venn diagrams, you can also plot the results in a venn diagram using the binary matrix (option -p). The 'underrepresented' category is exempt from the venn diagram, because it is outside of venn diagram logics.
Here are venn diagrams illustrating the logic categories (see folder 'pics'):
There are two optional categories (which are only considered for the final print outs and in the final stats matrix, not for the binary matrix and the venn diagram): 'strict core' (option -co) for OGs where all genomes have an ortholog, independent of the cutoffs. Of course all the 'strict core' OGs are also included in the 'cutoff_core' category ('strict core' is identical to 'cutoff core' with -cut_i 1 and -cut_e 0). Option -s activates the detection of 'singleton/ORFan' OGs present in only one genome. Depending on the cutoffs and number of genomes in the groups, category 'underrepresented' includes most of these singletons.
Additionally, annotation is retrieved from multi-FASTA files created
with cds_extractor.pl
. See
cds_extractor.pl
for a description of the
format. These files are used as input for the PO analysis and with
option -d for po2group_stats.pl
. The annotations are printed
in category output files in the result directory.
Annotations are only pulled from one representative genome for each category present in the current OG. With option -co you can set a specific genome for the representative annotation for category 'strict core'. For all other categories the representative genome is chosen according to the order of the genomes in the group files, depending on the presence in each OG. Thus, the best annotated genome should be in the first group at the topmost position (especially for 'cutoff core'), as well as the best annotated ones at the top in all other groups.
In the result files, each orthologous group (OG) is listed in a row
of the resulting category files, the first column holds the OG
numbers from the PO input matrix (i.e. line number minus one). The
following columns specify the ID for each CDS, gene, EC number(s),
product, and organism are shown depending on their presence in the
CDS's annotation. The ID is in most cases the locus tag (see
cds_extractor.pl
). If several EC numbers exist
for a single CDS they are separated by a ';'. If the representative
genome within an OG includes paralogs (co-orthologs) these will be
printed in the following row(s) without a new OG number in the
first column.
The number of OGs in the category annotation result files are the same as listed in the venn diagram and the final stats matrix. However, since only annotation from one representative annotation is used the CDS number will be different to the final stats. The final stats include all the CDS in this category in all genomes present in the OG in groups >= the inclusion cutoff (i.e. for 'strict core' the CDS for all genomes in this OG are counted). Two categories are different, for 'unspecific' all unspecific groups are included, for 'underrepresented' all groups <= the exclusion cutoffs. This is also the reason, the 'pangenome' CDS count is greater than the 'included in categories' CDS count in the final stats matrix, as genomes in excluded groups are exempt from the CDS counts for most categories. 'Included in categories' is the OG/CDS sum of all non-optional categories ('*specific', '*absent', 'cutoff core', 'underrepresented', and 'unspecific'), since the optional categories are included in non-optionals. An exception to the difference in CDS counts are the 'singletons' category where OG and CDS counts are identical in the result files and in the overall final output matrix (as there is only one genome), as well as in group-'specific' categories for groups including only one genome.
At last, if you want the respective representative sequences for a category you can first filter the locus tags from the result file with Unix command-line tools:
grep -v "^#" result_file.tsv | cut -f 2 > locus_tags.txt
And then feed the locus tag list to
cds_extractor.pl
with option -l.
As a final note, in the prot_finder workflow is a
script, binary_group_stats.pl
, based upon po2group_stats.pl
,
which can calculate overall presence/absence statistics for column
groups in a delimited TEXT binary matrix (as with genomes here).
for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a > overall_stats.tsv
-
-i=str, -input=str
Proteinortho (PO) result matrix (*.proteinortho or *.poff)
-
-d=str, -dir_genome=str
Path to the directory including the genome multi-FASTA PO input files (*.faa or *.ffn), created with
cds_extractor.pl
-
-g=str, -groups_file=str
Tab-delimited file with group affiliation for the genomes with minimal two and maximal four groups (easiest to create in a spreadsheet software and save in tab-separated format). All genomes from the PO matrix need to be included. Group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two genomes in group A, three genomes in group B and D, and one genome in group C:
group_A group_B group_C group_D
genome1.faa genome2.faa genome3.faa genome4.faa
genome5.faa genome6.faa genome7.faa
genome8.faa genome9.faa
-
-h, -help
Help (perldoc POD)
-
-r=str, -result_dir=str
Path to result folder [default = inclusion and exclusion percentage cutoff, './results_i#_e#']
-
-cut_i=float, -cut_inclusion=float
Percentage inclusion cutoff for genomes in a group per OG, has to be > 0 and <= 1. Cutoff will be rounded according to the genome number in each group and has to be > the rounded exclusion cutoff in this group. [default = 0.9]
-
-cut_e=float, -cut_exclusion=float
Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. [default = 0.1]
-
-b, -binary_matrix
Print a binary matrix with the presence/absence genome group results according to the cutoffs (excluding 'unspecific' category OGs)
-
-p, -plot_venn
Plot venn diagram from the binary matrix (except 'unspecific' and 'underrepresented' categories) with function
venn
from R package gplots, requires option -b -
-co=(str), -core_strict=(str)
Include 'strict core' category in output. Optionally, give a genome name from the PO matrix to use for the representative output annotation. [default = topmost genome in first group]
-
-s, -singletons
Include singletons/ORFans for each genome in the output, activates also overall genome OG/CDS stats in final stats matrix for genomes with singletons
-
-u, -unspecific
Include 'unspecific' category representative annotation file in result directory
-
-a, -all_genomes_overall
Report overall stats for all genomes (appended to the final stats matrix), also those without singletons; will include all overall genome stats without option -s
-
-v, -version
Print version number to STDERR
-
STDOUT
The tab-delimited final stats matrix is printed to STDOUT. Redirect or pipe into another tool as needed.
-
./results_i#_e#
All output files are stored in a results folder
-
./results_i#_e#/[*_specific|*_absent|cutoff_core|underrepresented]_OGs.tsv
Tab-delimited files with OG annotation from a representative genome for non-optional categories
-
(./results_i#_e#/[*_singletons|strict_core|unspecific]_OGs.tsv)
Optional category tab-delimited output files with representative annotation
-
(./results_i#_e#/binary_matrix.tsv)
Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' category)
-
(./results_i#_e#/venn_diagram.pdf)
Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories)
-
Statistical computing language R
Rscript
is needed to plot the venn diagram with option -p, tested with version 3.2.2 -
gplots (https://cran.r-project.org/web/packages/gplots/index.html)
Package needed for R to plot the venn diagram with function
venn
. Tested with gplots version 2.17.0.
The Perl script runs under UNIX and Windows flavors.
Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
For citation, installation, and license information please see the repository main README.md.
- v0.1.3 (06.06.2016)
- included check for file system conformity for group names
- some minor syntax changes and additions to error messages, basically adapting to
binary_group_stats.pl
- v0.1.2 (19.11.2015)
- added
pod2usage
-die for Getopts::Long call - minor POD/README change
- added
- v0.1.1 (30.10.2015)
- fixed bug for representative annotation in output files, the representative genome was not chosen according to genome order in the groups file
- v0.1 (23.10.2015)