Note: This repo contains the code for the training phase of GeneMarkS-2 only. To download and use the complete program, please visit topaz.gatech.edu
Article Name: Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes.
Authors: Alex Lomsadze^, Karl Gemayel^, Shiyuyun Tang and Mark Borodovsky
^ joint first authors
Affiliation: Georgia Institute of Technology
Group Website: topaz.gatech.edu
PubMed: www.ncbi.nlm.nih.gov/pubmed/29773659/
Structure: GeneMarkS-2 is made up of four components:
- gms2.pl : Controls the entire GeneMarkS-2 algorithm
- biogem : Implements the training stages of GeneMarkS-2
- gmhmmp2 : Implements the prediction stages of GeneMarkS-2
- compp : Used for checking for convergence by comparing consecutive prediction files
See the INSTALL file for more detail.
To run GeneMarkS-2, simply execute the perl script 'gms2.pl' by invoking 'perl gms2.pl'. This will print out the usage message showing all possible input parameters (see below). GeneMarkS-2 with its default parameters can be run by:
perl gms2.pl -s sequence.fasta --genome-type TYPE --output OUT
Where 'sequence.fasta' is the FASTA file containing the sequence. And TYPE is bacteria, archaea or auto (auto detection of domain)
Usage: gms2.pl --seq SEQ --genome-type TYPE
Basic Options:
--seq File containing genome sequence in FASTA format
--genome-type Type of genome: archaea, bacteria, auto (default: auto)
--gcode Genetic code (default: auto. Supported: 11, 4, 25 and 15)
--output Name of output file (default: gms2.lst)
--format Format of output file (default: lst)
--ext Name of file with external information in GFF format (PLUS mode of GMS2)
--fnn Name of output file that will hold nucleotide sequences of predicted genes
--faa Name of output file that will hold protein sequences of predicted genes
--gid Change gene ID format
--species Name of the species to use inside the model file (default: unspecified)
--advanced-options Show the advanced options
Version: 1.14_1.24_lic
GeneMarkS-2 uses GeneMark.hmm-2 as a core gene finder. Final output is generated by GeneMark.hmm-2.
Coordinates of predicted genes can be saved in GFF, GTF, GFF3 and LST formats.
LST format is custom human readable format developed at GaTech for GeneMark.hmm. LST is default output format in GeneMark.hmm-2.
GFF, GTF and GFF3 formats were developed and have been widely used for description of genes in eukaryotic species. These formats are not yet widely adopted for gene description of prokaryotic species. Almost all prokaryotic gene finders use by default custom formats and also support one or another variant of GFF format with gene finder specific modifications.
GTF and GFF3 are formats derived from original GFF format. GFF, GTF and GFF3 formats use similar 8 first mandatory columns.
Deviation from standard in GeneMark.hmm-2 in first 8 columns:
-
Incomplete CDS can be present in genomes due to gaps in sequence assembly or linearization of circular chromosome. Most frequently incomplete CDSi's are found at the beginning or at the end of the contig. Incomplete CDS's predicted by GeneMark.hmm-2 always start and end with full codon. Thus, all predicted CDS in GFF* formats will have phase zero. For example, these three lines describe incomplete gene on direct (plus) strand shifted by 0, 1 and 2 nucleotides:
seq GeneMark.hmm2 CDS 1 474 24.07 + 0 ... partial 10 ...
seq GeneMark.hmm2 CDS 2 475 24.07 + 0 ... partial 10 ...
seq GeneMark.hmm2 CDS 3 476 24.07 + 0 ... partial 10 ...
-
Incomplete CDS can be predicted inside a sequence. Assembly gaps, represented by long stretches of letters 'N' in assembly, can lead to incomplete CDS structures inside the sequence. GeneMark.hmm2 can predict such incomplete genes. For example:
seq GeneMark.hmm2 CDS 11 472 23.7 + 0 ... partial 11 ...
The same rule of starting and ending at full codon applies to incomplete internal genes.
- CDS coordinates include position of the stop codon in GFF and GFF3 formats. Only in case of incomplete in 3' end CDS a stop is not present. GTF format specification excludes stop codon from CDS coordinates. GeneMark.hmm-2 deviates from GTF standard and always includes stop codon into CDS coordinates.
-
Score of CDS feature is log-odd score for CDS in GeneMark.hmm-2 (not P or E value).
Column 9 in GFF, GTF and GFF3
In GFF Column 9 in original GFF format was optional.
In GeneMark.hmm-2 column 9 in GFF is formatted using following rules: * Key and value pairs are separated by space * Key/values pares are separated by semicolon ';' * Order of key/value pairs is arbitrary
For example: seq GeneMark.hmm2 CDS 1 474 24.07 + 0 gene_id 1; partial 10; gene_type atypical; gc 45, length 474;
The following keys/values are currently supported:
-
gene_id number;
In GeneMarkS-2 Gene ID is an integer value starting with "1" and incremented by "1" across all the contigs in the input file.
-
partial label;
GFF* format has no dedicated rule for labeling incomplete genes. Thus, information about incomplete CDS status is stored with key 'partial' and one of the values '01', '10' or '11'. The value indicates the side where CDS is incomplete:
- '01' incomplete from right
- '10' incomplete from left
- '11' incomplete from both sides
-
Attention:
- Order of the key/value pairs in column 9 (attribute) is arbitrary and may change between the versions.
- Additional key/value pairs can be introduced in new versions of the code
- Formatting rules for CDS split by linearization of circular chromosome where not specified in the described formats. These rules may be introduced in the new versions of the code.