Skip to content

HanfeiBu/GeneMarkS-2-Training

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Note: This repo contains the code for the training phase of GeneMarkS-2 only. To download and use the complete program, please visit topaz.gatech.edu

GeneMarkS-2

Article Name: Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes.

Authors: Alex Lomsadze^, Karl Gemayel^, Shiyuyun Tang and Mark Borodovsky

      ^ joint first authors

Affiliation: Georgia Institute of Technology

Group Website: topaz.gatech.edu

PubMed: www.ncbi.nlm.nih.gov/pubmed/29773659/

Install

Structure: GeneMarkS-2 is made up of four components:

  • gms2.pl : Controls the entire GeneMarkS-2 algorithm
  • biogem : Implements the training stages of GeneMarkS-2
  • gmhmmp2 : Implements the prediction stages of GeneMarkS-2
  • compp : Used for checking for convergence by comparing consecutive prediction files

See the INSTALL file for more detail.

Execution

To run GeneMarkS-2, simply execute the perl script 'gms2.pl' by invoking 'perl gms2.pl'. This will print out the usage message showing all possible input parameters (see below). GeneMarkS-2 with its default parameters can be run by:

perl gms2.pl -s sequence.fasta --genome-type TYPE --output OUT

Where 'sequence.fasta' is the FASTA file containing the sequence. And TYPE is bacteria, archaea or auto (auto detection of domain)

Usage

Usage: gms2.pl --seq SEQ --genome-type TYPE
Basic Options:
--seq                                   File containing genome sequence in FASTA format
--genome-type                           Type of genome: archaea, bacteria, auto (default: auto)
--gcode                                 Genetic code (default: auto. Supported: 11, 4, 25 and 15)
--output                                Name of output file (default: gms2.lst)
--format                                Format of output file (default: lst)
--ext                                   Name of file with external information in GFF format (PLUS mode of GMS2)
--fnn                                   Name of output file that will hold nucleotide sequences of predicted genes
--faa                                   Name of output file that will hold protein sequences of predicted genes
--gid                                   Change gene ID format
--species                               Name of the species to use inside the model file (default: unspecified)
--advanced-options                      Show the advanced options

Version: 1.14_1.24_lic

GeneMarkS-2 Otput

GeneMarkS-2 uses GeneMark.hmm-2 as a core gene finder. Final output is generated by GeneMark.hmm-2.

GeneMark.hmm-2 Output

Coordinates of predicted genes can be saved in GFF, GTF, GFF3 and LST formats.

LST format is custom human readable format developed at GaTech for GeneMark.hmm. LST is default output format in GeneMark.hmm-2.

GFF, GTF and GFF3 formats were developed and have been widely used for description of genes in eukaryotic species. These formats are not yet widely adopted for gene description of prokaryotic species. Almost all prokaryotic gene finders use by default custom formats and also support one or another variant of GFF format with gene finder specific modifications.

GTF and GFF3 are formats derived from original GFF format. GFF, GTF and GFF3 formats use similar 8 first mandatory columns.

Deviation from standard in GeneMark.hmm-2 in first 8 columns:


  • Incomplete CDS can be present in genomes due to gaps in sequence assembly or linearization of circular chromosome. Most frequently incomplete CDSi's are found at the beginning or at the end of the contig. Incomplete CDS's predicted by GeneMark.hmm-2 always start and end with full codon. Thus, all predicted CDS in GFF* formats will have phase zero. For example, these three lines describe incomplete gene on direct (plus) strand shifted by 0, 1 and 2 nucleotides:

    seq GeneMark.hmm2 CDS 1 474 24.07 + 0 ... partial 10 ...
    seq GeneMark.hmm2 CDS 2 475 24.07 + 0 ... partial 10 ...
    seq GeneMark.hmm2 CDS 3 476 24.07 + 0 ... partial 10 ...


  • Incomplete CDS can be predicted inside a sequence. Assembly gaps, represented by long stretches of letters 'N' in assembly, can lead to incomplete CDS structures inside the sequence. GeneMark.hmm2 can predict such incomplete genes. For example:

    seq GeneMark.hmm2 CDS 11 472 23.7 + 0 ... partial 11 ...

    The same rule of starting and ending at full codon applies to incomplete internal genes.


  • CDS coordinates include position of the stop codon in GFF and GFF3 formats. Only in case of incomplete in 3' end CDS a stop is not present. GTF format specification excludes stop codon from CDS coordinates. GeneMark.hmm-2 deviates from GTF standard and always includes stop codon into CDS coordinates.

  • Score of CDS feature is log-odd score for CDS in GeneMark.hmm-2 (not P or E value).

    Column 9 in GFF, GTF and GFF3

    In GFF Column 9 in original GFF format was optional.

    In GeneMark.hmm-2 column 9 in GFF is formatted using following rules: * Key and value pairs are separated by space * Key/values pares are separated by semicolon ';' * Order of key/value pairs is arbitrary

    For example: seq GeneMark.hmm2 CDS 1 474 24.07 + 0 gene_id 1; partial 10; gene_type atypical; gc 45, length 474;

    The following keys/values are currently supported:

    • gene_id number;

      In GeneMarkS-2 Gene ID is an integer value starting with "1" and incremented by "1" across all the contigs in the input file.

    • partial label;

      GFF* format has no dedicated rule for labeling incomplete genes. Thus, information about incomplete CDS status is stored with key 'partial' and one of the values '01', '10' or '11'. The value indicates the side where CDS is incomplete:

      • '01' incomplete from right
      • '10' incomplete from left
      • '11' incomplete from both sides

Attention:

  • Order of the key/value pairs in column 9 (attribute) is arbitrary and may change between the versions.
  • Additional key/value pairs can be introduced in new versions of the code
  • Formatting rules for CDS split by linearization of circular chromosome where not specified in the described formats. These rules may be introduced in the new versions of the code.

About

Project for Prof. Gleichsner

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 86.0%
  • AMPL 7.0%
  • Perl 4.6%
  • Python 2.0%
  • Shell 0.3%
  • Makefile 0.1%