is a script to determine MLST sequence types for E. coli genomes and extract allele sequences.
- Synopsis
- Description
- Usage
- Options
- Output
- Run environment
- Author - contact
- Citation, installation, and license
- Changelog
perl -a fas -g fasta
The script searches for multilocus sequence type (MLST) alleles in E. coli genomes according to Mark Achtman's scheme with seven house-keeping genes (adk, fumC, gyrB, icd, mdh, purA, and recA) [Wirth et al., 2006]. NUCmer from the MUMmer package is used to compare the given allele sequences to bacterial genomes via nucleotide alignments.
Download the allele files (adk.fas ...) and the sequence type file ('publicSTs.txt') from this website:
To run
include all E. coli genome files (file
extension e.g. 'fasta'), all allele sequence files (file extension
'fas') and 'publicSTs.txt' in the current working directory. The
allele profiles are parsed from the created *.coord files and written
to a result file, plus additional information from the file
'publicSTs.txt'. Also, the corresponding allele sequences (obtained
from the allele input files) are concatenated for each E. coli genome
into a result multi-fasta file. Option -c can be used to initiate
an alignment for this multi-fasta file with ClustalW (standard
alignment parameters; has to be in the $PATH
or change variable
). The alignment fasta output file can be used
directly for RAxML. CAREFUL the Phylip alignment format from
ClustalW allows only 10 characters per strain ID.
works with complete and draft genomes. However, several genomes cannot be included in a single input file!
Obviously, only for those genomes whose allele sequences have been
deposited in Achtman's allele database results can be obtained. If an
allele is not found in a genome it is marked by a '?' in the result
profile file and a place holder 'XXX' in the result fasta file. For
these cases a manual NUCmer or BLASTN might be useful to fill the
gaps and
to get the corresponding 'new' allele
Non-NCBI fasta headers for the genome files have to have a unique ID directly following the '>' (e.g. 'Sakai', '55989' ...).
perl -a fas -g fasta -c
-a, -alleles
File extension of the MLST allele fasta files, e.g. 'fas' (<=> -g).
-g, -genomes
File extension of the E. coli genome fasta files, e.g. 'fasta' (<=> -a).
-h, -help
Help (perldoc POD)
-c, -clustalw
Call ClustalW for alignment
Tab-separated allele profiles for the E. coli genomes, plus additional info from 'publicSTs.txt'
Multi-fasta file of all concatenated allele sequences for each genome
Text files that contain the coordinates of the NUCmer hits for each genome and allele
Error file, summarizing number of not found alleles or unclear NUCmer hits
Optional, ClustalW alignment in Phylip format
Optional, ClustalW alignment guide tree
The Perl script runs only under UNIX flavors.
Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
For citation, installation, and license information please see the repository main
- v0.3 (30.01.2013)
- additional info in POD
- check if result files already exist and ask user what to do
- changed script name from
- v0.2 (20.10.2012)
- included a POD
- options with Getopt::Long
- don't consider input E. coli genome query files, which are too big (set cutoff at 9 MB for a fasta E. coli file)
- draft E. coli genomes can now be used as input query files
- additional info in 'publicSTs.txt' now associated to found ST types in output
- give text to STDOUT which files were created
- new option -c to align the resulting allele sequences via ClustalW
- v0.1 (25.10.2011)