EnsembleTR is a tool for ensemble Tandem Repeat (TR) calling. It takes one or more VCF files with TR genotypes for a panel of samples and outputs a consensus set of genotypes.
python3 setup.py install --user
Type EnsembleTR
. You should see the help message appear.
To run EnsembleTR, use the following command
EnsembleTR --out output.vcf
--ref ref.fa
--vcfs vcf1.vcf,vcf2.vcf,...
Required parameters:
--vcfs <file.vcf,[file2.vcf]>
Comma separated list of input VCF files--ref
Refererence genome (.fa)--out
Path to output VCF file
Both zipped and unzipped VCF files are accepted as input. EnsembleTR can currently process VCF files generated by hipSTR, GangSTR, adVNTR, and ExpansionHunter.
You must input a reference genome in FASTA format. This must be the same reference build used for TR calling in input files.
For more information on VCF file format, see the VCF spec. EnsembleTR output VCF file contains several fields that are described below.
INFO fields contain aggregated statistics about each TR. The following custom fields are added:
FIELD | DESCRIPTION |
---|---|
START | Start position of the TR |
END | End position of the TR |
PERIOD | Length of the repeat unit |
RU | Repeat motif |
METHODS | Methods that attempted to genotype this locus (AdVNTR, EH, HipSTR, GangSTR) |
FORMAT fields contain information specific to each genotype call. The following custom fields are added:
FIELD | DESCRIPTION |
---|---|
GT | Genotype |
GB | Base pair difference from ref allele |
NCOPY | Genotype given in number of copies of the repeat motif |
EXP | Boolean showing if the genotype alleles were expanded |
SCORE | Score of the consensus call |
GTS | Method(s) that support the consensus call |
ALS | Number of times each bp difference was seen across all calls |
INPUTS | Raw calls |
Score is calculated by aggregating quality information from calls that are getting merged at each locus.
You can use statSTR from TRTools to compute various per-locus statistics for EnsembleTR .VCF files.
For example, to compute per-locus allele frequency use the following command:
statSTR --vcf EnsembleTR_file.vcf.gz
--vcftype hipstr
--afreq
--out EnsembleTR_per_locus_allele_frequency
Chromosome 1 VCF file and tbi file
Chromosome 2 VCF file and tbi file
Chromosome 3 VCF file and tbi file
Chromosome 4 VCF file and tbi file
Chromosome 5 VCF file and tbi file
Chromosome 6 VCF file and tbi file
Chromosome 7 VCF file and tbi file
Chromosome 8 VCF file and tbi file
Chromosome 9 VCF file and tbi file
Chromosome 10 VCF file and tbi file
Chromosome 11 VCF file and tbi file
Chromosome 12 VCF file and tbi file
Chromosome 13 VCF file and tbi file
Chromosome 14 VCF file and tbi file
Chromosome 15 VCF file and tbi file
Chromosome 16 VCF file and tbi file
Chromosome 17 VCF file and tbi file
Chromosome 18 VCF file and tbi file
Chromosome 19 VCF file and tbi file
Chromosome 20 VCF file and tbi file
Chromosome 21 VCF file and tbi file
Chromosome 22 VCF file and tbi file
Phased variants of 3,202 samples from the 1000 Genomes Project (1kGP).
TRs imputed from 3,202 1kGP samples.
Total 70,692,015 variants + 1,089,670 TR markers.
All the coordinates are based on the hg38 human reference genome.
Chromosome 1 VCF file and tbi file
Chromosome 2 VCF file and tbi file
Chromosome 3 VCF file and tbi file
Chromosome 4 VCF file and tbi file
Chromosome 5 VCF file and tbi file
Chromosome 6 VCF file and tbi file
Chromosome 7 VCF file and tbi file
Chromosome 8 VCF file and tbi file
Chromosome 9 VCF file and tbi file
Chromosome 10 VCF file and tbi file
Chromosome 11 VCF file and tbi file
Chromosome 12 VCF file and tbi file
Chromosome 13 VCF file and tbi file
Chromosome 14 VCF file and tbi file
Chromosome 15 VCF file and tbi file
Chromosome 16 VCF file and tbi file
Chromosome 17 VCF file and tbi file
Chromosome 18 VCF file and tbi file
Chromosome 19 VCF file and tbi file
Chromosome 20 VCF file and tbi file
Chromosome 21 VCF file and tbi file
Chromosome 22 VCF file and tbi file
Use Beagle to impute TRs into SNP data:
java -Xmx4g -jar beagle.version.jar \
gt=SNPs.vcf.gz \
ref=${chrom}_final_SNP_merged.vcf.gz \
out=imputed_TR_SNPs