David E. Larson, Travis E. Abbott and Christopher C. Harris
October 26, 2011
The purpose of this program is to identify single nucleotide positions that are different between tumor and normal (or, in theory, any two bam files). It takes a tumor bam and a normal bam and compares the two to determine the differences. It outputs a file in a format very similar to Samtools consensus format. It uses the genotype likelihood model of MAQ (as implemented in Samtools) and then calculates the probability that the tumor and normal genotypes are different. This probability is reported as a somatic score. The somatic score is the Phred-scaled probability (between 0 to 255) that the Tumor and Normal genotypes are not different where 0 means there is no probability that the genotypes are different and 255 means there is a probability of 1 − 10^(255∕−10) that the genotypes are different between tumor and normal. This is consistent with how the SAM format reports such probabilities.
There are two modes, the joint genotyping mode (-J) takes into account the fact that the tumor and normal samples are not entirely independent and also takes into account the prior probability of a somatic mutation. This probability can be scaled to control the sensitivity of the algorithm. An accurate value for this prior would be 0.000001, but this may result in a severe lack of sensitivity at lower depths. A less realistic prior probability will generate more sensitive results at the expense of an increase in the number of false positives. To get a similar sensitivity to the default mode, we recommend using a prior of 0.01. The default mode treats the two samples as if they came from two different individuals. This mode uses a less accurate mathematical model, but yields good results, especially if the normal may contain some tumor cells or the tumor is quite impure.
bam-somaticsniper [options] -f <ref.fasta> <tumor.bam> <normal.bam> <snv_output_file>
Required Option:
-f FILE REQUIRED reference sequence in the FASTA format
Options:
-q INT filtering reads with mapping quality less than INT [0]
-Q INT filtering somatic snv output with somatic quality less than INT [15]
-L FLAG do not report LOH variants as determined by genotypes
-G FLAG do not report Gain of Referene variants as determined by genotypes
-p FLAG disable priors in the somatic calculation. Increases sensitivity for solid tumors.
-J FLAG Use prior probabilities accounting for the somatic mutation rate
-s FLOAT prior probability of a somatic mutation (implies -J) [0.01]
-T FLOAT theta in maq consensus calling model (for -c/-g) [0.850000]
-N INT number of haplotypes in the sample (for -c/-g) [2]
-r FLOAT prior of a difference between two haplotypes (for -c/-g) [0.001000]
-F STRING select output format (vcf or classic) [classic]
Minimally, you must provide the program the reference fasta the bams were aligned against (passed with the -f option), a tumor bam, a normal bam, and the filename of the resulting output file. We recommend filtering out reads with a mapping quality of 0 (i.e. use -q 1) as they are typically randomly placed in the genome. We have also found that few variants with a somatic score less than 15 validate, but you may decrease the minimum score or increase it to a higher threshold (eg -Q 40). To obtain high confidence sites, we recommend also thresholding the minimum average mapping quality for the variant base to 40 for reads aligned with BWA or 70 for reads aligned with MAQ. We have not tested other aligners at this time. Disabling priors is not recommended, but may increase sensitivity at the cost of a decrease in specificity.
We recommend that you utilize both the -G
and -L
options when running in order to reduce likely false positives with little impact on sensitivity. An example command-line is below:
bam-somaticsniper -Q 40 -G -L -f reference.fa tumor.bam normal.bam output.txt
A small number of basic Perl scripts are included in the SomaticSniper package (located in src/scripts of the source code release) to aid in filtering out likely false positives. In order to get the recommended filtering you should do the following. Defaults are set assuming that BWA short is the aligner used. Other aligners have not been tested and recommendations are not available. Before proceeding you will need to obtain and compile bam-readcount (https://github.com/genome/bam-readcount). You will also need to generate a samtools pileup (not mpileup) indel file. Handling of indel containing VCFs is not implemented.
-
Filter on standard filters using the indel file. This will also remove LOH calls e.g.
perl snpfilter.pl –snp-file your_sniper_file –indel-file your_indel_pileup
-
Adapt the remainder for use with bam-readcount e.g.
perl prepare_for_readcount.pl –snp-file your_sniper_file.SNPfilter
-
Run bam-readcount (I’d recommend using the same mapping quality -q setting as you ran SomaticSniper with) e.g.
bam-readcount -b 15 -f your_ref.fasta -l your_sniper_file.SNPfilter.pos your_tumor.bam > your_readcounts.rc
Run the false positive filter e.g. perl fpfilter.pl –snp-file your_sniper_file.SNPfilter –readcount-file your_readcounts.rc -
Lastly, run the "high confidence" filter which filters based on the Somatic Score and mapping quality e.g.
perl highconfidence.pl –snp-file your_sniper_file.SNPfilter.fp_pass
Your final set of high confidence and highly filtered indels is now in the file your_sniper_file.SNPfilter.fp_pass.hc
The output by SomaticSniper consists of line for all sites whose consensus differs from the reference base. Each of the three available output formats is described below
Classic:
Each line contains the following tab-separated values:
- Chromosome
- Position
- Reference base
- IUB genotype of tumor
- IUB genotype of normal
- Somatic Score
- Tumor Consensus quality
- Tumor variant allele quality
- Tumor mean mapping quality
- Normal Consensus quality
- Normal variant allele quality
- Normal mean mapping quality
- Depth in tumor (# of reads crossing the position)
- Depth in normal (# of reads crossing the position)
- Mean base quality of reads supporting reference in tumor
- Mean mapping quality of reads supporting reference in tumor
- Depth of reads supporting reference in tumor
- Mean base quality of reads supporting variant(s) in tumor
- Mean mapping quality of reads supporting variant(s) in tumor
- Depth of reads supporting variant(s) in tumor
- Mean base quality of reads supporting reference in normal
- Mean mapping quality of reads supporting reference in normal
- Depth of reads supporting reference in normal
- Mean base quality of reads supporting variant(s) in normal
- Mean mapping quality of reads supporting variant(s) in normal
- Depth of reads supporting variant(s) in normal
VCF
VCF output from SomaticSniper conforms to version 4.1 of the VCF specification. Hence, each non-header output line contains the following fields:
- Chromosome
- Position
- ID (unused)
- Reference base
- Alternate bases (comma separated)
- Quality (unused)
- Filters (unused)
- INFO (unused)
- FORMAT specification for each sample
- NORMAL sample data
- TUMOR sample data
The following FORMAT fields will be populated for each of NORMAL and TUMOR.
ID | Number | Type | Description |
---|---|---|---|
GT | 1 | String | Genotype |
IGT | 1 | String | Genotype when called independently (only filled if called in joint prior mode) |
DP | 1 | Integer | Total read depth |
DP4 | 4 | Integer | Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases |
BCOUNT | 4 | Integer | Occurrence count for each base at this site (A,C,G,T) |
GQ | 1 | Integer | Genotype quality |
JGQ | 1 | Integer | Joint genotype quality (only filled if called in joint prior mode) |
VAQ | 1 | Integer | Variant quality |
BQ | . | Integer | Average base quality of each base in the call, reported in alphabetical order (A,C,G,T) |
MQ | 1 | Integer | Average mapping quality across all reads. |
AMQ | . | Integer | Average mapping quality of each base in the call, reported in alphabetical order (A,C,G,T) |
SS | 1 | Integer | Variant status relative to non-adjacent normal: 0=wildtype, 1=germline, 2=somatic, 3=LOH, 4=unknown |
SSC | 1 | Integer | Somatic Score |
Please first search Biostar and then ask a question there if needed. We automatically monitor Biostar for questions related to our tools.