SAMsift is a program for advanced filtering and tagging of SAM/BAM alignments using Python expressions.
# clone this repo and add it to PATH
git clone http://github.com/karel-brinda/samsift
cd samsift
export PATH=$(pwd)/samsift:$PATH
# filtering: keep only alignments with score >94, save them as filtered.bam
samsift -i tests/test.bam -o filtered.bam -f 'AS>94'
# filtering: keep only unaligned reads
samsift -i tests/test.bam -f 'FLAG & 0x04'
# filtering: keep only aligned reads
samsift -i tests/test.bam -f 'not(FLAG & 0x04)'
# filtering: keep only sequences containing ACCAGAGGAT
samsift -i tests/test.bam -f 'SEQ.find("ACCAGAGGAT")!=-1'
# filtering: keep only sequences containing A and T only (defined using regular expressions)
samsift -i tests/test.bam -f 're.match(r"^[AT]*$", SEQ)'
# filtering: sample alignments with 25% rate
samsift -i tests/test.bam -f 'random.random()<0.25'
# filtering: sample alignments with 25% rate with a fixed RNG seed
samsift -i tests/test.bam -f 'random.random()<0.25' -0 'random.seed(42)'
# filtering: keep only alignments of reads specified in tests/qnames.txt
samsift -i tests/test.bam -0 'q=open("tests/qnames.txt").read().splitlines()' -f 'QNAME in q'
# filtering: keep only first 5000 reads from chr1 and 5000 reads from chr2
samsift -i tests/test.bam -0 'c={"chr1":5000,"chr2":5000}' -f 'c[RNAME]>0' -c 'c[RNAME]-=1' -m nonstop-remove
# tagging: add tags 'ln' with sequence length and 'ab' with average base quality
samsift -i tests/test.bam -c 'ln=len(SEQ);ab=1.0*sum(QUALa)/ln'
# tagging: add a tag 'ii' with the number of the current alignment
samsift -i tests/test.bam -0 'i=0' -c 'i+=1;ii=i'
# updating: removing sequences and base qualities
samsift -i tests/test.bam -c 'a.query_sequence=""'
# updating: switching all reads to unaligned
samsift -i tests/test.bam -c 'a.flag|=0x4;a.reference_start=-1;a.cigarstring="";a.reference_id=-1;a.mapping_quality=0'
Using Bioconda:
# add all necessary Bioconda channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
# install samsift
conda install samsift
Using PIP from PyPI:
pip install --upgrade samsift
Using PIP from Github:
pip install --upgrade git+https://github.com/karel-brinda/samsift
Program: samsift (advanced filtering and tagging of SAM/BAM alignments using Python expressions) Version: 0.2.5 Author: Karel Brinda <[email protected]> Usage: samsift.py [-i FILE] [-o FILE] [-f PY_EXPR] [-c PY_CODE] [-m STR] [-0 PY_CODE] [-d PY_EXPR] [-t PY_EXPR] Basic options: -h, --help show this help message and exit -v, --version show program's version number and exit -i FILE input SAM/BAM file [-] -o FILE output SAM/BAM file [-] -f PY_EXPR filtering expression [True] -c PY_CODE code to be executed (e.g., assigning new tags) [None] -m STR mode: strict (stop on first error) nonstop-keep (keep alignments causing errors) nonstop-remove (remove alignments causing errors) [strict] Advanced options: -0 PY_CODE initialization [None] -d PY_EXPR debugging expression to print [None] -t PY_EXPR debugging trigger [True]
exec(INITIALIZATION)
for ALIGNMENT in ALIGNMENTS:
if eval(DEBUG_TRIGER):
print(eval(DEBUG_EXPR))
if eval(FILTER):
exec(CODE)
print(ALIGNMENT)
Python expressions and code. All expressions and code should be valid with
respect to Python 3. Expressions are evaluated
using the eval
function and code is executed using the exec function.
Initialization can be used for importing Python modules, setting global
variables (e.g., counters) or loading data from disk. Some modules (namely
random
and re
) are loaded without an explicit request.
Example (printing all alignments):
samsift -i tests/test.bam -f 'True'
SAM fields. Expressions and code can access variables mirroring the fields
from the alignment section of the SAM specification, i.e., QNAME
, FLAG
,
RNAME
, POS
(1-based), MAPQ
, CIGAR
, RNEXT
, PNEXT
,
TLEN
, SEQ
, and QUAL
. Several additional variables are defined to
simply accessing some useful information: QUALa
stores the base qualities
as an integer array; SEQs
, QUALs
, QUALsa
skip soft-clipped bases;
and RNAMEi
and RNEXTi
store the reference ids as integers.
Example (keeping only the alignments with leftmost position <= 10000):
samsift -i tests/test.bam -f 'POS<=10000'
SAMsift internally uses the PySam library and
the representation of the current alignment (an instance of the class
pysam.AlignedSegment) is
available as a variable a
. Therefore, the previous example is equivalent to
samsift -i tests/test.bam -f 'a.reference_start+1<=10000'
The a
variable can also be used for modifying the current alignment record.
Example (removing the sequence and the bases from every record):
samsift -i tests/test.bam -c 'a.query_sequence=""'
SAM tags. Every SAM tag is translated to a variable with the same name.
Example (removing alignments with a score smaller or equal to the sequence length):
samsift -i tests/test.bam -f 'AS>len(SEQ)'
If CODE
is provided, all two-letter variables except re
(the Python regex
module) are back-translated to tags after the code execution.
Example (adding a tag ab
carrying the average base quality):
samsift -i tests/test.bam -c 'ab=1.0*sum(QUALa)/len(QUALa)'
Errors. If an error occurs during an evalution of an expression or an
execution of a code (e.g., due to accessing an undefined tag), then SAMsift
behavior depends on the specified mode (-m
). With the strict mode (-m
strict
, default), SAMsift will immediately interrupt the computation and
report an error. With the -m nonstop-keep
option, SAMsift will continue
processing the alignments while keeping the error-causing alignments in the
output. With the -m nonstop-remove
option, all error-causing alignments
are skipped and ommited from the output.
- samtools view can filter alignments based on FLAGS, read group tags, and CIGAR strings.
- sambamba view supports, in addition to SAMtools, a filtration using simple Perl-like expressions. However, it is not possible to use floats or compare different tags.
- BamQL provides a simple query language for filtering SAM/BAM files.
- bamPals adds tags XB, XE, XP and XL.
- SamJavascript can filter alignments using JavaScript expressions.
- Picard FilterSamReads can also filter alignments using JavaScript expressions.
Please use Github issues.
See Releases.