Pacbio .m5 alignment format is used by several assembly software like pbdagcon
, sparc
,
yet only blasr
could output this format. sam/bam file is still the standard output
format for alignment. Here provides a converter from sam/bam format to m5,
connecting universal Pacbio alignment results with downstream analysis softwares need
m5 format.
Usage: python3 bam2m5.py <in.bam> <ref.fa> <score_scheme> <out.m5>
- in.bam
input bam file, should be sorted by coordinate for efficiency.
Note: as coordinate of read is used in m5 file, if bam is generated by blasr,
-clipping (soft|hard)
parameter should be used.- ref.fa
reference file
the <ref.fa>.fai is needed. either you already have one or the programm will build one for you, in which case the write permission to the dir which contains <ref.fa> is needed.
- score_scheme
- scoring parameter used for alignment, in format match,mismatch,gap_open,gap_extend.
- eg: -5,6,10,0 means score -5 for a match, score 6 for a mismatch, 10 for gap open and 0 for one base gap extend.
- the default score schemes for these software:
blasr -sam
: -5,6,10,0blasr -m 5
: -5,6,0,5bwa mem -x pacbio
: 1,-1,-1,-1
- notice the different sign of scores for blasr and bwa
- if the
score
filed in m5 file is used by downstream analysis, one may choose use theblasr -m 5
scheme to get compatibility withblasr -m 5
result, no matter which score scheme is really used by the alignment software.
- out.m5
- output m5 file
example: python3 bam2m5.py align.sorted.bam ref.fa -5,6,0,5 align.sorted.m5
- Python >= 3.0:
- this script is in python3, python2 support may be added later
- BioUtil >= 0.2:
- python package, handling bam file, fasta file reading.
Use
pip3 install BioUtil
to install this pacakge. - cython >= 0.24 (optional):
- used for speed up code, optional
- download the file, unzip the file, into the dir
- if you want cython support (recomanded), run
python3 cython_build.py
- (optional)
cd test; bash run_test.sh
for test. (blasr
needed) - Now you can invoke
bam2m5.py
for use.
Yu XU, [email protected]
These scripts are under GPL2 lisense.