Skip to content
forked from zzwch/smartliu

Python package for preprocessing pipeline for smartseq2 tangversion data

Notifications You must be signed in to change notification settings

XpChen79/smartliu

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An in-house Command Line Interface to process tag-based scRNA-Seq data.

Installation

Dependence

  1. some common genome analysis tools, Including but no limited to: hisat2, samtools, htseq-count, bamtools, bam2fastx, R,multiqc. see [tools] section in configs/mm10.config file to find more information.

  2. It is highly recommended to use mamba install your-tool-name to install dependcies. Conda envioments cound be managed by Mamba that could save your time. If you are in China mainland, try to use the local anaconda mirrors. Here is an example using the mirrors of Tsinghua University.

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

$mamba install -c bioconda fastqc cutadapt hisat2 samtools htseq bam2fastx bamtools multiqc

$mamba install -c conda-forge R=4.0 perl biopython

The installation of parts of tools may meet some problems, for these, they could be installed by pip package manager.

  1. Some R packages and perl module JSON are required
 # For JSON.pm used in scripts/paired2single.pl 
cpanm install JSON
## It could be also installed by mamba if you get some problems from `cpanm`
mamba install -c bioconda perl-json

# For R packages used in scripts/*.R
r_packages <- ("rjson", "rtracklayer", "mixtools", "ggplot2", "reshape2", "ggthemr", "gridExtra")
# You may use BiocManager::install(r_packages) to install them
  1. genome and trancscriptome index files build by hisat2-build
    Ensembl http://www.ensembl.org/info/data/ftp/index.html Gencode http://www.gencodegenes.org/
  • get genome fasta sequence
    wget ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
  • get RefSeq genes annotation
    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.107/GFF/ref_GRCh38.p2_top_level.gff3.gz
  • or Alternative, Here are detailed steps for converting a local hg19 refGene table (in genePred format) to GTF.
#Download your gene set of interest for hg19. For this example, I'll use the refGene table, 
#but you can choose other gene sets, such as the knownGene table from the "UCSC Genes" track.
$rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ./
#Unzip
$gzip -d refGene.txt.gz
#Remove the first "bin" column:
$cut -f 2- refGene.txt > refGene.input
#Convert to gtf:
$genePredToGtf file refGene.input hg19refGene.gtf
#Sort output by chromosome and coordinate
$cat hg19refGene.gtf  | sort -k1,1 -k4,4n > hg19refGene.gtf.sorted
#gff2gtf
$gffread my.gff3 -T -o transcripts.gtf
#gtf2gff
$gffread merged.gtf -o- > merged.gff3

#you may append Spike-in gtf to transcript.gtf
$cat ERCC92_RGC.gtf >> transcripts.gtf
#Given our limited computing resource of our labServer "DELL T630" -- 56PC 256GB, 
#I choose to use HISAT2 (told 50 times faster) as mapper instead of Tophat2. 
#Build hisat2 index
#genome index
$hisat2-build -p 30 genome.fa genome
#add transcriptome info to index by doing
$extract_splice_sites.py transcripts.gtf > transcripts.ss
$extract_exons.py transcripts.gtf > transcripts.exon
$hisat2-build -p 30 --ss transcripts.ss --exon transcripts.exon genome.fa genome.trans

Install

$git clone 'Repository Address'
$cd smartliu
$pip install .

Test

Please use test_data to ensure everything is OK. eg. smartliu -c mm10 -i test_data -o smart_mm10

Usage

Use smartliu --help to see how to start

About

Python package for preprocessing pipeline for smartseq2 tangversion data

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 97.7%
  • Python 1.4%
  • Other 0.9%