Skip to content

Commit

Permalink
add snakemake
Browse files Browse the repository at this point in the history
  • Loading branch information
HuaZou committed Sep 1, 2019
1 parent d0c1663 commit fbaaab5
Show file tree
Hide file tree
Showing 5 changed files with 262 additions and 2 deletions.
7 changes: 6 additions & 1 deletion Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,12 @@ GEM
http_parser.rb (~> 0.6.0)
ethon (0.12.0)
ffi (>= 1.3.0)
eventmachine (1.2.7)
eventmachine (1.2.7-x64-mingw32)
execjs (2.7.0)
faraday (0.15.4)
multipart-post (>= 1.2, < 3)
ffi (1.11.1)
ffi (1.11.1-x64-mingw32)
forwardable-extended (2.6.0)
gemoji (3.0.1)
Expand Down Expand Up @@ -205,6 +207,8 @@ GEM
jekyll-seo-tag (~> 2.1)
minitest (5.11.3)
multipart-post (2.1.1)
nokogiri (1.10.4)
mini_portile2 (~> 2.4.0)
nokogiri (1.10.4-x64-mingw32)
mini_portile2 (~> 2.4.0)
octokit (4.14.0)
Expand Down Expand Up @@ -240,11 +244,12 @@ GEM
unicode-display_width (1.6.0)

PLATFORMS
ruby
x64-mingw32

DEPENDENCIES
github-pages
rdiscount

BUNDLED WITH
1.16.1
2.0.2
5 changes: 4 additions & 1 deletion _posts/2019-08-30-Program-R10.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,8 @@ ggplot(data_m, aes(x = sampleID, y = ID)) +
#### 相关图

```R
library(corrplot)

corrplot(corr = cor(dat[1:7]), order = "AOE", type = "upper", tl.pos = "d")
corrplot(corr = cor(dat[1:7]), add = TRUE, type = "lower", method = "number",
order = "AOE", diag = FALSE, tl.pos = "n", cl.pos = "n")
Expand Down Expand Up @@ -258,6 +260,7 @@ dat %>% slice(dat.cln$ID_unit) %>%

```R
library(VennDiagram)

A <- sample(LETTERS, 18, replace = FALSE)
B <- sample(LETTERS, 18, replace = FALSE)
C <- sample(LETTERS, 18, replace = FALSE)
Expand Down Expand Up @@ -318,10 +321,10 @@ upset(movies, main.bar.color = "black",
library(ggplot2)

data <- read.table(choose.files(),header = TRUE)

data$color <- ifelse(data$padj<0.05 & abs(data$log2FoldChange)>= 1,
                     ifelse(data$log2FoldChange > 1,'red','blue'),'gray')
color <- c(red = "red",gray = "gray",blue = "blue")

ggplot(data, aes(log2FoldChange-log10(padj), col = color)) +
  geom_point() +
  theme_bw() +
Expand Down
252 changes: 252 additions & 0 deletions _posts/2019-09-01-Tools-snakemake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
---
layout: post
title: "snakemake:编写任务流程的工具"
date: 2019-09-01
updated: 2019-09-01
categories: 生信工具
tags: 分析软件
---

### What's Snakemake?

**Snakemake**是一种可重复和标准化的编写任务流程的工具。该流程工具基于友好的python语言,并且可以跨平台运行。生信分析领域经常会构建分析流程,通常我们会用snakemake语言构建。除此之外,**[Broad institude](https://www.broadinstitute.org/)**(布罗德研究所:麻省理工学院和哈佛大学的Eli和Edythe L. Broad研究所,通常被称为Broad研究所,是一个位于美国马萨诸塞州剑桥的生物医学和基因组研究中心)提出的**[workflow description language](https://software.broadinstitute.org/wdl/)**,这也可用于生信标准流程开发。

> The Snakemake workflow management system is a tool to create **reproducible and scalable** data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
```python
rule targets:
input:
"plots/dataset1.pdf",
"plots/dataset2.pdf"

rule plot:
input:
"raw/{dataset}.csv"
output:
"plots/{dataset}.pdf"
shell:
"somecommand {input} {output}"
```

1. snakemake基于python语法定义处理输入文件的rule,每个rule对输入文件处理后得到输出文件;
2. 输入输出文件可以通过**通配符**匹配,从而进行多数据处理;
3. rule不仅仅包含shell命令,也可以是python或者其他处理输入输出文件的编程语言;
4. 安装snakemake软件最好通过**conda**(python3版本)安装:*conda install -c bioconda snakemake*



### snakemake的规则

#### **rule**及其命令

1. **rule**是snakefile的基本组成,每个rule一般包含input、output和shell三部分。其中比较特殊的是rule all命令,*它一般被用来定义一个流程的最后输出结果*

```python
rule all:
input:"path/to/inputfile", "path/to/other/inputfile"

rule NAME:
#多个输入输出可用逗号链接
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", somename = "path/to/another/outputfile"
shell: "somecommand {input} {output}"
```



2. **wildcards**:获得匹配输入数据名字部分的通配符。

```python
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
# dataset,group是两通配符,只要文件名符合该格式就会被处理。
```



3. **threads**:指定某个rule的分配给程序的线程数目。

4. **resources**:指定程序所需要的内存大小。

5. **expand**:扩展函数,可以将输入或者输出结果存成list。

6. **message**:运行rule时,输出在终端的信息。

7. **log**:用来生成程序运行的日志文件。

8. **Priority**:指定程序运行的优先性。

9. **params**:程序运行的参数。

10. **shell**:指定运行命令。

11. **run**:指定运行python命令。

12. **scripts**:指定运行的脚本。

13. **temp****protected**:temp类型中间文件会在结果输出后删除;protected类文件会保留下来。

14. **directory**:指定输出结果是目录。

```python
rule NAME:
input:
"path/to/inputfile"
output:
directory("path/to/outputdir")
shell:
"somecommand {input} {output}"
```



#### 配置文件

配置文件有两种类型:json和yaml。每次运行需要修改的文件或者参数等均可以写入配置文件,然后在snakefile内读入文件。

```yaml
samples: "Sample_path.tsv"

params:
left_primer: 29

results:
fastqc: "result/00.quality/fastqc"

logs: "result/logs"

# snakefile --> configfile: "config.yaml"
```



### 执行snakemake

#### 参数含义

```markdown
--dryrun, -n 不执行命令,用于debug
--printshellcmds, -p 在屏幕输出每个rule的shell命令
--reason, -r 输出每条rule执行的原因,默认FALSE
--cores, --jobs, -j 指定运行的核数,若不指定,则使用最大的核数
--force, -f 重新运行第一条rule或指定的rule
--forceall, -F 重新运行所有的rule,不管是否已经有输出结果
--forcerun, -R 重新执行Snakefile,当更新了rule时候使用此命令
--snakefile, -s 指定Snakefile,否则是当前目录下的Snakefile
```

#### 流程可视化命令

```shell
snakemake --dag result/06.taxonomy/taxonomy.silva.cluster.txt | dot -Tpdf > workflow.pdf
```

#### 集群投递

```markdown
snakemake --cluster "qsub -V -cwd -q 节点队列" -j 10
# --cluster /-c CMD: 集群运行指令
# qusb -V -cwd -q, 表示输出当前环境变量(-V),在当前目录下运行(-cwd), 投递到指定的队列(-q), 如果不指定则使用任何可用队列
# --local-cores N: 在每个集群中最多并行N核
# --cluster-config/-u FILE: 集群配置文件
```



### 转录组cufflinks分析案例

```python
# path to track and reference
TRACK = 'hg19.gtf'
REF = 'hg19.fa'


# sample names and classes
CLASS1 = '101 102'.split()
CLASS2 = '103 104'.split()
SAMPLES = CLASS1 + CLASS2


# path to bam files
CLASS1_BAM = expand('mapped/{sample}.bam', sample=CLASS1)
CLASS2_BAM = expand('mapped/{sample}.bam', sample=CLASS2)


rule all:
input:
'diffexp/isoform_exp.diff',
'assembly/comparison'


rule assembly:
input:
'mapped/{sample}.bam'
output:
'assembly/{sample}/transcripts.gtf',
dir='assembly/{sample}'
threads: 4
shell:
'cufflinks --num-threads {threads} -o {output.dir} '
'--frag-bias-correct {REF} {input}'


rule compose_merge:
input:
expand('assembly/{sample}/transcripts.gtf', sample=SAMPLES)
output:
txt='assembly/assemblies.txt'
run:
with open(output.txt, 'w') as out:
print(*input, sep="\n", file=out)


rule merge_assemblies:
input:
'assembly/assemblies.txt'
output:
'assembly/merged/merged.gtf', dir='assembly/merged'
shell:
'cuffmerge -o {output.dir} -s {REF} {input}'


rule compare_assemblies:
input:
'assembly/merged/merged.gtf'
output:
'assembly/comparison/all.stats',
dir='assembly/comparison'
shell:
'cuffcompare -o {output.dir}all -s {REF} -r {TRACK} {input}'


rule diffexp:
input:
class1=CLASS1_BAM,
class2=CLASS2_BAM,
gtf='assembly/merged/merged.gtf'
output:
'diffexp/gene_exp.diff', 'diffexp/isoform_exp.diff'
params:
class1=",".join(CLASS1_BAM),
class2=",".join(CLASS2_BAM)
threads: 8
shell:
'cuffdiff --num-threads {threads} {gtf} {params.class1} {params.class2}'
```

![](https://raw.githubusercontent.com/HuaZou/HuaZou.github.io/master/_posts/img/snakemake_cufflinks.png)



### Reference

1. [snakemake](https://snakemake.readthedocs.io/en/latest/index.html)
2. [snakemake使用笔记](https://www.jianshu.com/p/14b9eccc0c0e)


参考文章如引起任何侵权问题,可以与我[联系](https://github.com/HuaZou/),谢谢。
Binary file added _posts/img/R.pie.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _posts/img/snakemake_cufflinks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit fbaaab5

Please sign in to comment.