-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtoyfish.Rmd
298 lines (186 loc) · 14.6 KB
/
toyfish.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: "Toyfish - a quick start guide with an example dataset"
output:
github_document:
toc: true
toc_depth: 4
---
## Dataset overview

(Image credit: [NOAA](https://www.fisheries.noaa.gov/west-coast/science-data/southern-california-shelf-rockfish-hook-and-line-survey))
To help users quickly learn and test loco-pipe on their own, we provide a heavily subsetted lcWGS dataset with which the pipeline can be run in a few minutes. This dataset that we call "toyfish" is part of a larger rockfish dataset that the Sudmant lab has generated together with several collaborators. It is composed of 30 rockfish individuals from a pair of sister taxa: sunset rockfish (Sebastes crocotulus) and vermilion rockfish (Sebastes miniatus). Each species has 15 specimens. Within vermilion, there are three populations that are genetically distinct: vermilion_1, vermilion_2 and vermilion_3; each population has 5 specimens.
As shown in the picture above, these two species are very similar in their phenotype and even experts cannot accurately to tell them apart. It is even harder to distinguish one vermilion population from another based only on visual cues. So hopefully, in addition to teaching users about loco-pipe, this example dataset will be a good demonstration of the power of lcWGS in characterizing population structure in natural populations.
Raw sequencing data were generated from an Illumina Novaseq sequencer, and we performed quality filtering and sequence alignment with [grenepipe](https://github.com/moiexpositoalonsolab/grenepipe). For reference genome, we used the vermilion genome generated by [Kolora et al. 2021](https://www.science.org/doi/full/10.1126/science.abg5332). Rockfish has 24 chromosomes; however, with the aim of having a small test dataset, we selected a 3Mbp region from chromosomes 15 and 16 (from 17Mbp to 20Mbp), and further subsetted their read depth to 40% of the original data. These bam files are stored in `toyfish/bams`. The subsetted reference genome sequence and its index file are stored in `toyfish/reference`
## Setting up the pipeline
1. Install [mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) or conda (https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) if you have not already done so. A fresh install of mamba with miniforge (https://github.com/conda-forge/miniforge) is highly recommended because mamba is much faster than conda. It is ok if you prefer to use conda though; just replace all occurrences of `mamba` with `conda` in the code below.
2. Download `loco-pipe` from GitHub (e.g. using `git clone https://github.com/sudmantlab/loco-pipe.git`). We recommend you to download it to a folder where you store your software programs. We will refer to the full path of the directory that contains the `loco-pipe` folder as `SOFTWARE_DIR`.
3. Create the `loco-pipe` conda environment using mamba by running `mamba env create -f $SOFTWARE_DIR/loco-pipe/workflow/envs/loco-pipe.yaml` (replace $SOFTWARE_DIR with a real path).
4. (Optional) If you would like to run PCA with the software [PCAngsd](https://github.com/Rosemeis/pcangsd) using loco-pipe, you **must** install PCAngsd manually as it is not yet available on conda. Please install it to a conda environment named `pcangsd_lcpipe` using the script below. Even if you already have PCAngsd installed on your machine, you will need to run the following code to ensure that the version is compatible.
```{bash eval=FALSE}
# first set your working directory to a folder where you store your software programs
cd $SOFTWARE_DIR # replace $SOFTWARE_DIR with a real path
# download PCAngsd from Github
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
# check out the version the loco-pipe is based on
git checkout f90d41f7b9b245481781ae319c4a174376e5f471
# create an environment for PCAngsd
mamba env create -f $SOFTWARE_DIR/loco-pipe/workflow/envs/pcangsd.yaml
# activate the conda environment
conda activate pcangsd_lcpipe
# build PCAngsd
python setup.py build_ext --inplace
pip3 install -e .
# deactivate the conda environment
conda deactivate
```
5. (Optional) If you would like to run local PCA with the [lostruct](https://github.com/petrelharp/local_pca) package in R using loco-pipe, you **must** install lostruct (in addition to PCAngsd, see above) manually as it is not yet available on conda. Please install it to a conda environment named `lostruct_lcpipe` using the script below.
```{bash eval=FALSE}
# create a conda environment named lostruct_lcpipe and install R and some key R packages
mamba env create -f $SOFTWARE_DIR/loco-pipe/workflow/envs/lostruct.yaml
# activate the lostruct conda environment
conda activate lostruct_lcpipe
# launch R
R
# install lostruct
devtools::install_github("petrelharp/local_pca/lostruct", ref = "93ad59309151e44d2d3d0d7748cdc92f6121f564")
# quit R
q()
# deactivate the conda environment
conda deactivate
```
Note: depending on your system, you may need to ensure that lostruct is properly installed to the `lostruct_lcpipe` environment with something like the following
```{r eval=FALSE}
withr::with_libpaths(new = "/path/to/conda/envs/lostruct_lcpipe/lib/R/library",
devtools::install_github("petrelharp/local_pca/lostruct"))
```
## Preparing the project directory and required input files
The sequence alignment files, reference genome, sample table, chromosome table, and pipeline configuration file for toyfish are already set up for you. However, we still need to change some file paths to match the ones on your computer.
1. Define the environmental variable `SOFTWARE_DIR`. Replace `/path/to` with the path of the directory that contains `loco-pipe`
```{bash, eval = FALSE}
SOFTWARE_DIR=/path/to
```
2. Run the shell script `toyfish/prepare.sh` to update file paths in the sample table and the pipeline configuration file
```{bash, eval = FALSE}
bash $SOFTWARE_DIR/loco-pipe/toyfish/prepare.sh
```
3. If you want to run `toyfish` on a computer cluster, also edit the `cluster_config.yaml` file under `workflow/profiles` to make sure that your cluster settings are specified correctly.
## Launching the pipeline
1. Activate the `loco-pipe` environment with
```{bash, eval = FALSE}
conda activate loco-pipe
```
2. Plot the pipeline flowchart
```{bash, eval = FALSE}
mkdir -p $SOFTWARE_DIR/loco-pipe/toyfish/figures/flowchart
snakemake -n --forceall --rulegraph \
--directory $SOFTWARE_DIR/loco-pipe/toyfish \
--snakefile $SOFTWARE_DIR/loco-pipe/workflow/pipelines/loco-pipe.smk | \
dot -Tpdf > $SOFTWARE_DIR/loco-pipe/toyfish/figures/flowchart/toyfish.pdf
```
3. Launch the pipeline
On a single machine (change `--cores` to the number of cores that you have available):
```{bash, eval = FALSE}
snakemake \
--use-conda \
--conda-frontend mamba \
--directory $SOFTWARE_DIR/loco-pipe/toyfish \
--rerun-triggers mtime \
--scheduler greedy \
--printshellcmds \
--snakefile $SOFTWARE_DIR/loco-pipe/workflow/pipelines/loco-pipe.smk \
--cores 1 -n
```
On a computer cluster:
```{bash, eval = FALSE}
snakemake \
--use-conda \
--conda-frontend mamba \
--directory $SOFTWARE_DIR/loco-pipe/toyfish \
--rerun-triggers mtime \
--scheduler greedy \
--printshellcmds \
--snakefile $SOFTWARE_DIR/loco-pipe/workflow/pipelines/loco-pipe.smk \
--profile $SOFTWARE_DIR/loco-pipe/workflow/profiles/slurm \
--default-resources mem_mb=None disk_mb=None -n
```
Note that the ending `-n` flag means it is a dry run. If the dry run goes through, remove `-n` to actually run the pipeline.
## Example output
Below are some figures that are automatically plotted by loco-pipe. After running it with the toyfish dataset, you should see these figures recreated in the `toyfish/figures` folder.
#### Pipeline flowchart
[flow chart](toyfish/figures/flowchart/toyfish.pdf)
<br>
#### Depth distribution and choice of depth filters

A density plot showing the empirical sequencing depth distribution (in black), and the fitted truncated normal distribution (in blue). The mean and standard deviation of the fitted distribution are shown on the top corner, and the chosen minimum and maximum depth filters are represented by the red lines.
<br>
#### PCA
###### All samples combined

A PCA plot showing the top eight PC axes with all samples combined. Each point is an individual, and they are colored by the population that they belong to. A thinned SNP list is used for this analysis.
<br>
###### Vermilion only

A PCA plot showing the top eight PC axes with the vermilion samples only. Each point is an individual, and they are colored by the population that they belong to. A thinned SNP list is used for this analysis.
<br>
#### Admixture analysis
###### All samples combined

An admixture plot with all samples combined and K=2-7 source populations. Each bar is an individual, and each colors represent a source population. A thinned SNP list is used for this analysis.
<br>
###### Vermilion only

An admixture plot with the vermilion samples only and K=2-4 source populations. Each bar is an individual, and each colors represent a source population. A thinned SNP list is used for this analysis.
<br>
#### Fst between sunset and vermilion
###### Per-SNP estimates

Per-SNP Fst between sunset and vermilion rockfish along the subsetted genome.
<br>
###### In 100-SNP windows

Fst in 100-SNP windows between sunset and vermilion rockfish along the subsetted genome.
<br>
###### In 10000bp windows

Fst in 10000bp windows between sunset and vermilion rockfish along the subsetted genome.
<br>
#### Site frequency spectrum
###### Sunset

###### Vermilion

#### Theta and neutrality stats in 10000bp windows
###### Sunset

Nucleotide diversity (pi), Watterson's theta, and Tajima's D in 10000bp windows along the subsetted genome in sunset rockfish.
<br>
###### Vermilion

Nucleotide diversity (pi), Watterson's theta, and Tajima's D in 10000bp windows along the subsetted genome in vermilion rockfish.
<br>
#### Heterozygosity estimates

Average heterozygosity of each sample over the subsetted genome. Each point represents a sample, and they are grouped by the population that they belong to.
<br>
#### Local PCA
Note that since this is a highly susbsetted dataset, the local PCA results are for demonstration purpose only and any signals are likely artifacts rather than reflecting biologically meaningful patterns.
###### Local PCA MDS plot when combining all chromosomes

An MDS plot showing the z-score of each 100-SNP window along the top six MDS axes. Windows with z-scores higher than 3 or lower than -3 along a certain axis are considered outliers and are colored in magenta. Outlier windows along the same axis tend to show similar PCA patterns among themselves and distinct patterns when compared to the genome-wide average. The MDS is conducted combining all chromosomes for this plot, so outlier windows along the same axis in different chromosomes tend to shown similar patterns.
<br>
###### Local PCA outlier window consensus PCA plots when combining all chromosomes

Consensus PCA plots showing the combined PCA pattern of outlier windows along each MDS axis. A consensus PCA plot with all the non-outlier windows is also generated.
<br>
###### Local PCA MDS plot when running each chromsome separately

An MDS plot showing the z-score of each 100-SNP window along the top six MDS axes. Windows with z-scores higher than 3 or lower than -3 along a certain axis are considered outliers and are colored in magenta. Outlier windows along the same axis tend to show similar PCA patterns among themselves and distinct patterns when compared to the genome-wide average. The MDS is conducted for each chromosome separately for this plot, so outlier windows along the same axis in different chromosomes don't necessarily show similar patterns.
<br>
###### Local PCA outlier window consensus PCA plots when running each chromsome separately
[consensus PCA](toyfish/figures/lostruct/global/separated.pca.pdf)
Consensus PCA plots showing the combined PCA pattern of outlier windows along each MDS axis with each chromosome analysed separately.
<br>
```{r eval=FALSE, echo=FALSE}
# One of the most important quality control steps in the analysis of lcWGS data is the establishment of proper depth filters. Highly repetitive regions in the reference genome tend to have much lower mapping rate than the rest of the genome as reads cannot map uniquely to them. On the other hand, when such repetitive/duplicated regions are collapsed to a single sequence on the reference genome, too many reads may map incorrectly to it. These regions thus violate many basic assumptions used in downstream analyses and will bias their results if not filtered out. However, there has not been a consensus method to determine the depth filters, and common practices so far have included arbitrarily picking a value or taking a certain standard deviation away from the mean/median of the empirical distribution. But since the distribution of sequencing depth tends to have an irregular shape and a long tail, taking its standard deviation is often uninformative and is not a generalizable solution. With loco-pipe, we fit a normal distribution to the bulk of the empirical distribution, excluding both tails, and use the fitted normal distribution as a guideline for depth filter determination. This method thus has the advantage of reproducibility across datasets, and is more sound statistically than existing approaches.
```