Initial commit of pipeline for SHIV analysis

niaid · Feb 6, 2023 · dc6ef5c · dc6ef5c
1 parent 603ea8d
commit dc6ef5c
Show file tree

Hide file tree

Showing 45 changed files with 1,974 additions and 585 deletions.
diff --git a/README.md b/README.md
@@ -1,173 +1,89 @@
-# Pacbio UMI Analysis Pipeline
-This pipeline was developed in Virus Persistence and Dynamics Section (VPDS) of the Immunology Laboratory (IL) in the Vaccine Research Center (VRC) at the National Institutes of Health (NIH).
+# PacBio UMI Analysis Pipeline
 
-## Requirements
-
-This pipeline requires a python environment. A conda environment is recommended, and a requirements.txt is included listing all required modules.
-
-### Note 
-You will need to change the loading of your python environment, inside any scripts.
-
-### pacbio-pipeline-with-blast.sh
-
-1. Orients the fasta files with reference to `*.fasta`, output generated to `oriented` folder.
-
-2. Trim the forward  and reverse (PCR) primers sequence using cutadapt. This step also does an insert filtering. Inserts shorter than 90% of the length of the desired insert length are trimmed. The output is generated to `trimmed` folder.
-
-3. Script `bin_by_UMI.py` gets called and generates results under the `umi_stats` folder. In this folder `*_umi_seq.txt` gives the list of umi sequences extracted from the data.
-
-4. Script `extract_seq.py` gets called and extracts reads to `sequences` folder based on associated UMI sequence.
-
-5. Script `cluster_bins.sh` gets called to get consensus sequence from each umi bin. The output is generated to `sequences` folder.
-
-6. Script `read_uc_file.py` gets called to get a fasta file with consensus sequence with highest read abundance from each UMI bin. The output is generated in `sequences/cluster_with_read_counts` folder.  It then removes insert errors. The output of error reads is saved in `error_insert` folders. 
-
-7. Script `umi_cluster_reads.py` gets called to extract and save insert error in `error_insert` folder. 
-
-8. CCS reads obtained from step 1 to 7 are saved in `Tobe_blasted` folder for blasting against reference.
-
-9. Script `blast_nr.sh` gets called to blast sequences in each fasta file against local blast database. The results are saved in two folders: 1. `Tobe_blasted/Blast/.out` which contains the blast results and `Tobe_blasted/cov_headers/.txt` which grab the headers with contains the word we are interested (for example. coronavirus). 
-
-10. Script `down_select_seqs.py` gets called to subset the sequences with headers from `Tobe_blasted/cov_headers` folder for each fasta file. The output is saved in `final_ccs_reads` folder. 
-
-11. Script `remove_fake_umis_nongen.sh` or `remove_fake_umis.sh` get called to detect and set aside artifactual UMIs. The output is saved in either `fake-umi-curation-nogen` or `fake-umi-curation`, respectively.
-
-12. Final output is saved in  `final_post_curation/nongen` folder or `final_post_curation/gen` folder based on the choice in 11 as a `*.fasta` file.
-
-## Usage
-
-How to execute the pipeline:
-
-1. Navigate to the pacbio-pipeline folder.
-
-2. To run in batch execute the following command: 
-
-Run1: 
-
-	scripts/batch-pacbio-pipeline-with-blast.sh <path-to-raw-fasta-files-directory> <path-to-config-script>  nogen   
-
-or
-
-Run2: 
-
-	scripts/batch-pacbio-pipeline-with-blast.sh <path-to-raw-fasta-files-directory> <path-to-config-script> config_file
-
-
-where `path-to-raw-fasta-files-directory` is the folder where all the fasta files are located. There must be a `/` at the end. The path to the `path-to-raw-fasta-files-directory` folder needs to be an absolute path. `config_file` is the folder where the primers, error, length information are located.
-`gen_dist` is nogen in case of excluding genetic distance, or nothing which include genetic distance in network-based filtering.
-
-Note: This only operates on fasta files, so if there are other files/folders in the project directory, those are unchanged.
-
-3. To run in individual fasta file execute the following command:
+A command-line pipeline for analysis of PacBio sequencing data from libraries with unique molecular identifiers (UMIs).
 
-Run1:
+Inputs to the pipeline are FASTA PacBio CCS reads and a configuration file describing necessary parameters, such as a reference file and primer sequences.
 
- 	 qsub -pe round 8 scripts/pacbio-pipeline-with-blast.sh project file config_file nogen
+Outputs include UMI-consensus target sequences (i.e., single-copy/single-genome/single-molecule sequences) as well as intermediate outputs from various processing steps.  
 
-or
+The pipeline assumes a single target sequence and the following read structure:
 
-Run2: 
+*5'* – **[ PCR forward primer ][ target sequence ][ RT primer ][ UMI ][ PCR reverse primer ]** – *3'*
 
- 	 qsub -pe round 8 scripts/pacbio-pipeline-with-blast.sh project file config_file  
+The primer sequences and UMI region can be of any length. Reads need not be oriented prior to running the pipeline; the pipeline will orient the reads against the provided target sequence reference.
 
+*This pipeline was developed by the Virus Persistence and Dynamics Section at the National Institutes of Health (NIH).*
 
-### Example
 
-	qsub scripts/batch-pacbio-pipeline-with-blast.sh <path-to-raw-fasta-files-directory> <path-to-config-script> nogen
-
-or 
-
-	qsub scripts/batch-pacbio-pipeline-with-blast.sh <path-to-raw-fasta-files-directory> <path-to-config-script>
-
-
-
-# Revert Mutations by Erroneous to Consensus Pipeline:
-
-This pipeline, conserves the positions of real variants detected from variant caller and revert the rest of the bases back to consensus. <!---This is now a pypi package see [rev-seqs](https://pypi.org/project/rev-seqs/)-->
-
-## Components:
-
-### make_variant_dict.py
-
-This script reads variant excel file (obtained from variant caller) and turns this into a dictionary and save this dictionary in pickle format to be read later. The output contains all the real variants.
-
-### revert_mutations_bam.py
-
-This script reads either pickle dictionary or .vcf file and bam/bai file and returns bam file where each file was reconstructed by conserving real variants and revert the rest of the bases to consensus. 
-
-### batch-revert_mutations.sh
+## Requirements
 
-This script reads .xlsx( if True .vcf), .bam, .bai files and save the output as .bam format in `reverted_mutations` folder. It then use samtools to extract index and fasta files for each .bam file. The results are saved in `reverted_mutations` folder. 
+This pipeline requires several command-line tools as well as a Python 3 environment.
+
+The following command-line tools are required:
+* [bcftools and samtools](https://www.htslib.org/download/)
+* [BLAST+ / blastn](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
+* [mafft](https://mafft.cbrc.jp/alignment/software/)
+* [minimap2](https://github.com/lh3/minimap2)
+* [seqkit](https://bioinf.shenwei.me/seqkit/)
+* [cutadapt](https://cutadapt.readthedocs.io/en/stable/index.html)
+* [vsearch](https://github.com/torognes/vsearch)
+
+The following Python packages are also required:
+* Biopython
+* joblib
+* tqdm
+* networkx
+* numpy
+* pandas
+* pysam
+* scipy
+* matplotlib
 
 ## Usage
 
-How to execute:
-
-1. Navigate to the pacbio-pipeline folder.
+The pipeline is executed by running the `pacbio-pipeline-with-blast.sh` script. The script requires a FASTA file and a configuration file providing run parameters.
 
-2. To run in batch execute the following command: 
+Inputs are provided in the following way:
 
-Run1: 
+`./pacbio-pipeline-with-blast.sh $folder $file $config`, where `$folder` is the folder containing the FASTA file, `$file` is the name of the FASTA file (without the file extension), and `$config` is the path to a configuration file.
 
-	scripts/batch-revert_mutations.sh <path-to-bam-files-directory>  True  
-
-or
-
-Run2: 
-
-	scripts/batch-revert_mutations.sh <path-to-bam-files-directory> 
-
-
-where `path-to-bam-files-directory` is the folder where all the bam/.bai/.vcf files are located. There must be a `/` at the end. 
-
-If `True`, it takes .vcf file for variant caller. 
-
-If none, it take excel file for variants information in .xlsx format.
-
-
-### Example
-
-	qsub scripts/batch-revert_mutations.sh <path-to-bam-files-directory> True
-
-or 
-
-	qsub scripts/batch-revert_mutations.sh <path-to-bam-files-directory>
+Upon completion, final single-copy sequences are placed in a folder named `sgs` within the folder containing the specified FASTA input.
 
 ## Pipeline components:
-
-## bin_by_UMI.py
-This script has functions that read the trimmed files and looks for 8 base primerID sequence after RT primer. Each primerID (UMI) with associated ID and sequence is written to a `*.csv` file. UMIs and their frequency of occurances are written to a `*.txt`. UMI sequences with read counts above the inflection point in UMI count distribution are preserved and formatted file as : >seqID_UMI \n sequence is written to a `*.fasta` file. 
 
-## extract_seq.py
-This script extracts reads to `sequences` folder based on associated UMI sequence. It collects all the sequences with the same UMI and place them in one fasta file formated as UMI_seq.fasta
+`pacbio-pipeline-with-blast.sh` performs a complete analysis, although the pipeline is broken down into components and scripts that may be run individually. Major steps are listed below; check the source code for additional information on the syntax of specific scripts.
 
-## cluster_bins.sh
-This script first uses cutadapt to trim the RT primer and the 8 base umi sequence and then uses usearch -cluster_fast to generate consensus sequence based on 99% identity. This script generates a `trimmed` folder that has trimmed inserts left after removal of RT primer and 8 base UMI sequence. It also generates `centrioid_usearch` folder that has centroid sequences from usearch, `cluster_usearch` folder that has the `*.uc` files generated from running usearch. These `*.uc` files are used to get the read abundance for each umi sequence.
+### Preprocessing: Orientation and Primer Trimming
+- `vsearch` is used to orient reads against the provided reference sequence.
+- `cutadapt` is used to trim PCR forward and reverse primers. 
+
+### UMI Binning (`bin_umis.py`)
+- Attempts to match RT primer sequence to trimmed reads; if successful, the UMI is extracted and stored.
+- The relevant section of the read can differ from the provided RT primer sequence by up to 1 base.
+- UMIs with 10 CCS or more are kept as putative bins for consensus calling.
 
-## read_uc_file.py
-This script reads the `*.uc` files generated by running usearch and then calculates the read abundance associated with consensus sequence and writes to `cluster_stats` folder. The script then reads the consensus with highest read abundance and writes that to a folder `cluster_with_read_counts`. This folder has a final fasta file that can be used for multiple sequence alignment. It then removes insert error based on a sets of criteria. The output of error reads is saved in `error_insert` folders. CCS reads are saved in `final_ccs_reads` folder.
+### UMI Consensus Determination
+- `vsearch` is used to cluster reads within bins; only bins with 1 coherent cluster (largest cluster is at least 2x larger than next cluster *and* largest cluster has at least 10 reads) are kept.
+- Reads are mapped with `minimap2` to the bin consensus determined by `vsearch`.
+- `bcftools` is used to call variants from the read mapping and determine the final consensus sequence.
 
-## umi_cluster_reads.py 
-This script gets called to extract and save possible insert error in `error_insert` folders. 
+### Sequence Filtering
+- Bin consensus sequences are compared to the reference sequence with `blastn`; only sequences which are similar to the reference are kept.
 
-## check_umi_counts.sh
-This script checks if the number of unique UMIs in final output matches the number of UMIs in final reads + insert_error. If there is a mismatch it save the result as `*.txt` file in `project` folder, else it prints nothing.  
+### Fake UMI Removal
+- UMIs that are 1-base edit distance away from each other are compared; if one bin is greater than 2x the size of the other, the smaller bin is deemed fake. Its read count is added to the larger bin, but the reads are not re-processed.
+- PCR errors for each bin are compiled; error positions are used as fingerprints to associate bins with each other. UMIs with 1-base edit distance that share fingerprints are merged.
 
-## remove_fake_umis_nogen.sh 
-This script uses two functions: `umi_dedup_nogen.py`: this script locates and set aside fake UMIs considering edit distance 1, and a_n>2n_b-1 criteria. The output is fake UMIs that are saved in `fake-umi-curation-nogen` folder.`inflection_removal.py`: this script removes low count UMIs based on counts below inflection point or knee point. The choice of inflection/knee point is deterimed by True (use inflection point) and without (True) knee point. Final output is saved in  `final_post_curation/nongen` folder in `*.fasta` formatted files.
+### PCR Error Correction
+- PCR error positions (ambiguous bases in a bin) are checked and reverted to the consensus if the position is highly conserved in the sample (>90% identity) and one of the conflicting nucleotides matches the sample consensus. 
 
-## remove_fake_umis.sh 
-This script uses three functions: `umi_dedup.py`: this script locates and set aside fake UMIs considering genetic distance 0, edit distance 1, and a_n>2n_b-1 criteria.The output is fake UMIs that are saved in `fake-umi-curation` folder. `inflection_removal.py`: this script removes low count UMIs based on counts below inflection point or knee point. The choice of inflection/knee point is deterimed by True (use inflection point) and without (True) knee point. `down_select_post_cure.py`: this script calls final_ccs_reads prior to fake UMI curation and extract the headers and match with the headers in `final_post_curation/gen` files and select matched headers and saved unaligned final output in `final_unaligned_post_cur` folder. Final output is saved in  `final_post_curation/gen` folder in `*.fasta` formatted files.
-
-## blast_nr.sh
-This script uses each fasta file from `Tobe_blasted` folder and blast it against local blast db. The output is printed to a file with .out format for each fasta file. This file contains information about each sequence and what it is blasted to. The output is saved in `Tobe_blasted/Blast` folder. Following this step, for each .out file, the `grep` tool is called to extract the sequences' headers that contains "coronavirus (in case of CoV2)" for each fasta file and save these headers as a `*.txt` file in `Tobe_blasted/cov_headers` folder. 
+### Final Alignment
+- Polished consensus sequences are written to the `$folder/sgs/` folder. Sequences are provided in 3 schemes: intra-sample aligned, aligned against the reference, and unaligned.
 
-## down_select_seqs.py
-This script uses header files from `Tobe_blasted/cov_headers` folder to subset sequences and write them as a fasta file in `file_ccs_reads` folder. 
 
+## Citation
 
-## Citing UMI-pacbio-pipeline
-If you use UMI-pacbio-pipeline in your work, please cite:
+If you use this pipeline in your work, please cite:
 
 [Ko SH, Bayat Mokhtari E, Mudvari P, Stein S, Stringham CD, et al. (2021) High-throughput, single-copy sequencing reveals SARS-CoV-2 spike variants coincident with mounting humoral immunity during acute COVID-19. PLOS Pathogens 17(4): e1009431.](https://doi.org/10.1371/journal.ppat.1009431)
 

diff --git a/config-scripts/SHIV_Env.sh b/config-scripts/SHIV_Env.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+forward=GAGCAGAAGACAGTGGCAATGA
+forwardPrimer_RC=TCATTGCCACTGTCTTCTGCTC
+
+reverse=CCCGCGTGGCCTCCTGAATTAT
+reversePrimer_RC=ATAATTCAGGAGGCCACGCGGG
+
+RTprimer=TATAATAAATCCCTTCCAGTCCCCCC
+RTprimer_RC=GGGGGGACTGGAAGGGATTTATTATA
+
+error=0.1
+minLength=2800
+maxLength=4000
+
+referenceFile="reference-files/SHIV-AD8-EO_env.fasta"
+dbFile="shiv_db/SHIV-AD8-EO_full-length.fasta"
+term="SHIV"
+umiLength=8
diff --git a/requirements.txt b/requirements.txt
@@ -1,8 +1,9 @@
-biopython
+Biopython
 joblib
-kneebow
 tqdm
 networkx
 numpy
 pandas
 pysam
+scipy
+matplotlib
diff --git a/scripts/Count_stats.sh b/scripts/Count_stats.sh
diff --git a/scripts/_version.py b/scripts/_version.py
@@ -0,0 +1 @@
+__version__ = "1"
diff --git a/scripts/bam2fasta.sh b/scripts/bam2fasta.sh
diff --git a/scripts/batch-mafft.sh b/scripts/batch-mafft.sh
diff --git a/scripts/batch-pacbio-pipeline-with-blast.sh b/scripts/batch-pacbio-pipeline-with-blast.sh
@@ -1,44 +1,44 @@
 #!/bin/sh
 
-# Job Name
-#$ -N batch-pipeline-with-blast
-
-# Execute the script from the current working directory
+#$ -N pbptdev
 #$ -cwd
-
-# Merge the output of the script, and any error messages generated to one file
 #$ -j y
-
 #$ -S /bin/bash
-
-# Send the output of the script to a directory called 'UGE-output' uder current working directory (cwd)
-  if [ ! -d "UGE-output" ]; then #Create output directory in case it does NOT exist
-      mkdir UGE-output
-  fi
-#$ -o UGE-output/
-
-# Send mail when the job is submitted, and when the job completes
+#$ -o /hpcdata/vrc_vpds/radeckipe/dev/umipbp-dev/job-outputs/
 #$ -m be
+#$ -pe threaded 16
+#$ -M [email protected]
 
-# Tell the job your memory requirements
-#$ -l h_vmem=2G
-
-# This batch processes all the fasta files in a single directory
-# to run execute the command 
-# navigate to the folder contraining the batch-script then run the following command
-# qsub batch-pipeline.sh project config_file
+# ./batch-pacbio-pipeline-with-blast.sh project config_file
 # project: the folder where all the fasta files are. There must be a / at the end.
 # config_file: This is the config script that loads all the variables.
-#		Config files are located in the folder config-scipts.
+#		Example config files are located in the folder config-scripts.
 #		This allows change of primers and read length without editing code.
 # This only operates on fasta files, so if there are other files/folders in the project directory, those are unchanged.
 
 project=$1
 config_file=$2
-gen_dist=$3
+
+echo "Project: " $project > $project/project-log.txt
+echo "Configuration file: " $config_file >> $project/project-log.txt
+echo "Genetic distance: " $gen_dist >> $project/project-log.txt
+
+module purge
+module load uge
+module load usearch
+module load "BLAST+"
+module load mafft
+module load Anaconda3/2020.07
+module load minimap2
+module load samtools/1.13-GCC-4.8.4
+module load bcftools/1.13-GCC-4.8.4
+module load seqkit
+export PATH=$PATH:/hpcdata/vrc_vpds/scripts/vsearch/vsearch-2.21.1-linux-x86_64-static/bin
+
 
 ls $project | grep "fasta\$" | rev | cut -d "." -f2- | rev | while read -r file;
 do
 	echo $file
-	qsub -pe round 8 scripts/pacbio-pipeline-with-blast.sh $project $file $config_file $gen_dist
+	module refresh
+	. /hpcdata/vrc_vpds/scripts/umipbp-dev/scripts/pacbio-pipeline-with-blast.sh $project $file $config_file
 done
diff --git a/scripts/batch-revert_mutations.sh b/scripts/batch-revert_mutations.sh
diff --git a/scripts/batch-umipbp.sh b/scripts/batch-umipbp.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+project=$1
+config_file=$2
+gen_dist=$3
+
+# echo "Project: " $project > $project/log.txt
+# echo "Configuration file: " $config_file >> $project/log.txt
+# echo "Genetic distance: " $gen_dist >> $project/log.txt
+
+# module unload all
+# module load usearch
+# module load "BLAST+"
+# module load mafft
+# module load Anaconda3/2020.07
+
+# # Activate Conda environment with cutadapt, Biopython, joblib, etc. (see requirements.txt) - do not use LOCUS system cutadapt
+# source activate pbpenv
+
+ls $project | grep "fasta\$" | rev | cut -d "." -f2- | rev | while read -r file;
+do
+	echo $file
+	./scripts/pacbio-pipeline-with-blast.sh $project $file $config_file $gen_dist
+done