Skip to content

The analysis for "CRISPR/Cas9 gene-editing of HSPCs from SCD patients has unintended consequences"

Notifications You must be signed in to change notification settings

mcao0404/LongAmpseq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

LongAmpseq

The analysis for "CRISPR/Cas9 gene-editing of HSPCs from SCD patients has unintended consequences". The pipeline for illumina based large deletion profiling (LongAmp-seq) and Nano-pore based long-read analysis are stored in the following folders:

-update 11/12/2020
Switched delly version from delly_v0.8.1_linux_x86_64bit to delly_v0.8.5_linux_x86_64bit

Repository structure

Folder: Illumina
The scripts for processing LongAmp-seq data.

  • longamp_bwa_PCR.sh
    Processing raw Illumina data for PCR based library and provide read splitting patterns for grouping and visualization.
  • longamp_bwa_cellline.sh
    Processing raw Illumina data for the reporter cell line and provide read splitting patterns for grouping and visualization.
  • bedfile.py
    Processing read splitting patterns output from bash scripts and provide large deletion patterns and frequency analysis
  • longamp_delly_calling.sh
    Performing delly variant calling.
  • longampfigures_distribution.py
    Data visualizations.
  • delly
    The statistical version of delly. Please refer to the original github page for further details: https://github.com/dellytools/delly

Folder: Nanopore
The scripts for processing Nanopore data (by Yilei Fu @ Treagen Lab).

Environment set up

We recommend users to utilize virtual environment control tools such as conda to set up python environment. The current version is on python 3, with the commands as follows for downloading required packages using conda:

conda create -n longamp python=3
conda activate longamp
conda install -c bioconda bwa samtools seqtk bedtools bcftools
conda install numpy scikit-learn pandas scipy
conda install -c conda-forge matplotlib autoconf

After setting up the environment, you can download the code for LongAmp-seq through git using the command below or download the zip file and unzip directly.

git clone https://github.com/baolab-rice/LongAmpseq.git
cd LongAmpseq/Illumina

Then you will be ready for processing LongAmp-seq data.

LongAmp-seq analysis

Before you start, please move the demultiplexed fastq files to the Illumina folder. (Note that the standard output fastq of bcl2fastq should end with R1_001.fastq or R2_001.fastq)
Note that you will need the reference genome ready for bwa alignment, and the path to the reference geneome in .fasta format will be required for LongAmp-seq analysis.

  • Raw data processing

For processing raw illumina fastqs (using common cell types that can utilize the reference genome directly)

bash longamp_bwa_PCR.sh [chromosome] [start index of long-range PCR] [index of cut site] [index of cut site +1] [end index of long-range PCR] [directory to the reference genome]

For example:

bash longamp_bwa_PCR.sh chr11 5245453 5248229 5248230 5250918 ~/Desktop/genomes/hg19/hg19.fa

And the output in csv format (file name ended with "filtered_2+.csv") should look like:

Chromosome Start End Read ID Score Strand
chr11 59136655 59136956 M04808:132:000000000-CTP75:1:2107:19955:2557 60 -
chr11 59138619 59138657 M04808:132:000000000-CTP75:1:2107:19955:2557 60 -

For reporter cell lines, please use

bash longamp_bwa_cellline.sh GFPBFP 1 1004 1005 9423 GFPBFP_9423bp_NEW_ref_cutsite_1004bp.fa

With standard pipeline, three folders will be created:
raw_data folder: containing original fastq files.
processing folder: containing all the intermediate files (including the "filteredsorted.bam" used for delly calling).
output folder: containing output for LongAmp-seq analysis listed as below:

filtered_2+.fastq: the collection of filtered full-length reads that will split in alignments  
filtered_2+.csv: the alignment pattern of split reads
filtered_1.fastq: the collection of filtered full-length reads that will NOT split in alignments

largedel_output.csv: all the read IDs containing large deletions
largedel_group.csv: large deletion patterns grouped by their start positions and lengths.bash longamp_bwa_cellline.sh GFPBFP 1 1004 1005 9423 GFPBFP_9423bp_NEW_ref_cutsite_1004bp.fa

delly.csv: structural variants called by delly
delly.svg: visualization of delly output
  • Large deletion Profile generating

The raw data processing section will call the correlated script and run it for you. For processing this step by yourselves, users can run the command as below:

# for general use
python bedfile.py [the filtered_2+.csv file] [output1: largedel_output.csv] [output2: largedel_group.csv] [length of amplicon]

# or for using reporter cell line
# Step 1: get the read number that spanning the cut site without large deletion:
echo $(cat [your file end with 30_filtered_1.fastq]|wc -l)/4|bc
# Step 2: run the following command
python bedfile_cellline.py [the filtered_2+.csv file] [output1: largedel_output.csv] [output2: largedel_group.csv] [length of amplicon, default=9423] [The number you got in Step 1]

After running bedfile.py commands you should be able to generate two csv files:
largedel_output.csv: all the read IDs containing large deletions largedel_group.csv: large deletion patterns grouped by their start positions and lengths.

  • Variant calling using delly

The raw data processing section will call the correlated script and run it for you. For processing this step by yourselves, users should first put the files end with filteredsorted.bam in the Illumina folder, and run the command as below:

# This command will process all the filtered and sorted bam files in the same directory.
bash longamp_delly_calling.sh [directory to the reference genome]

Therefore you will get the output with file name end with filteredsorted.csv, which includes all the variants that called by delly. This file will be analyzed for visualization in the next step.

  • Visualization

The raw data processing section will call the correlated script and run it for you. For processing this step by yourselves, users can run the command as below:

python longampfigures_distribution.py [the largedel_output.csv file]

You can also specify the chromosomal index for the script:

python longampfigures_distribution.py [the largedel_output.csv file] [chromosome] [cut site] [length of long-range PCR]

For example:

python longampfigures_distribution.py [the largedel_output.csv file] chr11 5248229 5465

Please contact Yidan Pan ([email protected]) if you have any questions.

About

The analysis for "CRISPR/Cas9 gene-editing of HSPCs from SCD patients has unintended consequences"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 72.3%
  • Shell 27.7%