Skip to content

PacBio Jasmine pipeline, penalized regression, and DMR identification

Notifications You must be signed in to change notification settings

aliciasillers/Preliminary-CpG-Data-Analysis

Repository files navigation

Prelimiary CpG Methylation Data

Introduction

This repository contains the scripts for processing and analyzing some preliminary data for my work on plant aging, first by following a workflow to extract CpG methylation data from PacBio HiFi subreads, then by identifying age-associated cpg sites and differentially methylated regions.

Conda Environment Creation

Environment for PacBio CpG workflow without snakemake

conda env create -n my.bio3 pbccs pbjasmine pbmm2

#or using the my.bio3.yml file in this repo
conda env create -n my.bio3 -f my.bio3.yml

Environment for PacBio CpG workflow with snakemake

conda env create -n snakebio3 pbccs pbjasmine pbmm2 snakemake

#or using the snakebio3.yml file in this repo
conda env create -n snakebio3 -f snakebio2.yml

Environment for cpg site analysis in r

conda env create -n regr -f regr.yml

Environment for dmr analysis in r

conda env create -n rmethyl -f rmethyl.yml

Part 1: PacBio subreads to CpG data workflow

This workflow takes PacBio subreads as input and produces a .bed file and a .bw file that give the probability of methylation of each CpG site across the genome. The .bw file can be viewed in IGV or another genome browser. The .bed file will be used for further analysis in this project. To run a sample through this workflow, the scripts in this repo must be run in the following order with commandline input:

  1. ccs.sh SUBREADSFILE SAMPLENAME
  2. jasmine.sh HIFIFILE SAMPLENAME
  3. align.sh 5MCFILE SAMPLENAME
  4. cpgtools.sh ALIGNEDFILE SAMPLENAME

The merge.sh script was used in between steps 2 and 3 for some samples that had both subreads and already existing 5mc.hifi.bam files.

The output .bed file will have the following tab-delimited columns of data:

  1. reference name
  2. start coordinate
  3. end coordinate
  4. modification probability
  5. haplotype
  6. coverage
  7. estimated modified site count (extrapolated from model modification probability)
  8. estimated unmodified site count (extrapolated from model modification probability)
  9. discretized modification probability (calculated from estimated mod/unmod site counts)

Part 2: Analysis Two Ways

Identifying Age-Associated CpG Sites (Epigenetic Clock)

Filtering

Because sequence coverage differs among samples, the data needs to be filtered to limit the numbers of sites that are represented in more samples than others. The first filtering script (filter1.sh) in this repository removes cpg sites that are covered in less than 30% of samples. The second (filter2.sh) returns a list of samples that have less than 10% of the maximum remaining coverage of cpg sites. It does not automatically remove the samples from the data table so that low coverage samples can be manually picked to include or exclude based on other factors. After this filtering has been completed, any remaining NA values are imputed with the column mean using the impute script (impute.sh).

Regression

The R package glmnet is used to perform penalized regression. regression.R takes the output from the imputation done in the filtering step and returns a table of CpG sites with the strongest associations with age. Run using regression.sh

Identifying Differentially Methylated Regions

Formating

MethylKit is an R program that can identify DMRs with high sensitivity. It requires input files to be in a specific format. The methylformat.sh script takes the bed file for each sample and creates a tab-separated file for that sample in the correct format for use with MethylKit.

Filtering

The first step is to filter CpG sites that have low coverage among the samples. The dmrfilter.sh script runs an R script that reads each formatted file and filters out CpG sites that are covered in less than half of the samples.

DMR identification with MethylKit

Run DMR identification with getdmrs.sh

Visualization

The previous script will automatically output bedgraph files that can be visualized in genome viewers such as IGV. It will also output a tab separated file with a list of CpG sites in DMRs. This file can be used in the next script (in progress) which will visualize changes in CpG sites among samples.

About

PacBio Jasmine pipeline, penalized regression, and DMR identification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published