Skip to content
/ JAMS Public

Quantitative modeling of sequence, DNA accessibility, and methylation preferences of transcription factors using ChIP-seq data.

License

Notifications You must be signed in to change notification settings

csglab/JAMS

Repository files navigation

JAMS: Joint Accessibility-Methylation-Sequence models

Quantitative modeling of sequence, DNA accessibility, and methylation preferences of transcription factors using ChIP-seq data.

JAMS creates models for inferring the effect of CpG methylation on TF binding in vivo. It creates quantitative models that connect the strength of the binding signal observed in ChIP-seq to the DNA accessibility of the binding site, regional methylation level, DNA sequence, and base-resolution cytosine methylation.

JAMS has three main tasks (DATA, BUILD, and PREDICT). Briefly, DATA extracts the information required to build the JAMS model from BAM and BIGWIG files and creates a directory with the files required by BUILD. BUILD builds a generalized linear model as described in the Method section of JAMS' preprint. PREDICT predicts TF or background binding signal specified by a BED file, it can either predict on the first position of each genomic region or look for the best motif match.

Requirements:

After cloning, you can add the line export PATH=${DIR}/JAMS:$PATH to your .bashrc file.

Usage:

Input files can be downloaded with wget:

cd ./data/CTCF_demo/01_input_files/
bash download_HEK293_CTCF_demo_data.sh

01_JAMS_demo.sh contains an example of JAMS for CTCF in HEK293 (tasks DATA and PREDICT in the run). This should create a ./data/CTCF_demo/03_output/JAMS_CTCF_HEK293_GSM2026781_small_flanking_20bps folder.

For example:

./JAMS --task DATA BUILD \
       --experiment ${EXPERIMENT_ID} \
       --region ${PEAKS} \
       --wgbs_met_data ${METH} \
       --wgbs_unmet_data ${UNMETH} \
       --dna_acc_map ${DNA_ACC} \
       --genome_fa ${GEN_FA} \
       --pfm ${PFM} \
       --chr_sizes ${CHR_SIZES} \
       --mask_regions ${mask_regions} \
       --bam_control ${CONTROL_BAM} \
       --bam_pulldown ${PULLDOWN_BAM} \
       --tag_summit_range ${SUMMIT_RANGE} \
       --data_dir ${DATA_DIR}
       --flanking ${FLANKING} \
       --output_dir ${OUT_DIR}

ARGUMENTS:

Input files:
  • --region <PEAKS>

Peaks from ChiP-seq experiment, output from MACS "*_peaks.xls". When predict is used, the --region argument takes a bed files with columns as chr\tstart\tend\tname\tscore\tstrand\n

  • --wgbs_met_data <METH>

Methylated counts (BigWig).

  • --wgbs_unmet_data <UNMETH>

Unmethylated counts (BigWig).

  • --dna_acc_map <DNA_ACC>

Normalized DNA accessibility from ENCODE in bigwig file.

  • --bam_control <CONTROL_BAM>

Indexed sorted bam file for the ChIP-seq control.

  • --bam_pulldown <PULLDOWN_BAM>

Indexed and sorted bam file for ChIP-seq pulldown.

  • --genome_fa <GEN_FA>

Reference sequence in the FASTA format (indexed).

  • --pfm <PFM>

Position frequency matrix, Transfac format.

  • --chr_sizes <CHR_SIZES>

Chromosome sizes.

  • --mask_regions <mask_regions>

Genomic regions to be masked (repeats or/and blacklisted regions).

Input and output directories:

  • --data_dir <DATA_DIR>

The path to the folder that contains the required files to build the JAMS model.

  • --output_dir <OUT_DIR>

The path to the folder that will contain the output files.

Other Arguments:

  • --experiment <EXPERIMENT_ID>

Experiment ID. ex: CTCF_rep2_HEK293.

  • --tag_summit_range <SUMMIT_RANGE>

Range from the best motif match where the read tags will be extracted. Default= 400.

  • --flanking <FLANKING>

The number of base pairs around the core motif that will be included in the model.

Output:

An example of JAMS' output is provided as ./data/CTCF_demo/03_output/CTCF_HEK293_GSM2026781_hg38_from_motif_RANGE_400_neg_binomial_CpG_av_Met_methylation_flank_20.tar.gz. Next we describe the output files.

JAMS outputs the following files in the <OUT_DIR>/<EXPERIMENT_ID>_flanking_<FLANKING>bps folder:

  • <experimentName>_log.txt – Text file containing the logs of the GLM task.
  • <experimentName>_mCpG_only_model_total_cv_scatterplot.pdf – A PDF file showing the scatterplot of predicted vs. observed logarithm of tag density.
  • <experimentName>_dna_acc_vs_tags_heatmap.pdf – PDF file comparing the DNA accessibility and the observed tags of each ChIP-seq peak.
  • <experimentName>_DNA_accessibility_around_motif.pdf – Observed DNA accessibility on and around the best motif match.
  • <experimentName>_meth_vs_tags_heatmap_max_coverage.pdf, and <experimentName>_meth_vs_tags_heatmap_sensitive_coverage.pdf – Heatmap with the coverage and percentage of WGBS reads on the best motif match. max plots the coverage without the scale being cap, sensitive being capped at 60 is more informative.
  • <experimentName>_methylation_correlation_with_binding.pdf – PDF file summarizing the correlation of methylation at each position with the strength of TF binding. The statistics are presented for CpG's, non-CpG methylation events, and all methylation events combined. Note that since methylation events are locally correlated with each other, interpretation of these graphs can be difficult.
  • <experimentName>_methylation_stats.pdf – PDF file summarizing the frequency of methylation at different positions of the sequences. The statistics are presented for CpG's, non-CpG methylation events, and all methylation events combined.
  • <experimentName>_sequence_vs_tags_heatmap.pdf – DNA sequence of the motif and the observed tag counts from ChIP-seq.
  • <experimentName>_tags_vs_lenght.pdf – Scatterplot showing the tag counts from ChIP-seq and the length of the peak.
  • <experimentName>_dna_coefficients.pdf – DNA coefficients for the TF-specific and background parts of the JAMS model.
  • <experimentName>_mCpG_only_model_coefficients.txt – Text file containing the coefficients of a Negative binomial regression model that jointly considers DNA sequence, DNA accessibility and methylation at CpG sites to explain ChIP-seq tag counts.<experimentName>_mCpG_only_model_coefficients_with_FDR.txt includes the FDR.
  • <experimentName>_mCpG_only_model_control_n_pulldown_motif_logo.pdf – PDF file containing a graphical representation of the above model. The PDF essentially contains a motif logo (after scaling each position to have an average of zero), plus arrows that represent the effect of methylation: blue arrows represent methylation on the positive strand, and light brown arrows represent methylation on the reverse strand; upward arrows indicate positive and statistically significant effect on binding, and downward arrows represent negative and significant effect on binding. For both the TF-specific and background parts of the model.
  • <experimentName>_mCpG_only_model_control_motif_logo.pdf, and <experimentName>_mCpG_only_model_pulldown_motif_logo.pdf – Same as above but the TF-specific and background motifs are in a different PDF file.
  • <experimentName>_mCpG_only_model_control.motif_scaled.txt, and <experimentName>_mCpG_only_model_pulldown.motif_scaled.txt – Text file containing the exact values shown in the above PDF. Non-significant methylation effects are represented by NA's.
  • <experimentName>_mCpG_only_model_TF_binding.txt, <experimentName>_mCpG_only_model_TF_background.txt – Coefficient files that can be used by the PREDICT task.
  • <experimentName>_XX_train.tab, <experimentName>_c_train.tab, and <experimentName>_t_train.tab – the XX, c and t matrices and vectors described in the Methods section of the preprint.

Task: DATA

Only extracting data in and around the best motif match in ChIP-seq peaks.

An example is included in the 02_JAMS_demo_format_data.sh script:

./JAMS --task DATA \
       --experiment ${EXPERIMENT_ID} \
       --region ${PEAKS} \
       --wgbs_met_data ${METH} \
       --wgbs_unmet_data ${UNMETH} \
       --dna_acc_map ${DNA_ACC} \
       --genome_fa ${GEN_FA} \
       --pfm ${PWM} \
       --chr_sizes ${CHR_SIZES} \
       --mask_regions ${mask_regions} \
       --bam_control ${CONTROL_BAM} \
       --bam_pulldown ${PULLDOWN_BAM} \
       --tag_summit_range ${SUMMIT_RANGE} \
       --data_dir ${DATA_DIR}

This creates the input files to build the JAMS model:

  1. AffiMx output file: <experimentName>_affimx.affinity.txt
  2. Aligned sequence file: <experimentName>_aligned_sequences.tabulated.mx.txt
  3. Methylated read count file: <experimentName>_aligned_sequences.fasta.methylreads
  4. Non-methylated read count file: <experimentName>_aligned_sequences.fasta.nonmethylreads
  5. Normalized DNA accessibility file: <experimentName>_aligned_sequences.fasta.accessibility
  6. ChIP-seq tag count file: <experimentName>_summits.vRepeats.scores.txt
  7. PFM file for the motif that was used to align the sequences: *_pfm.txt

Task: BUILD

To build a JAMS model with the BUILD task you need to specify the directory with the files created in the DATA task.

For example:

./JAMS --task BUILD \
       --experiment ${EXPERIMENT_ID} \
       --flanking ${FLANKING} \
       --data_dir ${DATA_DIR} \
       --output_dir ${OUT_DIR}

The script 03_JAMS_demo_GLM.sh is an example of the BUILD task. Before, running extract the input files with:

gunzip ./data/CTCF_demo/02_formatted_data/small_demo/*.gz
bash 03_JAMS_demo_GLM.sh

Task: Predict

Predict TF or background signal on user specified genomic regions (BED format). It can either predict binding of the first position of each region (--motif_in_regions START) or look for the best motif match (--motif_in_regions SEARCH, if a PFM is provided). 260 high quality pretrain TF models are provided in ./data/JAMS_models/representative_JAMS_models.tar.gz. *_TF_binding.txt and *_background.txt files can be used with the --model_coeficients argument. When predict is used, the --region argument takes a bed files with columns as chr\tstart\tend\tname\tscore\tstrand\n.

For example:

./JAMS --task PREDICT \
       --motif_in_regions ${SEARCH_or_START} \
       --model_coeficients ${COEFFICIENTS} \
       --flanking 20 \
       --regions ${REGION} \
       --data_dir ${DATA_DIR} \
       --predicted_binding_prefix ${OUT_PREFIX} \
       --wgbs_met_data ${METH} --wgbs_unmet_data ${UNMETH} \
       --dna_acc_map ${DNA_ACC} --genome_fa ${GEN_FA} \
       --chr_sizes ${CHR_SIZES} --pfm ${PFM}

The script 04_JAMS_demo_predict.sh is an example of the PREDICT task.

Predict specific arguments:

  • --predicted_binding_prefix – Prefix for the output files. For example /directory/prefix.
  • --model_coeficients – Coefficients created by GLM task, files ending with *_TF_binding.txt and *_background.txt.
  • --motif_in_regionsSTART uses the first position of the bed file to predict binding. SEARCH, searches for the best motif match if a PFM is provided.
  • --region – takes a bed files with columns as chr\tstart\tend\tname\tscore\tstrand\n.

Output:

  • <OUT_PREFIX>_predicted_binding_heatmap.pdf – Heatmap with main predictors (DNA accessibility, motif sequence, and methylation) and predicted binding. Sorted by binding. If regions > 1000, the heatmap will contain 1000 regions randomly sampled.
  • <OUT_PREFIX>_predicted_binding_n_predictors.tab – Region names with predictors and predicted binding.
  • <OUT_PREFIX>_predicted_binding.tab – Region names with predicted binding.

Citation:

Hernandez-Corchado, A., Najafabadi, H.S. Toward a base-resolution panorama of the in vivo impact of cytosine methylation on transcription factor binding. Genome Biol 23, 151 (2022). https://doi.org/10.1186/s13059-022-02713-y

About

Quantitative modeling of sequence, DNA accessibility, and methylation preferences of transcription factors using ChIP-seq data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published