For the manuscript "MaizeCODE reveals bi-directionally expressed enhancers that harbor molecular signatures of maize domestication", the pre-release MaizeCODE v0.1.0-manuscript contains the code used to analyze the data and generate the figures, complemented with the script "MaizeCode_extra_manuscript_figures.sh" in the "scripts/manuscript" folder.
- Clone the git repository anywhere you want, for example in a new folder called projects
git clone https://github.com/martienssenlab/maize-code.git ./maize-code
or to clone a specific branch 'devel'
git clone --branch devel https://github.com/martienssenlab/maize-code.git ./maize-code
You will be prompted to input your GitHub username and password.
If you only want to update the scripts, usegit pull
. If you want to update from a specific branch 'devel'git pull origin devel
. - cd into the maize-code folder that has been created, so following the same example
cd ./maize-code/
- Check that the following required packages are installed and in your $PATH (the versions noted here are working for sure, no guarantees for different versions). Recommended installation using conda, except kentUtils that need to be installed form source
For all types of data:
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
cutadapt 2.10
deeptools 3.5.0
fastqc 0.11.9
homer 4.11
IDR 2.0.4.2
kentUtils (bedSort, bedGraphToBigWig)
macs2 2.2.7.1
meme 5.3.0
multiqc 1.11
sra-tools 2.11.0 (if downloading data from SRA)
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
seqkit 0.13.2
shortstack 3.8.5
STAR 2.7.5c
wget 1.20.1
R 4.0.3 + R packages: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, AnnotationForge 1.32.0, rrvgo 1.5.3, topGO 2.42.0, purrr 0.3.4, limma 3.46.0, edgeR 3.32.1, stringr 1.4.0, ComplexUpset 1.2.1, wesanderson 0.3.6
Histone ChIPseq:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, purrr 0.3.4, ComplexUpset 1.2.1
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
macs2 2.2.7.1
IDR 2.0.4.2
deeptools 3.5.0
RNAseq samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, AnnotationForge 1.32.0, rrvgo 1.5.3, topGO 2.42.0, purrr 0.3.4, limma 3.46.0, edgeR 3.32.1, stringr 1.4.0
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
STAR 2.7.5c
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
kentUtils (bedSort, bedGraphToBigWig)
deeptools 3.5.0
RAMPAGE samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
STAR 2.7.5c
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
kentUtils (bedSort, bedGraphToBigWig)
macs2 2.2.7.1
IDR 2.0.4.2
deeptools 3.5.0
TF ChIPseq samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, purrr 0.3.4, ComplexUpset 1.2.1, stringr 1.4.0
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
macs2 2.2.7.1
IDR 2.0.4.2
deeptools 3.5.0
meme 5.3.0
homer 4.11
wget 1.20.1
shRNA samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, wesanderson 0.3.6
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
shortstack 3.8.5
deeptools 3.5.0
seqkit 0.13.2
- Organize your reference genome directories so that they are all in the same main folder and that each contain ONE fasta file (.fa extension), ONE GFF file (.gff or .gff* extension) and ONE GTF (.gtf extension) file.\
For example, having a
genomes/
folder that contains thegenomes/B73_NAM/
directory where you can findgenomes/B73_NAM/B73_NAM.fa
,genomes/B73_NAM/B73_NAM.gff
andgenomes/B73_NAM/B73_NAM.gtf
files
Other references should be in the samegenomes/
folder, following the same pattern, i.e.genomes/W22_v2/W22.fa
,genomes/W22_v2/W22.gff
andgenomes/W22_V2/W22.gtf
The GTF file can be created from a GFF file with cufflinksgffread -T <gff_file> -o <gtf_file>
and check that 'transcript_id' and 'gene_id' look good in the 9th column.
The GFF file should have 'gene' in the 3rd column.
All files can be gzipped (.gz extension). - Make the samplefile you want, following the pattern below and examples below. A complete example of a samplefile is in the data folder (Example_samplefile.txt). For cleaner naming purposes, use "_samplefile.txt" as a suffix. The columns are the following:\
Type of data | Line | Tissue | Type of sample | Replicate ID | Sequencing ID | Path to fastq | Paired-end or single-end data | Genome reference |
---|---|---|---|---|---|---|---|---|
ChIP | B73 | roots | H3K27ac | Rep1 | SRRxxxxxx | SRA | SE | B73_NAM |
ChIP | B73 | roots | Input | Rep1 | SRRxxxxxx | SRA | SE | B73_NAM |
RNAseq | W22 | ears | RNAseq | Rep1 | S01 | /home/maize-code/RNAseq/fastqs | PE | W22_v2 |
RAMPAGE | W22 | ears | RAMPAGE | Rep1 | rampage_exp1 | /home/maize-code/RAMPAGE/fastqs | PE | W22_v2 |
TF_TB1 | B73 | leaf | IP | Rep1 | SRRxxxxxx | SRA | PE | B73_v4 |
TF_TB1 | B73 | leaf | Input | Rep1 | SRRxxxxxx | SRA | PE | B73_v4 |
shRNA | NC350 | cn | shRNA | Rep1 | cn | /home/maize-code/shRNA/fastqs | SE | NC350_NAM |
mC | B73 | roots | mC | Rep1 | SRRxxxxxxx | SRA | PE | B73_NAM |
mC | B73 | roots | Pico | Rep1 | SRRxxxxxxx | SRA | PE | B73_NAM |
- Submit the
scripts/MaizeCode.sh
script, giving as argument-f <samplefile.txt>
the samplefile of your choice and-p <path>
the path to your folder that contains the different genome directories, i.e. thegenomes
folder mentioned above:
qsub scripts/MaizeCode.sh -f example_samplefile.txt -r /path/to/genomes
- By default, it will proceed with the analysis.
-s
can be set so that it does not proceed with the analysis at all or-c
can be set if only single sample analysis should be performed but no combined analysis per line or between lines. - If the analysis has not proceeded or if you want to analyze different samples together, make the new samplefile of your choice and submit the
scripts/MaizeCode.sh
script again.
qsub scripts/MaizeCode.sh -f new_samplefile.txt -r /path/to/genomes
The samples that have already been processed will not be repeated but will still be included in the analysis. - Have a look at the results! (see Output below).
- There is one wrapper script
MaizeCode.sh
that launches sub-scripts depending on what needs to be done. - Potentially, each script could be submitted on its own but it could be tricky. Check out the usage of each script before by running the script without arguments (or followed by
-h
). - The shRNA and complete RAMPAGE pipelines are not ready yet
- It should work for both Single-end or Paired-end data
- It works perfectly with 2 replicates for every type of data (including two different inputs for ChIP). Adapting the scripts to allow for more variation in the number of replicates is under development (having only one ChIP input replicate works, as well as multiple RNAseq replicates).
- The whole pipeline creates a lot of report files and probably files that are not necessary to keep but for now I keep them like this for potential troubleshooting.
- For now I’ve used the
MaizeCode.sh
script from scratch for all samples of one inbred line at a time (all ChIPs and RNAseq from 5 tissues). It runs in ~19h (but it depends on the size of the files and how busy the cluster is...). Once that the mapping and single-sample analysis have been done, reusing these samples in a different analysis is much quicker though (~2h), the limitations are for mapping ChIPseq samples and calling ChIPseq peaks (since it does it for each biological replicate, the merged file and both pseudo-replicates and cannot be multi-threaded). That is probably the first step that could be optimized for faster runs. - Always process the Input samples with their corresponding ChIP in the
MaizeCode.sh
script. - The analysis will have to be adapted to the desired output, but running the default complete pipeline should give a first look at the data and generate all the files required for further analysis.
- These are still preliminary version of the scripts!
- If you want to map samples to different references of the same inbred line (e.g. B73 v3 and B73 NAM) it is much safer to run a new analysis in a different folder
-
MaizeCode.sh - wrapper script for the whole pipeline
Creates the different folders
Runs theMaizeCode_check_environment.sh
script for each environment (datatype * reference) that needs to be created
Runs an instance ofMaizeCode_ChIP_sample.sh
orMaizeCode_RNA_sample.sh
for each sample
Waits for the samples to be mapped
Runs theMaizeCode_R_mapping_stats.r
script to plot the mapping statitistics of all the samples in samplefile into bar plots\ Runs theMaizeCode_analysis.sh
script if the-s
argument (that stops after mapping) has not been given
Runs theMaizeCode_analysis.sh -s
script if the-c
argument (that stops after single sample analyis) has been given
By default, it will provide the<reference_genome>_all_genes.bed
files created by the check_environment script as region files -
MaizeCode_check_environment.sh
Checks if there is ONE fasta and ONE gff3 file in the reference folder (and unzip them if required)
Makes achrom.sizes
file if not there (can be useful down the line for bedGraphtoBigWig for example)
Makes aall_genes.bed
file if not there (will be used for analysis/plots)
Create the template for the stat files
Makes the bowtie2 or STAR indexes (for ChIP and RNA, respectively) if not already there -
MaizeCode_ChIP_sample.sh
Copies fastq files from their original folder or GEO to the fastq/ folder (if not already done)
Runs fastQC on the raw data
Trims adapters, low quality and small reads (<20bp) with cutadapt
Runs fastQC on trimmed data
Maps with bowtie2
Removes PCR duplicates with samtools
Gets some mapping stats -
MaizeCode_TF_sample.sh
Copies fastq files from their original folder or GEO to the fastq/ folder (if not already done)
Runs fastQC on the raw data
Trims adapters, low quality and small reads (<20bp) with cutadapt
Runs fastQC on trimmed data
Maps with bowtie2
Removes PCR duplicates with samtools
Gets some mapping stats -
MaizeCode_RNA_sample.sh shRNA NOT DONE YET and will be processed by a different pipeline (using shortStack)
Copies fastq files from their original folder or GEO to the fastq/ folder (if not already done)
Runs fastQC on the raw data
Trims adapters and low quality with cutadapt
Runs fastQC on trimmed data
Maps with STAR with different settings depending on the type of data (RNAseq, shRNA, RAMPAGE)
Marks duplicates with STAR with different settings depending on the type of data (RNAseq, shRNA, RAMPAGE)
Creates stranded bigwig files with STAR and bedGraphToBigWig with different settings depending on the type of data (RNAseq, shRNA, RAMPAGE)
Gets some mapping stats -
MaizeCode_analysis.sh - wrapper script for the analysis pipeline
If new samples are to be analyzed individually, sends each group of samples of the same datatype toMaizeCode_ChIP_analysis.sh
orMaizeCode_RNA_analysis.sh
Gathers peak statistics for all ChIPseq samples in the samplefile and lauchesMaizeCode_R_peak_stats.r
to plot them
Gathers gene expression statistics for all RNAseq samples in the samplefile and lauchesMaizeCode_R_gene_ex_stats.r
to plot them
Gathers TSS statistics for all RAMPAGE samples in the samplefile
Launches theMaizeCode_line_analysis.sh
script for each reference present in the samplefile
If different lines are present, launches theMaizeCode_combined_analysis.sh
script NOT DONE YET -
MaizeCode_ChIP_analysis.sh
Merges biological replicates and split into pseudo-replicates
For each type of file (replicate1, replicate2, pseudo-replicate1, pseudo-replicate2 and merged) in parallel:
Calls peaks with macs2 (calls broad peaks for H3K4me1, and narrow peaks for H3K4me3 and H3K27ac)
Makes bigwig files with deeptools (log2 FC vs Input, normalizing each file by CPM)
Plot Fingerprint
Waits for the previous steps to proceed
Makes IDR analysis for biological replicates with idr
Makes aselected_peaks
file with the peaks called in the merged sample and both pseudo-replicates with bedtools intersect
Makes some stats on the number of peaks -
MaizeCode_TF_analysis.sh
Merges biological replicates and split into pseudo-replicates
For each type of file (replicate1, replicate2, pseudo-replicate1, pseudo-replicate2 and merged) in parallel:
Calls peaks with macs2 (calls broad peaks for H3K4me1, and narrow peaks for H3K4me3 and H3K27ac)
Makes bigwig files with deeptools (log2 FC vs Input, normalizing each file by CPM)
Plot Fingerprint
Waits for the previous steps to proceed
Makes IDR analysis for biological replicates with idr
Makes aselected_peaks
file with the peaks called in the merged sample and both pseudo-replicates with bedtools intersect
Makes some stats on the number of peaks
Search for motifs on the selected peaks and on the replicated peaks (present in both biological replicates) with meme
Search for motifs on the selected peaks with homer (might be limiter to the best combination of motifs/peak files in the future)\ -
MaizeCode_RNA_analysis.sh
Processes each sample in parallel
For RAMPAGE data:
Merges biological replicates and creates stranded tracks (bigwigs) with STAR and bedGraphToBigWig
Calls peaks (to identify TSS) with macs2 (should be grit but not maintained and pretty cryptic)
Makes IDR analysis for biological replicates with idr
Make some stats on the number of peaks/tss
For RNAseq data:
Merges biological replicates and creates stranded tracks (bigwigs) with STAR and bedGraphToBigWig
Makes some stats on the number of expressed genes\ -
MaizeCode_line_analysis.sh Analyses marked by *** are still under development:
Splits the samplefile into ChIPseq, TF and RNAseq and RAMPAGE samples
For RNAseq samples:
if several tissues are present in the samplefile:
Makes sample and count tables
Calls differentially expressed genes between all pairs of tissues usingMaizeCode_R_DEG.r
script
For B73_v4 samples, performs gene ontonlogy analysis on each pairwise DEG usingMaizeCode_R_DEG_GO.r
script
Identifies genes unique to each tissue in the samplefile, and for B73_v4 samples, performs gene ontonlogy analysis on them usingMaizeCode_R_GO.r
script
For ChIPseq samples:
Makes a single file, merging all selected peaks from all samples with bedtools merge
Gets distance of each peak to the closest region from the regionfile with bedtools closest
Creates an Upset plot to show overlap among the different samples usingMaizeCode_R_Upset.r
script
if several tissues are present in the samplefile:
Calculates differential peaks between the different tissues***
For TF samples:
Makes a single file, merging all selected peaks from all samples with bedtools merge
Intersect with H3K27ac samples (if present in the same samplefile, or if previously processed)
Gets distance of each peak to the closest region from the regionfile with bedtools closest
Creates an Upset plot to show overlap among the different samples usingMaizeCode_R_Upset_TF.r
script
Compare TF binding sites with DEGs, if both data types are present***
For RAMPAGE samples:
if several tissues are present in the samplefile:
Calls differential TSS between the different tissues***
On all the samples: Plots heatmaps of the ChIPseq, RNAseq and RAMPAGE samples over the regionfile
Plots heatmaps and profiles of the ChIPseq samples over the differentially expressed genes (if they were called previously)
Plots heatmaps and profiles of the ChIPseq samples over genes split by expression quantiles (if RNA data is present)
Plots heatmaps and profiles on identified enhancers (only from H3K27ac distal peaks for now)\ -
MaizeCode_combined_analysis.sh NOT DONE YET, but expectations are:
Compares gene status (silent, constitutive and tissue-specific) between homolog genes
Compares enhancers\ -
MaizeCode_R_mapping_stats.r
Creates a plot representing the mapping statistics (both read numbers and distribution), namedcombined/plots/mapping_stats_<samplefile_name>.pdf
-
MaizeCode_R_peak_stats.r
Creates a plot representing the number of peaks called in the different ChIPseq sub-samples, namedcombined/plots/peak_stats_<samplefile_name>.pdf
-
MaizeCode_R_gene_ex_stats.r
Creates a plot summarizing the expression levels of genes in all RNAseq samples namedcombined/plots/gene_expression_stats_<samplefile_name>.pdf
-
MaizeCode_R_Upset.r
Creates an Upset plot of overlapping peaks and their presence in gene bodies namedcombined/plots/Upset_<analysis_name>.pdf
-
MaizeCode_R_DEG.r
Performs differential expression analysis with edgeR on all RNAseq samples present in the samplefile with edgeR
Plots MDS (combined/plots/MDS_<analysis_name>.pdf
) and BCV (combined/plots/BCV_<analysis_name>.pdf
)
For each pair of tissues:
Create a table of log2FC for all genes (namedcombined/DEG/FC_<analysis_name>_<tissue1>_vs_<tissue2>.txt
)
Create a table of differentially expressed genes (namedcombined/DEG/DEG_<analysis_name>_<tissue1>_vs_<tissue2>.txt
)
Plots two heatmaps on all the differentially expressed genes (by log(cpm) namedcombined/plots/Heatmap_cpm_<analysis_name>.pdf
and scaling per row (z_score) namedcombined/plots/Heatmap_zscore_<analysis_name>.pdf
)
Directories:
From the main folder <maizecode>
where the MaizeCode.sh
is run
-
<maizecode>/ChIP
: Folder containing data from ChIPseq sample(s)
only created if at least one ChIPseq sample has been analyzed<maizecode>/ChIP/fastq
: Folder containing raw and trimmed fastq files<maizecode>/ChIP/mapped
: Folder containing mapped and indexed data (bam and bam.bai files). It will contain mapped data before and after deduplication for each biological replicate, the merged replicates and the pseudo-replicates files.<maizecode>/ChIP/tracks
: Folder containing bigwig files and the all_genes.bed file for all genome references<maizecode>/ChIP/plots
: Folder containing the fingerprint plots for each sample and idr plots between biological replicates<maizecode>/ChIP/peaks
: Folder containing all peak files and idr analysis between biological replicates<maizecode>/ChIP/reports
: Folder containing the fastQC reports, trimming details, mapping details, idr details, thesummary_mapping_stats.txt
file that has a summary of mapping statistics for all ChIPseq samples ever processed and thesummary_ChIP_peaks.txt
file that has a summary of peak statistics for all ChIPseq samples ever processed<maizecode>/ChIP/logs
: Folder containing log files to go back to in case of error during environment buildingenv_<genome_reference>.log
, mapping<sample_name>.log
, analysis of ChIP samples together<samplefile_name>.log
and single sample analysisanalysis_<sample_name>[|_Rep1|_Rep2|_pseudo1|_pseudo2|_merged].log
<maizecode>/ChIP/chkpts
: Folder containingtouch
files to track success and completion of environment buildingenv_<genome_ref>
, sample mapping<sample_name>
and single sample analysis (peak calling and bigwig files)analysis_<sample_name>
. These files are produced to prevent these steps to be repeated if they were already performed in order to only performed the combined analysis of different combinations of samples. If these files are deleted, the mapping and analysis steps will be repeated and will overwrite existing files.
-
<maizecode>/RNA
: Folder containing data from RNA sample(s)
only created if at least one RNA sample has been analyzed<maizecode>/RNA/fastq
: Folder containing raw and trimmed fastq files<maizecode>/RNA/mapped
: Folder containing mapped and indexed data (bam and bam.bai files). It will contain mapped data before and after deduplication for each biological replicate and the merged replicates files<maizecode>/RNA/tracks
: Folder containing stranded bigwig files based on all or unique reads (4 files per biological replicate, plus 4 files for the merged replicates) and the all_genes.bed file for all genome references- '/RNA/TSS`: Folder containing the peaks (TSS) called on RAMPAGE data and the IDR analysis between biological replicates
<maizecode>/RNA/plots
: Folder containing the idr plots between biological replicates for RAMPAGE samples<maizecode>/RNA/reports
: Folder containing the fastQC reports, trimming details, mapping details, thesummary_mapping_stats.txt
file that has a summary of mapping statistics for all RNA samples ever processed, thesummary_gene_expression.txt
file that has a summary of gene expression statistics for all RNAseq samples ever processed and thesummary_RAMPAGE_tss.txt
file that has a summary of peak/tss statistics for all RAMPAGE samples ever processed<maizecode>/RNA/logs
: Folder containing log files to go back to in case of error during environment buildingenv_<genome_reference>.log
, mapping<sample_name>.log
, analysis of RNA samples together<samplefile_name>.log
and single sample analysisanalysis_<sample_name>.log
<maizecode>/RNA/chkpts
: Folder containingtouch
files to track success and completion of environment buildingenv_<genome_ref>
, sample mapping<sample_name>
and single sample analysis (bigwig files)analysis_<sample_name>
. These files are produced to prevent these steps to be repeated if they were already performed in order to only performed the combined analysis of different combinations of samples. If these files are deleted, the mapping and analysis steps will be repeated and will overwrite existing files.
-
<maizecode>/combined
: Folder containing data from combined analysis
only created if at least one sample has been mapped
<analysis_name>
is a combination of the samplefile and regionfile names:<samplefile_name>_on_<regionfile_name>
By default,<regionfile_name>
isall_genes
<maizecode>/combined/DEG
: Folder containing differentially expressed genes analysis results.FC_<sample1>_vs_<sample2>.txt
are pairwise comparison between sample1 and sample2 for all genes.DEG_<sample1>_vs_<sample2>.txt
only contain the differentially expressed genes (FDR<=0.05, |logFC|>2) between sample1 and sample2.<maizecode>/combined/peaks
: Folder containing combined ChIPseq peak filespeaks_<analysis_name>.bed
, RAMPAGE TSS filestss_<analysis_name>.bed
and matrix for Upset plotsmatrix_upset_<analysis_name>.txt
.<maizecode>/combined/matrix
: Folder containing matrix files for heatmap plottingregions_<analysis_name>.gz
,tss_<analysis_name>.gz
anddeg_<analysis_name>.gz
, outputed regions from kmean clustering of the heatmaps<analysis_name>_regions_regions_k5.txt
and<analysis_name>_tss_regions_k5.txt
and the value tables to be used for the scales of heatmapsvalues_regions_<analysis_name>.txt
andvalues_tss_<analysis_name>.txt
. 'regions' corresponds to the 'scale_regions' argument of deeptools, 'tss' corresponds to the 'reference-point --referencePoint TSS' argument of deeptools and 'k5' corresponds to the '--kmeans 5' argument of deeptools.<maizecode>/combined/plots
: Folder containing the mapping statistics plotmapping_stats_<analysis_name>.pdf
, the peak statistics plotpeak_stats_<analysis_name>.pdf
, the gene expression statistics plotgene_expression_stats_<analysis_name>.pdf
, the Upset plots of peaks in gene bodiesUpset_<analysis_name>.pdf
, the MDS and BCV plots from the DEG analysisMDS_<analysis_name>.pdf
andBCV_<analysis_name>.pdf
, respectively, the heatmaps of differentially expressed genes clustered accross all samples with log(cpm) valuesHeatmap_cpm_<analysis_name>.pdf
and normalized for each geneHeatmap_zscore_<analysis_name>.pdf
, and the different deeptools heatmaps<analysis_name>_heatmaps_regions.pdf
,<analysis_name>_heatmaps_regions_k5.pdf
,<analysis_name>_heatmaps_tss.pdf
and<analysis_name>_heatmaps_tss_k5.pdf
, and profiles . 'regions' corresponds to the 'scale_regions' argument of deeptools, 'tss' corresponds to the 'reference-point --referencePoint TSS' argument of deeptools and 'k5' corresponds to the '--kmeans 5' argument of deeptools.<maizecode>/combined/reports
: Folder containing thesummary_mapping_stats_<samplefile_name>.txt
file that has a summary of mapping statistics for all samples in each samplefile, thesummary_ChIP_peaks_<samplefile_name>.txt
file that has a summary of peak statistics for all ChIPseq samples in each samplefile,thesummary_gene_expression_<samplefile_name>.txt
file that has a summary of gene expression statistics for all RNAseq samples in each samplefile and thesummary_RAMPAGE_tss_<samplefile_name>.txt
file that has a summary of peak/tss statistics for all RAMPAGE samples in each samplefile.<maizecode>/combined/logs
: Folder containing log files to go back to in case of error during combined analysisanalysis_<analysis_name>.log
<maizecode>/combined/chkpts
: Folder containingtouch
files to track success of combined analysis<analysis_name>.log
. These files are only for success tracking and will be overwritten if an analysis with the same name is to be performed.
-
<maizecode>/chkpts
: Folder containingtouch
files to track success of a run without analysis
Statistics:
-
summary_mapping_stats.txt
Located in<maizecode>/<ChIP|RNA>/reports/
Tab-delimited file with 8 columns giving information for each sample (detailed in columns#1 to #4) on
the genome reference it was mapped to (column #5),
the number of reads in total (column #6),
the number of reads (and percentage of the total reads) that pass filtering (column #7),
the number of reads (and percentage of the total reads) that are mapping to the reference (inlcuding multi-mappers) (column #8), the number of reads (and percentage of the total reads) that are uniquely mapped (column #9). -
summary_ChIP_peaks.txt
andsummary_ChIP_peaks_<samplefile_name>.txt
Located in<maizecode>/combined/reports/
and<maizecode>/ChIP/reports/
Tab-delimited file with 10 columns giving information for each histone mark (detailed in columns#1 to #3) on
the number of peaks called in each biological replicate (columns #4 and #5, respectively),
the number of peaks in common between the biological replicates (all peaks given by the IDR analysis) and the percentage relative to each biological replicate (column #6),
the number of peaks in common between the biological replicates that pass the IDR threshold of 0.05 and the percentage relative to the number of peaks in common (column #7),
the number of peaks called when both replicates are merged (column #8),
the number of peaks shared by each pseudo-replicate (column #9),
the number of selected peaks (i.e. the peaks that will be used for downstream analysis) which are the peaks shared by the merged and both pseudo-replicates, and the percentage relative to the the number of merged peaks (column #10). -
summary_RAMPAGE_tss.txt
andsummary_RAMPAGE_tss_<samplefile_name>.txt
Located in<maizecode>/combined/reports/
and<maizecode>/RNA/reports/
Tab-delimited file with 8 columns giving information for each RAMPAGE sample (detailed in columns#1 to #3) on
the number of annotated genes in the reference genome (columns #4),
the number of peaks called in each biological replicate (columns #5 and #6, respectively),
the number of peaks in common between the biological replicates (all peaks given by the IDR analysis) and the percentage relative to each biological replicate (column #7),
the number of peaks in common between the biological replicates that pass the IDR threshold of 0.05 and the percentage relative to the number of peaks in common (column #8).\ -
summary_gene_expression.txt
andsummary_gene_expression_<samplefile_name>.txt
Located in<maizecode>/combined/reports/
and<maizecode>/RNA/reports/
Tab-delimited file with 13 columns giving information for each RNAseq sample (detailed in columns#1 to #3) on
the number of annotated genes in the reference genome (columns #4),
the number of silent genes (cpm=0), lowly expressed genes (cpm<1) and highly expressed genes (cpm>1) in the first biological replicate (columns #5, #6 and #7, respectively),\ the number of silent genes (cpm=0), lowly expressed genes (cpm<1) and highly expressed genes (cpm>1) in the second biological replicate (columns #8, #9 and #10, respectively)
the number of silent genes (cpm=0), lowly expressed genes (cpm<1) and highly expressed genes (cpm>1) after averaging both replicates (columns #11, #12 and #13, respectively)\
Plots: (examples are in the github data folder)
-
ChIP/plots/Fingerprint_<sample_name>_<replicate>.png
Fingerprint plot from deeptools to assess the genome-wide distribution of reads for each ChIPseq sample replicate (Rep1, Rep2, merged, pseudo1 and pseudo2) and its corresponding Input.
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html
[ generated by deeptools fromMaizeCode_ChIP_analysis.sh
] -
ChIP/plots/idr_<ChIPseq_sample_name>.png
Scatter plots and box plots from idr showing correlation between biological replicates of a ChIPseq sample.
Tool details: https://github.com/nboley/idr
[ generated by idr fromMaizeCode_ChIP_analysis.sh
] -
RNA/plots/idr_<RAMPAGE_sample_name>.png
Scatter plots and box plots from idr showing correlation between biological replicates of a RAMPAGE sample.
Tool details: https://github.com/nboley/idr
[ generated by idr fromMaizeCode_RNA_analysis.sh
] -
combined/plots/mapping_stats_<samplefile_name>.pdf
Bar plots showing number and distribution of uniquely mapped, multi-mapping, unmapped and filtered reads for all samples in<samplefile_name>_analysis_samplefile.txt
.
tool details: https://ggplot2.tidyverse.org/reference/geom_bar.html
tool details: https://wilkelab.org/cowplot/reference/index.html
[ generated in R byMaizeCode_R_mapping_stats.r
, started fromMaizeCode.sh
] -
combined/plots/peak_stats_<samplefile_name>.pdf
Bar plots showing the number of peaks called in each biological replicate, the number of peaks in common between the replicates, the number of peaks in common passing an IDR threshold of 0.05, the number of peaks called when the bam files of both biological replicates are merged, the number of peaks in common between pseudo-replicates (called on two random halves of the merged bam file) and the number of selected peaks (common between the merged and the pseudoreplicates) for all ChIPseq samples in<samplefile_name>_analysis_samplefile.txt
.
tool details: https://ggplot2.tidyverse.org/reference/geom_bar.html
[ generated in R byMaizeCode_R_mapping_stats.r
, started fromMaizeCode.sh
] -
combined/plots/gene_expression_stats_<samplefile_name>.pdf
Bar plots showing the distribution of genes in three large categories: unexpressed (cpm=0), lowly expressed (cpm<1) and highly expressed (cpm>1) for each biological replicate and when taking their average, for all RNAseq samples in<samplefile_name>_analysis_samplefile.txt
.
tool details: https://ggplot2.tidyverse.org/reference/geom_bar.html
[ generated in R byMaizeCode_R_mapping_stats.r
, started fromMaizeCode.sh
] -
combined/plots/Upset_<samplefile_name>_on_<regionfile_name>.pdf
Upset plots showing intersection between all the ChIP samples in the<samplefile_name>_analysis_samplefile.txt
, highlighting the peaks that are present on the regions in<regionfile_name>.bed
.
Tool details: https://github.com/hms-dbmi/UpSetR
[ generated in R byMaizeCode_R_Upset.r
, started fromMaizeCode_line_analysis.sh
] -
combined/plots/MDS_<samplefile_name>_on_<regionfile_name>.pdf
MDS plot (2D representation of variance) between all the RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
mapping to the same reference genome used to create the<regionfile_name>.txt
.
Tool details: https://rdrr.io/bioc/edgeR/man/plotMDS.DGEList.html
[ generated in R byMaizeCode_R_DEG.r
, started fromMaizeCode_line_analysis.sh
] -
combined/plots/BCV_<samplefile_name>_on_<regionfile_name>.pdf
BCV plot (Biological coefficient of variation) for all the genes in the<regionfile_name>.txt
based on all the RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
.
Tool details: https://rdrr.io/bioc/edgeR/man/plotBCV.html
[ generated in R byMaizeCode_R_DEG.r
, started fromMaizeCode_line_analysis.sh
] -
combined/plots/Heatmap_cpm_<samplefile_name>_on_<regionfile_name>.pdf
Clustered heatmap of all the differentially expressed genes between all pairs of RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
, scaled by log(count per million) of the RNAseq replicate samples (highlights gene expression levels).
Tool details: https://www.rdocumentation.org/packages/gplots/versions/3.1.0/topics/heatmap.2
[ generated in R byMaizeCode_R_DEG.r
, started fromMaizeCode_line_analysis.sh
] -
combined/plots/Heatmap_zscore_<samplefile_name>_on_<regionfile_name>.pdf
Clustered heatmap of all the differentially expressed genes between all pairs of RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
, scaling each row by zscore among the RNAseq replicate samples (highlights differences between samples). -
combined/plots/<samplefile_name>_<regionfile_name>_heatmap_regions.pdf
Heatmap of the enrichment for all ChIP and RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
on the regions from<regionfile_name>.bed
, scaling each region to the same length, in decreasing order of overall enrichment in all samples.
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html
[ generated in R byMaizeCode_R_DEG.r
, started fromMaizeCode_line_analysis.sh
] -
combined/plots/<samplefile_name>_on_<regionfile_name>_heatmap_regions_k5.pdf
Heatmap of the enrichment for all ChIP and RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
on the regions from<regionfile_name>.bed
, scaling each region to the same length, clustered into 5 regions by kmeans.
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html
[ generated in R by deeptools fromMaizeCode_line_analysis.sh
] -
combined/plots/<samplefile_name>_on_<regionfile_name>_heatmap_tss.pdf
Heatmap of the enrichment for all ChIP and RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
on the regions from<regionfile_name>.bed
, aligning all regions by their transcription start site, in decreasing order of overall enrichment in all samples.
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html
[ generated in R by deeptools fromMaizeCode_line_analysis.sh
] -
combined/plots/<samplefile_name>_on_<regionfile_name>_heatmap_tss_k5.pdf
Heatmap of the enrichment for all ChIP and RNAseq samples in the<samplefile_name>_analysis_samplefile.txt
on the regions from<regionfile_name>.bed
, aligning all regions by their transcription start site, clustered into 5 regions by kmeans.
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html
[ generated in R by deeptools fromMaizeCode_line_analysis.sh
] -
combined/plots/<samplefile_name>_on_<regionfile_name>_heatmap_DEG.pdf
Heatmap of the enrichment for all ChIP samples in the<samplefile_name>_analysis_samplefile.txt
on the groups of differentially expressed genes called between the all pairs of RNAseq samples, scaling each gene to the same length, in decreasing order of overall enrichment in each group of UP and DOWN-regulated genes, using a specific scale for each mark (warning: genes differentially expressed between several pairs of tissues will be present in the corresponding clusters).
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html
[ generated in R by deeptools fromMaizeCode_line_analysis.sh
] -
combined/plots/<samplefile_name>_on_<regionfile_name>_heatmap_all_DEGs.pdf
Heatmap of the enrichment for all ChIP samples in the<samplefile_name>_analysis_samplefile.txt
on all the differentially expressed genes called between the all pair of RNAseq samples, scaling each gene to the same length, and clustering into 5 groups by kmeans, using a specific scale for each mark (warning: the generated clusters of genes are not linked to the samples they were originally called in as differentially expressed).
Tool details: https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html
[ generated in R by deeptools fromMaizeCode_line_analysis.sh
]