Contents
Before diving too far into the documentation, it is important to understand some of the nomenclature and semantics used throughout this documentation. Focus will be mostly spent on words important to the workflow and that may be ambiguous to when used without definition. Some of these words may seem like they overlap in definition, and in certain situation, they do. In these situations, we should still give proper designation to each catagory to distiguish them from each other.
We'll focus first on the three S's that relate to what we are working with, Subject, Specimen, and Sample.
- Subject: the who or what we are working with, this could be a patient or an experiment. It is important to remember which of the downstream identifiers are associated with a specific subject.
- Specimen: is collected from a subject and is the start of the protocol. A specimen for iGUIDE could be considered a tube of starting gDNA. This will be the actual material that will be worked with for the protocol.
- Sample: While many people commonly use specimen and sample interchangibly, here we note that a sample comes from a specimen. We make this distiction because we realize there are multiple ways to workup a single specimen, each of these different ways is a different sample. Samples are taken from the specimen, just like the specimen is taken from the Subject.
In the following workflow, you'll notice that in certain places, we refer to 'sampleName' (such as in the sampleInfo file), or 'specimen' (such as in the supplemental data file). These designations are consistent with the above definitions and it is expected that the user will follow these customs.
How do we destinguish Subject, Specimen, and Sample? During processing these identifiers will need to be distiguished from each other using different nomenclature. Below is an example of a naming scheme for the three identifiers.:
Subject Specimen Sample {patient} {Spec.ID} {Spec.ID-info-rep.ID} pND405 iGSP0015 iGSP0015-neg-1 pND405 iGSP0015 iGSP0015-neg-2 pND405 iGSP0015 iGSP0015-neg-3
Here we have an example workflow. Subject identifiers are not usually part of the processing, we consider Subject typically during data interpretation through reviewing the output reports. Subject identifiers can be included in supplemental files with runs. Specimens can have identifiers (or IDs). For iGUIDE, it is easiest to use a single alpha-numeric string as an identifier (without delimiters!).
Following the above practice, the specimen ID can be included in a sample ID (or sampleName) along with additional information. As indicated above, iGUIDE will treat sampleNames as three part strings, the specimen ID is at the beginning, delimited (or separated by a "-") from additional information. The last part of the string is a replicate identifier, expected to be numeric. In practice, we find it best to create 4 samples for processing from a single specimen. This limits the possibility for PCR jackpotting an allows an analyst to utilize capture-recapture statistics for population estimation. The remainder of the string that is not captured in the first or last components is not directly used by iGUIDE, except as a unique identifier of the specimen. Therefore it is a great place to indicate sample specific treatments.
Given the above example, three different samples have been indicated, all from a single specimen and single subject. During processing, the user will indicate each of sample in the sampleName column of the sampleInfo file. When iGUIDE returns the analysis, each specimen will be indicated. So while three samples go into the pipeline, data will be combined in the output to represent the single specimen.
More information can be found about specimen and sampleNames in this user guide.
While writing this documentation, I thought it would be helpful to explain in a general sense what an experiment might look like using with repective terminology of this software.
For a respecitive subject (patient, individual experiment, ...) that has been treated with the marker dsODN during genome editing, specimens are harvested from various conditions (with nuclease, with different targets controlled by gRNAs, ...). This harvesting yeilds genomic DNA which is commonly cataloged into a specimen database holding metadata and experimental parameters for the different specimens.
Samples are then taken from these specimens, typically 4 samples (see protocol from iGUIDE manuscript), and processed through the iGUIDE protocol. Before sequencing, a sampleInfo sheet would be constructed where each row of a csv file indicates a different sample that was processed along with the samples barcode and demultiplexing information.
During sequencing (or after), a run specific configuration file (config file) would be constructed by one or two parties. There is run specific information that needs to be included, such as: target sequence patterns, nuclease profiles, treatment information, etc. If a variable changes throughout the samples, then it can be indicated in the sampleInfo file, while if it is constant, it can be indicated in the config file.
The latter part of the config is reviewed and checked by the individual who will computationally process the run. This portion of the config file contains parameters that modify or tune the software to run on different systems.
After the computational processing has completed, a stat report and analytical report are generated in the reports directory. These can be reviewed by respecitive parties.
Additionally, if multiple runs contain samples to be analyzed together, auxiliary commands in iGUIDE allow for the computational analyst to generate new reports combining multiple sequencing runs together.
If the user is unsure if the experiment or would work with this type of analysis, feel free to contact the maintainers of iGUIDE.
Once installed, iGUIDE utilization is broken down into subcommands as indicated in the Figure 1 below. A description of these commands are reviewed here to give the user an understanding of how the software would work from a workflow view point.
Primary subcommands: Used for standard or primary workflow of processing sequencing runs.
setup
: This subcommand initiates a project or run directory. It requires a config file and will create the new project directory within the iGUIDE/analysis directory.run
: This subcommand will process a run given a config file using a Snakemake workflow (https://snakemake.readthedocs.io/en/stable/). Therefore, Snakemake specific commands can be passed into therun
subcommand. All Snakemake specific commands should come after a--
break in the options.
Auxiliary subcommands: Used for auxiliary workflows which further dive into analyzing the processed data.
eval
: Short for evaluation, this subcommand will analyze a run's data and yeild an RDS file (R-based data file). Supplemental data can additionally be passed into the evaluation to group specimens together for analysis and include metadata. This output object has a host of broad analysis that are based in the input information.report
: This will generate a full report on the given config file(s) or input evaluated RDS file. The report is defaultly produced as an html document but can be changed to a pdf if the correct latex libraries are installed. Additionally, all figures and tables can be output as independent files (pdf and png formats for figures and csv formats for tables).summary
: Similar to the report but with some reduced utility, this subcommand will output a single text file that overviews the data. This is readable on the terminal and is helpful for getting quick answers to data questions if working on the command line.
Additional subcommands: Used for cleanup and helpful suggestions for processing.
clean
: After processing, most intermediate data files are removed as they are designated temparary, but other file still exist within the run directory that may inflate the size and are no longer needed, such as input data and log files. Theclean
subcommand will remove files no longer required. A "clean" run directory can still be used witheval
,report
, andsummary
. Additionally, this subcommand can remove the entire run directory by passing the--remove_proj
flag.hints
: Prints out a message with Snakemake option hints to help with using therun
subcommand.
A workflow is simply how data is moved from an unprocessed state (like sequencing data off an Illumina sequencer) to a processed state (a final report). Below we will review the primary and auxiliary workflows iGUIDE is designed to handle.
In the primary workflow, we consider how to get from input sequence information to processed reports. To initiate this process, the user needs to gather the information and complete two files, the configuration file (config file) and the sample information file (sampleInfo file). These two files will tell iGUIDE how to process the sequence information, sample specific parameters should be included in the sampleInfo file while constant parameters can be simply specified in the config file. Once these two files are completed, they can be deposited into their repective directories (config file --> iGUIDE/configs and sampleInfo file --> iGUIDE/sampleInfo). Additionally, if a supplemental file (supp file) is to be included, it is easiest to deposit this file with the sampleInfo file, in iGUIDE/sampleInfo.
Figure 2. Primary workflow for processing input sequencing files to processed runs with data deliverables like reports and figures.
With the config, sampleInfo, and potentially supp files in place, the user can
use iguide setup {path/to/[run].config.yml}
to create a new run directory.
In Figure 1, three runs have been developed, named proj1, proj2, and proj3.
Each of these would have a different config and sampleInfo file. With the files
in their respective directories, the user would run
iguide setup configs/proj1.config.yml
to create the "proj1" run directory
in the analysis directory, and then repeat the command with the other two config
files to have a total of three empty run directories under the analysis
directory.
Once the run directories are setup, the input data needs to be located. This can be done in a number of ways. In the config file, the user can specify the path to the sequence files (preferably not demultiplexed, see latter sections for skipping demultiplexing). The user can create symbolic links to the data within the input_data directory of the run directory, or the user can simply deposit the sequence files (fastq.gz) into the input_data directory.
With config file, sampleInfo file, and sequencing files ready, the user can
start processing with iguide run configs/{run}.config.yml
. Recall that the
run
subcommand is built on a Snakemake workflow, so additional Snakemake
options can be passed after --
when issuing the command. For example,
iguide run configs/proj1.config.yml -- --cores 6 --nolock -k
, tells
Snakemake to use 6 cores for processing, do not lock the working directory
(helpful for running multiple processing runs at the same time), and keep going
even if one job has an error.
Allowing the iguide run
command to go to completion will yeild a processed
data run. At this point, if calling the same "run" command on a project,
Snakemake should return a message indicating that there is nothing to do. If
for some reason processing gets terminated, iguide run
and Snakemake will
pickup from where it left off in the processing.
If the user is content with the processing, then they can run the
iguide clean
command to clean up a specific run directory (shown in
Figure 3 below). This leaves the output data (useful in the auxiliary workflow)
and the reports, but will remove input_data and log files. Additionally if the
user wants to remove the run directory completely, they can also use the
iguide clean
command with an optional flag.
After running the primary workflow on several runs, or if the user would like to change specific parameters (gene lists, target sequences, ...) then the auxiliary workflow becomes quite useful.
There are three subcommands included in this workflow: eval
, report
, and
summary
. Each of them work in similar ways, but have different outputs.
The iguide eval
is a focal point of the auxiliary workflow. This command
will process one or more runs and analyze them in a consistent manner, so the
user is confident they don't have a mixed data set. This subcommand will output
a binary R-based file (*.rds) which can be read into an R environment with the
function base::readRDS(). This file contains a host of analysis and can be used
with the other two subcommands, report
and summary
.
The iguide report
will output an html or pdf analysis of the evaluated
dataset. This is the standard deliverable from the iGUIDE package. Additionally,
the command can generate the figures and tables along with the report.
iguide summary
is very similar, but only generates a text-file based report.
Both will take eval
output files as an input, but they can also be used with
the same input as would be given to eval
, config file(s).
Supplemental files carrying specimen-based metadata can also be included in the auxiliary commands. Any specimen not indicated in the supp file will be dropped from the analysis. This means the user can select which samples are included in the analysis by specifying the associated specimens to include, even if the specimens are across multiple runs.
With this knowlege in hand, the remainder of the documentation should have more context as to how it is applied to processing data with the iGUIDE software.
- A relatively-recent Linux computer with more than 2Gb of RAM
We do not currently support Windows or Mac. (iGUIDE may be able to run on Windows using the [WSL](https://docs.microsoft.com/en-us/windows/wsl/about), but it has not been tested).
To install iGUIDE, simply clone the repository to the desired destination.:
git clone https://github.com/cnobles/iGUIDE.git
Then initiate the install using the install script. If the user would like the
installed environment to be named something other than 'iguide', the new conda
environment name can be provided to the install.sh
script as shown below.:
cd path/to/iGUIDE bash install.sh
Or specify a different environment name.:
cd path/to/iGUIDE bash install.sh -e {env_name}
Additionally, help information on how to use the install.sh
can be accessed
with the -h
flag.:
bash install.sh -h
If the user would like to run a test of the software during the installation,
the install script has a -t
option that helps with just that. The below
command will install the software with the environment named 'iguide' and test
the software with the built-in simulated dataset during installation. Be ready
for the testing to take a little bit of time through (up to 30 mins or so).:
bash install.sh -e iguide -t
Otherwise, the testing can be initiated after install using the following command.:
bash etc/tests/test.sh {env} {cores}
Where {env}
would be the environment the user would like to test, "iguide"
by default, and {cores}
would be the number of cores to run the test on. The
test will complete faster given more cores.
The test dataset can be regenerated with a script provided in the
iGUIDE/etc/tests/construct_scripts directory, simulate_incorp_data.R
. This
script is configured by a partner config.yml file, sim_config.yml
. A quick
look through this configuration and the user can change the size of the
simulated data output, rerun the script to generate new data, and develop a new
test for iGUIDE.:
cd etc/tests/construct_scripts Rscript simulate_incorp_data.R sim_config.yml
There are two scripts included in the tools/rscript directory that work with the simulated data. The first is designed to check the accuracy compared to the "truth" dataset that the simulated data was built on. To run that script, follow the command below.:
Rscript tools/rscripts/check_test_accuracy.R configs/simulation.config.yml etc/tests/Data/truth.csv -v
The second script checks output files by their md5 digest, therefore any changes to the test (including generating new data, changing the aligner, changing parameters, ...) could make the test fail.:
Rscript tools/rscripts/check_file_digests.R etc/tests/simulation.digests.yml -v
Both testing scripts will exit with exit code 1 if they fail, which makes them easy to build into integration testing.
Over time, components of iGUIDE will be updated, including environmental builds, the commandline interface (python library or lib), and the supporting R-package (iguideSupport or pkg), as well as the standard code base. To update these, pull the latest release from GitHub with the following command after installation.:
git pull origin master
Once this has updated, the user should update their install by running the install script with the update option.:
bash install.sh -u all
It is recommended to update everything if the user is unsure of what has been
updated. If the user just wants to update specific parts of the software
through, they can use env
, pkg
, or lib
after the -u
flag to
specify a component.
It is recommened that after updating, the user rerun the testing scripts to make sure the software is working appropriately on the specified system.
To uninstall iGUIDE, the user will need to remove the environment and the directory.
To remove the environment and channels used with conda:
cd path/to/iGUIDE bash etc/uninstall.sh
Or:
cd path/to/iGUIDE bash etc/uninstall.sh {env_name}
If the user would rather remove the environment created for iGUIDE, it is recommended to use conda. This will leave the channels within the conda config for use with other conda configurations:
conda env remove -n iguide
Or:
conda env remove -n {env_name}
To remove the iGUIDE directory and conda, the following two commands can be used:
# Remove iGUIDE directory and software rm -r path/to/iGUIDE # Remove conda rm -r path/to/miniconda3
Configuration files, or configs for short, contain both run-related and pipeline-related information. This is by design. For reproducibility it is easiest to have what was processed and how it was processed in the same location. There should be one config file for each sequencing run to be processed. Below is a brief summary of how to 'configure' your config file to your specific run.
Config files need to be named in the format '{RunName}.config.yml', where
{RunName}
is a parameter set within the config file for the run. For
example, the default run configuration file is named simulation.config.yml
,
so the run name is simulation
.
Config files can be deposited anywhere in the users directory, but a dediacted
directory has been included in the release of iGUIDE. For convienence, config
files can be placed in iGUIDE/configs/
.
For sample specific information, input is more easily placed in a sampleInfo file. See the included section regarding sample info files.
Config files are in a yaml
format, but are broken into two parts. The first
contains run specific information that should be filled out by an individual
familiar with the sequence data used in the laboratory bench-side protocol.
Additionally, they should be aware of the biochemistry related to the enzymes
and sequences they are using.
The second part (below the divide ----
) should be filled out by an
individual familiar with the bioinformatic processing. Explanations of the
different portions can be found in the following pages.
Run_Name
- This is the name of the sequencing run, and should only contain alpha-numeric
characters. Underscores (
_
) and dashes (-
) are also allowed within the run name parameters. Other symbols should not be included, such as a dot (.
). The run name is further used by the software to link files and directories together, so it will need to be consistent whenever it is used. Examples include: iGUIDE_190201_B6V99, 181213_PD1_T-cell_exp. Sample_Info
- This is a file path to the sample information file. It can either be an absolute file path or relative file path. If the file path is relative though, it will need to be relative to the Snakefile used by the iGUIDE software. For more information about this file, please see the Sample Information page.
Supplemental_Info
- Similar to
Sample_Info
, this is a file path to a supplementary file which can contain information related to experimental parameters or patient information. This will be used during the report output, which will group samples with identical parameters. The format for this file is quite loose, and it only requires a single columnSpecimen
, which should match the names of specimens in the sample information file. For more information about this file, please see the Supplemental Information page. If no file is to be used, set the value for this parameter to"."
and make sure to set thesuppFile
in the run protion of the config toFALSE
. Ref_Genome
- This is a designation for the reference genome to used during processing. The
genome will need to be included in the R libraries through BioConductoR prior
to running the software. The human genome draft
hg38
is included by default. Please see information on the BioConductoR package 'BSgenome' for installing alternative genomes. Aligner
- Options include either 'blat' or 'bwa', though at this time, only 'blat' is supported. Future versions of iGUIDE may support other alignment softwares. Please contact the maintainers if you have a favorite you would like to see listed here.
UMItags
- This is a logical parameter indicating whether to use unique molecular indices (UMI) sequence tags ('TRUE') or to only use unique fragments lengths (see SonicAbundance <https://doi.org/10.1093/bioinformatics/bts004>) to quantify abundances of unique observations.
Seq_Path
- This is the file path to the sequence files. Rather than repeating the path for each below, just include the path to the directory containing the files.
R1 / R2 / I1 / I2
- These parameters should be the file names of the sequence files to be analyzed by the iGUIDE software. It is recommened to pass complete sequencing files to iGUIDE rather than demultiplexing prior to analysis.
Sample_Name_Column
- This is the name of the column in the sample information file which contains
identifiable information about samples. An appropriate format for the sample
names is "{specimen}-{rep}" where 'specimen' is an alpha-numeric designator
for the specimen and 'rep' is a numeric identifier for technical or biological
replicates, separated by a dash (
-
). Replicates will be pooled during the final analysis, so if you want them to be separate in the report, make sure you give each specimen a different identifier. For example, iGSP0002-1 and iGSP0002-2, will be pooled together for the report and analysis, but iGSP0002-1 and iGSP0003-1 will not. These names will be used in naming files, so do not include any special characters that will confuse file managment. Try to stick to common delimiters, such as "-" and "_". A good practice is to put specimen identifiers at the beginning, replicate identifiers at the end following a "-", and anything else descriptive in the middle. For example, iGSP0002-neg-1, can specify the priming orientation the sample was processed with.
R{1/2}_Leading_Trim
- Sequence to be removed from the 5' or beginning of the R1 or R2 sequences.
Commonly a linker or fixed sequence that is part of the priming scheme during
amplification. If no sequence should be removed, just include
"."
. If the sequence is sample or specimen specific, it can be included in the sample information file and indicated in these fields as"sampleInfo:{column}"
, where 'column' is the column name with the data in the sample information file. R{1/2}_Overreading_Trim
- Similar to the
Leading_Trim
parameters, these parameters indicate the sequence that should be removed from the 3' or end of the reads if it is present. Again, if no sequence should be removed, use a"."
or if the data is present in the sample information file,"sampleInfo:{column}"
. R2_Leading_Trim_ODN
- This is a key parameter difference between iGUIDE and the original GUIDEseq
method. This parameter indicates the sequence that is part of the dsODN but is
not primed against. This sequence should directly follow the
R2_Leading_Trim
sequence and should be a reverse complement of the beginning of theR1_Overreading_Trim
sequence if the iGUIDE dsODN is being used. For GUIDEseq, simply include"."
, or if you have multiple sequences, then specify in the sample information file as"sampleInfo:{column}"
.
Target_Sequences
- This parameter specifies the target sequences, not including the PAM
sequences for guide RNAs. An acceptable input format would be
{target_name} : "{sequence}"
(i.e.B2M.3 : "GAGTAGCGCGAGCACAGCTANGG"
) and additional target sequences can be included, one per line, and each indented at the same level. The input format of{target_name} : {target_seq}
needs to be maintained for proper function. The 'target_name' in this situation will need to match the 'target_name' used in theOn_Target_Sites
andTreatment
parameters. 'target_name' should follow a common format, and use standard delimiters, such as "-", "_", and ".". For example:B2M.3
,TRAC.1.5
,TruCD33v5
. On_Target_Sites
This parameter indicates the specific location for editing by the target enzyme. There should be one line for each on-target site, even if there are more than one on-target sites for a given target sequence. Typically the input format should follow
{target_name} : "{seqname}:{+/-}:{position}"
, where 'target_name' matches the name of the given target sequence, and if multiple on-target sites exist, then the names can be expanded using a{target_name}'#
notation. Additionally, the notation can be expanded to{target_name} : "{seqname}:{+/-/*}:{min.position}-{max.position}"
, where '*' indicates either orientation and 'min.position' and 'max.position' represent the numerical range for the on-target site. The value for each on-target site specifies the location or genomic coordinates of nuclease activity. The 'seqname' indicates the chromosome or sequence name, an orientation of '+' or '-' is given to the location depending on the editing orientation (in line with positional numbering is '+' and opposite is '-', unknown or both is '*'), and the 'position' or 'min/max.position' indicates the nucleotide(s) of editing. For Cas9, the position of editing is commonly between the 3rd and 4th nucleotide from the 3' end of the targeting sequence (not including the PAM). Being off by a nucleotide or so will not cause any problems. Example below.:On_Target_Sites : TRAC.5 : "chr14:+:22547664" TRBC.4'1 : "chr7:+:142792020" TRBC.4'2 : "chr7:+:142801367" PD1.3 : "chr2:-:241858808" TRAC.3.4 : "chr14:-:22550616-22550625" B2M.3 : "chr15:*:44711569-44711570" CIITA.15.1 : "chr16:+:10916399"
Treatment
- This parameter indicates how samples were treated. If samples were all treated
differently, then this information can be included in the sample information
file as
all : "sampleInfo:{column}"
where 'column' is the name of the column with the information. If a single sample was treated with more than one target sequence, then delimit multiple target names by a semicolon (;
), i.e.all : "B2M;TRAC;TRBC"
. Additionally, each specimen can be indicated individually on a new line. Only specimen names should be given here and provided individually, not sample identifiers. This means that if your sample names follow the suggested format, "{specimen}-{replicate}", you would only specify the "{specimen} : {treatment}" underneath this parameter.
Specimen nuclease treatment
Nuclease
- Similar to target treatment above, this parameter dictates which nuclease(s) where used on the specimens. This refers to the class of nuclease, such as Cas9 or Cpf1, which behave differently when they edit DNA. Notation can follow the same as above, if all specimens were treated with the same class of nuclease, then just specify 'all : "{nuclease_profile}"', or list out by specimen. Additionally you can specify the column in sampleInfo in the same format as above. Currently, iGUIDE does not support processing for specimens with multiple classes of nuclease profiles. Only one profile can be specified per specimen.
Nuclease_Profiles
- See below section on nuclease profiles.
Below are parameters that are used to process the large amount of data, such as setting memory suggestions if resources are specified or parameters for sequence alignments. While these figues may not be relevant to the bench scientist, they are particulars for computational scientists.
Resource management is not required, but it can help when using HPC or limiting jobs. You are encouraged to spend some time optimizing if you would like, these parameters work out well on the designer's platform.
Read_Types
- This parameter should include which read types will be used in the analysis,
i.e.
["R1", "R2", "I1", "I2"]
. This follows a list notation is Python. If only single barcoding or some other method is employed and a read type is not included, simply leave it out of the example. Genomic_Reads
- This parameter is similar to the
Read_Types
but only indicates which reads contain genomic information rather than indexing. readNamePattern
- This is a regex pattern for which to gather read names, it should not make the
read name sequencing orientation specific, R1 and R2 should have the same read
name. The default works well for Illumina based readnames
[\w\:\-\+]+
. For R-based scripts to interpret the regex correctly, you will need to use double escapes,[\\w\\:\\-\\+]+
.
defaultMB / demultiMB / trimMB / filtMB / consolMB / alignMB / qualCtrlMB / assimilateMB / evaluateMB / reportMB
- Controls the amount of memory allocated to each of these processes during
snakemake processing. While working on a server or multicored machine, these
parameters will work internally to help schedule jobs. Each value will act as
an upper limit for the amount of MB of RAM to expect the process to take, and
schedule jobs appropriately using the
--resources mem_mb={limitMB}
flag with Snakemake. During HPC use, these parameters can be combined with the cluster config to schedule specific memory requirements for jobs. Additionally, if the--restart-times {x}
is used where "x" is the number of times to restart a job if it fails, then the amount of memory for the job will increase by a unit of the parameter. For example, if a trimming job fails because it runs out of memory, then restarting the job will try to allocate 2 times the memory for the second attempt. All parameters should be in megabytes (MB).
skipDemultiplexing
- Logical (either TRUE or FALSE) to indicate if demultiplexing should be carried
out. If TRUE, sequence files (*.fastq.gz) need to be placed or linked in the
input_data directory of an existing project directory (as with
iguide setup
), one sequence file for each type (R1, R2, I1, I2). These need to be identified in the "Run" portion of the config file. If FALSE, then demultiplexed files need to be included in the input_data directory of an existing project directory. The files need to be appropriately named, in the format of{sampleName}.{readtype}.fastq.gz
, wheresampleName
matches the 'sampleName' column found in the associated 'sampleInfo' file, andreadtype
is R1, R2, I1, or I2. IfUMItags
isFALSE
, then only R1 and R2 file types are required for analysis, ifUMItags
isTRUE
, then I2 is a required file type as well. barcode{1/2}Length
- Integer values indicating the number of nucleotides in the barcodes or indexing sequences.
barcode{1/2}
- Character values (i.e.
"I1"
) indicating which reads to find the associated indexing information for demultiplexing. bc{1/2}Mismatch
- An integer value indicating the number of tolarated mismatches in the barcode sequences for either barcode 1 or 2.
R{1/2}leadMismatch
- Integer values indicating the number of allowed mismatches in either R1 or R2 leading sequence trimming. Recommend to set to less than 10% error.
R2odnMismatch
- Integer value indicating the number of allowed mismatches in the unprimed ODN sequence, typically should be set to 0.
R{1/2}overMismatch
- Integer values indicating the number of allowed mismatches in either R1 or R2 overreading trimming. This is converted into a percent matching and should be thought of as a number of mismatches allowed out of the total length of the overreading trim sequence.
R{1/2}overMaxLength
- Searching for overread trimming in sequences can be time consuming while not
producing different results. For this the total length of searched for
sequences can be limited here. For example, if
ATGCGTCGATCGTACTGCGTTCGAC
is used as the overreading sequence, and 5 mismatches are allowed, then the tolerance will be 5/25 or 80% matching, but only the first 20 nucleotides of the sequence will be aligned for overtrimming,ATGCGTCGATCGTACTGCGT
. With an 80% matching requirement, 16 out of 20 nucleotides will need to align for overread trimming to be initiated.
bins
- A number of bins to separate filtered sequences into for higher parallel processing. The increasing the number of bins can help spread out the work required for processing to keep memory requirements lower.
level
- A number indicating the number of reads that should be targeted for each bin. Bins will be filled to the level amount, leaving remaining bins empty if previous bins contain all the reads. Additionally, if all bins will "overflow", then reads will be evenly distributed across the number of bins.
BLATparams
- A character string to be included with the BLAT call. A suggested example has
been provided in the simulation config file. For options, please see the BLAT
help options by typing
blat
into the commandline after activatingiguide
. BWAparams
- A character string to be inclued with the BWA call. A suggested example has
been provided in the simulation config file. For options, please see BWA help
by typing
bwa mem
into the commandline after activatingiguide
.
maxAlignStart
- Integer value indicating the number of nucleotides at the beginning of the alignment that will be allowed to not align. Another way of thinking of this is the maximum start position on the query rather than the target reference. A default value of 5 means that the alignment needs to start in the first 5 nucleotides or the alignment is discarded during quality control filtering.
minPercentIdentity
- This is a value between 0 and 100 indicating the minimum global percent identity allow for an alignment. If an alignment has less, then it is discarded during quality control filtering.
{min/max}TempLength
- Specify the minimum (min) and maximum (max) template length expected. Joined alignments between R1 and R2 the are outside of this range are considered artifacts and are discarded or classified as chimeras.
refGenes / oncoGeneList / specialGeneList
- These are special reference files in either text or BioConductoR's
GenomicRanges objects. They can be in an '.rds' format or table format
('.csv' or '.tsv'). The
file
parameter should indicate the file path to the file (relative paths should be relative to the SnakeFile), and thesymbolCol
parameter should indicate the column in the data object which contains the reference names to be used in the analysis. maxTargetMismatch
- The maximum number of mismatches between the reference genome and target sequence allowed for consideration to be a target matched incorporation site. This is an integer value and is compared to the target sequence(s).
upstreamDist
- The distance upstream of the incorporation site to look for a target
similar sequence within the criteria specified by
maxTargetMismatch
. downstreamDist
- The distance downstream of the incorporation site to look / include for a
target similar sequence within the criteria specified by
maxTargetMismatch
. pileUpMin
- An integer value indicating the number of alignments required to overlap before being considered a 'pileUp'.
recoverMultihits
- While multihit alignments are often difficult to analyze, some information
can still be gleamed from the data given reasonable assumptions. Adjusting
this parameter to
TRUE
will still only focuses on sites that are uniquely mapped, but if a multihit includes a unique site and other locations, contributions are given to the unique site location. Further, reads and their contributions, umitags and fragments, are not double counted but instead evenly distributed to all included unique sites. Note, some sequencing artifacts may arrise in "off-target" associated sites. Users should be careful to conclude anything from these alignment artifacts. Leaving this option asFALSE
is recommended if the user does not have a target sequence that locates a repetitive sequence.
suppFile
- Logical (
TRUE
orFALSE
), if the supplemental file provided inSupplemental_Info
should be used in the default report generated at the end of processing. If set toFALSE
, theSupplemental_Info
parameter is not required for processing. {tables/figures}
- Logicals indicating if tables and figures should be generated from the report.
Data will be included under the
reports
directory in the project run directory. For figures, both PDF and PNG formats will be generated if set toTRUE
at 300 dpi while tables will be generated in a comma-separated values (csv) format. reportData
- Logical indicating if a RData object should be saved during the report
generation in the
reports
directory. infoGraphic
- Logical indicating if an info graphic displaying the genomic distribution of incorporations should be generated at the beginning of the report. While aesthetically pleasing, the graphic gives the report a unique twist and can provide the knowledgeable user with information about the report at the very beginning.
signature
- Character string included at the beginning of reports to denote the author, analyst, laboratory, etc. Make sure you change if you don't want Chris getting credit for your work.
An additional component to the first part of the config file, is the Nuclease Profiles. The user can specify which nuclease they are using and include and profile to help identify edit sites. Nuclease can range from Cas9 to Cpf1 or TALEN based nickases.
Note: For TALEN and dual flanking nickases or nucleases, each side will need
to be input as a different target. Specify in Target_Sequences
the sequence
and On_Target_Sites
the actual editing site. Make sure you include two
distinct identifiers for the sequences on-target sites, then specify the
target treatment as {target_seq1};{target_seq2}
.
Any name can be given in the Nuclease
section, but that name needs to match
the profile name as well. So if you want to call it "Cas9v2", then just make
sure you have a profile named "Cas9v2".
Below is some ascii art that indicates the differences between nucleases. Additionally, below the art are example profiles for input into the iGUIDE software.:
Editing strategies by designer nucleases Cas9 : >< PAM ATGCATGCATGCATGCATGCA TGG (sense strand) TGCATGCATGCATGCATGCA NGG # gRNA |||||||||||||||||||| ||| TACGTACGTACGTACGTACGT ACC (anti-sense strand) >< # Dominant cutpoint Cpf1 : Also known as Cas12a (similar nuclease structure for CasX) >< # Dominant cutpoint GTTTG ATGCATGCATGCATGCATGCATGCATGC (sense strand) PAM TTTV ATGCATGCATGCATGCATGCA # gRNA, nuclease activity leave overhang |||| ||||||||||||||||||||| CTAAC TACGTACGTACGTACGTACGTACGTACG (anti-sense strand) >< # Dominant cutpoint TALEN : Protin-DNA binding domain fused with FokI nickase ATATATATATATATATATAT GCATGCATGCATGCAT GCGCGCGCGCGCGCGCGCGC (sense strand) \\\\\\\\\\\\\\\\\\\\ |-------> <-------| \\\\\\\\\\\\\\\\\\\\ TATATATATATATATATATA CGTACGTACGTACGTA CGCGCGCGCGCGCGCGCGCG (anti-sense strand) # Proteins bind flanking the cleavage site and cut in the "insert" sequence. CasCLOVER : Clo051 or another nickases with CRISPR-based binding domains ATCCT ATGCATGCATGCATGCATGC TTAACCGGTTAACCGG TACGTACGTACGTACGTACG CGGTC ||| |||||||||||||||||||| (sense strand) PAM Target Sequence \-------> <-------\ Target Sequence PAM (anti-sense strand) |||||||||||||||||||| ||| TAGGA TACGTACGTACGTACGTACG AATTGGCCAATTGGCC ATGCATGCATGCATGCATGC GCCAG
Below are the example profiles.:
Nuclease_Profiles : Cas9 : PAM : "NGG" PAM_Loc : "3p" PAM_Tol : 1 Cut_Offset : -4 Insert_size : FALSE Cpf1 : PAM : "TTTV" PAM_Loc : "5p" PAM_Tol : 1 Cut_Offset : 26 #(Anywhere between 23 and 28) Insert_size : FALSE CasX : PAM : "TTCN" PAM_Loc : "5p" PAM_Tol : 1 Cut_Offset : 22 #(Anywhere between 16 and 29) Insert_size : FALSE TALEN : PAM : FALSE PAM_Loc : FALSE PAM_Tol : 0 Cut_Offset : Mid_insert Insert_size : "15:21" CasCLOVER : PAM : "NGG" PAM_Loc : "3p" PAM_Tol : 1 Cut_Offset : Mid_insert Insert_size : "10:30"
PAM
- protospacer adjacent motif - should be specified here and can contain ambiguous nucleotides.
PAM_Loc
- indicates the location of the PAM with respect to the pattern, either '5p', '3p' or FALSE.
PAM_Tol
- indicates the tolerance for mismatches in the PAM sequence (ignorned if PAM is FALSE).
Cut_Offset
- indicates the offset from the 5' nucleotide of the PAM sequence where the nuclease creates a double strand break, unless PAM is FALSE, then the 5' position of the target sequence (also accepts "mid_insert" to specify middle of region between paired alignments).
Insert_size
- is used if target sequences are expected to flank each other for editing, such as with TALENs, and indicates the expected size of the insert. To input a range, delimit the min and max by a colon, ie. 15:21. All names of nucleases used to treat specimens need to have a profile. Additional profiles should be added under the 'Nuclease_Profiles' parameter.
Sample information files (or sampleInfo files) contain information that may change from specimen to specimen. They need to contain at lease 3 columns of information: sample names, barcode 1, and barcode 2 sequences. Additionally, other parameters defined in the config file can be defined in the sample information file if they change from specimen to specimen.
Run specific config file will need to point to the sample information files. For
convienence, a directory can be found at iGUIDE/sampleInfo/
for depositing
these files.
SampleInfo files need to have a specific naming format that follows '{RunName}.sampleinfo.csv'.
An appropriate format for the sample names is "{specimen}-{rep}" where
'specimen' is an alpha-numeric designator for the specimen and 'rep' is a
numeric identifier for technical or biological replicates, separated by a dash
(-
). Replicates will be pooled during the final analysis, so if you want
them to be separate in the report, make sure you give each specimen a different
identifier.
For example, iGSP0002-1 and iGSP0002-2, will be pooled together for
the report and analysis, but iGSP0002-1 and iGSP0003-1 will not. These names
will be used in naming files, so do not include any special characters that will
confuse file managment. Try to stick to common delimiters, such as -
and _
.
Using a dot, .
, as a delimiter is not currently supported.
A good practice is to put specimen identifiers at the beginning, replicate identifiers at the end following a "-", and anything else descriptive in the middle. For example, iGSP0002-neg-1, can specify the orientation the sample was processed with.
Supplemental information files (or supp files) contain information that may change from specimen to specimen. They have only one required column, "Specimen", but subsequence columns will be used to define conditions. Let's use the below supp file as an example.:
# Supplemental csv file example, padding included for visualization Specimen, Nuclease, gRNA iGXA, Cas9, TRAC iGXB, Cas9, TRAC iGXC, Cas9, B2M iGXD, Cas9, B2M iGXE, Mock, Mock iGXF, Mock, Mock
This type of setup would indicate that there are 6 specimens to be analyzed (iGXA - iGXF). Each of these would correlate with their sampleName'd replicates, so for iGXA, all samples with the format iGXA-{number} or iGXA-{info}-{number} would be pooled into the iGXA specimen.
Additionally, there are three conditions, defined by the distinct data excluding information in the "Specimen" column. So in this case, the conditions are "Cas9-TRAC", "Cas9-B2M", and "Mock-Mock". Within the report format, there are several analyses that are conditionally based rather than specimen based. This adds to the flexibility and utility of the reporting functions supplied with iGUIDE.
If the user would rather ever specimen analyzed independently and reported in that manner, then they can either run a report without a supp file or in a supp file include a column that distinguishes each specimen from each other.
Column names and formating are transferred directly into the report. Additionally, this files sets the order presented in the report. If "iGXC" comes before "iGXB" in the supp file, the it will be orderd as so throughout the report. Conditions, as well, follow this format. As presented above, the report will order the conditions in the following order "Cas9-TRAC", "Cas9-B2M", and "Mock-Mock", which is the order of first observation.
Once the config and sampleInfo files have been configured, a run directory can be created using the command below where {ConfigFile} is the path to your configuration file:
cd path/to/iGUIDE iguide setup {ConfigFile}
The directory should look like this (RunName is specified in the ConfigFile):
> tree analysis/{RunName} analysis/{RunName}/ ├── config.yml -> {path to ConfigFile} ├── input_data ├── logs ├── output ├── process_data └── reports
Components within the run directory:
- config.yml - This is a symbolic link to the config file for the run
- input_data - Directory where input fastq.gz files can be deposited
- logs - Directory containing log files from processing steps
- output - Directory containing output data from the analysis
- process_data - Directory containing intermediate processing files
- reports - Directory containing output reports and figures
As a current convention, all processing is done within the analysis directory. The above command will create a file directory under the analysis directory for the run specified in by the config ('/iGUIDE/analysis/{RunName}'). At the end of this process, iGUIDE will give the user a note to deposit the input sequence files into the /analysis/{RunName}/input_data directory. Copy the fastq.gz files from the sequencing instrument into this directory if you do not have paths to the files specified in the config file.
iGUIDE typically uses each of the sequencing files (R1, R2, I1, and I2) for
processing since it is based on a dual barcoding scheme. If I1 and I2 are
concatenated into the read names of R1 and R2, it is recommended the you run
bcl2fastq ... --create-fastq-for-index-reads
on the machine output
directory to generate the I1 and I2 files.
As iGUIDE has its own demultiplexing, it is recommend to not use the Illumina machine demultiplexing through input of index sequences in the SampleSheet.csv. If your sequence data has already been demultiplexed though, please see the :ref:`usage` for setup instructions.
As long as the config and sampleInfo files are present and in their respective
locations, you can get a quick view of what samples are related to the project.
Using the iguide list_samples
command will produce an overview table on
the console or write the table to a file (specified by the output option).
Additionally, if a supplemental information file is associated with the run, the
data will be combined with the listed table.:
> iguide list_samples configs/simulation.config.yml Specimen Info for : simulation. specimen replicates gRNA nuclease ---------- ------------ --------------- ---------- iGXA 1 TRAC Cas9v1 iGXB 1 TRAC;TRBC;B2M Cas9v1 iGXD 1 NA NA
Once the input_data directory has the required sequencing files, the run can be processed using the following command:
cd path/to/iGUIDE/ iguide run {ConfigFile}
Snakemake offers a great number of resources for managing the processing through
the pipeline. I recommend familiarizing yourself with the utility
(https://snakemake.readthedocs.io/en/stable/). Here are some helpful snakemake
options that can be passed to iGUIDE by appending to the iguide command after
--
:
[--cores X]
multicored processing, specified cores to use by X.[--nolock]
prevents locking of the working directory, allows for multiple sessions to run at the same time.[--notemp]
keep all temporary files which are otherwise removed.[-k, --keep-going]
will keep processing if one or more job error out.[-w X, --latency-wait X]
wait X seconds for the output files to appear before erroring out.[--restart-times X]
X is the number of time to restart a job if it fails. Defaults to 0, but is used iniguide
to increase memory allocation.[--resources mem_mb=X]
Defined resources, foriguide
the mem_mb is the MB units to allow for memory allocation to the whole run. For HPC, this can be coupled with--cluster-config
to request specific resources for each job.[--rerun-incomplete, --ri]
Re-run all jobs that the output is recognized as incomplete, useful if your run gets terminated before finishing.[--cluster-config FILE]
A JSON or YAML file that defines wildcards used for HPC.
After the iguide run
command has completed, the final run directory will
contain a number of output and report files depending on the config parameters.
Additionally, the if user is content with the analysis, they can use the
iguide clean
command to "clean up" the run directory. This will remove input
data files, log files, and any remaining process data files, but will leave
output and report files. This makes the "cleaned" run directories still
compatible with the auxiliary workflow. A clean run directory will look
something like the below tree.:
> tree analysis/{RunName} analysis/{RunName}/ ├── config.yml -> {path to ConfigFile} ├── input_data ├── process_data ├── logs ├── output | ├── incorp_sites.{RunName}.rds | ├── stats.core.{RunName}.csv | └── stats.eval.{RunName}.csv └── reports ├── report.{RunName}.html ├── runstats.{RunName}.html └── summary.{RunName}.txt
There are several standard output files. The incorp_sites.{RunName}.rds
is
the intermediate object that can be reprocessed into final data object and
reports if the user would like to change most parameters. The stats
files
contain processing related information in a condensed form. These stats can be
viewed in a more interpretable fashion from the runstats.{RunName}.html
report.
The report.{RunName}.html
would be the main data analysis report. The
summary
is a similar report but in a text based format. These are ample
descriptions within the report template that will be included with the report.
But if the user would like to customize this report, then they can modify the
report template, found
tools/rscripts/report_templates/iGUIDE_report_template.Rmd
. Custom Rmd
templates can also be provided through the iguide report
command which will
use eval
output objects to "knit" reports in html or pdf output formats.
Should you have any questions or comments and would like to contact the maintainer and designer of the iGUIDE software, please send a email to Chris [dot] L [dot] Nobles [at] Gmail [dot] com, with iGUIDE in the subject.