Table of Contents |
---|
Simplot++ is an open-source multi-platform application designed by Stéphane Samson, Étienne Lord and Vladimir Makarenkov (Université du Québec à Montréal). It is implemented in Python. SimPlot++ produces publication-ready SimPlot and plots using 43 nucleotide and 20 amino acid distance models. Intergenic and intragenic recombination events can be identified using Phi, χ2, NSS and Proportion tests. Simplot++ also generates and analyzes interactive sequence similarity networks, while supporting multi-processing and providing distance calculability diagnostics.
SimPlot++ offers the following features:
- Create and save consensus sequences (representing a given sequence group)
- Run a SimPlot (sequence similarity plot) analysis
- Run a BootScan analysis
- Identify informative sites
- Create and analyze sequence similarity networks
- Run Phi, NSS and χ2 statistical tests to detect recombination
- Provide distance calculability diagnostics
SimPlot++ for Windows is available either as an executable file or as a Python script.
The Windows installer can be found at the release. Click on the SimPlot++-x.x-win64.msi
file to download it.
A requirements.txt
file containing all required libraries is available in the GitHub repository.
Assuming Python 3.8 or higher is installed on the machine, the script should run well with the libraries installed.
Here is an example of how to run the script in Windows:
- After downloading the source code, go to the folder containing
main.py
. - Create a new virtual environment (venv) in your Windows PowerShell using
python3 -m venv SimPlot++_venv
. - Still in the PowerShell, enter the new venv using
SimPlot++_env/Scripts/Activate.ps1
. - Install the required libraries using
pip install -r requirements.txt
. - Launch SimPlot++ using
python3 main.py
.
SimPlot++ is available as a Python script.
A requirements.txt
file containing all required libraries is available in the GitHub repository.
Assuming Python 3.8 or higher is installed on the machine, the script should run well with the libraries installed.
Here is an example of how to run the script in Linux/UNIX or Mac OS:
- After downloading the source code, go to the folder containing
main.py
. - If you do not have
virtualenv
installed, runpython3 -m pip install --user virtualenv
- Create a new virtual environment (venv) in your terminal using
python3 -m venv SimPlot++_env
. - Still in the terminal, enter the new venv using
source SimPlot++_env/bin/activate
. - Install the required libraries using
python3 -m pip install -r requirements.txt
. - Launch SimPlot++ using
python3 main.py
.
The group page allows users to load multiple sequences files (in the Fasta, Nexus, Pir, Phylip, Stockholm or Clustal format) and manually organize individual sequences in groups. Don't forget to align your sequences before loading them in SimPlot++. For each of the groups created by the user, a consensus sequence will be generated by SimPlot++. The % threshold for consensus sequences can be modified in the preference tab of the menu. The consensus groups are essential for the SimPlot, BootScan and Network Similarity analyses. The groups can be saved in a Nexus file to avoid recreating them every time.
- Select your aligned sequence file (DNA or AA) in an accepted format (Fasta, Nexus, Pir, Phylip, Stockholm or Clustal) through the File browser button.
- Once the file is loaded, the sequence IDs will appear in the Ungrouped Sequences section, on the bottom right.
- Groups can be created and deleted, and sequence IDs can be moved from the Ungrouped Sequences section to the Group section.
- Groups can be renamed by clicking on their names.
- Group colors can be modified by clicking on the corresponding colored circles.
- A group that has no sequences will prevent the user from running most of the available analyses.
- A minimum of two groups containing at least 1 sequence each are necessary to run an analysis.
- Once your groups are created, customized, and are ready to be used, we suggest saving them locally using the Save groups to .Nexus button.
- This feature will save your initial data in the Nexus format along with the group content, allowing you to skip the group creation process the next time you start SimPlot++.
- Once these steps are completed, you can run your analysis of choice by selecting it on the top bar of the application.
Below is a summary of the steps presented:
A SimPlot analysis uses a window of a specified size and a specified advancement step to slide this window over the Multiple Sequences Alignment (MSA). Every sub-MSA covered by the window is extracted and a distance matrix based on a selected distance model is generated. This distance matrix is then used to produce a similarity plot for every consensus sequence against the reference sequence chosen by the user. The variations in similarity between the reference sequence and the consensus sequences can be used, for example, to detect potential recombination events.
- 43 DNA distance and 20 amino acid distance models are available (including models from Biopython and Cogent3)
- Multiprocessing functionality is available
- MatPlotlib-based plots with a toolbar to easily customize and save the outputs in multiple formats
- Plots can be viewed in a pop-up window (with the toolbar)
- A new quality control window will open an interactive HTML page to access additional information with the distance calculability diagnostic
Window length: Determines the size of the sliding window
Step: Determines the size of the advancement step of the sliding window (with overlap)
Strip gap: Determines the maximum number of ambiguous positions allowed in a consensus sequence (optional)
Multiprocessing: Allows multiple windows to be analyzed simultaneously. Recommended for large datasets
Distance Model: 43 DNA models and 20 amino acid models available
A new feature of SimPlot++ is the quality control feature for the Simplot analysis allowing the user to visualize gap-related data (such as the regions where the number of gaps in the consensus sequence was higher than the gap threshold permitted).
Overall data completeness (whether genetic distances were successfully computed or not in each window) can also be viewed through this feature.
Since certain models can generate errors when computing genetically distant sequences (divisions by zero, log of negative numbers, etc.), it is recommended to use this feature to check if such issues happened. If it did, it is recommended to modify the analysis parameters, consensus groups, consensus threshold, or the distance model used. As a general rule, simpler models tend to be more lenient.
The sequence similarity network analysis is an interactive representation of a SimPlot analysis using a window in which every group (including the reference group) is represented by a network node. These nodes are connected by an edge depending on the calculated global (over the whole sequence) or local (over sub-sequences of a selected length) similarity.
By adjusting the minimum similarity threshold required to show each of the edge types (global and local), it is possible to get a better insight on the relationships between every group. Furthermore, the network similarity representation can be limited to a specific range of the full MSA (in order to analyze a gene or region of interest).
Datatables allows the user to view the most important similarity regions between the network nodes (i.e. MSA sequences or sub-sequences). The results of the Proportion test are also available in a datatable (discussed in more details in the statistical methods section).
The graph data and visualization can be saved in an HTML file. The graph itself can be saved as either a .png or .svg (the option is in the user preference page) directly from the toolbox in the HTML file.
Bootscanning is a pipeline consisting of 4 main steps, all done using a sliding window analysis (as in the SimPlot analysis).
- The subsequences extracted from the consensus groups are bootstrapped N times.
- For each of the N bootstrapped sub-MSAs, a distance matrix is generated.
- A phylogenetic tree is inferred for each distance matrix (either with Neighbor-Joining or UPGMA).
- The conflicting phylogenetic signals are quantified and expressed as the % of trees where each sequence is the nearest neighbor of the reference sequence.
- 43 DNA distance models are available for generating the distance matrices
- Multiprocessing functionality is available
- MatPlotlib-based plots with a toolbar to easily customize and save the outputs in multiple formats
- Plots can be viewed in a pop-up window (with the toolbar)
Bootstrap: Number of replicates to be generated for each sub-MSA (each position of the sliding window)
Tree model: Neighbor-Joining or UPGMA
Window length: Size of the sliding window
Step: Sliding window advancement step
Distance Model: 43 DNA substitution models are available
Multiprocessing: Allows multiple windows to be analyzed simultaneously (recommended for large datasets)
The FindSites scan is used for locating possible regions of recombination by identifying informative sites. The first step of the analysis is to select a sequence assumed to be originated from a recombination event as well as two sequences of interest (one from each of the two possible parental evolutionary lines), and a fourth sequence as an outgroup. Informative sites will be identified as those where, at the same position, two of the sequences share the same nucleotide, and the other two sequences share another (different) nucleotide.
Statistical tests for detecting recombination events from the PhiPack package by Trevor Bruen et al. have been implemented in SimPlot++. The multiprocessing option is recommended for faster computation.
The Phi, Phi-profile, Max χ2 and NSS tests are available for both the ungrouped (raw sequences) and grouped consensus sequences.
Additional information on these tests can be found here.
Moreover, a new simple Proportion test has been designed as a complement to the traditional SimPlot analysis in order to identify quickly the most likely mosaic regions (i.e. possible recombination events) in the grouped sequences. This test is based on the proportion of genetic distances extracted from the SimPlot distance matrices. The Proportion score is an indicator of the signal strength but should not be always considered as a recombination signal.
Please email us at : [email protected] or [email protected] for any question or feedback.