finder
is a gene annotator pipeline which automates the process of downloading short reads, aligning them and using the assembled transcripts to generate gene annotations. Additionally it uses protein sequences and reports gene predictions by BRAKER2
. It is a fast, scalable, platform independent software that generates gene annotations in GTF format. finder
accepts inputs through the command line interface. It finds several novel genes/transcripts and also reports the tissue/conditions they were found to be in. finder
is released as a docker image. Users need to have python3 installed in their system to be able to run finder
. The header script will create either a docker
container or a singularity
container depending on what is installed on the system with preference given to docker
.
If you use finder
for your research please cite
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z Sen,Roger P Wise, and Carson M Andorf. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences BMC Bioinformatics
finder
requires a number of softwares which needs to be installed. This might cause version conflicts with softwares that are already installed in your system. Hence, the developers have decided to enforce the use of finder
within a conda environment.
git pull https://github.com/sagnikbanerjee15/Finder.git
wget https://github.com/sagnikbanerjee15/Finder/archive/refs/tags/finder_v1.1.0.tar.gz
tar -xvzf finder_v1.1.0.tar.gz
cd finder_v1.1.0
echo "export PATH=\$PATH:$(pwd)" >> ~/.bashrc
source ~/.bashrc
You can choose to run finder
using the command outlined in [this section](#Running Finder). When run_finder
command is executed, it will pull the latest docker image from docker hub. Depending on what is installed, the program will create either a docker or a singularity container and execute the main program inside it. If you wish to create the docker image locally execute the following command:
docker build -t sagnikbanerjee15/finder:1.1.0 .
Please remember to add proxies if you are on a VPN.
finder
runs BRAKER2
which depends on GeneMark-ET
. GeneMark-ET
is hosted at the University of Georgia website. The license prohibits the redistribution of their software, which is why it could not be included in this package. Hence, users have to manually download the software and provide the path as input to the software. Please follow the instructions below to download the softwares and the key:
- Open a browser of your choice
- Go to this website
- Select the option GeneMark-ES/ET/EP ver 4.62_lic (2nd from top) and LINUX 64
- Enter your name, institution, country and email-id and click on the button that says I agree to the terms of this license agreement
- Right click on the link that says Please download program here and select Copy Link Address
- Then type in
wget
and paste the path you just copied - This command will download the file gmes_linux_64.tar.gz in the current directory
- Now, right click on the link that says 64_bit and select Copy Link Address
- Then type in
wget
and paste the path you just copied - This command will download the file gm_key_64.tar.gz in the current directory. Please note that this key will expire after one year from the date of download.
- Execute the following commands:
tar -xvzf gm_key_64.tar.gz
tar -xvzf gmes_linux_64.tar.gz
Please follow the following the instructions to generate gene annotations using Arabidopsis thaliana. A csv
file template has been provided with the release in example/Arabidopsis_thaliana_metadata.csv
. Keep all the headers intact and replace the data with your samples of choice. Also note, that FINDER can work with both data downloaded from NCBI
and also with data on local directories. Below is a detailed description of the each column of the metadata file. All the fields must be present in the metadata file. Mandatory fields must have some valid data whereas other fields like Description
, Date
and Read Length
can be left vacant.
Column Name | Column Description | Mandatory |
---|---|---|
BioProject | Name of the bioproject that the data belongs to. If you are using locally saved data then please enter a dummy project name. Please note that FINDER will NOT be able to process empty fields of Bioproject. | YES |
SRA Accession | Enter the SRA Accession number of the sample that you expect finder to use for generating the gene annotations. Note that FINDER will use this ID to download the read samples from NCBI-SRA. In case you wish to use data which is not currently uploaded to NCBI, then you should enter the name of the local file. Do not enter any file extension in this field. For example, if your filename is sample1.fastq , please enter sample1 in this field. finder assumes all files have the extension fastq. If there are files in your system that end with f.q please rename those to *.fastq . For paired-ended samples do not include the pair information in this field. For example, if you have 2 files sample2_1.fastq and sample2_2.fastq please enter sample2 in this field. |
YES |
Tissues | Mention the tissue type or condition from which the sample has been collected. finder will report the tissues that are associated with a particular transcript. This can be used to find gene models that are expressed in a specific tissue and/or condition |
YES |
Description | A brief description of the data. This field is not mandatory and is not used by finder . It is upto the user to enter whatever metadata is deemed important. |
NO |
Date | Enter the date of producing the RNA-Seq sample. This field is not mandatory and is not used by finder . |
NO |
Read Length (bp) | Enter the length of the reads. This field is not mandatory and is not used by finder . |
NO |
Ended | Enter either PE or SE for Paired ended reads or single neded reads. No other value should be entered. | YES |
RNA-Seq | Enter 1 for all the rows. This field is included for future extensions. | YES |
process | Enter 1 if you wish to process the sample. If a value of 0 is present, then finder will ignore the sample |
YES |
Location | Enter the location of the directory. For samples to be downloaded from NCBI, this field should be left empty. If the location of a directory is provided here then finder will assume that the sample is present in it. finder will generate an error if the sample is not found in this directory. It is not necessary to have all the samples in the same directory. |
YES |
To optimize disk space usage finder
will process read samples from each bioproject at a time. Once the data is downloaded and reads are mapped, FINDER will remove all those data (if -no-cleanup
is not specificied) to save disk space. But samples that were locally present will not be removed.
Help menu for FINDER can be launched by the following command:
run_finder -h
usage: run_finder [-h] [--version] --metadatafile METADATAFILE --output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model {VERT,INV,PLANTS,FUNGI} --genemark_path GENEMARK_PATH --genemark_license GENEMARK_LICENSE [--cpu CPU] [--genome_dir_star GENOME_DIR_STAR]
[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint CHECKPOINT] [--perform_post_completion_data_cleanup] [--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3 EXONERATE_GFF3]
[--star_shared_mem] [--framework {docker,singularity}]
Generates gene annotation from RNA-Seq data
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Required arguments:
--metadatafile METADATAFILE, -mf METADATAFILE
Please enter the name of the metadata file. Enter 0 in the last column of those samples which you wish to skip processing. The columns should represent the following in order --> BioProject, SRA Accession, Tissues, Description, Date, Read Length, Ended (PE or SE), RNA-Seq, process, Location. If the sample is skipped it will not be downloaded. Leave the directory path blank if you are downloading the samples. In the end of the run the program will output a csv file with the directory path filled out. Please check the provided csv file for more information on how to configure the metadata file.
--output_directory OUTPUT_DIRECTORY, -out_dir OUTPUT_DIRECTORY
Enter the name of the directory where all other operations will be performed
--genome GENOME, -g GENOME
Enter the SOFT-MASKED genome file of the organism
--organism_model {VERT,INV,PLANTS,FUNGI}, -om {VERT,INV,PLANTS,FUNGI}
Enter the type of organism
--genemark_path GENEMARK_PATH, -gm GENEMARK_PATH
Enter the path to genemark
--genemark_license GENEMARK_LICENSE, -gml GENEMARK_LICENSE
Enter the licence file. Please make sure your license file is less than 365 days old
Optional arguments:
--cpu CPU, -n CPU Enter the number of CPUs to be used.
--genome_dir_star GENOME_DIR_STAR, -gdir_star GENOME_DIR_STAR
Please enter the location of the genome index directory of STAR
--genome_dir_olego GENOME_DIR_OLEGO, -gdir_olego GENOME_DIR_OLEGO
Please enter the location of the genome index directory of OLego
--verbose VERBOSE, -verb VERBOSE
Enter a verbosity level
--protein PROTEIN, -p PROTEIN
Enter the protein fasta
--no_cleanup, -no_cleanup
Provide this option if you do not wish to remove any intermediate files. Please note that this will NOT remove any files and might take up a large amount of space
--preserve_raw_input_data, -preserve
Set this argument if you want to preserve the raw fastq files. All other temporary files will be removed. These fastq files can be later used.
--checkpoint CHECKPOINT, -c CHECKPOINT
Enter a value if you wish to restart operations from a certain check point. Please note if you have new RNA-Seq samples, then FINDER will override this argument and computation will take place from read alignment. If there are missing data in any step then also FINDER will enforce restart of operations from a previous
. For example, if you wish to run assembly on samples for which alignments are not available then FINDER will readjust this value and set it to 1.
1. Align reads to reference genome (Will trigger removal of all alignments and start from beginning)
2. Assemble with PsiCLASS (Will remove all assemblies)
3. Find genes with FINDER (entails changepoint detection)
4. Predict genes using BRAKER2 (Will remove previous results of gene predictions with BRAKER2)
5. Annotate coding regions
6. Merge FINDER annotations with BRAKER2 predictions and protein sequences
--perform_post_completion_data_cleanup, -pc_clean
Set this field if you wish to clean up all the intermediate files after the completion of the execution. If this operation is requested prior to generation of all the important files then it will be ignored and finder will proceed to annotate the genome.
--run_tests, -rt Modify behaviour of finder to accelerate tests. This will reduce the downloaded fastq files to a bare minimum and also check the other installations
--addUTR, --addUTR Turn on this option if you wish BRAKER to add UTR sequences
--skip_cpd, --skip_cpd
Turn on this option to skip changepoint detection. Could be effective for grasses
--exonerate_gff3 EXONERATE_GFF3, -egff3 EXONERATE_GFF3
Enter the exonerate output in gff3 format
--star_shared_mem, --star_shared_mem
Turn on this option if you want STAR to load the genome index into shared memory. This saves memory if multiple finder runs are executing on the same host, but might not work in your cluster environment.
--framework {docker,singularity}, -fm {docker,singularity}
Enter your choice of framework
finder
can be launched using the following command:
run_finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -preserve 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error
This program will download and run the entire process of annotation. The duration of execution will depend on your internet speed and the number of cores you assigned to FINDER. Also, FINDER is designed in a way to handle a large number of RNA-Seq samples. So the speedup might not be noticeable with just a few samples.
Run the following command to remove all intermediate files. We recommend that while you run finder
, you preserve all intermediate files and then run the following command to remove all the intermediate files.
run_finder -no_cleanup -mf Arabidopsis_thaliana_metadata.csv -n $CPU -out_dir $PWD/FINDER_test_ARATH -g $PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p $PWD/uniprot_ARATH.fasta -preserve -pc_clean 1> $PWD/FINDER_test_ARATH.output 2> $PWD/FINDER_test_ARATH.error
finder
allows users to enforce execution from a specific checkpoints. Requesting a particular checkpoint does not mean that finder
will skip all previous steps. It means that finder
will remove all files generated by process after the checkpoint to ensure that the modules recalculate those. Below is a description of all the checkpoints that finder
can accept:
- Align reads to reference genome - Requesting
finder
to start from this checkpoint will trigger removal of all previous alignments. - Assemble with
PsiCLASS
- Requestingfinder
to start from this checkpoint will trigger removal of assemblies that was previously generated. Aligned files will not be removed. If there are some RNA-Seq samples that are not aligned FINDER will align those first before attempting to assemble them - Find genes with
finder
-finder
will regenerate all files post assembly by PsiCLASS - Predict genes using
BRAKER2
-finder
will rerun the BRAKER2 step - Annotate coding regions -
finder
will restart from annotating the coding sequences - Merge
finder
annotations withBRAKER2
predictions and protein sequences -finder
will generate merged annotations from RNA-Seq samples, predictions and protein sequences
If you wish to start finder
from downloading the SRA samples, please delete the output directory and start over.
All relevant output files generated by finder
can be found in the final_GTF_files
directory under the output directory. Below is the list of files and what data they contain
- braker.gtf - gene models generated by
BRAKER2
- braker_utr.gtf - gene models, with UTR models, generated by
BRAKER2
- combined_redundant_transcripts_removed.gtf - GTF file from PsiCLASS output
- combined_split_transcripts_with_bad_SJ_redundancy_removed.gtf - GTF file after splitting transcripts. This file is generated only from RNA-Seq expression evidence
- combined_with_CDS.gtf -
finder
output with CDS predicted by GeneMark-S/T - combined_with_CDS_high_conf.gtf-
finder
gene models with high confidence - combined_with_CDS_low_conf.gtf-
finder
gene models with low confidence - combined_with_CDS_BRAKER_appended_high_conf.gtf - High confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models
- combined_with_CDS_high_and_low_confidence_merged.gtf - High and Low confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models
- FINDER_BRAKER_PROT.gtf - High confidence gene models from RNA-Seq evidence combined with BRAKER2 gene models and gene models from protein evidence
- tissue/condition to transcript - A file with two columns. the first column lists the transcripts and the seconds column lists the tissues/conditions they were found in. In future versions, we will include the functionality of extracting transcripts specific to a tissue/condition.
finder
generates several intermediate files and folders. This section contains a detailed outline of the contents of each folder and what each file represents.
finder
is configured to output information to a log file location in the output directory named progress.log
. While reporting issues please make sure you attach the log file.
finder
offeres users the opportunity to augment data into already completed annotation runs. Users need to update the metadata.csv file with the new RNA-Seq data and rerun finder
. The program will determine an optimal starting point. finder
will skip downloading of already processed RNA-Seq samples and will proceed with the new data. Users also have the option of removing some previously supplied RNA-Seq samples.
finder
offers users with 2 utilites which could be used independently.
-
downloadAndDumpFastqFromSRA.py
- A python program that optimizes the download of data from SRA. Ids of RNA-Seq (or any sequencing for that matter) needs to be provided as a newline separated file. The program will download the RNA-Seq files, using the requested number of cores, convert those to fastq and remove the.sra
files.downloadAndDumpFastqFromSRA.py
will continuosly query the SRA database in the event of a failure.python downloadAndDumpFastqFromSRA.py -h usage: download_and_dump_fastq_from_SRA.py [-h] --sra SRA --output OUTPUT [--cpu CPU] Parallel download of fastq data from NCBI. Program will create the output directory if it is not present. If fastq file is present, then downloading is skipped. Program optimizes downloading of sra files and converting to fastq by utilizing multiple CPU cores. optional arguments: -h, --help show this help message and exit --sra SRA, -s SRA Please enter the name of the file which has all the SRA ids listed one per line. Please note the bioproject IDS cannot be processed --output OUTPUT, -o OUTPUT Please enter the name of the output directory. Download will be skipped if file is present --cpu CPU, -n CPU Enter the number of CPUs to be used.
-
verifyInputsToFINDER.py
- This program will verify whether all the resuested samples are in fact from a transcriptomic source of the organism whose genome is being annotated.python verifyInputsToFINDER.py -h usage: verify_inputs_to_finder.py [-h] --metadatafile METADATAFILE --srametadb SRAMETADB --taxon_id TAXON_ID Verifies whether all the data are transcriptomic and from the organism under consideration optional arguments: -h, --help show this help message and exit --metadatafile METADATAFILE, -mf METADATAFILE Please enter the name of the metadata file. Enter 0 in the last column of those samples which you wish to skip processing. The columns should represent the following in order --> BioProject,Run,tissue_group,tis sue,description,Date,read_length,ended (PE or SE),directorypath,download,skip. If the sample is skipped it will not be downloaded. Leave the directory path blank if you are downloading the samples. In the end of the run the program will output a csv file with the directory path filled out. Please check the provided csv file for more information on how to configure the metadata file. --srametadb SRAMETADB, -m SRAMETADB Enter the location of the SRAmetadb file. --taxon_id TAXON_ID, -t TAXON_ID Enter the taxonomic id of the organism. Enter -1 if you are working on a non-model organism or a sub- species for which no taxonomic id exists.
MIT License
Copyright (c) [2021] [Banerjee]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Please report all issues here.