GitHub - snu-cdrc/gencube: Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

Centralized retrieval and integration of multi-omics resources from leading databases

Keun Hong Son^1,2,3, and Je-Yoel Cho^1,2,3

¹ Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, Korea
² Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, Korea
³ BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, Korea

Manuscript

bioRxiv (Posted July 22, 2024)

🔥 Your interest and contributions help Gencube evolve to meet your needs! 👨‍🔬
⭐️ If you have any questions, ideas, or suggestions, please share them on our Issues page. 👈
🚀 We'd love to hear from you! 😊

gencube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve integrated metadata for high-throughput sequencing-based multi-omics resources suitable for specific requirements.

Databases accessed from gencube

GenBank: NCBI GenBank Nucleotide Sequence Database
RefSeq: NCBI Reference Sequence Database
GenArk: UCSC Genome Archive
Ensembl Rapid Release: Ensembl genome browser that provides frequent updates for newly sequenced species
Zoonomia TOGA: Tool to infer Orthologs from Genome Alignments
INSDC: International Nucleotide Sequence Database Collaboration
SRA: NCBI Sequence Read Archive
ENA: EMBL-EBI European Nucleotide Archive
DDBJ: DNA Data Bank of Japan

Detailed information of each database

GenBank & RefSeq README.txt - genome, geneset, sequence
UCSC GenArk paper - genome, geneset, annotation
Ensembl Rapid Release Help & Docs & Ensembl 2023 paper - genome, geneset, sequence, crossgenome
Zoonomia TOGA README.txt & Paper - geneset, crossgenome
Search in SRA Entrez, Entrez Help & SRA Advanced Search Builder - seqmeta

Installation

The latest release can be installed with

$ pip install gencube

Alternative

# Still setting up!!
$ conda install -c bioconda gencube

Email and NCBI API key for E-utilities

When you first run gencube, you'll be prompted for your email and NCBI API key, which are saved in the .gencube_entrez_info file in your home directory for future use.

All gencube key subcommands use NCBI's Entrez Utilities (E-Utilities), requiring an email. Without an NCBI API key, you can make 3 requests per second; with an NCBI API key, this limit increases to 10 requests per second. If you submit your NCBI API key, you can perform tasks at more than three times the speed when using the seqmeta subcommand, especially when fetching metadata. If possible, it is recommended to submit your API key.

$ gencube
Email address: 
NCBI API key (type 'no' to skip):

To update the submitted information, run the following command.

$ gencube info

Tutorials

gencube consists of six main subcommands excluding info

$ gencube
usage: gencube [-h] {genome,geneset,annotation,sequence,crossgenome,seqmeta,info} ...

gencube v1.0.0

positional arguments:
  {genome,geneset,annotation,sequence,crossgenome,seqmeta,info}
    genome              Search, download, and modify chromosome labels for genome assemblies
    geneset             Search, download, and modify chromosome labels for genesets (gene annotations)
    annotation          Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats
    sequence            Search and download sequence data of genesets
    crossgenome         Search and download comparative genomics data, such as homology, and codon or protein alignments
    seqmeta             Search, retrive, and integrate metadata of experimental sequencing data
    info                Resubmit email and NCBI API key for use with NCBI's Entrez Utilities (E-Utilities)

options:
  -h, --help            show this help message and exit

0. The positional argument and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand

When using the above five subcommands, it's important to find genome assemblies required for personal research. Below are the positional argument and options shared by the these subcommands to browse and search for specific genome assemblies.

positional arguments:
  keywords              Taxonomic names to search for genomes
                        You can provide various forms such as species names or accession numbers
                        Examples: homo_sapiens, human, GCF_000001405.40, GCA_000001405.29, GRCh38, hg38
                        
                        Multiple names can be combined and will be merged in the search results
                        To specify multiple names, separate them with spaces

options:
  -h, --help            show this help message and exit
  -v level, --level level
                        Specify the genome assembly level (default: complete,chromosome)
                        complete   : Fully assembled genomes
                        chromosome : Assembled at the chromosome level
                        scaffold   : Assembled into scaffolds, but not to the chromosome level
                        contig     : Contiguous sequences without gaps
                        
  -r, --refseq          Show genomes that have RefSeq accession (GCF_* format)
  -u, --ucsc            Show genomes that have UCSC name
  -l, --latest          Show genomes corresponding to the latest version

Examples

# Search using scientific or common name
# It is recommended to use the scientific name for a more precise search.
$ gencube genome homo_sapiens 
$ gencube genome canis_lupus_familiaris
$ gencube genome human # less accurate
$ gencube genome dog   # less accurate

# Search using assembly name
$ gencube genome GRCh38
$ gencube genome grch38 # Case sensitivity is not an issue
$ gencube genome GRCm39 GRCm38 # Multiple keywords can be searched simultaneously

# Search using UCSC name
$ gencube genome hg38
$ gencube genome hg38 hg19
$ gencube genome mm39 mm10
$ gencube genome canfam4 canfam5 canfam6

# Search using GenBank (GCA_*) or RefSeq (GCF_*) accession
$ gencube genome GCA_021950905.1
$ gencube genome GCF_000001405.40
$ gencube genome GCF_000001405.40 GCA_021950905.1

# Show searched genomes corresponding to all genome assembly levels
$ gencube genome homo_sapiens --level complete,chromosome # default
$ gencube genome homo_sapiens --level complete,chromosome,scaffold,contig
$ gencube genome homo_sapiens --level scaffold,contig

# Only show genomes that have RefSeq accession and UCSC name, and correspond to the latest version
$ gencube genome homo_sapiens --refseq --ucsc --latest

Example output displayed in the terminal

🔥 $ gencube genome GCF_000001405.40 GCA_021950905.1

# Search assemblies in NCBI database
  Keyword: ['GCF_000001405.40', 'GCA_021950905.1']

  Total 3 genomes are searched.

# Convert JSON to dataframe format.
  Filter options
  Level:   ['Complete', 'Chromosome']
  RefSeq:  False
  UCSC:    False
  Latest:  False

# Check accessibility to GenArk, Ensembl Rapid Release
  UCSC GenArk  : 4167 genomes across 2813 species
  Ensembl Rapid: 2272 genomes across 1522 species

+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|    | Assembly name          |   Taxid | Release    | NCBI             | UCSC   | GenArk   | Ensembl   |
+====+========================+=========+============+==================+========+==========+===========+
|  0 | HG002.mat.cur.20211005 |    9606 | 2022/02/04 | GCA_021951015.1  |        | v        | v         |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|  1 | HG002.pat.cur.20211005 |    9606 | 2022/02/04 | GCA_021950905.1  |        | v        | v         |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|  2 | GRCh38.p14             |    9606 | 2022/02/03 | GCF_000001405.40 | hg38   | v        |           |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+

1. `genome` subcommand

Search, download, and modify chromosome labels for genome assemblies
You can download genome data in FASTA format from four different databases (GenBank, RefSeq, GenArk, Ensembl Rapid Release). Each database uses a different soft-masking method, and you can selectively download the data as needed. You can also download unmasked and hard-masked genomes from the Ensembl Rapid Release database.

options:
  -m, --metadata        Save metadata for the searched genomes
  -d, --download        Download "fasta" formatted genome file
  -db types, --database types
                        Database where genome file is downloaded (default: refseq)
                        Default is from the RefSeq database
                        If not available, download from the GenBank database
                        genbank : by NCBI GenBank
                        refseq  : by NCBI RefSeq
                        genark  : by UCSC GenArk
                        ensembl : by Ensembl Rapid Release
  -c type, --chr_style type
                        Chromosome label style used in the download file (default: ensembl)
                        ensembl : 1, 2, X, MT & unknowns (GenBank IDs)
                        gencode : chr1, chr2, chrX, chrM & unknowns (GenBank IDs)
                        ucsc    : chr1, chr2, chrX, chrM & unknowns (UCSC-specific IDs)
                                  !! Limited use if UCSC IDs are not issued
                        raw     : Uses raw file labels without modification
                                 - NCBI GenBank: CM_* or other-form IDs
                                 - NCBI RefSeq : NC_*, NW_* or other-form IDs
                                 - GenArk      : GenBank or RefSeq IDs
                                 - Ensembl     : Ensembl IDs
  -mk type, --masking type
                        Masking type for output data (default: soft)
                        soft : soft-masked
                        hard : hard-masked
                        none : unmasked
  -cl 1-9, --compresslevel 1-9
                        Compression level for output data (default: 6)
                        Lower numbers are faster but have lower compression
                         
  --recursive           Download files regardless of their presence only if integrity check is not possible

Examples

# Download the full information metadata of searched genomes
$ gencube genome homo_sapiens --metadata
$ gencube genome canis_lupus_familiaris --metadata

# Download genome file under the default conditions (RefSeq or GenBank)
$ gencube genome GCF_011100685.1 --download
$ gencube genome GCF_011100685.1 -d

# Download genome file from a specific database
$ gencube genome GCF_011100685.1 --download --database ensemble
# Download multiple genomes from various databases
$ gencube genome GCF_011100685.1 --download --database refseq,genark,ensemble

# Change the chromosome labels
$ gencube genome GCF_011100685.1 --download --chr_style ensembl # default
$ gencube genome GCF_011100685.1 --download --chr_style gencode
$ gencube genome GCF_011100685.1 --download --chr_style ucsc

# Change the masking type
$ gencube genome GCF_011100685.1 --download --masking soft # default
$ gencube genome GCF_011100685.1 --download --masking hard
$ gencube genome GCF_011100685.1 --download --masking none

# Set the compression level of the file to 2.
$ gencube genome GCF_011100685.1 --download --compresslevel 6 # default
$ gencube genome GCF_011100685.1 --download --compresslevel 1

Example output displayed in the terminal

🔥 $ gencube genome GCF_011100685.1 --download --chr_style gencode

# Search assemblies in NCBI database
  Keyword: ['GCF_011100685.1']

  Total 1 genomes is searched

# Filter genomes based on the following criteria
  Level:   ['Complete', 'Chromosome']
  RefSeq:  False
  UCSC:    False
  Latest:  False

# Check accessibility to GenArk, Ensembl Rapid Release
  UCSC GenArk  : 5122 genomes across 3389 species
  Ensembl Rapid: 2582 genomes across 1788 species

+----+-----------------+---------+------------+-----------------+---------+----------+-----------+
|    | Assembly name   |   Taxid | Release    | NCBI            | UCSC    | GenArk   | Ensembl   |
+====+=================+=========+============+=================+=========+==========+===========+
|  0 | UU_Cfam_GSD_1.0 |    9615 | 2020/03/10 | GCF_011100685.1 | canFam4 | v        | v         |
+----+-----------------+---------+------------+-----------------+---------+----------+-----------+

[GCA_011100685.1 / GCF_011100685.1 / UU_Cfam_GSD_1.0]
- refseq
  Canis_lupus_familiaris-UU_Cfam_GSD_1.0_assembly_report.txt: 263kB [00:00, 673kB/s]                                          
  Canis_lupus_familiaris-UU_Cfam_GSD_1.0-refseq.sm.fa.gz: 100%|████████████████████████████| 742M/742M [02:39<00:00, 4.88MB/s]

# Change chromosome names and masking method: gencode-style & soft-masked
[GCA_011100685.1 / GCF_011100685.1 / UU_Cfam_GSD_1.0]
  Downloaded genome: ['refseq.sm']
  - refseq.sm
    Modify chromosome names
    Write compressed fasta file (compresslevel: 6)
    Processing time: 455 seconds

  !! If the file appears to have any problems, please delete it and retry the process

2. `geneset` subcommand

Search, download, and modify chromosome labels for genesets (gene annotations)

options:
  -d types, --download types
                        Type of gene set
                        refseq_gtf    : RefSeq gene set (GTF format)
                        refseq_gff    : RefSeq gene set (GFF)
                        gnomon        : RefSeq Gnomon gene prediction (GFF)
                        cross         : RefSeq Cross-species alignments (GFF)
                        same          : RefSeq Same-species alignments (GFF)
                        agustus       : GenArk Augustus gene prediction (GFF)
                        xenoref       : GenArk XenoRefGene (GFF)
                        genark_ref    : GenArk RefSeq gene models (GFF)
                        ensembl_gtf   : Ensembl Rapid Release gene set (GTF)
                        ensembl_gff   : Ensembl Rapid Release gene set (GFF)
                        toga_gtf      : Zoonomia TOGA gene set (GTF)
                        toga_bed      : Zoonomia TOGA gene set (BED)
                        toga_pseudo   : Zoonomia TOGA processed pseudogenes (BED)
  -c type, --chr_style type
                        Chromosome label style used in the download file (default: ensembl)
                        ensembl : 1, 2, X, MT & unknowns (GenBank IDs)
                        gencode : chr1, chr2, chrX, chrM & unknowns (GenBank IDs)
                        ucsc    : chr1, chr2, chrX, chrM & unknowns (UCSC-specific IDs)
                                  !! Limited use if UCSC IDs are not issued
                        raw     : Uses raw file labels without modification
                                 - NCBI GenBank: CM_* or other-form IDs
                                 - NCBI RefSeq : NC_*, NW_* or other-form IDs
                                 - GenArk      : GenBank or RefSeq IDs
                                 - Ensembl     : Ensembl IDs
                         
  --recursive           Download files regardless of their presence only if integrity check is not possible

Examples

# search usable and accessible data
gencube geneset canis_lupus_familiaris
gencube geneset GCF_011100685.1

# Download specific geneset file from a database
$ gencube geneset GCF_011100685.1 --download refseq_gtf
$ gencube geneset GCF_011100685.1 --download agustus
$ gencube geneset GCF_011100685.1 --download toga_gtf

# Download multiple genesets from various databases
$ gencube geneset GCF_011100685.1 --download refseq_gtf,agustus,toga_gtf

Example output displayed in the terminal

🔥 $ gencube geneset canis_lupus_familiaris

# Search assemblies in NCBI database
  Keyword: ['canis_lupus_familiaris']

  Total 34 genomes are searched

# Filter genomes based on the following criteria
  Level:   ['Complete', 'Chromosome']
  RefSeq:  False
  UCSC:    False
  Latest:  False

# Check accessibility to GenArk, Ensembl Rapid Release and Zoonomia server
  UCSC GenArk  : 5122 genomes across 3389 species
  Ensembl Rapid: 2582 genomes across 1788 species
  Zoonomia TOGA: 951 genomes across 973 species

+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|    | Assembly name       |   Taxid | Release    | NCBI            | UCSC    | GenArk   | Ensembl   | Zoonomia   |
+====+=====================+=========+============+=================+=========+==========+===========+============+
|  0 | LK_Cfam_Beagle_1.1  |    9615 | 2024/10/30 | GCA_044048985.1 |         |          |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  1 | ASM4364393v1        |    9615 | 2024/10/24 | GCA_043643935.1 |         |          |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  2 | UNSW_CanFamBas_1.2  |    9615 | 2021/02/23 | GCA_013276365.2 |         | v        |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  3 | UMICH_Zoey_3.1      |    9615 | 2019/05/30 | GCF_005444595.1 | canFam5 |          | v         | v          |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  4 | ROS_Cfam_1.0        |    9615 | 2020/09/03 | GCF_014441545.1 |         | v        | v         |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  5 | Yella_v2            |    9615 | 2023/09/07 | GCA_031165255.1 |         | v        |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  6 | CA611_1.0           |    9615 | 2023/08/31 | GCA_031010295.1 |         | v        |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  7 | OD_1.0              |    9615 | 2023/08/31 | GCA_031010635.1 |         | v        |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  8 | BD_1.0              |    9615 | 2023/08/31 | GCA_031010765.1 |         | v        |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
|  9 | Lak_Megaphage_Wal-2 |    9615 | 2024/06/07 | GCA_964164975.1 |         |          |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
| 10 | Lak_Megaphage_Wal-1 |    9615 | 2024/06/07 | GCA_964162815.1 |         |          |           |            |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
| 11 | UU_Cfam_GSD_1.0     |    9615 | 2020/03/10 | GCF_011100685.1 | canFam4 | v        | v         | v          |
+----+---------------------+---------+------------+-----------------+---------+----------+-----------+------------+
...

# Check accessible data in databases
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|    | Assembly name       | UCSC    | RefSeq                        | GenArk                | Ensembl   | Zoonomia     |
+====+=====================+=========+===============================+=======================+===========+==============+
|  0 | LK_Cfam_Beagle_1.1  |         |                               |                       |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  1 | ASM4364393v1        |         |                               |                       |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  2 | UNSW_CanFamBas_1.2  |         |                               | agustus, xenoref      |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  3 | UMICH_Zoey_3.1      | canFam5 | gtf, gff, gnomon, cross, same |                       | ensembl   | mouse, human |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  4 | ROS_Cfam_1.0        |         | gtf, gff, gnomon, cross, same | agustus, xenoref, ref | ensembl   |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  5 | Yella_v2            |         |                               | agustus, xenoref      |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  6 | CA611_1.0           |         |                               | agustus, xenoref      |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  7 | OD_1.0              |         |                               | agustus, xenoref      |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  8 | BD_1.0              |         |                               | agustus, xenoref      |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
|  9 | Lak_Megaphage_Wal-2 |         |                               |                       |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
| 10 | Lak_Megaphage_Wal-1 |         |                               |                       |           |              |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
| 11 | UU_Cfam_GSD_1.0     | canFam4 | gtf, gff, gnomon, cross, same | agustus, xenoref, ref | ensembl   | mouse, human |
+----+---------------------+---------+-------------------------------+-----------------------+-----------+--------------+
...

3. `annotation` subcommand

Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats

options:
  -d types, --download types
                        Download annotation file.
                        gap : Genomic gaps - AGP defined (bigBed format)
                        sr   : Simple tandem repeats by TRF (bigBed)
                        td   : Tandem duplications (bigBed)
                        wm   : Genomic intervals masked by WindowMasker + SDust (bigBed)
                        rmsk : Repeated elements annotated by RepeatMasker (bigBed)
                        cpg  : CpG Islands - Islands < 300 bases are light green (bigBed)
                        gc   : GC percent in 5-Base window (bigWig)
  -c type, --chr_style type
                        Chromosome label style used in the download file (default: ensembl)
                        ensembl : 1, 2, X, MT & unknowns (GenBank IDs)
                        gencode : chr1, chr2, chrX, chrM & unknowns (GenBank IDs)
                        ucsc    : chr1, chr2, chrX, chrM & unknowns (UCSC-specific IDs)
                                  !! Limited use if UCSC IDs are not issued
                        raw     : Uses raw file labels without modification
                                 - NCBI GenBank: CM_* or other-form IDs
                                 - NCBI RefSeq : NC_*, NW_* or other-form IDs
                                 - GenArk      : GenBank or RefSeq IDs
                                 - Ensembl     : Ensembl IDs
                         
  --recursive           Download files regardless of their presence only if integrity check is not possible

Examples

# search usable and accessible data
gencube annotation canis_lupus_familiaris
gencube annotation GCF_011100685.1

# Download specific geneset file from a database
gencube annotation GCF_011100685.1 --download sr
gencube annotation GCF_011100685.1 --download td
gencube annotation GCF_011100685.1 --download rmsk

# Download multiple annotations from various databases
gencube annotation GCF_011100685.1 --download sr,td,rmsk,gc

4. `sequence` subcommand

Search and download sequence data of genesets

options:
  -d types, --download types
                        Download "fasta" formatted sequence file
                        1. Nucleotide sequences:
                           refseq_rna         : Accessioned RNA sequences annotated on the genome assembly
                           refseq_rna_genomic : RNA features based on the genome sequence
                           refseq_cds_genomic : CDS features based on the genome sequence
                           refseq_pseudo      : Pseudogene and other gene regions without transcribed RNA or translated protein products
                           ensembl_cdna       : Ensembl Rapid Release cDNA sequences of transcripts
                           ensembl_cds        : Ensembl Rapid Release coding sequences (CDS)
                           ensembl_repeat     : Ensembl repeat modeler sequences
                        2. Protein sequences:
                           refseq_pep         : Accessioned protein sequences annotated on the genome assembly
                           refseq_pep_cds     : CDS features translated into protein sequences
                           ensembl_pep        : Ensembl Rapid Release protein sequences
                         
  --recursive           Download files regardless of their presence only if integrity check is not possible

Examples

# search usable and accessible data
gencube sequence canis_lupus_familiaris
gencube sequence GCF_011100685.1

# Download specific geneset file from a database
$ gencube sequence GCF_011100685.1 --download refseq_rna
$ gencube sequence GCF_011100685.1 --download ensembl_cdna
$ gencube sequence GCF_011100685.1 --download ensembl_pep

# Download multiple genesets from various databases
$ gencube sequence GCF_011100685.1 --download refseq_rna,ensembl_cdna,refseq_pep,ensembl_pep

5. `crossgenome` subcommand

Search and download comparative genomics data, such as homology, and codon or protein alignment

options:
  -d types, --download types
                        ensembl_homology   : Homology data from Ensembl Rapid Release,
                                             detailing gene orthology relationships across species
                        toga_homology      : Homology data from TOGA, providing predictions of
                                             orthologous genes based on genome alignments
                        toga_align_codon   : Codon alignment data from TOGA, showing aligned codon
                                             sequences between reference and query species
                        toga_align_protein : Protein alignment data from TOGA, detailing aligned
                                             protein sequences between reference and query species
                        toga_inact_mut     : List of inactivating mutations from TOGA, identifying
                                             mutations that disrupt gene function
                         
  --recursive           Download files regardless of their presence only if integrity check is not possible

Examples

# search usable and accessible data
gencube crossgenome canis_lupus_familiaris
gencube crossgenome GCF_011100685.1

# Download specific geneset file from a database
$ gencube crossgenome GCF_011100685.1 --download toga_homology
$ gencube crossgenome GCF_011100685.1 --download toga_align_codon

# Download multiple crossgenome data
$ gencube crossgenome GCF_011100685.1 --download toga_homology,toga_align_codon

6. `seqmeta` subcommand

Searches for high-throughput sequencing data corresponding to user-specified keywords, retrieves the related sample metadata, and integrates it into experiment-level and study-level tables.

$ gencube seqmeta
usage: gencube seqmeta [-h] [-o string] [-st string] [-sr string] [-pl string] [-sl string] [-fi string] [-pr string] [-ly string] [-ac string] [-bp string] [-bs string]
                       [-as string] [-ti string] [-at string] [-pd range] [-md range] [-rl range] [-mb string] [-tw string] [-ex keywords] [-d] [-m]
                       [keywords ...]

Search, retrive, and integrate metadata of experimental sequencing data

positional arguments:
  keywords              Keywords to search for sequencing-based experimental data. You can provide various forms
                        Examples: liver, k562, cancer, breast_cancer, etc
                        
                        Multiple keywords can be combined
                        Keywords separated by commas will combine their results
                        Keywords separated by spaces will intersect their results
                        Example: liver,lung cancer,tumor

options:
  -h, --help            show this help message and exit
  -o string, --organism string
                        Scientific name or common name (as found in the NCBI Taxonomy Browser)
                        Example: homo_sapiens or human
  -st string, --strategy string
                        Sequencing strategy:
                        wgs, wga, wxs, targeted_capture, synthetic_long_read, gbs, rad_seq, tn_seq, clone_end, amplicon
                        clone, rna_seq, mrna_seq, ncrna_seq, ribo_seq, rip_seq, mirna_seq, ssrna_seq, est, fl_cdna, atac_seq
                        dnase_hypersensitivity, faire_seq, chip_seq, chip, mre_seq, bisulfite_seq, mbd_seq, medip_seq, hi_c
                        chia_pet, tethered_chromatin_conformation_capture
  -sr string, --source string
                        Source of the biological data:
                        genomic, genomic_single_cell, transcriptomic, transcriptomic_single_cell, metagenomic
                        metatranscriptomic, synthetic, viral_rna, other
  -pl string, --platform string
                        Name of the sequencing platform:
                        abi_solid, bgiseq, capillary, complete_genomics, dnbseq, element, genapsys, genemind, helicos
                        illumina, ion_torrent, ls454, oxford_nanopore, pacbio_smrt, tapestri, ultima, vela_diagnostics 
  -sl string, --selection string
                        Library selection methodology:
                        5_methylcytidine_antibody, cage, cdna, cdna_oligo_dt, cdna_randompriming, chip, chip_seq, dnase
                        hmpr, hybrid_selection, inverse_rrna, mbd2_protein_methyl_cpg_binding_domain, mda, mf, mnase, msll
                        oligo_dt, other, padlock_probes_capture_method, pcr, polya, race, random, random_pcr
                        reduced_representation, repeat_fractionation, restriction_digest, rt_pcr, size_fractionation
                        unspecified
  -fi string, --filter string
                        Option to find SRA records that are cross-referenced with other NCBI databases
                        (PubMed, PubMed Central (PMC), Nucleotide, Assembly, and others):
                        sra_all, sra_assembly, sra_bioproject, sra_bioproject_all, sra_biosample, sra_biosample_all, sra_gap
                        sra_gap_all, sra_gds, sra_genome, sra_nuccore, sra_nuccore_alignment, sra_nuccore_wgs, sra_omim
                        sra_pmc, sra_public, sra_pubmed, sra_taxonomy
  -pr string, --properties string
                        Option to narrow search results by controlled-vocabulary library's annotations:
                        aligned_data, cloud_gs, cloud_s3, location_gs_us, location_s3_us_east1, location_s3_us_east_1
                        location_s3_us_west_2, filetype_10x_genomics_bam_file, filetype_ab1, filetype_activ_sars2_vcf
                        filetype_archive, filetype_archive/gzip, filetype_assembled_contigs
                        filetype_assembly/realign_summary, filetype_assembly_of_unidentified_reads, filetype_bai
                        filetype_bam, filetype_bam_header, filetype_basemodification, filetype_complete_genomics
                        filetype_crai, filetype_cram, filetype_fast5, filetype_fasta, filetype_fastq
                        filetype_geo_feature_count, filetype_helicos, filetype_illumina_native, filetype_nanopore
                        filetype_pacbio_base_modification_report, filetype_pacbio_metadata, filetype_pacbio_native
                        filetype_realign_to_de_novo_assembly, filetype_reference_fasta, filetype_run, filetype_run_realign
                        filetype_run_zq, filetype_sff, filetype_solid_native, filetype_source, filetype_sra_lite
                        filetype_sra_normalized, filetype_srf, filetype_tar_archive_of_complete_genomics_tree, filetype_tenx
                        filetype_vcf, filetype_vcf_index, filetype_vdbcache, filetype_vdbcache_zq, filetype_wgmlst_sig
                        filetype_wgmlst_signature, has_data, instrument_454_gs, instrument_454_gs_20, instrument_454_gs_flx
                        instrument_454_gs_flx_titanium, instrument_454_gs_junior, instrument_ab_310_genetic_analyzer
                        instrument_ab_3130_genetic_analyzer, instrument_ab_3130xl_genetic_analyzer
                        instrument_ab_3500_genetic_analyzer, instrument_ab_3500xl_genetic_analyzer
                        instrument_ab_3730_genetic_analyzer, instrument_ab_3730xl_genetic_analyzer
                        instrument_ab_5500_genetic_analyzer, instrument_ab_5500xl_genetic_analyzer
                        instrument_ab_5500xl_w_genetic_analysis_system, instrument_ab_solid_3_plus_system
                        instrument_ab_solid_4_system, instrument_ab_solid_4hq_system, instrument_ab_solid_pi_system
                        instrument_ab_solid_system, instrument_ab_solid_system_2_0, instrument_ab_solid_system_3_0
                        instrument_bgiseq_50, instrument_bgiseq_500, instrument_complete_genomics, instrument_dnbseq_g400
                        instrument_dnbseq_g400_fast, instrument_dnbseq_g50, instrument_dnbseq_t7, instrument_element_aviti
                        instrument_fastaseq_300, instrument_genexus, instrument_genolab_m, instrument_gridion
                        instrument_gs111, instrument_helicos_heliscope, instrument_hiseq_x_five, instrument_hiseq_x_ten
                        instrument_illumina_genome_analyzer, instrument_illumina_genome_analyzer_ii
                        instrument_illumina_genome_analyzer_iix, instrument_illumina_hiscansq
                        instrument_illumina_hiseq_1000, instrument_illumina_hiseq_1500, instrument_illumina_hiseq_2000
                        instrument_illumina_hiseq_2500, instrument_illumina_hiseq_3000, instrument_illumina_hiseq_4000
                        instrument_illumina_hiseq_x, instrument_illumina_hiseq_x_ten, instrument_illumina_iseq_100
                        instrument_illumina_miniseq, instrument_illumina_miseq, instrument_illumina_novaseq_6000
                        instrument_illumina_novaseq_x_plus, instrument_ion_genesudio_s5, instrument_ion_genesudio_s5_plus
                        instrument_ion_genesudio_s5_prime, instrument_ion_s5, instrument_ion_s5_xl
                        instrument_ion_torrent_genexus, instrument_ion_torrent_pgm, instrument_ion_torrent_proton
                        instrument_ion_torrent_s5, instrument_ion_torrent_s5_xl, instrument_mgiseq_2000rs, instrument_minion
                        instrument_nextseq_1000, instrument_nextseq_2000, instrument_nextseq_500, instrument_nextseq_550
                        instrument_onso, instrument_pacbio_rs, instrument_pacbio_rs_ii, instrument_promethion
                        instrument_revio, instrument_sentosa_sq301, instrument_sequel, instrument_sequel_ii
                        instrument_sequel_iie, instrument_tapestri, instrument_ug_100, instrument_unspecified
                        study_type_cancer_genomics, study_type_epigenetics, study_type_exome_sequencing
                        study_type_metagenomics, study_type_other, study_type_pooled_clone_sequencing
                        study_type_population_genomics, study_type_synthetic_genomics, study_type_transcriptome_analysis
                        study_type_transcriptome_sequencing, study_type_whole_genome_sequencing
  -ly string, --layout string
                        Library layout of the sequencing data:
                        paired, single
  -ac string, --access string
                        Data accessibility:
                        public, controlled
  -bp string, --bioproject string
                        BioProject accession in the form of PRJNA#, PRJEB#, or PRJDB#
  -bs string, --biosample string
                        BioSample accession in the form of SAMN#, SAMEA#, or SAMD#
  -as string, --accession string
                        SRA/ENA/DDBJ accession
                        Study with accessions in the form of SRP#, ERP#, or DRP#
                        Sample with accessions in the form of SRS#, ERS#, or DRS#
                        Experiment with accessions in the form of SRX#, ERX#, or DRX#
                        Run with accessions in the form of SRR#, ERR#, or DRR#
  -ti string, --title string
                        Descriptive name of the dataset
  -at string, --author string
                        Researcher or group that submitted the data
                         Example: SON_KH
  -pd range, --publication range
                        Publication Date
                        YYYY.MM.DD : YYYY.MM.DD format
                        Example: 2016, 2016.07, 2016.07.01, 2016.07:2023.02
  -md range, --modification range
                        Modification Date
                        YYYY.MM.DD : YYYY.MM.DD format
                        Example: 2016, 2016.07, 2016.07.01, 2016.07:2023.02
  -rl range, --readlength range
                        Length of the sequencing readsExample: 100 or 100:500
  -mb string, --mbases string
                        Number of mega bases in the SRA Runs
  -tw string, --textword string
                        General search term for finding datasets by specific words in metadata
  -ex keywords, --exclude keywords
                        Exclude the results for the keywords used in this option
                        Example: cell_line,normal,crispr
                         
  -d, --detail          Show the number of searched results for each option and keyword
                         
  -m, --metadata        Save integrated metadata
                         
  -u, --url             Save the file, including the URL address of the raw data (.fastq)

Examples

# Search for specific sequencing data for a specific species
$ gencube seqmeta --organism homo_sapiens --strategy rna_seq
$ gencube seqmeta --organism homo_sapiens --strategy chip_seq
$ gencube seqmeta --organism human --strategy chip_seq

# Search for cancer data for specific tissues
$ gencube seqmeta --organism human --strategy chip_seq liver,lung   # liver OR lung
$ gencube seqmeta --organism human --strategy chip_seq cancer,tumor # cancer OR tumor
$ gencube seqmeta --organism human --strategy chip_seq liver,lung cancer,tumor # (liver OR lung) AND (cancer OR tumor)

# Exclude results containing specific keywords
$ gencube seqmeta --organism human --strategy chip_seq liver,lung cancer,tumor --exclude cell_line,crispr # cell_line OR crispr

# Use wild card (*) to search for a broader range of results
$ gencube seqmeta --organism human --strategy chip_seq liver,lung cancer*,tumor* # (liver OR lung) AND (cancer* OR tumor*)

# Use ^ for phrase (not word) search
$ gencube seqmeta --organism human --strategy chip_seq liver,lung cancer,tumor --exclude cell_line^,crispr # "cell line" OR crispr

# Search using accession
$ gencube seqmeta PRJNA838583
$ gencube seqmeta SRP375422
# or specifically
$ gencube seqmeta --bioproject PRJNA838583
$ gencube seqmeta --accession SRP375422

# Search using a custom query
$ gencube seqmeta '(((human[Organism]) AND ("chip seq"[Strategy])) AND ((liver OR lung) AND (cancer OR tumor)))'

# Output the number of search results for each option and keyword
$ gencube seqmeta --organism human --strategy chip_seq --exclude cell_line,crispr liver,lung cancer,tumor --detail

# Save the integrated metadata
$ gencube seqmeta --organism human --strategy chip_seq --exclude cell_line,crispr liver,lung cancer,tumor --metadata

Using the --url option, you can obtain additional files that include the URLs of the raw files.

🔥 $ gencube seqmeta -o dog -st chip_seq --metadata --url

Experiment	Type	URL
SRX23379089	paired	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR277/078/SRR27712678/SRR27712678_1.fastq.gz
SRX23379089	paired	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR277/078/SRR27712678/SRR27712678_2.fastq.gz
SRX23379093	paired	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR277/074/SRR27712674/SRR27712674_1.fastq.gz
SRX23379093	paired	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR277/074/SRR27712674/SRR27712674_2.fastq.gz
SRX1726522	single	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR344/009/SRR3440079/SRR3440079.fastq.gz
SRX1559581	single	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR314/007/SRR3143427/SRR3143427.fastq.gz
SRX1559583	single	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR314/009/SRR3143429/SRR3143429.fastq.gz
...

Structure of output files and directories

🗂️ Displayed results

$ tree
├── gencube_output
│   ├── Canis_lupus_familiaris-CanFam3.1_chr_name.txt
│   ├── Canis_lupus_familiaris-Dog10K_Boxer_Tasha_chr_name.txt
│   ├── genome
│   │   ├── Canis_lupus_familiaris-CanFam3.1-refseq.sm.ens-id.fa.gz
│   │   └── Meta_genome_canfam3_complete-chromosome_241125_103030.txt
│   ├── geneset
│   │   ├── Canis_lupus_familiaris-CanFam3.1-refseq.ens-id.gtf.gz
│   │   └── Canis_lupus_familiaris-Dog10K_Boxer_Tasha-ensembl_ensembl.ens-id.gtf.gz
│   ├── annotation
│   │   ├── Canis_lupus_familiaris-CanFam3.1-genark.chrom.sizes.ens-id.txt
│   │   ├── Canis_lupus_familiaris-CanFam3.1-genark.gc5Base.ens-id.bw
│   │   └── Canis_lupus_familiaris-CanFam3.1-genark.simpleRepeat.ens-id.bb
│   └── sequence
│   │   ├── Canis_lupus_familiaris-CanFam3.1-refseq.protein.faa.gz
│   │   └── Canis_lupus_familiaris-CanFam3.1-refseq.rna.fna.gz
│   ├── crossgenome
│   │   ├── Canis_lupus_familiaris-UU_Cfam_GSD_1.0-ensembl_ensembl_homology.tsv.gz
│   │   ├── Canis_lupus_familiaris-UU_Cfam_GSD_1.0-toga_human_homology.tsv.gz
│   │   └── Canis_lupus_familiaris-UU_Cfam_GSD_1.0-toga_mouse_homology.tsv.gz
│   └── seqmeta
│       ├── Meta_seq_dog_241125_112002_experiment_n279.txt
│       ├── Meta_seq_dog_241125_112002_experiment_n279_urls.txt
│       └── Meta_seq_dog_241125_112002_study_n14.txt
└── gencube_raw_download
    ├── bedGraphToBigWig
    ├── bedToBigBed
    ├── bigBedToBed
    ├── bigWigToBedGraph
    ├── Canis_lupus_familiaris-CanFam3.1_assembly_report.txt
    ├── Canis_lupus_familiaris-CanFam3.1-genark.chrom.sizes.txt
    ├── Canis_lupus_familiaris-CanFam3.1-genark.gc5Base.bw
    ├── Canis_lupus_familiaris-CanFam3.1-genark.simpleRepeat.bb
    ├── Canis_lupus_familiaris-CanFam3.1-refseq.gtf.gz
    ├── Canis_lupus_familiaris-CanFam3.1-refseq.sm.fa.gz
    ├── Canis_lupus_familiaris-Dog10K_Boxer_Tasha_assembly_report.txt
    └── Canis_lupus_familiaris-Dog10K_Boxer_Tasha-ensembl_ensembl.gtf.gz

🔥 Commands used to generate the above results

# Genome
$ gencube genome canfam3 --donwload
$ gencube genome canfam3 --metadata
# Gene set
$ gencube geneset canfam3 -d refseq_gtf
$ gencube geneset canfam3 -d ensembl_gtf
# Annotation
$ gencube annotation canfam3 -d sr
$ gencube annotation canfam3 -d gc
# Sequence
$ gencube sequence canfam3 -d refseq_rna
$ gencube sequence canfam3 -d refseq_pep
# Cross-genome
$ gencube crossgenome GCF_011100685.1 -d ensembl_homology
$ gencube crossgenome GCF_011100685.1 -d toga_homology
# Seqmeta
$ gencube seqmeta -o dog -st chip_seq --metadata --url

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
examples/gencube_output		examples/gencube_output
figures		figures
gencube		gencube
.editorconfig		.editorconfig
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Centralized retrieval and integration of multi-omics resources from leading databases

Manuscript

Databases accessed from gencube

Detailed information of each database

Installation

Email and NCBI API key for E-utilities

Tutorials

0. The positional argument and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand

Examples

Example output displayed in the terminal

1. `genome` subcommand

Examples

Example output displayed in the terminal

2. `geneset` subcommand

Examples

Example output displayed in the terminal

3. `annotation` subcommand

Examples

4. `sequence` subcommand

Examples

5. `crossgenome` subcommand

Examples

6. `seqmeta` subcommand

Examples

Using the --url option, you can obtain additional files that include the URLs of the raw files.

Structure of output files and directories

Credits

About

Releases

Packages

Contributors 2

Languages

License

snu-cdrc/gencube

Folders and files

Latest commit

History

Repository files navigation

Centralized retrieval and integration of multi-omics resources from leading databases

Manuscript

Databases accessed from gencube

Detailed information of each database

Installation

Email and NCBI API key for E-utilities

Tutorials

0. The positional argument and options shared among the genome, geneset, sequence, annotation, and crossgenome subcommand

Examples

Example output displayed in the terminal

1. genome subcommand

Examples

Example output displayed in the terminal

2. geneset subcommand

Examples

Example output displayed in the terminal

3. annotation subcommand

Examples

4. sequence subcommand

Examples

5. crossgenome subcommand

Examples

6. seqmeta subcommand

Examples

Using the --url option, you can obtain additional files that include the URLs of the raw files.

Structure of output files and directories

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

0. The positional argument and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand

1. `genome` subcommand

2. `geneset` subcommand

3. `annotation` subcommand

4. `sequence` subcommand

5. `crossgenome` subcommand

6. `seqmeta` subcommand

Packages