Skip to content

Latest commit

 

History

History
101 lines (61 loc) · 4.03 KB

106-file_formats.md

File metadata and controls

101 lines (61 loc) · 4.03 KB
pagetitle
SARS-CoV-2 Genomics

Common File Formats

This page lists some common file formats used in Bioinformatics (listed alphabetically). The heading of each file links to a page with more details about each format.

Generally, files can be classified into two categories: text files and binary files.

  • Text files can be opened with standard text editors, and manipulated using command-line tools (such as head, less, grep, cat, etc.). However, many of the standard files listed in this page can be opened with specific software that displays their content in a more user-friendly way. For example, the NEWICK format is used to store phylogenetic trees and, although it can be opened in a text editor, it is better used with a software such as FigTree to visualise the tree as a graph.
  • Binary files are often used to store data more efficiently. Typically, specific tools need to be used with those files. For example, the BAM format is used to store sequences aligned to a reference genome and can be manipulated with dedicated software such as samtools.

Very often, text files may be compressed to save storage space. A common compression format used in bioinformatics is gzip with has extension .gz. Many bioinformatic tools support compressed files. For example, FASTQ files (used to store NGS sequencing data) are often compressed with format .fq.gz.

BAM ("Binary Alignment Map")

  • Binary file.
  • Same as a SAM file but compressed in binary form.
  • File extensions: .bam

BED ("Browser Extensible Data")

  • Text file.
  • Stores coordinates of genomic regions.
  • File extension: .bed

CSV ("Comma Separated Values")

  • Text file.
  • Stores tabular data in a text file. (also see TSV format)
  • File extensions: .csv

These files can be opened with spreadsheet programs (such as Microsoft Excel). They can also be created from spreadsheet programs by going to File > Save As... and select "CSV (Comma delimited)" as the file format.

  • Binary file. More specifically, this is a Hierarchical Data Format (HDF5) file.
  • Used by Nanopore platforms to store the called sequences (in FASTQ format) as well as the raw electrical signal data from the pore.
  • File extensions: .fast5
  • Text file.
  • Stores nucleotide or amino acid sequences.
  • File extensions: .fa or .fas or .fasta
  • Text file, but often compressed with gzip.
  • Stores sequences and their quality scores.
  • File extensions: .fq or .fastq (compressed as .fq.gz or .fastq.gz)

GFF ("General Feature Format")

  • Text file.
  • Stores gene coordinates and other features.
  • File extension: .gff
  • Text file.
  • Stores phylogenetic trees including nodes names and edge lengths.
  • File extensions: .tree or .treefile

SAM ("Sequence Alignment Map")

  • Text file.
  • Stores sequences aligned to a reference genome. (also see BAM format)
  • File extensions: .sam

TSV ("Tab-Separated Values")

  • Text file.
  • Stores tabular data in a text file. (also see CSV format)
  • File extensions: .tsv or .txt

These files can be opened with spreadsheet programs (such as Microsoft Excel). They can also be created from spreadsheet programs by going to File > Save As... and select "Text (Tab delimited)" as the file format.

VCF ("Variant Calling Format")

  • Text file but often compressed with gzip.
  • Stores SNP/Indel variants
  • File extension: .vcf (or compressed as .vcf.gz)