Skip to content

Commit

Permalink
more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jguhlin committed Apr 30, 2024
1 parent 29841d3 commit 077cd40
Show file tree
Hide file tree
Showing 12 changed files with 282 additions and 89 deletions.
36 changes: 28 additions & 8 deletions FORMAT.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,12 @@
# SFASTA File Format Definition
Version 0.0.2
Version 0.0.3

## Rasoning / Need
FASTA/Q format is slow for random access, even with an index provided by samtools.

## Pros/Cons
By concatenating all of the sequences into successive blocks, it becomes more difficult to add or remove sequences at a whim. However, many large sequence files are rarely changed (NT, UniProt, nr, reads, etc).

## Warning
This is a format still in heavy development, backwards compatability is incredibly unlikely and not desirable at this stage.

## File Format
Bincoded, using serde.

### Overview
* Directory struct
* Parameters struct
Expand All @@ -23,7 +17,33 @@ Bincoded, using serde.
* Index struct
* Order Struct

# TODO: Add masking stream...
## Header
| Data Type | Name | Description |
| ---:|:--- |:--- |
| [u8; 6] | b"sfasta" | Indicates an sfasta file |
| u64 | Version | Version of the SFASTA file (currently set to 1) |
| struct | [Directory](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/directory.rs#L53) | Directory of the file; u64 bytes pointing to indices, sequence blocks, etc... |
| struct | [Parameters](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/parameters.rs#L5) | Parameters used to create the file |
| struct | [Metadata](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/metadata.rs#L3) | unused |
| structs | [SequenceBlock](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/sequence_block.rs#L11) | Sequences, split into block_size chunks, and compressed on disk (see: [SequenceBlockCompressed](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/sequence_block.rs#L154)) |
| Vec<u64> | Sequence Block Offsets | Location of each sequence block, in bytes, from the start of the file. Stored on disk as bitpacked u32. |
| [Header](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/header.rs) Region ||
| enum | [CompressionType](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/structs.rs#L23) | Type of compression used for the Headers. |
| u64 | Block Locations Position | Position of the header blocks locations |
| [u8] | Header Block | Header (everything after the sequence ID in a FASTA file) stored as blocks of u8 on disk, zstd compressed. |
| [u64] | Header Block Offsets | Location of each header block, in bytes, from the start of the file. |
| [IDs](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/id.rs) Region ||
| enum | [CompressionType](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/structs.rs#L23) | Type of compression used for the IDs. |
| u64 | Block Locations Position | Position of the ID blocks locations |
| [u8] | ID Block | IDs (everything after the sequence ID in a FASTA file) stored as blocks of u8 on disk, zstd compressed. |
| [u64] | ID Block Offsets | Location of each header block, in bytes, from the start of the file. |
| [Masking](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/masking.rs) Region ||
| u64 | bitpack_len | Length of each bitpacked block |
| u8 | num_bits| Number of bits used to bitpack each integer |
| [Packed] | BitPacked Masking Instructions | Bitpacked masking instructions. [See here](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/masking/ml32bit.rs) |
| [struct] | [SeqLocs](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/sequence_block.rs) | Sequence locations, stored as a vector of u64. |
| Special | [Dual Index](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/dual_level_index/dual_index.rs) | See file for more description. Rest of this table TBD |


### Directory
```
Expand Down
62 changes: 27 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,6 @@ Uncompressed: 272M

Compression speed is slower, but this is primarily due to the index creation. For bgzip samtools takes 2.07 seconds to generate the index. Also of note is pigz, ennaf, and zstd do not support indexing, while crabz does (using bgzf format).



## Nanopore Reads
As a FASTA file.

Expand Down Expand Up @@ -245,6 +243,31 @@ Uncompressed: 2.7G
| bgzip (excl index) | 635M |
| Zstd (no index) | 663M |

## Illumina Reads
11Gb of reads (FASTQ)

### Compression Speed
| Command | Mean [s] | Min [s] | Max [s] | Relative |
|:---|---:|---:|---:|---:|
| `sfa convert --threads 14 reads.fastq` | 80.794 ± 3.379 | 77.700 | 87.570 | 9.59 ± 0.53 |
| `bgzip -kf --threads 16 reads.fastq` | 48.920 ± 0.264 | 48.645 | 49.279 | 5.81 ± 0.21 |
| `pigz -kf -p 16 reads.fastq` | 80.712 ± 1.461 | 78.945 | 83.181 | 9.58 ± 0.39 |
| `ennaf --dna --fastq --temp-dir /tmp reads.fastq -o reads.naf` | 38.276 ± 0.246 | 37.951 |
38.602 | 4.54 ± 0.17 |
| `zstd -k reads.fastq -f -T16` | 8.423 ± 0.303 | 7.915 | 8.889 | 1.00 |
| `crabz -f bgzf -p 16 reads.fastq -o reads.fastq.gz` | 22.176 ± 3.802 | 17.889 | 26.813 | 2.63 ± 0.46 |

### Genome Size
Uncompressed: 2.7G

| Compression Type | Size |
|---|--|
| NAF (no index) | 1.9Gb |
| sfasta (incl index) | 2.5Gb |
| bgzip (excl index) | 635M |
| Zstd (no index) | 663M |


# Future Plans
## Implement NAF-like algorithm
[NAF](https://github.com/KirillKryukov/naf) has an advantage with 4bit encoding. It's possible to implement this, and use 2bit when possible, to gain additional speed-ups. Further, there is some SIMD support for 2bit and 4bit DNA/RNA encoding.
Expand Down Expand Up @@ -273,7 +296,6 @@ To make it easier to use in other programs and in python/jupyter
## Small file optimization
Sfasta is currently optimized for larger files.


## GFA file format support
Graph genome file format is in dire need of an optimized format

Expand All @@ -294,43 +316,13 @@ cargo fuzz run parse_sfasta -- -detect_leaks=0 -rss_limit_mb=4096mb -max_len=838
## I get a strange symbol near the progress bar
You need to install a font that supports Unicode. I'll see if there is a way to auto-detect.

## XZ compression is fast until about halfway, then slows to a crawl.
The buffers can store lots of sequence, but the compression algorithm takes longer.

## Why so many dependencies?
Right it it works with a wide range of compression functions. Once some are determined to be the best others could be dropped from future versions. The file format itself has a version identifier so we could request people rollback to an older version if they need to.

## Why samtools comparison?
I've got plenty of experiments trying to get a fast gzip compressed multi-threaded reader, but even when mounted on a ramdisk, it is too slow. Samtools is an awesome, handy tool that has the 'faidx' function, of which I am a huge fan. While the faidx is not it's main function, it is not optimized for large datasets, thus the test is a little unfair. Still, it's helpful to have something to compare to.
I've got plenty of experiments trying to get a fast gzip compressed multi-threaded reader, but even when mounted on a ramdisk, it is too slow. Samtools is an awesome, handy tool that has the 'faidx' function, which I use almost constantly. While faidx is a handy utility function, it is not optimized for large datasets, thus the test is a little unfair. Still, it's helpful to have something to compare to.

# File Format
The best source is currently this file: [conversion.rs](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/conversion.rs#L148). Masking is converted into instructions in a u32, see [ml32bit.rs](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/masking/ml32bit.rs).

## Header
| Data Type | Name | Description |
| ---:|:--- |:--- |
| [u8; 6] | b"sfasta" | Indicates an sfasta file |
| u64 | Version | Version of the SFASTA file (currently set to 1) |
| struct | [Directory](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/directory.rs#L53) | Directory of the file; u64 bytes pointing to indices, sequence blocks, etc... |
| struct | [Parameters](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/parameters.rs#L5) | Parameters used to create the file |
| struct | [Metadata](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/metadata.rs#L3) | unused |
| structs | [SequenceBlock](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/sequence_block.rs#L11) | Sequences, split into block_size chunks, and compressed on disk (see: [SequenceBlockCompressed](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/sequence_block.rs#L154)) |
| Vec<u64> | Sequence Block Offsets | Location of each sequence block, in bytes, from the start of the file. Stored on disk as bitpacked u32. |
| [Header](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/header.rs) Region ||
| enum | [CompressionType](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/structs.rs#L23) | Type of compression used for the Headers. |
| u64 | Block Locations Position | Position of the header blocks locations |
| [u8] | Header Block | Header (everything after the sequence ID in a FASTA file) stored as blocks of u8 on disk, zstd compressed. |
| [u64] | Header Block Offsets | Location of each header block, in bytes, from the start of the file. |
| [IDs](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/id.rs) Region ||
| enum | [CompressionType](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/structs.rs#L23) | Type of compression used for the IDs. |
| u64 | Block Locations Position | Position of the ID blocks locations |
| [u8] | ID Block | IDs (everything after the sequence ID in a FASTA file) stored as blocks of u8 on disk, zstd compressed. |
| [u64] | ID Block Offsets | Location of each header block, in bytes, from the start of the file. |
| [Masking](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/masking.rs) Region ||
| u64 | bitpack_len | Length of each bitpacked block |
| u8 | num_bits| Number of bits used to bitpack each integer |
| [Packed] | BitPacked Masking Instructions | Bitpacked masking instructions. [See here](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/masking/ml32bit.rs) |
| [struct] | [SeqLocs](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/data_types/sequence_block.rs) | Sequence locations, stored as a vector of u64. |
| Special | [Dual Index](https://github.com/jguhlin/sfasta/blob/main/libsfasta/src/dual_level_index/dual_index.rs) | See file for more description. Rest of this table TBD |
The format is found in the docs.

![Genomics Aotearoa](./static/genomics-aotearoa.png)
9 changes: 1 addition & 8 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,4 @@
# Summary

- [About](./about.md)
- [Format](./file_format.md)
- [Structs](./structs.md)
- [Directory](./structs/directory.md)
- [Parameters](./structs/parameters.md)
- [Metadata](./structs/metadata.md)
- [Datatypes](./datatypes.md)
- [Compressed Blocks](./datatypes/compressed_blocks.md)
- [Block Index](./block_index.md)
- [Format](./file_format.md)
225 changes: 217 additions & 8 deletions book/src/file_format.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,223 @@
# File Format
The SFASTA format is held together by [bincode](https://github.com/bincode-org/bincode), which serializes structs and other data to the file.

## File Format Table
Unlike many other formats, SFASTA is written more similar to a database, and meant to be crawled to properly understand. Linear processing will be difficult, and provide no benefits. Once the directory is read, it is possible to seek to the file location to prase the next bit of data.

It also means if the order of the data changes, the file format remains stable, assuming the directory remains in the same location at the head of the file.

## File Format

| Name | Type | Description |
| ---- | ---- | ----------- |
| Header | str | b"SFASTA" |
| Version | u64 | Indicated the version, for future compatability |
| Directory | Struct: [Directory](./structs/directory.md) | Locations of index, sequences, masking, etc... |
| Parameters | Struct: [Parameters](./structs/parameters.md) | Parameters used to create the file |
| Metadata | Struct: [Metadata](./structs/metadata.md) | Metadata about the file |
| Compressed Blocks | [Compressed Blocks](./datatypes/compressed_blocks.md) | Compressed blocks of data |
| Block Index | [Block Index](./block_index.md) | Index of the compressed blocks |
| Header | str | b"SFASTA" - The "Magic Bytes" |
| Version | u64 | Indicates the version, for future compatability |
| Directory | [Directory](#directory) | Locations of index, sequences, masking, etc... |
| Parameters | [Parameters](#parameters) | Parameters used to create the file |
| Metadata | [Metadata](#metadata) | Metadata about the file |
| Compressed Blocks | [Compressed Blocks](#compressed-blocks) | Individual compressed blocks |
| Block Locations | [Fractal tree](#fractal-tree) | Stored in this order: headers, ids, masking, sequences, scores |
| Headers | [Block Store Headers](#block-store-headers) | Stored in the following order: Headers, IDs, Masking, Sequences, Scores |

## Directory
| Field | Type | Description |
|:-----:|:----:|:-----------:|
| Index Loc | u64 | Location of the Index Fractal Tree |
| IDs Loc | u64 | Location of the ID Block Store Headers |
| SeqLocs Loc | u64 | Location of the SeqLocs Store Headers |
| Scores Loc | u64 | Location of the quality scores store headers |
| Masking Loc | u64 | Location of the masking store headers |
| Headers loc | u64 | Location of the headers store headers |
| Sequences Loc | u64 | Location of the sequences stores headers |
| Flags Loc | u64 | Not yet implemented |
| Signals Loc | u64 | Not yet implemented |
| Mods Loc | u64 | Not yet implemented |

For all directory entries, 0 indicates None.

```rust
#[derive(Debug, Clone, Default, PartialEq, Eq)]
pub struct Directory
{
pub index_loc: Option<NonZeroU64>,
pub ids_loc: Option<NonZeroU64>,
pub seqlocs_loc: Option<NonZeroU64>,
pub scores_loc: Option<NonZeroU64>,
pub masking_loc: Option<NonZeroU64>,
pub headers_loc: Option<NonZeroU64>,
pub sequences_loc: Option<NonZeroU64>,
pub flags_loc: Option<NonZeroU64>,
pub signals_loc: Option<NonZeroU64>,
pub mods_loc: Option<NonZeroU64>,
}
```

## Parameters
| Field | Type | Description |
|:-----:|:----:|:-----------:|
| block size | u32 | Block size used for storage. |

Likely to be removed, and headers to store block size for each individual data type.

```rust
#[derive(Debug, Clone, bincode::Encode, bincode::Decode)]
pub struct Parameters
{
pub block_size: u32,
}
```

## Metadata
Metadata is stored uncompressed, so that the files can be found with grep and metadata rapidly accessed. These are all open ended, and formats are not enforced.

| Field | Type | Description |
|:-----:|:----:|:-----------:|
| created_by | Option<String> | |
| citation_doi | Option<String> | |
| citation_url | Option<String> | |
| citation_authors | Option<String> | |
| date_created | Option<String> | |
| title | Option<String> | |
| description | Option<String> | |
| notes | Option<String> | |
| download_url | Option<String> | |
| homepage_url | Option<String> | |
| version | Option<String> | |

```rust
#[derive(
Debug,
Clone,
bincode::Encode,
bincode::Decode,
Default,
Serialize,
Deserialize,
)]
pub struct Metadata
{
pub created_by: Option<String>,
pub citation_doi: Option<String>,
pub citation_url: Option<String>,
pub citation_authors: Option<String>,
pub date_created: u64,
pub title: Option<String>,
pub description: Option<String>,
pub notes: Option<String>,
pub download_url: Option<String>,
pub homepage_url: Option<String>,
pub version: Option<usize>,
}
```

## Compressed Blocks
| Field | Type | Description |
|:-----:|:----:|:-----------:|
| data | Vec u8 | |

Data is converted when the block size is reached, or if the file processing is complete. Data is converted to bytes (u8) depending on the store being used (Strings, Masking, etc).

Data is bincoded to Vec u8 , the block is individually compressed, and bincoded into the file as Vec u8.

## Block Store Headers
| Field | Type | Description |
|:-----:|:----:|:-----------:|
| Compression Config | [CompressionConfig](#compression-config) | Configuration for this datatype |
| Location of the Blocks Fractal Tree Index | [FractalTree](#fractal-tree) | Location of the Fractal Tree |
| Block Size | u32 | Size of uncompressed blocks for this data type |

## Compression Config
| Field | Type | Description |
|:-----:|:----:|:-----------:|
| Compression Type | enum | [Compression Type](#compression-type) |
| Compression Level | i8 | Compression level to use for this data type |

```rust
pub struct CompressionConfig
{
pub compression_type: CompressionType,
pub compression_level: i8,

#[serde(skip)]
pub compression_dict: Option<Vec<u8>>,
}
```

## Compression Type
```rust
#[derive(
PartialEq,
Eq,
Debug,
Clone,
Copy,
bincode::Encode,
bincode::Decode,
Serialize,
Deserialize,
)]
#[non_exhaustive]
pub enum CompressionType
{
ZSTD, // 1 should be default compression ratio
LZ4, // 9 should be default compression ratio
SNAPPY, // Implemented
GZIP, // Implemented
NONE, // No Compression
XZ, // Implemented, 6 is default ratio
BROTLI, // Implemented, 6 is default
BZIP2, // Implemented
BIT2, // Not implemented
BIT4, // Not implemented
}
```

## Fractal Tree
The fractal tree exists in two states, the Build state, and the OnDisk state. It is only stored as on disk.

Fractal Trees are made up of [Nodes](#nodes)

```rust
pub struct FractalTreeDisk<K: Key, V: Value>
{
pub root: NodeDisk<K, V>,
pub start: u64, /* On disk position of the fractal tree, such that
* all locations are start + offset */
pub compression: Option<CompressionConfig>,
}
```

### Node
| Field | Type | Description |
|:-----:|:----:|:-----------:|
| is_root | bool | Is the root node? |
| is_leaf | bool | Is a leaf node? |
| state | NodeState | Is the node on the disk, or loaded into memory? |
| keys | Vec(k) | Vector of the keys |
| children | Vec(NodeDisk) | Children of this node (invalid if is_leaf) |
| values | Vec(V) | Values of this leaf node, invalid if is_leaf is false |

```rust
#[derive(Debug, Clone)]
pub struct NodeDisk<K, V>
{
pub is_root: bool,
pub is_leaf: bool,
pub state: NodeState,
pub keys: Vec<K>,
pub children: Option<Vec<Box<NodeDisk<K, V>>>>,
pub values: Option<Vec<V>>,
}
```

### Node State
Represents if the fractal tree node is currently on disk, compressed, or stored in memory (and thus accessible). If stored on disk, contains the location of the data.

```rust
#[derive(Debug, Clone)]
pub enum NodeState
{
InMemory,
Compressed(Vec<u8>),
OnDisk(u32),
}
```
1 change: 0 additions & 1 deletion book/src/structs/compressed_blocks.md

This file was deleted.

Loading

0 comments on commit 077cd40

Please sign in to comment.