Skip to content

Commit

Permalink
added huffman refs
Browse files Browse the repository at this point in the history
  • Loading branch information
eyaler committed Oct 27, 2022
1 parent 1f0dffa commit 662f17c
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 5 deletions.
2 changes: 1 addition & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- Ablation benchmarks
- Auto-caps should use modifiers for next letter/word/sentence/paragraph or block-level, over simple mode instead of falling back to raw
- Dictionary compression for long texts
- [Fast Huffman one-shift decoder](https://researchgate.net/publication/3159499_On_the_implementation_of_minimum_redundancy_prefix_codes)
- [Fast Huffman one-shift decoder](https://researchgate.net/publication/3159499_On_the_implementation_of_minimum_redundancy_prefix_codes), or [follow-up](https://arxiv.org/pdf/1410.3438.pdf) [works](https://arxiv.org/pdf/2108.05495.pdf)
- [Base139](https://github.com/kevinAlbs/Base122/issues/3#issuecomment-263787763)
- Compress the JS itself and use eval, considering also JS packing e.g. [JSCrush](http://iteral.com/jscrush), [RegPack](https://siorki.github.io/regPack), [Roadroller](https://lifthrasiir.github.io/roadroller)
- Benchmark [Roadroller](https://lifthrasiir.github.io/roadroller) entropy coding
Expand Down
9 changes: 5 additions & 4 deletions ztml/huffman.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@
Even though we later compress with DEFLATE which does its own Huffman encoding internally,
I found that for text compression, it is significantly beneficial to pre-encode with Huffman.
Canonical encoding obviates saving or reconstructing an explicit codebook.
Instead, we save a string of symbols ordered by increasing frequency,
and a sparse dictionary from codeword lengths to bases and offsets
(see paper, but note it is my custom implementation).
Instead, we save a string of symbols and a sparse dictionary from codeword lengths to bases and offsets
(see Moffat paper, but note it is my custom implementation).
A minimalistic JS decoder code is generated.
References:
https://wikipedia.org/wiki/Canonical_Huffman_code
https://github.com/ilanschnell/bitarray/blob/master/doc/canonical.rst
https://researchgate.net/publication/3159499_On_the_implementation_of_minimum_redundancy_prefix_codes
https://researchgate.net/publication/3159499_On_the_implementation_of_minimum_redundancy_prefix_codes (Moffat)
https://arxiv.org/pdf/1410.3438.pdf
https://arxiv.org/pdf/2108.05495.pdf
"""


Expand Down

0 comments on commit 662f17c

Please sign in to comment.