In the world of dictionary coding and probability based encoding, the floating point weirdness that is arithmetic coding is a refreshing and surprisingly efficient lossless compression algorithm. The algorithm takes the form of two stages, the first stage translates a string into a floating point range and the second stage translates this into a binary sequence. Let’s take a look at how each stage works.
+
Stage 1: Floating Point Ranges
+
At a very broad level arithmetic coding works by taking a character and assigning it a frequency to a table. This frequency is then mapped to a number line between 0 and 1. So, if we have the character frequency table as shown below for the word “HELLO”, we would end up with our number line shown below.
+
+
+
Character
+
Frequency
+
Probability
+
+
+
+
H
+
1
+
20%
+
+
+
E
+
1
+
20%
+
+
+
L
+
2
+
40%
+
+
+
O
+
1
+
20%
+
+
+
+
+
To do the encoding, we need a floating point range representing our encoded string. So, for example, let’s encode “HELLO”. We start out by encoding just the letter “H”, which would give us the range of 0 to 0.2. However, we’re not just encoding “H” so, we need to encode “E”. To encode “E” we take the range from encoding “H”, 0 to 0.2, and apply our same frequency table to that. You can see this represented below.
+
+
Blown up, you can see that we’re essentially copying the number line down, but fitting it within the range of 0 to 0.2 instead of 0 to 1.
+
+
Now, we’ll encode the letter “E”, and we can see it falls within the range of 0.04 to 0.08.
+
+
As we move through this process, this copying of the number line and fitting it within the previous range continues until we encode our entire string. Though, if you’re familiar with floating point arithmetic in computers, you know that computers aren’t good with decimals, especially long ones. There are some workarounds to this, but generally floating point math is too inefficient or innaccurate to make arithmetic coding work quickly or properly for compression.
+
The answer to this issue is called finite-precision arithmetic coding, with the above approach of fitting the number line within a range known as the infinite-precision version because we (supposedly) have an infinite amount of precision.
+
Now if we continue this process, we get a range representing 0.06368 to 0.06496. The difference between these two numbers is just 0.00128, a big difference from the 0.2 difference when encoding just “H”. You can imagine that larger files will have an even smaller difference between the two ranges, spelling out the need for finite-precision arithmetic coding.
+
Stage 2: Binary Search
+
The next, and luckily final, stage is to run a binary search-like algorithm over the table to find a binary range that lays within our range from the first stage.
+
The way this works is actually quite simple. We take our number line from 0 to 1, and lay it out.
+
+
Then, we plot our range on the number line, and place our current target in the middle of the range: 0.5.
+
+
The trick is to see whether our range falls on the left or right side of our target (0.5). In this case, our range pretty clearly falls on the left hand side so we output a 0. If it fell on the right hand side we would output a 1.
+
Now here’s where things get interesting. We change the top end of our range from 1 to 0.5, so now we’re looking at the range from 0 to 0.5, with out target at 0.25 (0.25 is in between 0 and 0.5).
+
+
You can see the range moves closer to our target and the area between gets a little larger. It’s important to note we’re not changing the range, just looking at it magnified. Our range is still below 0.25 so we’ll output a 0 and repeat this process for the range of 0-0.25.
+
This continues until you’re left with a binary sequence that represents a target, just like the 0.5 and 0.25 from earlier examples, that lays within our encoded range from stage 1. This binary stream is the coded version (the compressed version) as we can use it to get back to the original string with the right frequency table.
+
Infinite vs. Finite Precision
+
Infinite precision is the process that we just went over with two stages. However as we saw, the more characters we encode, the smaller the difference between our range floor and ceiling gets. This means that as we encode more and more characters the top and bottom sections of the range will eventually meet and represent the same value because a typical 32-bit system cannot represent infinite precision. There are ways around this, such as increasing the size of the floating point number’s precision or using infinite precision, but these solutions don’t work for all data or are very inefficient respectively.
+
The harder solution is to combine these steps in one stage, which is called finite-precision arithmetic coding because it only requires a finite amount of precision to operate. This version works by encoding the first character, then immediately trying to see if the range falls above or below 0.5. If so, it will output the binary number representing which half it lays within and will “blow up” the range so that it doesn’t lose precision. There is also an important corner-case of encoding a “10” or “01” if the range lays within 0.25-0.75 which requires memory to be carried over from each encoding.
+
To put it simply, infinite-precision arithmetic coding is a simple and easy way to understand arithmetic coding while finite-precision arithmetic coding is more complicated but scalable and efficient.
+
Now unfortunately I can’t explain how to implement your own version of finite-precision arithmetic coding well enough to be comprehensive, so I’ll redirect you to a wonderful article by Mark Nelson that explains how to write an arithmetic coder with infinite and finite precision. There are also some wonderful online lectures by mathematicalmonk on YouTube that go into detail about finite-precision coding in a visual way. If anything in this article doesn’t make sense to you then I can’t recommend mathematicalmonk’s YouTube lectures and Mark Nelson’s article. Arithmetic Compression from Compressor Head on YouTube is also a great and enjoyable primer on the topic.
+
Adaptive Arithmetic Coding
+
One last variation of arithmetic coding worth mentioning is the idea of adaptive arithmetic coding. This version works in mostly the same way as typical arithmetic coding except that rather than building a frequency table from the source, it builds the frequency table as it encodes each character.
+
This means if you were encoding “HELLO”, it would start with “H” representing 100% of the table. When it encoded the “E”, it would update the frequency table so “H” would represent 50% of the table and “E” representing the remaining 50%. This concept allows arithmetic coding to adapt to the content as it’s encoding which allows it to achieve a higher compression ratio.
+
Resources
+
If you’re interested in learning more about arithmetic coding, check out these great resources:
Dictionary coding is one of the most primitive and powerful forms of compression that exists currently. In fact, we use it everyday in English. I’ve actually used it already. Did you catch that?
+
The contraction “I’ve” is technically a form of dictionary coding because when we read the word we automatically expand it to “I have” in our mind. This simple concept is used everywhere from spoken languages, mathematic functions, to file encodings.
+
More advanced forms of dictionary coding form the basis for many different compression algorithms aside from a simple search-and-replace step such as LZ, Huffman, and more.
+
The Concept
+
Let’s get an idea of how you could use dictionary coding to compress data.
+
So let’s say we’re working in a restaurant and we have to communicate to the chefs what food we need to be prepared on paper. Now our restaurant has three things on the menu:
+
+
Pizza
+
Fries
+
Milkshakes
+
+
Rather than writing down “pizza” or “fries” everytime someone order’s pizza or fries, we can assign each item a unique code.
+
+
+1 - Pizza
+
+2 - Fries
+
+3 - Milkshake
+
+
Now when we’re preparing order tickets for the kitchen, we can simply write 1, 2, or 3. This simple concept is used to improve upon the most advanced modern compression algorithms.
+
Implementation
+
Implementing a dictionary coder and decoder is actually very simple. All we’re really doing is replacing the long text with corresponding codes to encode it, and replacing codes with the text it represents to decode it.
The key to dictionary coding is that it solves a different problem than general-purpose encoders. A dictionary coder must be built with a known list of words (or bytes) that are very common to be able to see any real difference. For example, you could build a dictionary coder with the most common English character and encode Shakespeare, which would probably give you a good compression ratio. But if you were to use the same dictionary to encode a PDF or HTML file you would get much worse results. You also have to make sure that both the coder and decoder have the same dictionaries, otherwise the encoded text will be useless.
+
Common Usage
+
While dictionary coders may sound imperfect and niche, they’re quite the opposite. Web compression algorithms like Brotli use a dictionary with the most common words, HTML tags, JavaScript tokens, and CSS properties to encode web assets. This dictionary, while large, is insignificant compared to the savings they provide to each file they decode.
+
Here’s some modern algorithms that employ dictionary coding:
Dynamic Markov Compression is an obscure form of compression that uses Markov chains to model the patterns represented in a file.
+
Markov Chains
+
Markov Chains are a simple way to model the transitions between states based on a measureable probability. For example, we could use a Markov Chain to model the weather and the probability that it will become sunny if it’s already raining, or vice-versa.
+
+
Each circle represents a state, and each arrow represents a transition. In this example, we have two states, raining and sunny, a perfect representation of true weather. Each state has two possible transitions, it can transition to itself again or it can transition to another state. The likelihood of each transition is defined by a percentage representing the probability that the transition occurs.
+
Now let’s say it’s sunny and we’re following this model. According to the model there’s a 50% chance it’s sunny again tomorrow or a 50% chance it’s rainy tomorrow. If it becomes rainy, then there’s a 25% chance it’s rainy the day after that or a 75% chance it’s sunny the day after that.
+
Markov Chains may sound scary but the essence of how they work is quite simple. Markov Chains are the statistical model behind a lot of the technology we use today from Google’s PageRank search algorithm to predictive text on smartphone keyboards. If you’d like to learn more, check out this wonderful article by Victor Powell and Lewis Lehe that goes into depth about how Markov Chains work. They also have a wonderful interactive demo.
+
Markov Chain Powered Compression
+
Dynamic Markov Compression is a very obscure and complicated subject when it comes to implementation, and unfortunately I cannot claim to understand it well enough to explain it myself. Though, I have written a similar algorithm from my own trial-and-error that employs stateful Markov chains that model a file. You can find the source code within the Raisin project under the compressor/mcc package. This code is un-optimized and a bit messy as it exists only as a research project to learn more about DMC.
+
If you have a better understanding of DMC and would like to contribute to this article, we would appreciate any and all contributions!
+
Resources
+
If you’re interested in trying to implement DMC yourself or are just interested in the algorithm, here’s a few helpful resources as a jumping-off point:
Push an array of huffman leaf objects containing each character and its associated frequency into a priority queue. To start building the tree, pop two leafs from the queue and assign them as the left and right leafs for a node.
+
Using this new node, push the node into the priority queue.
+
Continue this process until the size of the queue is 1.
+
+
+
+
+
+
+
+
diff --git a/docs/algorithms/huffman/index.html b/docs/algorithms/huffman/index.html
new file mode 100644
index 0000000..6c846b5
--- /dev/null
+++ b/docs/algorithms/huffman/index.html
@@ -0,0 +1,222 @@
+
+
+
+
+
+
+Huffman - The Hitchhiker's Guide to Compression
+
+
+Huffman | The Hitchhiker’s Guide to Compression
+
+
+
+
+
+
+
+
+
+ Link
+Search
+Menu
+Expand
+Document
+
Since it’s creation by David A. Huffman in 1952, Huffman coding has been regarded as one of the most efficient and optimal methods of compression. Huffman’s optimal compression ratios are made possible through it’s character counting functionality. Unlike many algorithms in the Lempel-Ziv suite, Huffman encoders scan the file and generate a frequency table and tree before begining the true compression process. Before discussing different implementations, lets dive deeper into how the algorithm works.
+
The Algorithm
+
Although huffman encoding may seem confusing from an outside view, we can break it into three simple steps:
Let’s start out by going over the frequency counting step. Throughout all of the examples, I will be using the following sample input string:
+
I AM SAM. I AM SAM. SAM I AM.
+THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!
+
+
The huffman encoder starts out by going over the inputted text and outputing a table correlating each character to the number of time it appears in the text. For the sample input, the table would look this:
+
+
+
Frequency
+
Character
+
+
+
+
1
+
N
+
+
+
1
+
\n
+
+
+
1
+
K
+
+
+
1
+
D
+
+
+
1
+
L
+
+
+
1
+
E
+
+
+
2
+
O
+
+
+
3
+
H
+
+
+
3
+
!
+
+
+
3
+
.
+
+
+
6
+
-
+
+
+
6
+
S
+
+
+
7
+
T
+
+
+
8
+
I
+
+
+
12
+
M
+
+
+
15
+
A
+
+
+
17
+
+
+
+
+
As displayed above, the table is sorted to ensure consistency in each step of the compression process.
+
Tree Building
+
Once the frequency table is created, the huffman encoder builds a huffman tree. A huffman tree follows the same structure as a normal binary tree, containing nodes and leafs. Each Huffman Leaf contains two values, the character and it’s corresponding frequency.
+
To build the tree, we traverse our table of frequencies and characters, and push the characters with the highest frequencies to the top of tree. Continuing the traversal until each table value is represented on a Huffman Leaf.
+
That might be confusing, so lets break it down step by step.
+
Huffman compression works by taking existing 8 bit characters and assigning them to a smaller number of bits. To optimize the compression, the characters with the highest frequency are given smaller bit values.
+
A Huffman Tree helps us assign and visualize the new bit value assigned to existing characters. Similar to a binary tree, if we start at the root node, we can traverse the tree by using 1 to move to the right and 0 to move to the left. The position of a leaf node relative the root node is used to determine it’s new bit value.
+
A huffman tree for our example is depicted below:
+
As shown in the image, Huffman trees can get very large and complicated very easily. To see a sample tree for any text go to url.
+
To understand more about the programatic implementation of tree building, click here.
+
Character Encoding
+
Character encoding is the final step for most huffman encoders. Once a tree and frequency table has built, the final step is to encode the characters from the initial file and write the encoded bytes to a new file.
Tree traversal is the first way of encoding the input of a huffman encoder. For each character, the tree is traversed recursively until a leaf with a matching character is found.
+
This method can easily get complicated and very inefficient as the tree has to be traversed multiple times.
+
For a simpler and quicker solution, we can use Array Indexing
+
Array Indexing
+
When compared to the previous tree traversal method, array indexing is much less complicated and significantly faster.
+
Before encoding the characters, the tree is traversed once and the values for each leaf are outputted in two corresponding arrays. The first array contains the value of each character, while the second contains its updated bit value.
+
Once created, the arrays are traversed and each character in the input is replaced with its updated bit value.
+
Once a new output text is generated, it is encoded as a byte array and written to the output file.
Lempel-Ziv, commonly referred to as LZ77/LZ78 depending on the variant, is one of the oldest, most simplistic, and widespread compression algorithms out there. Its power comes from its simplicity, speed, and decent compression rates. Now before we dive into an implementation, let’s understand the concept behind Lempel-Ziv and the various algorithms it has spawned.
+
The Algorithm(s)
+
Lempel-Ziv at its core is very simple. It works by taking an input string of characters, finding repetitive characters, and outputting an “encoded” version. To get an idea of it, here’s an example.
As you can see, the algorithm simply takes an input string, in this case, “Hello everyone! Hello world!”, and encodes it character by character. If it tries to encode a character it has already seen it will check to see if it has seen the next character. This repeats until it the character it’s checking hasn’t been seen before, following the characters it’s currently encoding, at this point it outputs a “token”, which is <16,6> in this example, and continues.
+
The <16,6> token is quite simple to understand too, it consists of two numbers and some syntactic sugar to make it easy to understand. The first number corresponds to how many characters it should look backwards, and the next number tells it how many characters to go forwards and copy. This means that in our example, <16,6> expands into “Hello “ as it goes 16 characters backwards, and copies the next 6 characters.
+
This is the essential idea behind the algorithm, however it should be noted that there are many variations of this algorithm with different names. For example, in some implementations, the first number means go forwards from the beginning instead of backwards from the current position. Small (and big) differences like these are the reason for so many variations:
It’s also important to understand the difference between LZ77 and LZ78, the first two Lempel-Ziv algorithms. LZ77 works very similarly to the example above, using a token to represent an offset and length, while LZ78 uses a more complicated dictionary approach. For a more in-depth explanation, make sure to check out this wonderful article explaining LZ78.
+
Implementations
+
Now because there are so many different variations of Lempel-Ziv algorithms, there isn’t a single “LZ” implementation. WIth that being said, if you are interested in implementing a Lempel-Ziv algorithm yourself, you’ll have to choose an algorithm to start with. LZSS is a great jumping-off point as it’s a basic evolution of LZ77 and can be implemented very easily while achieving a respectable compression ratio. If you’re interested in another algorithm, head back to the algorithms overview.
Lempel-Ziv-Storer-Szymanski, which we’ll refer to as LZSS, is a simple variation of the common LZ77 algorithm. It uses the same token concept with an offset and length to tell the decoder where to copy the text, except it only places the token when the token is shorter than the text it is replacing.
+
The idea behind this is that it will never increase the size of a file by adding tokens everywhere for repeated letters. You can imagine that LZ77 would easily increase the file size if it simply encoded every repeated letter “e” or “i” as a token, which may take at least 5 bytes depending on the file and implementation instead of just 1 for LZSS.
+
Implementing an Encoder
+
Let’s take a look at some examples, so we can see exactly how it works. The wikipedia article for LZSS has a great example for this, which I’ll use here, and it’s worth a read as an introduction to LZSS.
+
So let’s encode an excerpt of Dr. Seuss’s Green Eggs and Ham with LZSS (credit to Wikipedia for this example).
+
I AM SAM. I AM SAM. SAM I AM.
+
+THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!
+
+DO WOULD YOU LIKE GREEN EGGS AND HAM?
+
+I DO NOT LIKE THEM,SAM-I-AM.
+I DO NOT LIKE GREEN EGGS AND HAM.
+
+
This text takes up 192 bytes in a typical UTF-8 encoding. Let’s take a look at the LZSS encoded version.
+
I AM SAM. <10,10>SAM I AM.
+
+THAT SAM-I-AM! T<15,14>I DO NOT LIKE<29,15>
+
+DO WOULD YOU LIKE GREEN EGGS AND HAM?
+
+I<69,15>EM,<113,8>.<29,15>GR<64,16>.
+
+
This encoded, or compressed, version only takes 148 bytes to store (without a magic type to describe the file type), which is 77% of the original file size, or a compression ratio of 1.3. Not bad!
+
Analysis
+
Now let’s take a second to understand what’s happening before you start trying to conquer the world with LZSS.
+
As we can see, the “tokens” are reducing the size of the file by referencing pieces of text that are longer than the actual token. Let’s look at the first line:
+
I AM SAM. <10,10>SAM I AM.
+
+
The encoder works character by character. On the first character, ‘I’, it checks it’s search buffer to see if it’s already seen an ‘I’. The search buffer is essentially the encoder’s memory, for every character it encodes, it adds it into the search buffer so it can “remember” it. Because it hasn’t seen an ‘I’ already (the search buffer is empty), it just outputs an ‘I’, adds it to the search buffer, and moves to the next character. The next character is ‘ ‘ (a space). The encoder checks the search buffer to see if it’s seen a space before, and it hasn’t so it outputs the space and moves forward.
+
Once it gets to the second space (after “I AM”), LZ77 comes into play. It’s already seen a space before because it’s in the search buffer so it’s ready to output a token, but first it tries to maximize how much text the token is referencing. If it didn’t do this you could imagine that for every character it’s already seen it would output something similar to <5,1>, which is 5 times larger than any character. So once it finds a character that it’s already seen, it moves on to the next character and checks if it’s already seen the next character directly after the previous character. Once it finds a sequence of characters that it hasn’t already seen, then it goes back one character to the sequence of characters it’s already seen and prepares the token.
+
Once the token is ready, the difference between LZ77 and LZSS starts to shine. At this point LZ77 simply outputs the token, adds the characters to the search buffer and continues. LZSS does something a little smarter, it will check to see if the size of the outputted token is larger than the text it’s representing. If so, it will output the text it represents, not the token, add the text to the search buffer, and continue. If not, it will output the token, add the text it represents to the search buffer and continue.
+
+
Coming back to our example, the space character has already been seen, but a space followed by an “S” hasn’t been seen yet (“ S”), so we prepare the token representing the space. The token in our case would be “<3,1>”, which means go back three characters and copy 1 character(s). Next we check to see if our token is longer than our text, and “<3,1>” is indeed longer than “ “, so it wouldn’t make sense to output the token, so we output the space, add it to our search buffer, and continue.
+
This entire process continues until we get to the “I AM SAM. “. At this point we’ve already seen an “I AM SAM. “ but haven’t seen an “I AM SAM. S” so we know our token will represent “I AM SAM. “. Then we check to see if “I AM SAM. “ is longer than “<10,10>”, which it is, so we output the token, add the text to our search buffer and continue along.
+
+
This process continues, encoding tokens and adding text to the search buffer character by character until it’s finished encoding everything.
+
Takeaways
+
There’s a lot of information to unpack here, but the algorithm at a high level is quite simple:
+
+
Loop character by character
+
Check if it’s seen the character before
+
If so, check the next character and prepare a token to be outputted
+
If the token is longer than the text it’s representing, don’t output a token
+
Add the text to the search buffer and continue
+
+
+
If not, add the character to the search buffer and continue
+
+
+
+
It’s important to remember that no matter the outcome, token or no token, the text is always appended to the search buffer so it can always “remember” the text it’s already seen.
+
Implementation
+
Now let’s take a stab at building our very own version so we can understand it more deeply.
+
As with most of these algorithms, we have an implementation written in Go in our raisin project. If you’re interested in what a more performant or real-world example of these algorithms looks like, be sure to check it out. However for this guide we’ll use Python to make it more approachable so we can focus on understanding the algorithm and not the nuances of the language.
+
Character Loop
+
Let’s get started with a simple loop that goes over each character for encoding. As we can see from our takeaways, the character-by-character loop is what powers LZSS.
+
text="HELLO"
+encoding="utf-8"
+
+text_bytes=text.encode(encoding)
+
+forcharintext_bytes:
+ print(bytes([char]).decode(encoding))# Print the character
+
+
Output:
+
H
+E
+L
+L
+O
+
+
Although the code is functionally pretty simple, there’s a few important things going on here. You can see that looping character-by-character isn’t as simple as for char in text, first we have to encode it and then loop over the encoding. This is because it converts our string into an array of bytes, represented as a Python object called bytes. When we print the character out, we have to convert it from a byte (represented as a Python int) back to a string so we can see it.
+
The reason we do this is because a byte is really just a number from 0-255 as it is represented in your computer as 8 1’s and 0’s, called binary. If you don’t already have a basic understanding of how computers store our language, you should get acquainted with it on our encodings page.
+
Search Buffers
+
Great, we have a basic program working which can loop over our text and print it out, but that’s pretty far off from compression so let’s keep going. The next step to our program is to implement our “memory” so the program can check to see if its already seen a character.
+
text="HELLO"
+encoding="utf-8"
+
+text_bytes=text.encode(encoding)
+
+search_buffer=[]# Array of integers, representing bytes
+
+forcharintext_bytes:
+ print(bytes([char]).decode(encoding))# Print the character
+search_buffer.append(char)# Add the character to our search buffer
+
+
We no longer need to output anything as we’re just adding each character to the search buffer with the append method.
+
Checking Our Search buffer
+
Now let’s try to implement the LZ part of LZSS, we need to start looking backwards for characters we’ve already seen. This can accomplished quite easily using the list.index method.
+
forcharintext_bytes:
+ ifcharinsearch_buffer:
+ print(f"Found at {search_buffer.index(char)}")
+
+ print(bytes([char]).decode(encoding))# Print the character
+search_buffer.append(char)# Add the character to our search buffer
+
+
Output:
+
H
+E
+L
+Found at 2
+L
+O
+
+
Notice the if char in search_buffer, without this Python will throw an IndexError if the value is not in the list.
+
Building Tokens
+
Now let’s build a token and output it when we find the character.
+
i=0
+forcharintext_bytes:
+ ifcharinsearch_buffer:
+ index=search_buffer.index(char)# The index where the character appears in our search buffer
+offset=i-index# Calculate the relative offset
+length=1# Set the length of the token (how many character it represents)
+
+ print(f"<{offset},{length}>")# Build and print our token
+else:
+ print(bytes([char]).decode(encoding))# Print the character
+
+ search_buffer.append(char)# Add the character to our search buffer
+
+ i+=1
+
+
Output:
+
H
+E
+L
+<1,1>
+O
+
+
We’re nearly there! This is actually a rough implementation of LZ77, however there’s one issue. If we have a word that repeats twice, it will copy each character instead of the entire word.
+
text = "SAM SAM"
+
+
Output
+
S
+A
+M
+
+<4,1>
+<4,1>
+<4,1>
+
+
Note: <4,1> is technically correct as each character is represented 4 characters behind the beginning of the token.
+
That’s not exactly right, we should see <4,3> instead of three <4,1> tokens. So let’s write some code that can check our search buffer for more than one character.
+
Checking the Search Buffer for More Characters
+
Let’s modify our code to check the search buffer for more than one character.
+
defelements_in_array(check_elements,elements):
+ i=0
+ offset=0
+ forelementinelements:
+ iflen(check_elements)<=offset:
+ # All of the elements in check_elements are in elements
+returni-len(check_elements)
+
+ ifcheck_elements[offset]==element:
+ offset+=1
+ else:
+ offset=0
+
+ i+=1
+ return-1
+
+text="SAM SAM"
+encoding="utf-8"
+
+text_bytes=text.encode(encoding)
+
+search_buffer=[]# Array of integers, representing bytes
+check_characters=[]# Array of integers, representing bytes
+
+i=0
+forcharintext_bytes:
+ check_characters.append(char)
+ index=elements_in_array(check_characters,search_buffer)# The index where the characters appears in our search buffer
+
+ ifindex==-1ori==len(text_bytes)-1:
+ iflen(check_characters)>1:
+ index=elements_in_array(check_characters[:-1],search_buffer)
+ offset=i-index-len(check_characters)+1# Calculate the relative offset
+length=len(check_characters)# Set the length of the token (how many character it represents)
+
+ print(f"<{offset},{length}>")# Build and print our token
+else:
+ print(bytes([char]).decode(encoding))# Print the character
+
+ check_characters=[]
+
+ search_buffer.append(char)# Add the character to our search buffer
+
+ i+=1
+
+
Output:
+
S
+A
+M
+
+<4,3>
+
+
It works! But there’s quite a lot to unpack here so let’s go through it line by line.
+
The first and largest addition is the elements_in_array function. The code here essentially checks to see if specific elements are within an array in an exact order. If so, it will return the index in the array where the elements start, and if not it will return -1.
+
Moving on to our main function loop we can see now have check_characters defined at the top. This variable tracks what characters we’re looking for in our search_buffer. As we loop through, we use check_characters.append(char) to add the current character to the characters we’re searching. Then we check to see if check_characters can be found within search_buffer with elements_in_array.
+
Now we have the best part: the logic. If we couldn’t find a match or it’s the last character we want to output something. If we couldn’t find more than one character in the search_buffer then that means check_characters minus the last character was found, so we’ll output a token representing check_characters minus the last character. Otherwise, we couldn’t find a match for a single character so let’s just output that character.
+
And that’s essentially LZ77! Try it out for yourself with some different strings to see for yourself. However, you might notice that we’re trying to implement LZSS, not LZ77, so we have one more piece to implement.
+
Comparing Token Sizes
+
This crucial piece is the process described earlier of comparing the size of tokens versus the text it represents. Essentially we’re saying, if the token takes up more space than the text it’s representing then don’t output a token, just output the text.
+
Lucky for us this is a pretty simple change. Our main loop now looks like so:
+
forcharintext_bytes:
+ check_characters.append(char)
+ index=elements_in_array(check_characters,search_buffer)# The index where the characters appears in our search buffer
+
+ ifindex==-1ori==len(text_bytes)-1:
+ iflen(check_characters)>1:
+ index=elements_in_array(check_characters[:-1],search_buffer)
+ offset=i-index-len(check_characters)+1# Calculate the relative offset
+length=len(check_characters)# Set the length of the token (how many character it represents)
+
+ token=f"<{offset},{length}>"# Build our token
+
+ iflen(token)>length:
+ # Length of token is greater than the length it represents, so output the character
+print(bytes(check_characters).decode(encoding))# Print the characters
+else:
+ print(token)# Print our token
+else:
+ print(bytes([char]).decode(encoding))# Print the character
+
+ check_characters=[]
+
+ search_buffer.append(char)# Add the character to our search buffer
+
+ i+=1
+
+
Output:
+
S
+A
+M
+
+SAM
+
+
The key is the len(token) > length which checks if the length of the token is longer than the length of the text it’s representing. If it is, it simply outputs the characters, otherwise it outputs the token.
+
Sliding Windows
+
The last piece to the puzzle is something you might have noticed if you’re already trying to compress large file: the search buffer gets big. Let’s say we’re compressing a 1 Gb file. After we go over each character, we add it to the search buffer and continue, though each iteration we also search the entire search buffer for certain characters. This quickly adds up for larger files. In our 1 Gb file scenario, near the end we’ll have to search almost 1 billion bytes to encode a single character.
+
It should be pretty obvious that this very inefficient. And unfortunately, there is no magic solution. You have to make a tradeoff. With every compression algorithm you have to decide between speed and compression ratio. Do you want a fast algorithm that can’t reduce the file size very much, or a slow algorithm that reduces the file size more? The answer is: it depends. And so, the tradeoff in LZ77’s case is to create a “sliding window”.
+
The “sliding window” is actually quite simple, all you do is cap off the maximum size of the search buffer. When you add a character to the search buffer that makes it larger than the maximum size of the sliding window then you remove the first character. That way the window is “sliding” as you move through the file, and the algorithm doesn’t slow down!
+
max_sliding_window_size=4096
+
+...
+
+forcharintext_bytes:
+
+ ...
+
+ search_buffer.append(char)# Add the character to our search buffer
+
+ iflen(search_buffer)>max_sliding_window_size:# Check to see if it exceeds the max_sliding_window_size
+search_buffer=search_buffer[1:]# Remove the first element from the search_buffer
+
+ ...
+
+
These changes should be pretty self-explanatory. We’re just checking to see if the length of the search_buffer is greater than the max_sliding_window_size, and if so we pop the first element off of the search_buffer.
+
Keep in mind that while a maximum sliding window size of 4096 character is typical, it may be hard to use during testing, try setting it much lower (like 3-4) and test it with some different strings to see how it works.
+
Putting it all together
+
That’s everything that makes up LZSS, but for the sake of completing our example, let’s clean it up so we can call a function with some text, an optional max_sliding_window_size, and have it return the encoded text, rather than just printing it out.
+
encoding="utf-8"
+
+defencode(text,max_sliding_window_size=4096):
+ text_bytes=text.encode(encoding)
+
+ search_buffer=[]# Array of integers, representing bytes
+check_characters=[]# Array of integers, representing bytes
+output=[]# Output array
+
+ i=0
+ forcharintext_bytes:
+ check_characters.append(char)
+ index=elements_in_array(check_characters,search_buffer)# The index where the characters appears in our search buffer
+
+ ifindex==-1ori==len(text_bytes)-1:
+ iflen(check_characters)>1:
+ index=elements_in_array(check_characters[:-1],search_buffer)
+ offset=i-index-len(check_characters)+1# Calculate the relative offset
+length=len(check_characters)# Set the length of the token (how many character it represents)
+
+ token=f"<{offset},{length}>"# Build our token
+
+ iflen(token)>length:
+ # Length of token is greater than the length it represents, so output the character
+output.extend(check_characters)# Output the characters
+else:
+ output.extend(token.encode(encoding))# Output our token
+else:
+ output.extend(check_characters)# Output the character
+
+ check_characters=[]
+
+ search_buffer.append(char)# Add the character to our search buffer
+
+ iflen(search_buffer)>max_sliding_window_size:# Check to see if it exceeds the max_sliding_window_size
+search_buffer=search_buffer[1:]# Remove the first element from the search_buffer
+
+ i+=1
+
+ returnbytes(output)
+
+print(encode("SAM SAM",1).decode(encoding))
+print(encode("supercalifragilisticexpialidocious supercalifragilisticexpialidocious",1024).decode(encoding))
+print(encode("LZSS will take over the world!",256).decode(encoding))
+print(encode("It even works with 😀s thanks to UTF-8",16).decode(encoding))
+
+
The function definition is pretty simple, we can just move our text and max_sliding_window_size outside of the function and wrap our code in a function definition. Then we simply call it with some different values to test it, and that’s it!
+
The finished code can be found in lzss.py in the examples GitHub repo.
+
Lastly, there’s a few bugs in our program that we encounter with larger files. If we have some text, for example:
+
ISAM YAM SAM
+
+
When the encoder gets to the space right before the “SAM”, it will look for a space in the search buffer which it finds. Then it will search for a space and an “S” (“ S”) which it doesn’t find, so it continues and starts looking for an “A”. The issue here is that it skips looking for an “S” and continues to encode the “AM” not the “SAM”.
+
In some rare circumstances the code may generate a reference with a length that is larger than its offset which will result in an error.
+
To fix this, we’ll need to rewrite the logic in our encoder a little bit.
+
forcharintext_bytes:
+ index=elements_in_array(check_characters,search_buffer)# The index where the characters appears in our search buffer
+
+ ifelements_in_array(check_characters+[char],search_buffer)==-1ori==len(text_bytes)-1:
+ ifi==len(text_bytes)-1andelements_in_array(check_characters+[char],search_buffer)!=-1:
+ # Only if it's the last character then add the next character to the text the token is representing
+check_characters.append(char)
+
+ iflen(check_characters)>1:
+ index=elements_in_array(check_characters,search_buffer)
+ offset=i-index-len(check_characters)# Calculate the relative offset
+length=len(check_characters)# Set the length of the token (how many character it represents)
+
+ token=f"<{offset},{length}>"# Build our token
+
+ iflen(token)>length:
+ # Length of token is greater than the length it represents, so output the characters
+output.extend(check_characters)# Output the characters
+else:
+ output.extend(token.encode(encoding))# Output our token
+
+ search_buffer.extend(check_characters)# Add the characters to our search buffer
+else:
+ output.extend(check_characters)# Output the character
+search_buffer.extend(check_characters)# Add the characters to our search buffer
+
+ check_characters=[]
+
+ check_characters.append(char)
+
+ iflen(search_buffer)>max_sliding_window_size:# Check to see if it exceeds the max_sliding_window_size
+search_buffer=search_buffer[1:]# Remove the first element from the search_buffer
+
+ i+=1
+
+
To fix the first issue we add the current char to check_characters only at the end and check to see if check_characters + [char] is found. If not we know that check_characters is found so we can continue as normal, and check_characters gets cleared before char is added onto check_characters for the next iteration. We also implement a check on the last iteration to add the current char to check_characters as otherwise our logic wouldn’t be run on the last character and a token wouldn’t be created.
+
To resolve the other problem we simply have to move the search_buffer.append(char) calls up into our logic and change them to search_buffer.extend(check_characters). This way we only update our search buffer when we’ve already tried to find a token.
+
Implementing a Decoder
+
What’s the use of encoding something some text if we can’t decode it? For that we’ll need to build ourselves a decoder.
+
Luckily for us, building a decoder is actually much simpler than an encoder because all it needs to know how to do is convert a token (“<5,2>”) into the literal text it represents. The decoder doesn’t care about search buffers, sliding windows, or token lengths, it has only one job.
+
So, let’s get started. We’re going to decode character-by-character just like our encoder so we’ll start with our main loop inside of a function. We’ll also need to encode and decode the strings so we’ll keep the encoding = "utf-8".
+
encoding="utf-8"
+
+defdecode(text):
+
+ text_bytes=text.encode(encoding)# The text encoded as bytes
+output=[]# The output characters
+
+ forcharintext_bytes:
+ output.append(char)# Add the character to our output
+
+ returnbytes(output)
+
+print(decode("supercalifragilisticexpialidocious <35,34>").decode(encoding))
+
+
Here we’re setting up the structure for the rest of our decoder by setting up our main loop and declaring everything within a neat self-contained function.
+
Identifying Tokens
+
The next step is to start doing some actual decoding. The goal of our decoder is to convert a token into text, so we need to first identify a token and extract our offset and length before we can convert it into text.
+
Notice the various components of a token that need to be identified and extracted so we can find the text they represent
+
Let’s make a small change so we can identify the start and end of a token.
+
forcharintext_bytes:
+ ifchar=="<".encode(encoding)[0]:
+ print("Found opening of a token")
+ elifchar==">".encode(encoding)[0]:
+ print("Found closing of a token")
+
+ output.append(char)# Add the character to our output
+
+ returnbytes(output)
+
+
Because we’re going character-by-character we can simply check to see if the character is a token opening character or closing character to tell if we’re inside a token. Let’s add some more code to track the numbers between the comma, our seperator.
+
inside_token=False
+scanning_offset=True
+
+length=[]# Length number encoded as bytes
+offset=[]# Offset number encoded as bytes
+
+forcharintext_bytes:
+ ifchar=="<".encode(encoding)[0]:
+ inside_token=True# We're now inside a token
+scanning_offset=True# We're now looking for the length number
+elifchar==",".encode(encoding)[0]:
+ scanning_offset=False
+ elifchar==">".encode(encoding)[0]andinside_token:
+ inside_token=False# We're no longer inside a token
+
+ # Convert length and offsets to an integer
+length_num=int(bytes(length).decode(encoding))
+ offset_num=int(bytes(offset).decode(encoding))
+
+ print(f"Found token with length: {length_num}, offset: {offset_num}")
+
+ # Reset length and offset
+length,offset=[],[]
+ elifinside_token:
+ ifscanning_offset:
+ offset.append(char)
+ else:
+ length.append(char)
+
+ output.append(char)# Add the character to our output
+
+
Output:
+
Found token with length: 34, offset: 35
+supercalifragilisticexpialidocious <35,34>
+
+
We now have a bunch of if statements that give our loop some more control flow. Let’s go over the changes.
+
First off we have four new variables outside of the loop:
+
+
+inside_token - Tracks whether or not we’re inside a token
+
+scanning_offset - Tracks whether we’re currently scanning for the offset number or the length number (1st or 2nd number in the token)
+
+length - Used to store the bytes (or characters) that represent the token’s length
+
+offset- Used to store the bytes (or characters) that represent the token’s offset
+
+
Inside of the loop, we check if the character is a <, ,, or a > and modify the variables accordingly to track where we are. If the character isn’t any of those and we’re inside a token then we want to add the character to either the offset or length because that means the character is an offset or length.
+
Lastly, if the character is a >, that means we’re exiting the token, so let’s convert our length and offset into a Python int. We have to do this because they’re currently represented as a list of bytes, so we need to convert those bytes into a Python string and convert that string into an int. Then we finally print that we’ve found a token.
+
Translating Tokens
+
Now we have one last step left: translating tokens into the text they represent. Thanks to Python list slicing this is quite simple.
+
forcharintext_bytes:
+ ifchar=="<".encode(encoding)[0]:
+ inside_token=True# We're now inside a token
+scanning_offset=True# We're now looking for the length number
+token_start=i
+ elifchar==",".encode(encoding)[0]:
+ scanning_offset=False
+ elifchar==">".encode(encoding)[0]andinside_token:
+ inside_token=False# We're no longer inside a token
+
+ # Convert length and offsets to an integer
+length_num=int(bytes(length).decode(encoding))
+ offset_num=int(bytes(offset).decode(encoding))
+
+ # Get text that the token represents
+referenced_text=output[-offset_num:][:length_num]
+
+ output.extend(referenced_text)# referenced_text is a list of bytes so we use extend to add each one to output
+
+ # Reset length and offset
+length,offset=[],[]
+ elifinside_token:
+ ifscanning_offset:
+ offset.append(char)
+ else:
+ length.append(char)
+ else:
+ output.append(char)# Add the character to our output
+
+
+returnbytes(output)
+
In order to calculate the piece of text that a token is referencing we can simply use our offset and length to find the text from the current output. We use a negative slice to grab all the characters backwards from offset_num and grab up to length_num elements. This results in a referenced_text that represents the token references. Finally we add the referenced_text to our output and we’re finished.
+
Lastly, we’ll only want to add a character to the output if we’re not in a token so we add an else to the end of our logic which only runs if we’re not in a token.
+
And that’s it! We now have a LZSS decoder, and by extension, an LZ77 decoder as decoders don’t need to worry about outputting a token only if it’s greater than the referenced text.
+
Implementation Conclusion
+
We’ve gone through step-by-step building an encoder and decoder and learned the purpose of each component. Let’s do some basic benchmarks to see how well it works.
One thing to keep in mind is that when we refer to a “character”, we really mean a “byte”. Our loop runs byte-by-byte, not character-by-character. This distinction is minor but significant. In the world of encodings, not every character is a single byte. For example in utf-8, any english letter or symbol is a single byte, but more complicated characters like arabic, mandarin, or emoji characters require multiple bytes despite being a single “character”.
+
+
4 bytes - 😀
+
1 byte - H
+
3 bytes - 话
+
6 bytes - يَّ
+
+
If you’re interested in learning more about how bytes work, check out the Wikipedia articles on Bytes and Unicode or our reference page on bytes.
+
+
+
+
+
+
+
+
diff --git a/docs/algorithms/overview/index.html b/docs/algorithms/overview/index.html
new file mode 100644
index 0000000..6f9bcc5
--- /dev/null
+++ b/docs/algorithms/overview/index.html
@@ -0,0 +1,131 @@
+
+
+
+
+
+
+Overview of Algorithms - The Hitchhiker's Guide to Compression
+
+
+Overview of Algorithms | The Hitchhiker’s Guide to Compression
+
+
+
+
+
+
+
+
+
+ Link
+Search
+Menu
+Expand
+Document
+
The following is intended to be a comprehensive list of lossless compression algorithms (in no particular order), however if you feel like an algorithm is missing, please let us know.