Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
kedartatwawadi authored Mar 3, 2017
1 parent 8ccecd3 commit 1a83cdf
Showing 1 changed file with 65 additions and 54 deletions.
119 changes: 65 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,33 +28,84 @@ Also on the theoretical side, there are connections between predictors trained w
1. EE376c Lecture Notes on Prediction: [http://web.stanford.edu/class/ee376c/lecturenotes/Chapter2\_CTW.pdf](http://web.stanford.edu/class/ee376c/lecturenotes/Chapter2_CTW.pdf)

Another interesting thing to note is that, RNN based models have been partially used in the state-of-the-art lossless compressors. They have been mainly used only for context mixing. The compressors find the probability of the next character based on multiple human-designed contexts/features \(eg: past 20 chars, 4 words, or alternate characters, only the higher bits of the last 10 bytes etc.\). These probabilites are "mixed" \(somethig like boosting using experts\), using a LSTM based context mixer.
In fact, most of the leading text compressors, on the [Hutter prize](http://prize.hutter1.net/\) leaderboard use LSTMs for model mixing. For example, here is the flowchart for the [CMIX]\(http://www.byronknoll.com/cmix.html) use LSTM's for context mixing.
In fact, most of the leading text compressors, on the [Hutter prize](http://prize.hutter1.net/) leaderboard use LSTMs for model mixing. For example, here is the flowchart for the [CMIX](http://www.byronknoll.com/cmix.html) use LSTM's for context mixing.

Eg: CMIX compressor: [http://www.byronknoll.com/cmix.html](http://www.byronknoll.com/cmix.html)

![cmix\_image](http://www.byronknoll.com/images/architecture.png)
![cmix_image](http://www.byronknoll.com/images/architecture.png)

### Applications

1. **Improved intuitive understanding** of RNN based structures for compression. The understanding can be used later to make improvements to more complex image/video compressors
2. **Wide Applications** to generic text/DNA/parameter compression. i.e. wherever arithematic encoding is used.
3. **Theoretical Connections** with log-loss based predictors, can be understood based on simple linear-RNN networks etc.

### TODO
## 2. Experiments
We plan to conduct some fundamental experiments first before going on to compress real DNA/Text dataset.

The plan is to test with sources such as:
### IID sources
We first start with simplest sources, i.i.d sources over binary alphabet and see if we can compress them well. We can show that the expected cross entropy loss for i.i.d sequences has a lower bound of binary entropy of the soruce. Thus the aim is to read this log-loss limit, which will confirm that arithematic encoding will work well.

1. iid sources
2. k-Markov sources
3. 0 entropy sources with very high markovity. For example
X_n = X_{n-20} exor X_{n-40} exor X_{n-60}
4. Try compressing images: Eg: [https://arxiv.org/abs/1601.06759](https://arxiv.org/abs/1601.06759)
We observe that for iid sources, even a small model like a [8 cell, 2 layer network] is able to perform optimally with a very small (1000 length training sequence).

### 0-entropy sources
Our next sources are 0-entropy stationary sources. By 0-entropy we mean they have 0 entropy rate (the `$m^th$` order entropy converges to 0 as `$m \rightarrow \infty$`).
Our sources are very simple binary sources such as:

```mathjax
X_n = X_{n-1} + X_{n-k}
```
where k is the parameter we choose. (the + is over binary alphabets). In this case, we observe that the process is stationary and is deterministic once you fix the first `$k$` symbols. Thus, it has entropy rate 0. Note that it seems iid until order `$k-1$`. Thus, any sequence modelling it with a lower model wont be able to compress at all.

We conduct experiment by varying `$k$`. Higher `$k$` are generally more difficult to compress for standard compressors like LZ (and in-fact a lower order adaptive arithematic encoder wont be able to compress at all). Some of the observations are as follows:

#### Parameters:
* All results for 2 epoch runs \(1 epoch training & 1 epoch compression\)
* The input files generated are of size `$10^8$`, which is also one of the standard lengths for compression comparison.
* Model is a 32 cell 3 & 2 layer (we also try other models, but more on that later)
* The sequence length of training was 64 \(lengths higher than 64 will get difficult to train\)
* Unlike standard RNN models, we retain the state at the end of the batch, so that the state can be passed correctly to the next chunk. Also, the batches are parsed sequentially through the text.
* We have a validation set, which we reun every 100th iteration. The validation text is also generated with the same parameters and is of length 10,000.


#### 2-layer network
The graph shows the learning curve for the 2-layer model. The learning curves are for different inputs with markovity [10,20,30,40,50,60]. We observe that the model takes longer time to learn higher markovity models. This is expected as the model tries to explore every order (intuitively), and tries out smaller orders before going on to higher ones.
Also, observe that the model is not able to learn at all in 1 epoch from markovity 60.

![val-2](images/val_64_2_layer.png)

#### 3-layer network
The 3-layer model also has similar difficulties, as it is also not able to learn for markovity 60 text. This suggests that, this has to do with the information flow, and perhaps might be due to vanishing gradients issue.

![val-1](images/val_64_3_layer.png)

The overall resutls are as follows. The numbers are bits per symbol. As the input is binary, the worst we can do should be 1 bits/symbol. We compare the results with a universal compressor XZ.

| Markovity | 3-layer NN | 2-layer NN | XZ |
| --- | --- | --- | --- |
| 10 | 0.0001 | 0.0001 | 0.004 |
| 20 | 0.0002 | 0.0005 | 0.05 |
| 30 | 0.01 | 0.07 | 0.4 |
| 40 | 0.1 | 0.27 | 0.58 |
| 50 | 0.3 | 0.4 | 0.65 |
| 60 | 1 | 1 | 0.63 |

The results showcase that, even over pretty large files ~100MB, the models perform very well for markovity until 40-50. However, for longer markovity, it is not able to figure out much, while LZ figures out some things, mainly becasue of the structure of the LZ algorithm. \(I will regenerate data for markovity 60 a few more times to confirm, as 0.63 looks a bit low than expectations\).

This suggests that, we should be able to use LZ based features along with the NN to improve compression somehow. This also suggests that, directly dealing with sources with very long dependencies (images, DNA) in a naive way would not work due 50 markovity limit.

#### Analysis of how sequence length impacts the learning

It was observed that sequence length while we perform truncated backproagation dramatically impacts the learning. One positive is that, the network does learn dependencies longer than sequence length sometimes. Although very long sequence lengths will suffer from vanishing gradients issue, which we need to think how to solve.

For a 16-size 2 layer network, with sequence length of 8, we were able to train for markovity 10 very well (thus even though we do not explicitly backproagate, there is still some learning below 8 levels). However, anything above that
\(markovity 15, 20, ...\) gets very difficult to train.

Wold be interesting to see, if the RNN network is able to figure out such deep correlations. Would be useful to also quantify the amount of state information required to achieve entropy limits with there sources \(what RNN models, how many layers\).
![train-1](images/loss_8.png\)
![val-1]\(images/val_loss_8.png)
4. Try compressing images: Eg: [https://arxiv.org/abs/1601.06759](https://arxiv.org/abs/1601.06759)

## Feb 17 Update

### IID sources and 3-4 Markov Sources
### IID sources

I tried with some small markov sources and iid sources. The network is easly able to learn the distribution \(within a few iterations\).

Expand All @@ -75,11 +126,10 @@ I tried with two real datasets, The first one is the chromosome 1 DNA dataset \(

### Hutter prize dataset

The Hutter prize is a competition for compressing the wikipedia knowledge dataset \(100MB\) into 16MB or less. Compressors like gzip are able to perform upto 34MB, while more carefully preprocessed LZTurbo, can perform upto 25MB. The best, state of the art compressors, \(which incidentally also use neural networks for context mixing\) perform close to 15MB. Our basic character-level model performs close to 20MB compression, which again is comparatively good.
The Hutter prize is a competition for compressing the wikipedia knowledge dataset \(100MB\) into 16MB or less. Compressors like gzip are able to perform upto 34MB, while more carefully preprocessed LZTurbo, can perform upto 25MB. The best, state of the art compressors, \(which incidentally also use neural networks for context mixing\) perform close to 15MB. Our basic character-level model (1024 size 3 layer) performs close to 16.5MB compression, which again is comparatively good.

![hutter](char-rnn-tensorflow/images/img3.png)

The overall observation is that, the neural network is able to compress lower context very easily and pretty well. Capturing higher order contexts needs more careful training, or change in the model which I am still exploring.

## Feb 24 Update

Expand All @@ -105,45 +155,6 @@ I believe, using simpler models, with the new changes can significantly boost th
2. Run it on images/video? \(still needs some work\): see PixelRNN
3. Read more about the context mixing algorithms used in video codecs etc.

## Mar 3 Update

1. Running on a validation set of length 10000 \(every batch of training is a small sequence of 64 length\). It is observed that the model generalizes very well in most of the cases \(I observed 1 case, where the model was able to overfit the data significantly and not able to generalize to unseen sequences well. I am still investigating that scenario\)

2. I am able to train well for sequences until markovity 50 \(for a training sequence of length 64\). Above that, the model does not learn well. Comparison with xz. XZ is one of the best universal compressors. '

* All results for 2 epoch runs \(1 epoch training & 1 epoch compression\)
* File size is 10^8
* Model is a 32 cell 3 & 2 layer
* The sequence length of training was 64 \(lengths higher than 64 will get difficult to train\)

#### 2-layer network

![val-2](images/val_64_2_layer.png)

#### 3-layer network

![val-1](images/val_64_3_layer.png)

| Markovity | 3-layer NN | 2-layer NN | XZ |
| --- | --- | --- | --- |
| 10 | 0.0001 | 0.0001 | 0.004 |
| 20 | 0.0002 | 0.0005 | 0.05 |
| 30 | 0.01 | 0.07 | 0.4 |
| 40 | 0.1 | 0.27 | 0.58 |
| 50 | 0.3 | 0.4 | 0.65 |
| 60 | 1 | 1 | 0.63 |

The results showcase that, even over pretty large files ~100MB, the models perform very well for markovity until 40-50. However, for longer markovity, it is not able to figure out much, while LZ figures out some things, mainly becasue of the structure of the LZ algorithm. \(I will regenerate data for markovity 60 a few more times to confirm, as 0.63 looks a bit low than expectations\).

This suggests that, we should be able to use LZ based features along with the NN to improve compression somehow. This also suggests that, directly dealing with images in a vanilla rasterized fashion would not work for Neural networks, and we need to pass the context in a more clever way.

### Analysis of how sequence length impacts the learning

It was observed that sequence length dramatically impacts the learning. One positive is that, the network does learn dependencies longer than sequence length sometimes. Although very long sequence lengths will suffer from vanishing gradients issue, which we need to think how to solve.

For a 16-size 2 layer network, with sequence length of 8, we were able to train for markovity 10 very well. however, anything above that
\(markovity 15, 20, ...\) gets very difficult to train.

![train-1](images/loss_8.png\)
![val-1]\(images/val_loss_8.png)

0 comments on commit 1a83cdf

Please sign in to comment.