Useless but mildly interesting language model using compressors built-in to Python.
You can "train" it using some training data:
data = open(my_favorite_text_file).read().lower()
alphabet = "qwertyuiopasdfghjklzxcvbnm,.;1234567890 "
model = ziplm.ZipModel(alphabet, training=data)
"".join(model.sample_sequence(10)) # sample 10 characters from the alphabet
You can also run it without any training data, and just forward sample to see what kinds of patterns gzip likes:
alphabet = "abc"
model = ziplm.ZipModel(alphabet)
"".join(model.sample_sequence(100)) # I get 'ccabcabcabcabcabcabcabcabcabcabcabcabcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbccabcbcbcbc'
You can also get the probability for a sequence:
alphabet = "qwertyuiopasdfghjklzxcvbnm "
model = ziplm.ZipModel(alphabet)
model.sequence_logprob("this is my favorite string") # I get -83.8
You can also try using bz2
and lzma
as language models by passing them as the compressor
argument to the model
import lzma
model = ziplm.ZipModel(alphabet, compressor=lzma)
"".join(model.sample_sequence(100)) # I get 'cccbaaaaacccccabcacccbaaaaabaacaabaacaabaacaabaabacaaaaaaaaaaacccbabacaaaaaaaaaaaccccacaaccbaaaaaccc'
This works because of two facts:
- A language model is nothing but a distribution on the next token given previous tokens,
$p(x \mid c)$ . - There is a general equivalence between probability distributions and codes.
The second point is what makes this interesting. Information theory tells us that we can derive codes from probability distributions. That is, if I have some datapoints
Since I can convert probability distributions to codes in this way, I can also convert codes to probability distributions. If I have a code (like gzip) that describes my datapoint with length
The ZipLM model works by converting code lengths to probabilities in this way. If I have a vocabulary of size
It's pretty bad, but it doesn't generate total junk. Here I trained the gzip model in Moby Dick---from the Project Gutenberg text---and the output at least has some recognizable parts:
data = open("mobydick.txt").read().lower()
alphabet = "qwertyuiopasdfghjkl;'zxcvbnm,. "
model = ziplm.Model(alphabet, data)
"".join(model.sample_sequence(100))
This gives me
"'theudcanvas. ;cm,zumhmcyoetter toauuo long a one aay,;wvbu.mvns. x the dtls and enso.;k.like bla.njv'"
which at least seems to have "long a one" in it.