Skip to content

shawnz/textcoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Textcoder

Textcoder is a proof-of-concept tool for steganographically encoding secret messages such that they appear as ordinary, unrelated text.

It works by taking the secret message, encrypting it to produce a pseudorandom bit stream, and then using arithmetic coding to decompress that bit stream based on a statistical model derived from an LLM. This produces text which appears to be sampled randomly from the LLM, while actually encoding the secret message in the specific token choices.

For example, the secret message hello, world! could be encoded into the following text:

% echo 'hello, world!' | textcoder -p 'foobar' > encoded.txt
% cat encoded.txt
"Goodbye 2024! Can't wait to start fresh with a brand new year and a new chance to slay the game in 2025 #NewYearNewMe #ConfidenceIsKey" - @SlayMyGameLife7770 (280 character limit) I just ordered my fave coffee from Dunkin' yesterday but I almost spilled it on my shirt, oh no! #DunkinCoffeePlease #FashionBlunders Life is just trying to keep up with its favorite gamers rn. Wish I could say I'm coding instead of gaming, but when i have to put down my controller for a sec

A user with knowledge of the password could then decode the message as follows:

% cat encoded.txt | textcoder -d -p 'foobar'
hello, world!

Running the Project

Textcoder is packaged using Poetry. To run the project, Poetry must be installed. After installing Poetry, clone the repository and execute the poetry install command:

git clone https://github.com/shawnz/textcoder.git
cd textcoder
poetry install

Additionally, Textcoder makes use of the Llama 3.2 1B Instruct language model. The use of this model requires approval of the Llama 3.2 community license agreement. After approving the license, install the Hugging Face Hub command line interface and log in to your Hugging Face account:

huggingface-cli login

WARNING: The Llama 3.2 family of language models does not allow use of the model with the intent of representing the output as being human-generated. This might limit the situations in which you are allowed to use Textcoder. See this issue for details.

Finally, you can now run Textcoder using the textcoder command. To encode a message, run:

echo '<message>' | poetry run textcoder -p '<password>' > encoded.txt

To decode a message, run:

cat encoded.txt | poetry run textcoder -d -p '<password>'

Known Issues

Conflicting Tokenizations

The Llama tokenizer used in this project sometimes permits multiple possible tokenizations for a given string. As a result, it is sometimes the case that the arithmetic coder produces a series of tokens which don't match the canonical tokenization for that string exactly. In these cases, decoding will fail and you may need to try to run the encoding process again.

To mitigate this, the encoder will try decoding the output before returning it. This validation step requires extra time and memory usage. To skip validation, use the -n (--no-validate) parameter.

Non-Deterministic Behaviour

The Llama model used in this project is not guaranteed to provide completely deterministic behaviour. Due to issues such as floating-point inaccuracy and differing algorithms on different hardware and software versions, it is possible that outputs can change between platforms, or even between successive runs on the same platform. When this happens, the output will not be able to be decoded.

Some steps are taken in this project to use deterministic algorithms when possible, but this doesn't guarantee that different platforms will produce the same outputs. For best results, make sure you are decoding messages on the same platform that they were encoded on.

To help mitigate the impact of hardware differences, consider disabling hardware acceleration using the -a (--no-acceleration) parameter. However, this will greatly decrease performance.

Acknowledgements

This project wouldn't have been possible without the work of Matt Timmermans and his Bijective Arithmetic Coding algorithm. The bijective arithmetic coding algorithm is necessary to be able to decompress arbitrary bit streams. Matt Timmermans' code was ported to Python with the assistance of Anthropic's Claude 3.5 Sonnet for inclusion in this project. See the original ported code here.

Similar Projects

About

Steganographic message encoder using LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages