This is the changelog for the open source version of tiktoken.
- Optimise regular expressions for a 20% performance improvement, thanks to @paplorinc!
- Add
text-embedding-3-*
models toencoding_for_model
- Check content hash for downloaded files
- Allow pickling
Encoding
objects. RegisteredEncoding
will be pickled by reference - Workaround PyO3 bug for frozenset conversion
Thank you to @paplorinc, @mdwelsh, @Praneet460!
- Build wheels for Python 3.12
- Update version of PyO3 to allow multiple imports
- Avoid permission errors when using default cache logic
- Add
encoding_name_for_model
, undo some renames to variables that are implementation details
- Add
tiktoken._educational
submodule to better document how byte pair encoding works - Ensure
encoding_for_model
knows about several new models - Add
decode_with_offets
- Better error for failures with the plugin mechanism
- Make more tests public
- Update versions of dependencies
- Add
decode_batch
anddecode_bytes_batch
- Improve error messages and handling
tiktoken
will now make a best effort attempt to replace surrogate pairs with the corresponding Unicode character and will replace lone surrogates with the Unicode replacement character.
- Add encoding for GPT-4
- Build aarch64 wheels
- Make
blobfile
an optional dependency
Thank you to @messense for the environment variable that makes cargo not OOM under emulation!
- Improve performance by 5-20%; thank you to @nistath!
- Add
gpt-3.5-turbo
models toencoding_for_model
- Add prefix matching to
encoding_for_model
to better support future model versions - Fix a bug in the README instructions on extending tiktoken
- Update the set of available encodings
- Add packaging metadata
- Add
tiktoken.encoding_for_model
to get the encoding for a specific model - Improve portability of caching logic
Thank you to @fritzo, @arvid220u, @khanhvu207, @henriktorget for various small corrections
- Avoid use of
blobfile
for public files - Add support for Python 3.8
- Add py.typed
- Improve the public tests
- Initial release