Skip to content

Fix excessive token usage with Unicode text in realtime event serialization #2444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

josharsh
Copy link

@josharsh josharsh commented Jul 4, 2025

Non-ASCII characters in real-time event data (such as Cyrillic, Chinese, Arabic, etc.) were being unnecessarily escaped during JSON serialisation, causing significant token overhead.

This fix adds ensure_ascii=False to json.dumps() calls in real-time WebSocket event sending, preserving Unicode characters in their original form.

Token savings:

  • 54-60% size reduction for Unicode-heavy schemas
  • ~116+ tokens saved per typical function schema with Cyrillic descriptions
  • Backwards compatible - outputs valid JSON that parses identically

Fixes issue #2428 where Pydantic schema descriptions with Cyrillic text caused 3.6x token overhead.

The fix updates both sync and async realtime connection send() methods to use ensure_ascii=False, which is the modern standard for JSON serialisation with Unicode content.

  • I understand that this repository is auto-generated and my pull request may not be merged

Changes being requested

Additional context & links

…zation

Non-ASCII characters in realtime event data (such as Cyrillic, Chinese, Arabic, etc.)
were being unnecessarily escaped during JSON serialization, causing significant token overhead.
This fix adds ensure_ascii=False to json.dumps() calls in realtime WebSocket event sending,
preserving Unicode characters in their original form.

Token savings:
- 54-60% size reduction for Unicode-heavy schemas
- ~116+ tokens saved per typical function schema with Cyrillic descriptions
- Backward compatible - outputs valid JSON that parses identically

Fixes issue openai#2428 where Pydantic schema descriptions with Cyrillic text caused 3.6x token overhead.

The fix updates both sync and async realtime connection send() methods to use ensure_ascii=False,
which is the modern standard for JSON serialization with Unicode content.
@josharsh josharsh requested a review from a team as a code owner July 4, 2025 21:11
@tg-bomze
Copy link

tg-bomze commented Jul 6, 2025

Unfortunately, your code (pip install git+https://github.com/josharsh/openai-python.git@fix-unicode-token-usage) does not solve the problem in my example:

import json
from pydantic import BaseModel, Field
from openai import OpenAI

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]

class Schema(BaseModel):
    user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")

schema = Schema.model_json_schema()

print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")

print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")

with OpenAI(api_key=api_key) as client:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=messages,
        response_format=Schema,
    )

print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")

Result:

-- SCHEMA (ensure_ascii=False) ---
{"properties": {"user_name": {"description": "Имя пользователя, если он его называл. Если нет, то оставь пустую строку", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 65

--- SCHEMA (ensure_ascii=True) ---
{"properties": {"user_name": {"description": "\u0418\u043c\u044f \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044f, \u0435\u0441\u043b\u0438 \u043e\u043d \u0435\u0433\u043e \u043d\u0430\u0437\u044b\u0432\u0430\u043b. \u0415\u0441\u043b\u0438 \u043d\u0435\u0442, \u0442\u043e \u043e\u0441\u0442\u0430\u0432\u044c \u043f\u0443\u0441\u0442\u0443\u044e \u0441\u0442\u0440\u043e\u043a\u0443", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 233

--- USER PROMPT ---
My name John
Num tokens: 3

--- MESSAGES ---
[{'role': 'user', 'content': 'My name John'}]
Num tokens: 16

--- Prompt tokens (from response) ---
Num tokens: 240

I think this is a problem on the OpenAI servers' side, because they receive a JSON object and then serialize it without ensure_ascii=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants