Fix excessive token usage with Unicode text in realtime event serialization #2444

josharsh · 2025-07-04T21:11:43Z

Non-ASCII characters in real-time event data (such as Cyrillic, Chinese, Arabic, etc.) were being unnecessarily escaped during JSON serialisation, causing significant token overhead.

This fix adds ensure_ascii=False to json.dumps() calls in real-time WebSocket event sending, preserving Unicode characters in their original form.

Token savings:

54-60% size reduction for Unicode-heavy schemas
~116+ tokens saved per typical function schema with Cyrillic descriptions
Backwards compatible - outputs valid JSON that parses identically

Fixes issue #2428 where Pydantic schema descriptions with Cyrillic text caused 3.6x token overhead.

The fix updates both sync and async realtime connection send() methods to use ensure_ascii=False, which is the modern standard for JSON serialisation with Unicode content.

I understand that this repository is auto-generated and my pull request may not be merged

Changes being requested

Additional context & links

…zation Non-ASCII characters in realtime event data (such as Cyrillic, Chinese, Arabic, etc.) were being unnecessarily escaped during JSON serialization, causing significant token overhead. This fix adds ensure_ascii=False to json.dumps() calls in realtime WebSocket event sending, preserving Unicode characters in their original form. Token savings: - 54-60% size reduction for Unicode-heavy schemas - ~116+ tokens saved per typical function schema with Cyrillic descriptions - Backward compatible - outputs valid JSON that parses identically Fixes issue openai#2428 where Pydantic schema descriptions with Cyrillic text caused 3.6x token overhead. The fix updates both sync and async realtime connection send() methods to use ensure_ascii=False, which is the modern standard for JSON serialization with Unicode content.

tg-bomze · 2025-07-06T11:55:46Z

Unfortunately, your code (pip install git+https://github.com/josharsh/openai-python.git@fix-unicode-token-usage) does not solve the problem in my example:

import json
from pydantic import BaseModel, Field
from openai import OpenAI

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

user_prompt = "My name John"
messages = [{"role": "user", "content": user_prompt}]

class Schema(BaseModel):
    user_name: str = Field(description="Имя пользователя, если он его называл. Если нет, то оставь пустую строку")

schema = Schema.model_json_schema()

print("--- SCHEMA (ensure_ascii=False) ---")
str_schema = json.dumps(schema, ensure_ascii=False)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- SCHEMA (ensure_ascii=True) ---")
str_schema = json.dumps(schema, ensure_ascii=True)
num_tokens = len(enc.encode(str_schema))
print(str_schema)
print(f"Num tokens: {num_tokens}")

print("\n--- USER PROMPT ---")
num_tokens = len(enc.encode(user_prompt))
print(user_prompt)
print(f"Num tokens: {num_tokens}")

print("\n--- MESSAGES ---")
str_messages = str(messages)
num_tokens = len(enc.encode(str_messages))
print(str_messages)
print(f"Num tokens: {num_tokens}")

with OpenAI(api_key=api_key) as client:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=messages,
        response_format=Schema,
    )

print("\n--- Prompt tokens (from response) ---")
print(f"Num tokens: {response.usage.prompt_tokens}")

Result:

-- SCHEMA (ensure_ascii=False) ---
{"properties": {"user_name": {"description": "Имя пользователя, если он его называл. Если нет, то оставь пустую строку", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 65

--- SCHEMA (ensure_ascii=True) ---
{"properties": {"user_name": {"description": "\u0418\u043c\u044f \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u044f, \u0435\u0441\u043b\u0438 \u043e\u043d \u0435\u0433\u043e \u043d\u0430\u0437\u044b\u0432\u0430\u043b. \u0415\u0441\u043b\u0438 \u043d\u0435\u0442, \u0442\u043e \u043e\u0441\u0442\u0430\u0432\u044c \u043f\u0443\u0441\u0442\u0443\u044e \u0441\u0442\u0440\u043e\u043a\u0443", "title": "User Name", "type": "string"}}, "required": ["user_name"], "title": "Schema", "type": "object"}
Num tokens: 233

--- USER PROMPT ---
My name John
Num tokens: 3

--- MESSAGES ---
[{'role': 'user', 'content': 'My name John'}]
Num tokens: 16

--- Prompt tokens (from response) ---
Num tokens: 240

I think this is a problem on the OpenAI servers' side, because they receive a JSON object and then serialize it without ensure_ascii=False.

josharsh requested a review from a team as a code owner July 4, 2025 21:11

RobertCraigie added the sdk label Jul 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix excessive token usage with Unicode text in realtime event serialization #2444

Fix excessive token usage with Unicode text in realtime event serialization #2444

josharsh commented Jul 4, 2025

Uh oh!

tg-bomze commented Jul 6, 2025

Uh oh!

Uh oh!

Fix excessive token usage with Unicode text in realtime event serialization #2444

Are you sure you want to change the base?

Fix excessive token usage with Unicode text in realtime event serialization #2444

Conversation

josharsh commented Jul 4, 2025

Changes being requested

Additional context & links

Uh oh!

tg-bomze commented Jul 6, 2025

Uh oh!

Uh oh!