Streaming does not work as expected #52

djmaze · 2024-01-13T22:41:11Z

When using "stream": true, the results are returned in the expected format, but (as opposed to OpenAI) they seem to be returned in a big batch after the complete response has been generated. Is this because of a limitation in tabbyAPI or in ExllamaV2?

The text was updated successfully, but these errors were encountered:

bdashore3 · 2024-01-14T05:48:00Z

This is not happening on my end. Can you please give me some reproduction info?

djmaze · 2024-01-14T14:50:28Z

Here is an example using the openai client library (adapted from its documentation). You need to replace your tabby url and openai key. You can also use any other model. (I tried with turboderp/Mixtral-8x7B-instruct-exl2 and LoneStriker/Nous-Capybara-34B-4.65bpw-h6-exl2.)

import time
from openai import OpenAI

# Toggle to use Tabby instead of OpenAI
use_tabby = False

if use_tabby:
    client = OpenAI(
        base_url="http://your-tabby-server",
        api_key="dummy"
    )
    model="LoneStriker--Nous-Capybara-34B-4.65bpw-h6-exl2"
else:
    client = OpenAI(
        api_key="your-api-key"
    )
    model = "gpt-3.5-turbo"


prompt = "Write a python webserver which returns the number pi at the url `/pi`"


start = time.time()
stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}],
    stream=True,
)
end = time.time()

print("Completions:", end - start)

start = time.time()
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
end = time.time()

print("Stream:", end - start)

When running this, you can see that when using OpenAI, the characters get output one-by-one vs. with Tabby, there is a big delay after client.chat.completions.create while the streaming output is very fast: In my case 4.7s for the completions and 0.07s for the streaming. As opposed to OpenAI which takes 0.9s for the completions and 2.88s for the streaming.

bdashore3 · 2024-01-17T04:18:50Z

I was able to reproduce the issue using your code. To fix this on my end, I removed the end="" from L35 in the print statement. Doing so made streaming work properly and not stream per line.

bdashore3 · 2024-01-22T04:43:45Z

Closing this issue due to inactivity.

djmaze · 2024-01-24T00:17:23Z

Oh right, almost. The print() needs to end with end="", flush=True). Then it would work for me. So it looks this is working correctly.

Also, for a real test, I had to add max_tokens: 1024 to the completion call so the response could exceed a certain length. There is a mismatch with OpenAI by the way. When leaving out the max_tokens parameter, with OpenAI I get the whole response (even if it is much longer). This behaviour makes it harder to use tabbyAPI with existing OpenAI clients / UIs.

djmaze · 2024-01-24T00:23:29Z

(Adding to this, I would expect max_tokens to default to the model's context size).

bdashore3 · 2024-01-24T05:45:39Z

Defaulting max_tokens to the max sequence length is bad practice for local models as they can devolve really quickly. Other frontends such as SillyTavern use 150 as the default value. I increased this from the default value of 16 that OAI uses for its completions endpoint.

As for your problem with setting max_tokens in the request. There is a new feature coming soon to override generation parameter defaults from tabby's side. You can try it by pulling the formatting branch.

bdashore3 closed this as completed Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming does not work as expected #52

Streaming does not work as expected #52

djmaze commented Jan 13, 2024

bdashore3 commented Jan 14, 2024

djmaze commented Jan 14, 2024

bdashore3 commented Jan 17, 2024

bdashore3 commented Jan 22, 2024

djmaze commented Jan 24, 2024 •

edited

Loading

djmaze commented Jan 24, 2024

bdashore3 commented Jan 24, 2024

Streaming does not work as expected #52

Streaming does not work as expected #52

Comments

djmaze commented Jan 13, 2024

bdashore3 commented Jan 14, 2024

djmaze commented Jan 14, 2024

bdashore3 commented Jan 17, 2024

bdashore3 commented Jan 22, 2024

djmaze commented Jan 24, 2024 • edited Loading

djmaze commented Jan 24, 2024

bdashore3 commented Jan 24, 2024

djmaze commented Jan 24, 2024 •

edited

Loading