Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming does not work as expected #52

Closed
djmaze opened this issue Jan 13, 2024 · 7 comments
Closed

Streaming does not work as expected #52

djmaze opened this issue Jan 13, 2024 · 7 comments

Comments

@djmaze
Copy link
Contributor

djmaze commented Jan 13, 2024

When using "stream": true, the results are returned in the expected format, but (as opposed to OpenAI) they seem to be returned in a big batch after the complete response has been generated. Is this because of a limitation in tabbyAPI or in ExllamaV2?

@bdashore3
Copy link
Member

This is not happening on my end. Can you please give me some reproduction info?

@djmaze
Copy link
Contributor Author

djmaze commented Jan 14, 2024

Here is an example using the openai client library (adapted from its documentation). You need to replace your tabby url and openai key. You can also use any other model. (I tried with turboderp/Mixtral-8x7B-instruct-exl2 and LoneStriker/Nous-Capybara-34B-4.65bpw-h6-exl2.)

import time
from openai import OpenAI

# Toggle to use Tabby instead of OpenAI
use_tabby = False

if use_tabby:
    client = OpenAI(
        base_url="http://your-tabby-server",
        api_key="dummy"
    )
    model="LoneStriker--Nous-Capybara-34B-4.65bpw-h6-exl2"
else:
    client = OpenAI(
        api_key="your-api-key"
    )
    model = "gpt-3.5-turbo"


prompt = "Write a python webserver which returns the number pi at the url `/pi`"


start = time.time()
stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}],
    stream=True,
)
end = time.time()

print("Completions:", end - start)

start = time.time()
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
end = time.time()

print("Stream:", end - start)

When running this, you can see that when using OpenAI, the characters get output one-by-one vs. with Tabby, there is a big delay after client.chat.completions.create while the streaming output is very fast: In my case 4.7s for the completions and 0.07s for the streaming. As opposed to OpenAI which takes 0.9s for the completions and 2.88s for the streaming.

@bdashore3
Copy link
Member

I was able to reproduce the issue using your code. To fix this on my end, I removed the end="" from L35 in the print statement. Doing so made streaming work properly and not stream per line.

@bdashore3
Copy link
Member

Closing this issue due to inactivity.

@djmaze
Copy link
Contributor Author

djmaze commented Jan 24, 2024

Oh right, almost. The print() needs to end with end="", flush=True). Then it would work for me. So it looks this is working correctly.

Also, for a real test, I had to add max_tokens: 1024 to the completion call so the response could exceed a certain length. There is a mismatch with OpenAI by the way. When leaving out the max_tokens parameter, with OpenAI I get the whole response (even if it is much longer). This behaviour makes it harder to use tabbyAPI with existing OpenAI clients / UIs.

@djmaze
Copy link
Contributor Author

djmaze commented Jan 24, 2024

(Adding to this, I would expect max_tokens to default to the model's context size).

@bdashore3
Copy link
Member

Defaulting max_tokens to the max sequence length is bad practice for local models as they can devolve really quickly. Other frontends such as SillyTavern use 150 as the default value. I increased this from the default value of 16 that OAI uses for its completions endpoint.

As for your problem with setting max_tokens in the request. There is a new feature coming soon to override generation parameter defaults from tabby's side. You can try it by pulling the formatting branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants