Fix max number of tokens for synthetic data generator #170

jackcook · 2025-05-20T19:03:07Z

When using prompt_tokens_max (and not using prompt_tokens_stdev), there will occasionally be one token more than the maximum number specified. This can be tested as follows:

from guidellm.utils import IntegerRangeSampler

MIN_VALUE = 5
MAX_VALUE = 15

irs = IntegerRangeSampler(average=(MAX_VALUE - MIN_VALUE) // 2, variance=None, min_value=MIN_VALUE, max_value=MAX_VALUE, random_seed=None)
it = iter(irs)

for _ in range(10000):
    assert next(it) != 16

The assertion will fire, despite the max being set to 15. This happens because random.randint, which is used by IntegerRangeSampler, generates numbers up to and including the max value it is given. This PR fixes that.

markurtz

Looks great, thanks @jackcook!

Fix max number of tokens for synthetic data generator

e42a834

markurtz approved these changes May 21, 2025

View reviewed changes

Merge branch 'main' into main

fac78e7

markurtz merged commit b85c6b9 into neuralmagic:main May 21, 2025
9 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix max number of tokens for synthetic data generator #170

Fix max number of tokens for synthetic data generator #170

Uh oh!

jackcook commented May 20, 2025 •

edited

Loading

Uh oh!

markurtz left a comment

Uh oh!

Uh oh!

Uh oh!

Fix max number of tokens for synthetic data generator #170

Fix max number of tokens for synthetic data generator #170

Uh oh!

Conversation

jackcook commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markurtz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jackcook commented May 20, 2025 •

edited

Loading