Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

Closed
tchaton opened this issue Jan 17, 2025 · 3 comments
Closed

Invalid argument: max value for start_offset is 10_000, but got 20000 #5637

tchaton opened this issue Jan 17, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@tchaton
Copy link

tchaton commented Jan 17, 2025

Describe the bug
A clear and concise description of what the bug is.

Steps to reproduce (if applicable)
Steps to reproduce the behavior:

  1. Ingest a dataset
  2. Find a query with more than 20k responses
import requests
from time import time

QUERY = "paris"

t0 = time()

responses = []
MAX_HITS = 10000

session = requests.Session()
response = session.post(f"http://localhost:7280/api/v1/fineweb/search", json={"query": QUERY, "max_hits": MAX_HITS})
data = response.json()
NUM_HITS = data['num_hits']
responses.extend(data["hits"])
print(len(responses))

while len(responses) != NUM_HITS:
    response = session.post(f"http://localhost:7280/api/v1/fineweb/search", json={"query": QUERY, "max_hits": MAX_HITS, "start_offset": len(responses)})
    data = response.json()
    if "hits" not in data:
        raise Exception(data)
    responses.extend(data["hits"])
    print(len(responses))

print(len(responses))

print(time() - t0)

I am trying to search through fineweb and I want to collect all the matches. However, it doesn't seem to be possible as start_offset is capped to 10k.

Expected behavior
A clear and concise description of what you expected to happen.

I want an easy way to collect all the matches. Even better, I just want their ids.

Configuration:
Please provide:

  1. Output of quickwit --version
  2. The index_config.yaml
@tchaton tchaton added the bug Something isn't working label Jan 17, 2025
@trinity-1686a
Copy link
Contributor

Doing increasingly deep pagination based only on a start_offset isn't very efficient (to fetch 10k docs with a start_offset of 100k, you'd need to find the best 110k results, and drop the first 100k). For that reason, we don't support deep pagination that way.
Currently there isn't an alternative on the Quickwit API. If you don't mind using the ES-compatible API instead, we support both search_after and scroll, which don't suffer from that performance degradation (at least on Quickwit, scrolls are deprecated on ES).

@tchaton
Copy link
Author

tchaton commented Jan 22, 2025

Oh interesting, it wasn't documented or at least, I didn't find it in the docs, not the swagger UI.

@rdettai
Copy link
Collaborator

rdettai commented Jan 23, 2025

the endpoints mentioned by Trinity are documented here

@rdettai rdettai closed this as completed Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants