Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(weave): download dataset with images in rows in parallel #3939

Merged
merged 5 commits into from
Mar 24, 2025

Conversation

gtarpenning
Copy link
Member

@gtarpenning gtarpenning commented Mar 24, 2025

Description

WB-23955

Use the future executor to parallelize row requests. The only issue I can see is when the queue is full, but I think in that case this executes exactly as fast as it currently does.

This pr:

  • utilizes the async_batch_processor to handle row download requests, massively speeding up row downloads when rows have objects in them (like images).

Testing

Before After
29.7s 2.24s
Screenshot 2025-03-24 at 10 13 51 AM Screenshot 2025-03-24 at 10 09 59 AM

Image dataset test (run eval against downloaded dataset):

size prod eval time (s) branch eval time (s)
10 1.92s 0.66s
100 14.66s 0.96s
1000 175.16s 20.16s

@circle-job-mirror
Copy link

circle-job-mirror bot commented Mar 24, 2025

@gtarpenning gtarpenning marked this pull request as ready for review March 24, 2025 19:19
@gtarpenning gtarpenning requested a review from a team as a code owner March 24, 2025 19:19
@gtarpenning gtarpenning changed the title perf(weave): download dataset rows in parallel perf(weave): download dataset with images in rows in parallel Mar 24, 2025
Comment on lines +514 to +515
for future in futures:
yield future.result()
Copy link
Collaborator

@andrewtruong andrewtruong Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure this is expected:

As written, if you raise anywhere during iteration, no further futures will be returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice table_ref should be defined here. i'll move that check outside the process row fn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, there already is that check at the top. ill make this less aggro but just returning none?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point here is that if any of the future.result() raises an Exception, then none of the other futures will be yielded. Is that what you intend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. I think that is what we want right? During an eval, if anything fails I believe we fail the entire thing. Additionally, all i'm doing here is a thin wrapper over the existing functionality that downloaded in the main thread, which definitely would have errored. So I think this is safe. If we in the future decided to allow partial-evals, or "resuming" evals, or fixing errored traces in evals, we might have to rethink this. Unless you can see an obvious alternative?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, just want to make sure you thought about it. That makes sense to me :)

@gtarpenning gtarpenning merged commit c1fa204 into master Mar 24, 2025
132 of 133 checks passed
@gtarpenning gtarpenning deleted the griffin/populate-cache-on-dataset-create branch March 24, 2025 20:09
@github-actions github-actions bot locked and limited conversation to collaborators Mar 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants