-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(weave): download dataset with images in rows in parallel #3939
Conversation
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=c2396195c153d89d6857bef2a5a2aec4dabced36 |
for future in futures: | ||
yield future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure this is expected:
As written, if you raise anywhere during iteration, no further futures will be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in practice table_ref should be defined here. i'll move that check outside the process row fn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, there already is that check at the top. ill make this less aggro but just returning none?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point here is that if any of the future.result()
raises an Exception, then none of the other futures will be yielded. Is that what you intend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. I think that is what we want right? During an eval, if anything fails I believe we fail the entire thing. Additionally, all i'm doing here is a thin wrapper over the existing functionality that downloaded in the main thread, which definitely would have errored. So I think this is safe. If we in the future decided to allow partial-evals, or "resuming" evals, or fixing errored traces in evals, we might have to rethink this. Unless you can see an obvious alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, just want to make sure you thought about it. That makes sense to me :)
Description
WB-23955
Use the future executor to parallelize row requests. The only issue I can see is when the queue is full, but I think in that case this executes exactly as fast as it currently does.
This pr:
Testing
29.7s
2.24s
Image dataset test (run eval against downloaded dataset):
1.92s
0.66s
14.66s
0.96s
175.16s
20.16s