perf(weave): download dataset with images in rows in parallel #3939

gtarpenning · 2025-03-24T17:12:38Z

Description

Use the future executor to parallelize row requests. The only issue I can see is when the queue is full, but I think in that case this executes exactly as fast as it currently does.

This pr:

utilizes the async_batch_processor to handle row download requests, massively speeding up row downloads when rows have objects in them (like images).

Testing

Before	After
`29.7s`	`2.24s`

Image dataset test (run eval against downloaded dataset):

size	prod eval time (s)	branch eval time (s)
10	`1.92s`	`0.66s`
100	`14.66s`	`0.96s`
1000	`175.16s`	`20.16s`

circle-job-mirror · 2025-03-24T17:13:10Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=c2396195c153d89d6857bef2a5a2aec4dabced36

weave/trace/vals.py

andrewtruong · 2025-03-24T19:38:10Z

weave/trace/vals.py

+            for future in futures:
+                yield future.result()


Just want to make sure this is expected:

As written, if you raise anywhere during iteration, no further futures will be returned.

I think in practice table_ref should be defined here. i'll move that check outside the process row fn.

oh, there already is that check at the top. ill make this less aggro but just returning none?

my point here is that if any of the future.result() raises an Exception, then none of the other futures will be yielded. Is that what you intend?

hmm. I think that is what we want right? During an eval, if anything fails I believe we fail the entire thing. Additionally, all i'm doing here is a thin wrapper over the existing functionality that downloaded in the main thread, which definitely would have errored. So I think this is safe. If we in the future decided to allow partial-evals, or "resuming" evals, or fixing errored traces in evals, we might have to rethink this. Unless you can see an obvious alternative?

ok, just want to make sure you thought about it. That makes sense to me :)

perf(weave): download dataset rows in parallel

d6fda04

gtarpenning commented Mar 24, 2025

View reviewed changes

weave/trace/vals.py Outdated Show resolved Hide resolved

gtarpenning commented Mar 24, 2025

View reviewed changes

weave/trace/vals.py Show resolved Hide resolved

test

Loading
Loading status checks…

e37bdbc

andrewtruong reviewed Mar 24, 2025

View reviewed changes

weave/trace/vals.py Show resolved Hide resolved

require

Loading
Loading status checks…

d4f143f

gtarpenning marked this pull request as ready for review March 24, 2025 19:19

gtarpenning requested a review from a team as a code owner March 24, 2025 19:19

gtarpenning requested a review from andrewtruong March 24, 2025 19:19

gtarpenning changed the title ~~perf(weave): download dataset rows in parallel~~ perf(weave): download dataset with images in rows in parallel Mar 24, 2025

Merge branch 'master' into griffin/populate-cache-on-dataset-create

Loading
Loading status checks…

cbbbf1c

andrewtruong reviewed Mar 24, 2025

View reviewed changes

less aggro

Loading
Loading status checks…

9eaccd6

andrewtruong approved these changes Mar 24, 2025

View reviewed changes

gtarpenning merged commit c1fa204 into master Mar 24, 2025
132 of 133 checks passed

gtarpenning deleted the griffin/populate-cache-on-dataset-create branch March 24, 2025 20:09

github-actions bot locked and limited conversation to collaborators Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(weave): download dataset with images in rows in parallel #3939

perf(weave): download dataset with images in rows in parallel #3939

gtarpenning commented Mar 24, 2025 •

edited

Loading

circle-job-mirror bot commented Mar 24, 2025 •

edited

Loading

andrewtruong Mar 24, 2025 •

edited

Loading

gtarpenning Mar 24, 2025

gtarpenning Mar 24, 2025

andrewtruong Mar 24, 2025

gtarpenning Mar 24, 2025

andrewtruong Mar 24, 2025

perf(weave): download dataset with images in rows in parallel #3939

perf(weave): download dataset with images in rows in parallel #3939

Conversation

gtarpenning commented Mar 24, 2025 • edited Loading

Description

Testing

circle-job-mirror bot commented Mar 24, 2025 • edited Loading

andrewtruong Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

gtarpenning Mar 24, 2025

Choose a reason for hiding this comment

gtarpenning Mar 24, 2025

Choose a reason for hiding this comment

andrewtruong Mar 24, 2025

Choose a reason for hiding this comment

gtarpenning Mar 24, 2025

Choose a reason for hiding this comment

andrewtruong Mar 24, 2025

Choose a reason for hiding this comment

gtarpenning commented Mar 24, 2025 •

edited

Loading

circle-job-mirror bot commented Mar 24, 2025 •

edited

Loading

andrewtruong Mar 24, 2025 •

edited

Loading