feat: add distribute add columns by ray #3369

Jay-ju · 2025-01-11T10:54:34Z

#3228

LuQQiu

Thanks for adding this great capability!

python/python/lance/ray/distribute_task.py

LuQQiu

Looks good to me! Will see if @westonpace has time to give a second pass

westonpace

Thanks for getting this started. We'll want more user-facing documentation at some point. Is your goal to expand on this?

westonpace · 2025-01-14T21:47:30Z

python/python/tests/test_ray.py

+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+@pytest.mark.skip(
+    reason="This test local can run, but not in CI." "it's blocked by ray env"


What error do we get? The CI should be able to test Ray.

test_ray module not found

I'm not sure why we would get that error. I can confirm that Ray tests are running in the latest CI runs:

Can you remove the skip so we can see the failure in the CI and work through it?

@westonpace https://github.com/lancedb/lance/actions/runs/12859405923/job/35849807699?pr=3369
How can I log in to this environment to check the running content?

python/python/tests/test_ray.py

python/python/lance/ray/in_place_api.py

SaintBacchus · 2025-01-20T03:01:22Z

Hi @Jay-ju can you add some doc about this pr?

SaintBacchus · 2025-01-20T03:07:28Z

python/python/lance/ray/distribute_task.py

+        self.partition = partition
+
+
+class CustomTask:


CustomTask may be hard to understand. How about use RayBaseTask for these functions?

SaintBacchus · 2025-01-20T03:09:04Z

python/python/lance/ray/in_place_api.py

+from .distribute_task import DistributeCustomTasks
+
+
+def custom_inplace(


What does this file in_place mean?

SaintBacchus · 2025-01-20T03:17:55Z

python/python/tests/test_ray.py

+        .write_lance(tmp_path, schema=schema)
+    )
+    lance_ds = LanceDatasource(uri=tmp_path)
+    add_columns(DistributeCustomTasks(lance_ds), generate_label, ["height"])


Can we register the add_columns functions into lance data source like write_lance

I will align the team on Ray's side later. We need to take a look at the way this code is maintained together.
@SaintBacchus

Yeah this should go into a datasink (which allows you do to write_X)

Yeah this should go into a datasink (which allows you do to write_X)

@richardliaw
Does putting the form of adding columns into the datasink seem to have a bit of semantic conflict?
I would rather have a custom task. This is the PR I originally proposed in the Ray community. Because currently the datasource only supports read/write, but multimodal data like lance would rather be able to support more semantics, such as update, add_column, compaction, and so on.

However, I still hope that sinks can be uniformly maintained in the Ray community because there has been a situation where Lance cannot write through Ray due to interface changes in Ray sinks.

WE chatted with @westonpace yesterday -- I think starting a discussion on evolving datasinks to take care of state updates of the underlying storage makes sense (as compared to your previous PR which was putting it into the read/datasource API)

However, I still hope that sinks can be uniformly maintained in the Ray community because there has been a situation where Lance cannot write through Ray due to interface changes in Ray sinks.

We also plan on upstreaming our datasink into Ray soon :)

Then yes, I agree with richardliaw. We can make a datasink (or add flags to the datasink) that sinks data into additional columns instead of creating a brand new dataset.

WE chatted with @westonpace yesterday -- I think starting a discussion on evolving datasinks to take care of state updates of the underlying storage makes sense (as compared to your previous PR which was putting it into the read/datasource API)

@richardliaw How should this be understood? just like this:

@ray.remote(scheduling_strategy=ctx.scheduling_strategy) class DataSink: def __init__(self): self.rows_written = 0 self.enabled = True # new add columns function ?? def add_columns(self, value_fn: Callable[[pa.RecordBatch], pa.RecordBatch], read_columns: List[str]) -> None: xxxx

However, I still hope that sinks can be uniformly maintained in the Ray community because there has been a situation where Lance cannot write through Ray due to interface changes in Ray sinks.

We also plan on upstreaming our datasink into Ray soon :)

Then yes, I agree with richardliaw. We can make a datasink (or add flags to the datasink) that sinks data into additional columns instead of creating a brand new dataset.
@westonpace
When is this ray lance datasink expected to be merged into ray? I want to see how these codes should be processed.

LuQQiu · 2025-01-24T00:37:16Z

error_linux_39_x86.log

github-actions bot added enhancement New feature or request python labels Jan 11, 2025

Jay-ju force-pushed the merge_column_distribute_ray branch 19 times, most recently from 1812181 to 5500c36 Compare January 13, 2025 11:33

LuQQiu reviewed Jan 14, 2025

View reviewed changes

python/python/lance/ray/distribute_task.py Outdated Show resolved Hide resolved

python/python/lance/ray/distribute_task.py Show resolved Hide resolved

Jay-ju force-pushed the merge_column_distribute_ray branch 3 times, most recently from 613bbdd to 7cd0508 Compare January 14, 2025 03:32

LuQQiu requested a review from westonpace January 14, 2025 18:42

LuQQiu approved these changes Jan 14, 2025

View reviewed changes

westonpace approved these changes Jan 14, 2025

View reviewed changes

feat: add distribute add columns by ray

827272f

Jay-ju force-pushed the merge_column_distribute_ray branch from bab57fb to 827272f Compare January 15, 2025 01:45

feat: add distribute add columns by ray

ae0ab26

SaintBacchus reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add distribute add columns by ray #3369

feat: add distribute add columns by ray #3369

Jay-ju commented Jan 11, 2025 •

edited

Loading

LuQQiu left a comment •

edited

Loading

LuQQiu left a comment

westonpace left a comment

westonpace Jan 14, 2025

Jay-ju Jan 15, 2025

westonpace Jan 15, 2025

Jay-ju Jan 20, 2025 •

edited

Loading

Jay-ju Jan 20, 2025

SaintBacchus commented Jan 20, 2025

SaintBacchus Jan 20, 2025

SaintBacchus Jan 20, 2025

SaintBacchus Jan 20, 2025

Jay-ju Jan 20, 2025

richardliaw Jan 22, 2025 •

edited

Loading

Jay-ju Jan 23, 2025 •

edited

Loading

Jay-ju Jan 23, 2025

richardliaw Jan 23, 2025

westonpace Jan 23, 2025

Jay-ju Jan 31, 2025 •

edited

Loading

Jay-ju Jan 31, 2025 •

edited

Loading

LuQQiu commented Jan 24, 2025

		from .distribute_task import DistributeCustomTasks


		def custom_inplace(

feat: add distribute add columns by ray #3369

Are you sure you want to change the base?

feat: add distribute add columns by ray #3369

Conversation

Jay-ju commented Jan 11, 2025 • edited Loading

LuQQiu left a comment • edited Loading

Choose a reason for hiding this comment

LuQQiu left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jay-ju Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaintBacchus commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Jay-ju Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jay-ju Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Jay-ju Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

LuQQiu commented Jan 24, 2025

Jay-ju commented Jan 11, 2025 •

edited

Loading

LuQQiu left a comment •

edited

Loading

Jay-ju Jan 20, 2025 •

edited

Loading

richardliaw Jan 22, 2025 •

edited

Loading

Jay-ju Jan 23, 2025 •

edited

Loading

Jay-ju Jan 31, 2025 •

edited

Loading

Jay-ju Jan 31, 2025 •

edited

Loading