-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add distribute add columns by ray #3369
base: main
Are you sure you want to change the base?
Conversation
1812181
to
5500c36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this great capability!
613bbdd
to
7cd0508
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Will see if @westonpace has time to give a second pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this started. We'll want more user-facing documentation at some point. Is your goal to expand on this?
python/python/tests/test_ray.py
Outdated
|
||
@pytest.mark.filterwarnings("ignore::DeprecationWarning") | ||
@pytest.mark.skip( | ||
reason="This test local can run, but not in CI." "it's blocked by ray env" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error do we get? The CI should be able to test Ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_ray module not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we would get that error. I can confirm that Ray tests are running in the latest CI runs:
Can you remove the skip so we can see the failure in the CI and work through it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace https://github.com/lancedb/lance/actions/runs/12859405923/job/35849807699?pr=3369
How can I log in to this environment to check the running content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bab57fb
to
827272f
Compare
Hi @Jay-ju can you add some doc about this pr? |
self.partition = partition | ||
|
||
|
||
class CustomTask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CustomTask
may be hard to understand. How about use RayBaseTask
for these functions?
from .distribute_task import DistributeCustomTasks | ||
|
||
|
||
def custom_inplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this file in_place
mean?
.write_lance(tmp_path, schema=schema) | ||
) | ||
lance_ds = LanceDatasource(uri=tmp_path) | ||
add_columns(DistributeCustomTasks(lance_ds), generate_label, ["height"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will align the team on Ray's side later. We need to take a look at the way this code is maintained together.
@SaintBacchus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should go into a datasink (which allows you do to write_X
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should go into a datasink (which allows you do to
write_X
)
@richardliaw
Does putting the form of adding columns into the datasink seem to have a bit of semantic conflict?
I would rather have a custom task. This is the PR I originally proposed in the Ray community. Because currently the datasource only supports read/write, but multimodal data like lance would rather be able to support more semantics, such as update, add_column, compaction, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I still hope that sinks can be uniformly maintained in the Ray community because there has been a situation where Lance cannot write through Ray due to interface changes in Ray sinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WE chatted with @westonpace yesterday -- I think starting a discussion on evolving datasinks to take care of state updates of the underlying storage makes sense (as compared to your previous PR which was putting it into the read/datasource API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I still hope that sinks can be uniformly maintained in the Ray community because there has been a situation where Lance cannot write through Ray due to interface changes in Ray sinks.
We also plan on upstreaming our datasink into Ray soon :)
Then yes, I agree with richardliaw. We can make a datasink (or add flags to the datasink) that sinks data into additional columns instead of creating a brand new dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WE chatted with @westonpace yesterday -- I think starting a discussion on evolving datasinks to take care of state updates of the underlying storage makes sense (as compared to your previous PR which was putting it into the read/datasource API)
@richardliaw How should this be understood? just like this:
@ray.remote(scheduling_strategy=ctx.scheduling_strategy)
class DataSink:
def __init__(self):
self.rows_written = 0
self.enabled = True
# new add columns function ??
def add_columns(self, value_fn: Callable[[pa.RecordBatch], pa.RecordBatch],
read_columns: List[str]) -> None:
xxxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I still hope that sinks can be uniformly maintained in the Ray community because there has been a situation where Lance cannot write through Ray due to interface changes in Ray sinks.
We also plan on upstreaming our datasink into Ray soon :)
Then yes, I agree with richardliaw. We can make a datasink (or add flags to the datasink) that sinks data into additional columns instead of creating a brand new dataset.
@westonpace
When is this ray lance datasink expected to be merged into ray? I want to see how these codes should be processed.
#3228