Skip to content

feat(RHOAIENG-29330):Deny RayCluster creation with Ray Version mismatches #881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: ray-jobs-feature
Choose a base branch
from

Conversation

LilyLinh
Copy link
Contributor

Issue link

Jira

What changes have been made

This PR implements Ray version validation to prevent compatibility issues between the CodeFlare SDK and user-specified runtime images.

Solution: Added automatic Ray version detection and validation during cluster configuration:

  • Version Detection: New utility function extract_ray_version_from_image() parses Ray versions from various image name formats (e.g., ray:2.47.1, quay.io/modh/ray:2.47.1-py311-cu121)
  • Validation Logic: validate_ray_version_compatibility() compares runtime image Ray version against SDK version (2.47.1)
  • Integration: Validation automatically runs during ClusterConfiguration.__post_init__()
  • User Experience: Clear error messages guide users to fix version mismatches; warnings for undetectable versions
  • Documentation: Updated cluster configuration docs with Ray version requirements and best practices

Verification steps

Build the SDK, test its functionality in a notebook.

  • In Cluster create and configure step, when using runtime image with correct Ray version image="quay.io/modh/ray:2.47.1-py311-cu121", the cluster is created and up successfully; when using a different Ray version ( e.g., image="quay.io/modh/ray:2.46.1-py311-cu121"), it should show an error message.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025
@LilyLinh LilyLinh requested a review from kryanbeane August 13, 2025 09:48
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.76%. Comparing base (5a77f7b) to head (dc47dc9).

Additional details and impacted files
@@                 Coverage Diff                  @@
##           ray-jobs-feature     #881      +/-   ##
====================================================
+ Coverage             93.65%   93.76%   +0.10%     
====================================================
  Files                    20       21       +1     
  Lines                  1717     1747      +30     
====================================================
+ Hits                   1608     1638      +30     
  Misses                  109      109              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@LilyLinh LilyLinh changed the base branch from main to ray-jobs-feature August 13, 2025 10:01
Copy link
Contributor

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small changes but this is great work @LilyLinh well done :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we can get rid of this file. It's good practice to test as much as possible, but if we hardcode Ray version like on line 26 assert RAY_VERSION == "2.47.1" this will fail the moment we bump the ray version so I think we don't need to include this test file. But keep up the good practice of testing everything!

@@ -0,0 +1,188 @@
# Copyright 2024 IBM, Red Hat
Copy link
Contributor

@kryanbeane kryanbeane Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 IBM, Red Hat
# Copyright 2022-2025 IBM, Red Hat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (nit means nit pick, and is just a suggestion so you don't need to change it etc) - i don't know how lisences work or if we need the years to be up to date so not sure if this matters at all, just pointing it out!

Copy link
Contributor Author

@LilyLinh LilyLinh Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It should be update to "2022-2025" as stated here (https://www.apache.org/legal/src-headers.html#3part) and the codeflare project launched on 2022? It is said that we can do pre-commit hook on yml file to auto update the current year https://github.com/nanoufo/copyright-hook

@@ -289,3 +291,15 @@ def check_type(value, expected_type):
return isinstance(value, expected_type)

return check_type(value, expected_type)

def _validate_ray_version_compatibility(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to be moved to the ray jobs work instead of ray cluster. You can just copy and paste this as the function itself looks good just in the wrong place! Btw you can use this https://github.com/kryanbeane/codeflare-sdk/blob/5b6c937be4f2b177ed609ee040b8a67ece093523/demo-notebooks/guided-demos/5_rayjob_lifecycled_cluster.ipynb to for easier testing once you add it to the Ray job work just ensure you rebuild the .whl file and reinstall

@@ -247,3 +249,114 @@ def test_ray_usage_stats_enabled(mocker):
def test_cleanup():
os.remove(f"{aw_dir}test-all-params.yaml")
os.remove(f"{aw_dir}aw-all-params.yaml")


def test_ray_version_validation_compatible_image(mocker):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this needs to be moved. the tests themselves look good, just in the wrong place and might need slight refactoring to make it more rayjob focused. (like some of the mocker.patch lines will need changing etc) MSG if you want a hand!

Copy link
Contributor

openshift-ci bot commented Aug 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from kryanbeane. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants