Skip to content

feat(RHOAIENG-29330):Deny RayCluster creation with Ray Version mismatches #881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: ray-jobs-feature
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/sphinx/user-docs/cluster-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,79 @@ requirements for creating the Ray Cluster.
documentation on building a custom image
`here <https://github.com/opendatahub-io/distributed-workloads/tree/main/images/runtime/examples>`__.

Ray Version Compatibility
-------------------------

The CodeFlare SDK requires that the Ray version in your runtime image matches the Ray version used by the SDK itself.When you specify a custom runtime image, the SDK will automatically validate that the Ray version in the image matches this requirement.

Version Validation Behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The SDK performs the following validation when creating a cluster:

1. **Compatible versions**: If the runtime image contains Ray 2.47.1, the cluster will be created successfully.

2. **Version mismatch**: If the runtime image contains a different Ray version, cluster creation will fail with a detailed error message explaining the mismatch.

3. **Unknown versions**: If the SDK cannot determine the Ray version from the image name (e.g., SHA-based tags), a warning will be issued but cluster creation will continue.

Examples
~~~~~~~~

**Compatible image (recommended)**:

.. code:: python

# This will work - versions match
cluster = Cluster(ClusterConfiguration(
name='ray-example',
image='quay.io/modh/ray:2.47.1-py311-cu121'
))

**Incompatible image (will fail)**:

.. code:: python

# This will fail with a version mismatch error
cluster = Cluster(ClusterConfiguration(
name='ray-example',
image='ray:2.46.0' # Different version!
))

**SHA-based image (will warn)**:

.. code:: python

# This will issue a warning but continue
cluster = Cluster(ClusterConfiguration(
name='ray-example',
image='quay.io/modh/ray@sha256:abc123...'
))

Best Practices
~~~~~~~~~~~~~~

- **Use versioned tags**: Always use semantic version tags (e.g., `ray:2.47.1`) rather than `latest` or SHA-based tags for better version detection.

- **Test compatibility**: When building custom images, test them with the CodeFlare SDK to ensure compatibility.

- **Check SDK version**: You can check the Ray version used by the SDK with:

.. code:: python

from codeflare_sdk.common.utils.constants import RAY_VERSION
print(f"CodeFlare SDK uses Ray version: {RAY_VERSION}")

**Why is version matching important?**

Ray version mismatches can cause:

- Incompatible API calls between the SDK and Ray cluster
- Unexpected behavior in job submission and cluster management
- Potential data corruption or job failures
- Difficult-to-debug runtime errors


Ray Usage Statistics
-------------------

Expand Down
32 changes: 32 additions & 0 deletions src/codeflare_sdk/common/utils/test_constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2024 IBM, Red Hat
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import pytest
from codeflare_sdk.common.utils.constants import RAY_VERSION, CUDA_RUNTIME_IMAGE


class TestConstants:
"""Test constants module for expected values."""

def test_ray_version_is_defined(self):
"""Test that RAY_VERSION constant is properly defined."""
assert RAY_VERSION is not None
assert isinstance(RAY_VERSION, str)
assert RAY_VERSION == "2.47.1"

def test_cuda_runtime_image_is_defined(self):
"""Test that CUDA_RUNTIME_IMAGE constant is properly defined."""
assert CUDA_RUNTIME_IMAGE is not None
assert isinstance(CUDA_RUNTIME_IMAGE, str)
assert "quay.io/modh/ray" in CUDA_RUNTIME_IMAGE
188 changes: 188 additions & 0 deletions src/codeflare_sdk/common/utils/test_validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Copyright 2024 IBM, Red Hat
Copy link
Contributor

@kryanbeane kryanbeane Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 IBM, Red Hat
# Copyright 2022-2025 IBM, Red Hat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (nit means nit pick, and is just a suggestion so you don't need to change it etc) - i don't know how lisences work or if we need the years to be up to date so not sure if this matters at all, just pointing it out!

Copy link
Contributor Author

@LilyLinh LilyLinh Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It should be update to "2022-2025" as stated here (https://www.apache.org/legal/src-headers.html#3part) and the codeflare project launched on 2022? It is said that we can do pre-commit hook on yml file to auto update the current year https://github.com/nanoufo/copyright-hook

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import pytest
from codeflare_sdk.common.utils.validation import (
extract_ray_version_from_image,
validate_ray_version_compatibility,
)
from codeflare_sdk.common.utils.constants import RAY_VERSION


class TestRayVersionDetection:
"""Test Ray version detection from container image names."""

def test_extract_ray_version_standard_format(self):
"""Test extraction from standard Ray image formats."""
# Standard format
assert extract_ray_version_from_image("ray:2.47.1") == "2.47.1"
assert extract_ray_version_from_image("ray:2.46.0") == "2.46.0"
assert extract_ray_version_from_image("ray:1.13.0") == "1.13.0"

def test_extract_ray_version_with_registry(self):
"""Test extraction from images with registry prefixes."""
assert extract_ray_version_from_image("quay.io/ray:2.47.1") == "2.47.1"
assert (
extract_ray_version_from_image("docker.io/rayproject/ray:2.47.1")
== "2.47.1"
)
assert (
extract_ray_version_from_image("gcr.io/my-project/ray:2.47.1") == "2.47.1"
)

def test_extract_ray_version_with_suffixes(self):
"""Test extraction from images with version suffixes."""
assert (
extract_ray_version_from_image("quay.io/modh/ray:2.47.1-py311-cu121")
== "2.47.1"
)
assert extract_ray_version_from_image("ray:2.47.1-py311") == "2.47.1"
assert extract_ray_version_from_image("ray:2.47.1-gpu") == "2.47.1"
assert extract_ray_version_from_image("ray:2.47.1-rocm62") == "2.47.1"

def test_extract_ray_version_complex_registry_paths(self):
"""Test extraction from complex registry paths."""
assert (
extract_ray_version_from_image("quay.io/modh/ray:2.47.1-py311-cu121")
== "2.47.1"
)
assert (
extract_ray_version_from_image("registry.company.com/team/ray:2.47.1")
== "2.47.1"
)

def test_extract_ray_version_no_version_found(self):
"""Test cases where no version can be extracted."""
# SHA-based tags
assert (
extract_ray_version_from_image(
"quay.io/modh/ray@sha256:6d076aeb38ab3c34a6a2ef0f58dc667089aa15826fa08a73273c629333e12f1e"
)
is None
)

# Non-semantic versions
assert extract_ray_version_from_image("ray:latest") is None
assert extract_ray_version_from_image("ray:nightly") is None
assert (
extract_ray_version_from_image("ray:v2.47") is None
) # Missing patch version

# Non-Ray images
assert extract_ray_version_from_image("python:3.11") is None
assert extract_ray_version_from_image("ubuntu:20.04") is None

# Empty or None
assert extract_ray_version_from_image("") is None
assert extract_ray_version_from_image(None) is None

def test_extract_ray_version_edge_cases(self):
"""Test edge cases for version extraction."""
# Version with 'v' prefix should not match our pattern
assert extract_ray_version_from_image("ray:v2.47.1") is None

# Multiple version-like patterns - should match the first valid one
assert (
extract_ray_version_from_image("registry/ray:2.47.1-based-on-1.0.0")
== "2.47.1"
)


class TestRayVersionValidation:
"""Test Ray version compatibility validation."""

def test_validate_compatible_versions(self):
"""Test validation with compatible Ray versions."""
# Exact match
is_compatible, message = validate_ray_version_compatibility(
f"ray:{RAY_VERSION}"
)
assert is_compatible is True
assert "Ray versions match" in message

# With registry and suffixes
is_compatible, message = validate_ray_version_compatibility(
f"quay.io/modh/ray:{RAY_VERSION}-py311-cu121"
)
assert is_compatible is True
assert "Ray versions match" in message

def test_validate_incompatible_versions(self):
"""Test validation with incompatible Ray versions."""
# Different version
is_compatible, message = validate_ray_version_compatibility("ray:2.46.0")
assert is_compatible is False
assert "Ray version mismatch detected" in message
assert "CodeFlare SDK uses Ray" in message
assert "runtime image uses Ray" in message

# Older version
is_compatible, message = validate_ray_version_compatibility("ray:1.13.0")
assert is_compatible is False
assert "Ray version mismatch detected" in message

def test_validate_empty_image(self):
"""Test validation with no custom image (should use default)."""
# Empty string
is_compatible, message = validate_ray_version_compatibility("")
assert is_compatible is True
assert "Using default Ray image compatible with SDK" in message

# None
is_compatible, message = validate_ray_version_compatibility(None)
assert is_compatible is True
assert "Using default Ray image compatible with SDK" in message

def test_validate_unknown_version(self):
"""Test validation when version cannot be determined."""
# SHA-based image
is_compatible, message = validate_ray_version_compatibility(
"quay.io/modh/ray@sha256:6d076aeb38ab3c34a6a2ef0f58dc667089aa15826fa08a73273c629333e12f1e"
)
assert is_compatible is True
assert "Warning: Cannot determine Ray version" in message

# Custom image without version
is_compatible, message = validate_ray_version_compatibility(
"my-custom-ray:latest"
)
assert is_compatible is True
assert "Warning: Cannot determine Ray version" in message

def test_validate_custom_sdk_version(self):
"""Test validation with custom SDK version."""
# Compatible with custom SDK version
is_compatible, message = validate_ray_version_compatibility(
"ray:2.46.0", "2.46.0"
)
assert is_compatible is True
assert "Ray versions match" in message

# Incompatible with custom SDK version
is_compatible, message = validate_ray_version_compatibility(
"ray:2.47.1", "2.46.0"
)
assert is_compatible is False
assert "CodeFlare SDK uses Ray 2.46.0" in message
assert "runtime image uses Ray 2.47.1" in message

def test_validate_message_content(self):
"""Test that validation messages contain expected guidance."""
# Mismatch message should contain helpful guidance
is_compatible, message = validate_ray_version_compatibility("ray:2.46.0")
assert is_compatible is False
assert "compatibility issues" in message.lower()
assert "unexpected behavior" in message.lower()
assert "please use a runtime image" in message.lower()
assert "update your sdk version" in message.lower()
Loading