Skip to content

setup instructions for RHOAI 2.20 #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,15 @@ Instructions are provided for the following Red Hat OpenShift AI ***stable*** re
+ [RHOAI 2.16 Uninstall](./setup.RHOAI-v2.16/UNINSTALL.md)

Instructions are provided for the following Red Hat OpenShift AI ***fast*** releases:
+ Red Hat OpenShift AI 2.20
+ [RHOAI 2.20 Cluster Setup](./setup.RHOAI-v2.20/CLUSTER-SETUP.md)
+ [RHOAI 2.20 Team Setup](./setup.RHOAI-v2.20/TEAM-SETUP.md)
+ [UPGRADING from RHOAI 2.19](./setup.RHOAI-v2.20/UPGRADE-FAST.md)
+ [RHOAI 2.20 Uninstall](./setup.RHOAI-v2.20/UNINSTALL.md)
+ Red Hat OpenShift AI 2.19
+ [RHOAI 2.19 Cluster Setup](./setup.RHOAI-v2.19/CLUSTER-SETUP.md)
+ [RHOAI 2.19 Team Setup](./setup.RHOAI-v2.19/TEAM-SETUP.md)
+ [UPGRADING from RHOAI 2.18](./setup.RHOAI-v2.19/UPGRADE.md)
+ [UPGRADING from RHOAI 2.18](./setup.RHOAI-v2.19/UPGRADE-FAST.md)
+ [RHOAI 2.19 Uninstall](./setup.RHOAI-v2.19/UNINSTALL.md)

## Kubernetes
Expand Down
6 changes: 2 additions & 4 deletions setup.RHOAI-v2.16/UPGRADE-FAST.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,8 @@ First, update the MLBatch modifications to the default RHOAI configuration maps.
oc apply -f setup.RHOAI-v2.16/mlbatch-upgrade-configmaps.yaml
```

There are no MLBatch modifications to the default RHOAI configuration maps
beyond those already made in previous installs. Therefore, you can simply
approve the install plan replacing the example plan name below with the actual
value on your cluster:
Next you can approve the install plan replacing the example plan name
below with the actual value on your cluster:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
```
6 changes: 3 additions & 3 deletions setup.RHOAI-v2.19/UPGRADE-FAST.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Upgrading from RHOAI 2.19
# Upgrading from RHOAI 2.18

These instructions assume you installed and configured RHOAI 2.18 following
the MLBatch [install instructions for RHOAI-v2.18](../setup.RHOAI-v2.18/CLUSTER-SETUP.md)
Expand All @@ -14,8 +14,8 @@ oc get ip -n redhat-ods-operator
Typical output would be:
```sh
NAME CSV APPROVAL APPROVED
install-kpzzl rhods-operator.2.18.0 Manual false
install-nqrbp rhods-operator.2.19.0 Manual true
install-kpzzl rhods-operator.2.19.0 Manual false
install-nqrbp rhods-operator.2.18.0 Manual true
```

Before approving the upgrade, you must manually remove v1alpha1 MultiKueue CRD's
Expand Down
4 changes: 2 additions & 2 deletions setup.RHOAI-v2.19/UPGRADE-STABLE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ oc get ip -n redhat-ods-operator
Typical output would be:
```sh
NAME CSV APPROVAL APPROVED
install-kpzzl rhods-operator.2.16.0 Manual false
install-nqrbp rhods-operator.2.19.0 Manual true
install-kpzzl rhods-operator.2.19.0 Manual false
install-nqrbp rhods-operator.2.16.0 Manual true
```

Assuming the install plan exists you can begin the upgrade process.
Expand Down
171 changes: 171 additions & 0 deletions setup.RHOAI-v2.20/CLUSTER-SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Cluster Setup

The cluster setup installs Red Hat OpenShift AI and configures Scheduler Plugins, Kueue,
cluster roles, and priority classes.

## Priorities

Create `default-priority`, `high-priority`, and `low-priority` priority classes:
```sh
oc apply -f setup.RHOAI-v2.20/mlbatch-priorities.yaml
```

## Scheduler Configuration

MLBatch configures Kubernetes scheduling to accomplish two objectives:
+ Obtaining gang (all or nothing) scheduling for multi-Pod workloads.
+ Packing Pods whose GPU request is less than the number of GPUs on a Node to
maximize the number of Nodes available for Pods that request all the GPUs on a Node.

This is done by installing the Coscheduling out-of-tree scheduler plugin and configuring
the default NodeResourcesFit scheduler plugin to pack in the GPU dimension.


```sh
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
```
Patch scheduler-plugins pod priorities:
```sh
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.20/scheduler-priority-patch.yaml scheduler-plugins-controller
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.20/scheduler-priority-patch.yaml scheduler-plugins-scheduler
```



## Red Hat OpenShift AI

Create the Red Hat OpenShift AI subscription:
```sh
oc apply -f setup.RHOAI-v2.20/mlbatch-subscription.yaml
```
Create the mlbatch NetworkPolicy in the redhat-ods-applications namespace.
```sh
oc apply -f setup.RHOAI-v2.20/mlbatch-network-policy.yaml
```
Identify install plan:
```sh
oc get ip -n redhat-ods-operator
```
```
NAMESPACE NAME CSV APPROVAL APPROVED
redhat-ods-operator install-kmh8w rhods-operator.2.20.0 Manual false
```
Approve install plan replacing the generated plan name below with the actual
value:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
```
Create DSC Initialization:
```sh
oc apply -f setup.RHOAI-v2.20/mlbatch-dsci.yaml
```
Create Data Science Cluster:
```sh
oc apply -f setup.RHOAI-v2.20/mlbatch-dsc.yaml
```
The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift
AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
remaining components such as `dashboard` can be optionally enabled.

The configuration of the managed components differs from the default Red Hat OpenShift
AI configuration as follows:
- Kubeflow Training Operator:
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
- Kueue:
- `manageJobsWithoutQueueName` is enabled,
- `batch/job` integration is disabled,
- `waitForPodsReady` is disabled,
- `fairSharing` is enabled,
- `enableClusterQueueResources` metrics is enabled,
- Codeflare operator:
- the AppWrapper controller is enabled and configured as follows:
- `userRBACAdmissionCheck` is disabled,
- `schedulerName` is set to `scheduler-plugins-scheduler`,
- `queueName` is set to `default-queue`,
- `slackQueueName` is set to `slack-cluster-queue`

## Autopilot

Helm charts values and how-to for customization can be found [in the official documentation](https://github.com/IBM/autopilot/blob/main/helm-charts/autopilot/README.md). As-is, Autopilot will run on GPU nodes.

- Add the Autopilot Helm repository

```bash
helm repo add autopilot https://ibm.github.io/autopilot/
helm repo update
```

- Install the chart (idempotent command). The config file is for customizing the helm values and it is optional.

```bash
helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f your-config.yml
```

### Enabling Prometheus metrics

After completing the installation, manually label the namespace to enable metrics to be scraped by Prometheus with the following command:

```bash
oc label ns autopilot openshift.io/cluster-monitoring=true
```

The `ServiceMonitor` labeling is not required.

## Kueue Configuration

Create Kueue's default flavor:
```sh
oc apply -f setup.RHOAI-v2.20/default-flavor.yaml
```

## Cluster Role

Create `mlbatch-edit` role:
```sh
oc apply -f setup.RHOAI-v2.20/mlbatch-edit-role.yaml
```

## Slack Cluster Queue

Create the designated slack `ClusterQueue` which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
```sh
oc apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
```
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The `lendingLimit` for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
detailed discussion of the role of the slack `ClusterQueue`.
91 changes: 91 additions & 0 deletions setup.RHOAI-v2.20/TEAM-SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Team Setup

A *team* in MLBatch is a group of users that share a resource quota.

Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
for a discussion of our recommended best practices.


Setting up a new team requires the cluster admin to create a project,
a user group, a quota, a queue, and the required role bindings as described below.

Create project:
```sh
oc new-project team1
```
Create user group:
```sh
oc adm groups new team1-edit-group
```
Add users to group for example:
```sh
oc adm groups add-users team1-edit-group user1
```
Bind cluster role to group in namespace:
```sh
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
```

Specify the intended quota for the namespace by creating a `ClusterQueue`:
```sh
oc apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: team1-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
# borrowingLimit: 0
# lendingLimit: 0
- name: "memory"
nominalQuota: 128Gi
# borrowingLimit: 0
# lendingLimit: 0
- name: "nvidia.com/gpu"
nominalQuota: 16
# borrowingLimit: 0
# lendingLimit: 0
- name: "nvidia.com/roce_gdr"
nominalQuota: 4
# borrowingLimit: 0
# lendingLimit: 0
- name: "pods"
nominalQuota: 100
# borrowingLimit: 0
# lendingLimit: 0
EOF
```
Edit the above quantities to adjust the quota to the desired values. Pod counts
are optional and can be omitted from the list of covered resources.

Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
namespaces from borrowing quota from this namespace.

Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
```sh
oc apply -n team1 -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: default-queue
spec:
clusterQueue: team1-cluster-queue
EOF
```
We recommend naming the local queue `default-queue` as `AppWrappers` will
default to this queue name.

23 changes: 23 additions & 0 deletions setup.RHOAI-v2.20/UNINSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Uninstall

***First, remove all team projects and corresponding cluster queues.***

Then to uninstall the MLBatch controllers and reclaim the corresponding
namespaces, run:
```sh
# OpenShift AI uninstall
oc delete dsc mlbatch-dsc
oc delete dsci mlbatch-dsci
oc delete subscription -n redhat-ods-operator rhods-operator
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
oc delete crd featuretrackers.features.opendatahub.io \
dscinitializations.dscinitialization.opendatahub.io \
datascienceclusters.datasciencecluster.opendatahub.io
oc delete operators rhods-operator.redhat-ods-operator
oc delete operatorgroup -n redhat-ods-operator rhods-operator
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator

# Coscheduler uninstall
helm uninstall -n scheduler-plugins scheduler-plugins
oc delete namespace scheduler-plugins
```
27 changes: 27 additions & 0 deletions setup.RHOAI-v2.20/UPGRADE-FAST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Upgrading from RHOAI 2.19

These instructions assume you installed and configured RHOAI 2.19 following
the MLBatch [install instructions for RHOAI-v2.19](../setup.RHOAI-v2.19/CLUSTER-SETUP.md)
or the [fast stream upgrade instructions for RHOAI-V2.19](../setup.RHOAI-v2.18/UPGRADE-FAST.md)

Your subscription will have automatically created an unapproved
install plan to upgrade to RHOAI 2.20.

Before beginning, verify that the expected install plan exists:
```sh
oc get ip -n redhat-ods-operator
```
Typical output would be:
```sh
NAME CSV APPROVAL APPROVED
install-kpzzl rhods-operator.2.20.0 Manual false
install-nqrbp rhods-operator.2.19.0 Manual true
```

There are no MLBatch modifications to the default RHOAI configuration maps
beyond those already made in previous installs. Therefore, you can simply
approve the install plan replacing the example plan name below with the actual
value on your cluster:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
```
4 changes: 4 additions & 0 deletions setup.RHOAI-v2.20/default-flavor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: default-flavor
Loading