diff --git a/SETUP.md b/SETUP.md index 1b756e9..ed3dd33 100644 --- a/SETUP.md +++ b/SETUP.md @@ -45,10 +45,15 @@ Instructions are provided for the following Red Hat OpenShift AI ***stable*** re + [RHOAI 2.16 Uninstall](./setup.RHOAI-v2.16/UNINSTALL.md) Instructions are provided for the following Red Hat OpenShift AI ***fast*** releases: ++ Red Hat OpenShift AI 2.20 + + [RHOAI 2.20 Cluster Setup](./setup.RHOAI-v2.20/CLUSTER-SETUP.md) + + [RHOAI 2.20 Team Setup](./setup.RHOAI-v2.20/TEAM-SETUP.md) + + [UPGRADING from RHOAI 2.19](./setup.RHOAI-v2.20/UPGRADE-FAST.md) + + [RHOAI 2.20 Uninstall](./setup.RHOAI-v2.20/UNINSTALL.md) + Red Hat OpenShift AI 2.19 + [RHOAI 2.19 Cluster Setup](./setup.RHOAI-v2.19/CLUSTER-SETUP.md) + [RHOAI 2.19 Team Setup](./setup.RHOAI-v2.19/TEAM-SETUP.md) - + [UPGRADING from RHOAI 2.18](./setup.RHOAI-v2.19/UPGRADE.md) + + [UPGRADING from RHOAI 2.18](./setup.RHOAI-v2.19/UPGRADE-FAST.md) + [RHOAI 2.19 Uninstall](./setup.RHOAI-v2.19/UNINSTALL.md) ## Kubernetes diff --git a/setup.RHOAI-v2.16/UPGRADE-FAST.md b/setup.RHOAI-v2.16/UPGRADE-FAST.md index eeb9bb3..619b8c7 100644 --- a/setup.RHOAI-v2.16/UPGRADE-FAST.md +++ b/setup.RHOAI-v2.16/UPGRADE-FAST.md @@ -25,10 +25,8 @@ First, update the MLBatch modifications to the default RHOAI configuration maps. oc apply -f setup.RHOAI-v2.16/mlbatch-upgrade-configmaps.yaml ``` -There are no MLBatch modifications to the default RHOAI configuration maps -beyond those already made in previous installs. Therefore, you can simply -approve the install plan replacing the example plan name below with the actual -value on your cluster: +Next you can approve the install plan replacing the example plan name +below with the actual value on your cluster: ```sh oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl ``` diff --git a/setup.RHOAI-v2.19/UPGRADE-FAST.md b/setup.RHOAI-v2.19/UPGRADE-FAST.md index 913d9c0..52515e1 100644 --- a/setup.RHOAI-v2.19/UPGRADE-FAST.md +++ b/setup.RHOAI-v2.19/UPGRADE-FAST.md @@ -1,4 +1,4 @@ -# Upgrading from RHOAI 2.19 +# Upgrading from RHOAI 2.18 These instructions assume you installed and configured RHOAI 2.18 following the MLBatch [install instructions for RHOAI-v2.18](../setup.RHOAI-v2.18/CLUSTER-SETUP.md) @@ -14,8 +14,8 @@ oc get ip -n redhat-ods-operator Typical output would be: ```sh NAME CSV APPROVAL APPROVED -install-kpzzl rhods-operator.2.18.0 Manual false -install-nqrbp rhods-operator.2.19.0 Manual true +install-kpzzl rhods-operator.2.19.0 Manual false +install-nqrbp rhods-operator.2.18.0 Manual true ``` Before approving the upgrade, you must manually remove v1alpha1 MultiKueue CRD's diff --git a/setup.RHOAI-v2.19/UPGRADE-STABLE.md b/setup.RHOAI-v2.19/UPGRADE-STABLE.md index 6275137..91df26b 100644 --- a/setup.RHOAI-v2.19/UPGRADE-STABLE.md +++ b/setup.RHOAI-v2.19/UPGRADE-STABLE.md @@ -15,8 +15,8 @@ oc get ip -n redhat-ods-operator Typical output would be: ```sh NAME CSV APPROVAL APPROVED -install-kpzzl rhods-operator.2.16.0 Manual false -install-nqrbp rhods-operator.2.19.0 Manual true +install-kpzzl rhods-operator.2.19.0 Manual false +install-nqrbp rhods-operator.2.16.0 Manual true ``` Assuming the install plan exists you can begin the upgrade process. diff --git a/setup.RHOAI-v2.20/CLUSTER-SETUP.md b/setup.RHOAI-v2.20/CLUSTER-SETUP.md new file mode 100644 index 0000000..6873325 --- /dev/null +++ b/setup.RHOAI-v2.20/CLUSTER-SETUP.md @@ -0,0 +1,171 @@ +# Cluster Setup + +The cluster setup installs Red Hat OpenShift AI and configures Scheduler Plugins, Kueue, +cluster roles, and priority classes. + +## Priorities + +Create `default-priority`, `high-priority`, and `low-priority` priority classes: +```sh +oc apply -f setup.RHOAI-v2.20/mlbatch-priorities.yaml +``` + +## Scheduler Configuration + +MLBatch configures Kubernetes scheduling to accomplish two objectives: ++ Obtaining gang (all or nothing) scheduling for multi-Pod workloads. ++ Packing Pods whose GPU request is less than the number of GPUs on a Node to + maximize the number of Nodes available for Pods that request all the GPUs on a Node. + +This is done by installing the Coscheduling out-of-tree scheduler plugin and configuring +the default NodeResourcesFit scheduler plugin to pack in the GPU dimension. + + +```sh +helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \ + scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \ + --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]' +``` +Patch scheduler-plugins pod priorities: +```sh +oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.20/scheduler-priority-patch.yaml scheduler-plugins-controller +oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.20/scheduler-priority-patch.yaml scheduler-plugins-scheduler +``` + + + +## Red Hat OpenShift AI + +Create the Red Hat OpenShift AI subscription: +```sh +oc apply -f setup.RHOAI-v2.20/mlbatch-subscription.yaml +``` +Create the mlbatch NetworkPolicy in the redhat-ods-applications namespace. +```sh +oc apply -f setup.RHOAI-v2.20/mlbatch-network-policy.yaml +``` +Identify install plan: +```sh +oc get ip -n redhat-ods-operator +``` +``` +NAMESPACE NAME CSV APPROVAL APPROVED +redhat-ods-operator install-kmh8w rhods-operator.2.20.0 Manual false +``` +Approve install plan replacing the generated plan name below with the actual +value: +```sh +oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w +``` +Create DSC Initialization: +```sh +oc apply -f setup.RHOAI-v2.20/mlbatch-dsci.yaml +``` +Create Data Science Cluster: +```sh +oc apply -f setup.RHOAI-v2.20/mlbatch-dsc.yaml +``` +The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift +AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The +remaining components such as `dashboard` can be optionally enabled. + +The configuration of the managed components differs from the default Red Hat OpenShift +AI configuration as follows: +- Kubeflow Training Operator: + - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`, +- Kueue: + - `manageJobsWithoutQueueName` is enabled, + - `batch/job` integration is disabled, + - `waitForPodsReady` is disabled, + - `fairSharing` is enabled, + - `enableClusterQueueResources` metrics is enabled, +- Codeflare operator: + - the AppWrapper controller is enabled and configured as follows: + - `userRBACAdmissionCheck` is disabled, + - `schedulerName` is set to `scheduler-plugins-scheduler`, + - `queueName` is set to `default-queue`, + - `slackQueueName` is set to `slack-cluster-queue` + +## Autopilot + +Helm charts values and how-to for customization can be found [in the official documentation](https://github.com/IBM/autopilot/blob/main/helm-charts/autopilot/README.md). As-is, Autopilot will run on GPU nodes. + +- Add the Autopilot Helm repository + +```bash +helm repo add autopilot https://ibm.github.io/autopilot/ +helm repo update +``` + +- Install the chart (idempotent command). The config file is for customizing the helm values and it is optional. + +```bash +helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f your-config.yml +``` + +### Enabling Prometheus metrics + +After completing the installation, manually label the namespace to enable metrics to be scraped by Prometheus with the following command: + +```bash +oc label ns autopilot openshift.io/cluster-monitoring=true +``` + +The `ServiceMonitor` labeling is not required. + +## Kueue Configuration + +Create Kueue's default flavor: +```sh +oc apply -f setup.RHOAI-v2.20/default-flavor.yaml +``` + +## Cluster Role + +Create `mlbatch-edit` role: +```sh +oc apply -f setup.RHOAI-v2.20/mlbatch-edit-role.yaml +``` + +## Slack Cluster Queue + +Create the designated slack `ClusterQueue` which will be used to automate +minor adjustments to cluster capacity caused by node failures and +scheduler maintanence. +```sh +oc apply -f- << EOF +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: slack-cluster-queue +spec: + namespaceSelector: {} + cohort: default-cohort + preemption: + withinClusterQueue: LowerOrNewerEqualPriority + reclaimWithinCohort: Any + borrowWithinCohort: + policy: Never + resourceGroups: + - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] + flavors: + - name: default-flavor + resources: + - name: "cpu" + nominalQuota: 8000m + - name: "memory" + nominalQuota: 128Gi + - name: "nvidia.com/gpu" + nominalQuota: 8 + - name: "nvidia.com/roce_gdr" + nominalQuota: 1 + - name: "pods" + nominalQuota: 100 +EOF +``` +Edit the above quantities to adjust the quota to the desired +values. Pod counts are optional and can be omitted from the list of +covered resources. The `lendingLimit` for each resource will be +dynamically adjusted by the MLBatch system to reflect reduced cluster +capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a +detailed discussion of the role of the slack `ClusterQueue`. diff --git a/setup.RHOAI-v2.20/TEAM-SETUP.md b/setup.RHOAI-v2.20/TEAM-SETUP.md new file mode 100644 index 0000000..85c9429 --- /dev/null +++ b/setup.RHOAI-v2.20/TEAM-SETUP.md @@ -0,0 +1,91 @@ +# Team Setup + +A *team* in MLBatch is a group of users that share a resource quota. + +Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) +for a discussion of our recommended best practices. + + +Setting up a new team requires the cluster admin to create a project, +a user group, a quota, a queue, and the required role bindings as described below. + +Create project: +```sh +oc new-project team1 +``` +Create user group: +```sh +oc adm groups new team1-edit-group +``` +Add users to group for example: +```sh +oc adm groups add-users team1-edit-group user1 +``` +Bind cluster role to group in namespace: +```sh +oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1 +``` + +Specify the intended quota for the namespace by creating a `ClusterQueue`: +```sh +oc apply -f- << EOF +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: team1-cluster-queue +spec: + namespaceSelector: {} + cohort: default-cohort + preemption: + withinClusterQueue: LowerOrNewerEqualPriority + reclaimWithinCohort: Any + borrowWithinCohort: + policy: Never + resourceGroups: + - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] + flavors: + - name: default-flavor + resources: + - name: "cpu" + nominalQuota: 8000m + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "memory" + nominalQuota: 128Gi + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "nvidia.com/gpu" + nominalQuota: 16 + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "nvidia.com/roce_gdr" + nominalQuota: 4 + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "pods" + nominalQuota: 100 + # borrowingLimit: 0 + # lendingLimit: 0 +EOF +``` +Edit the above quantities to adjust the quota to the desired values. Pod counts +are optional and can be omitted from the list of covered resources. + +Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing +quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other +namespaces from borrowing quota from this namespace. + +Create a `LocalQueue` to bind the `ClusterQueue` to the namespace: +```sh +oc apply -n team1 -f- << EOF +apiVersion: kueue.x-k8s.io/v1beta1 +kind: LocalQueue +metadata: + name: default-queue +spec: + clusterQueue: team1-cluster-queue +EOF +``` +We recommend naming the local queue `default-queue` as `AppWrappers` will +default to this queue name. + diff --git a/setup.RHOAI-v2.20/UNINSTALL.md b/setup.RHOAI-v2.20/UNINSTALL.md new file mode 100644 index 0000000..776045d --- /dev/null +++ b/setup.RHOAI-v2.20/UNINSTALL.md @@ -0,0 +1,23 @@ +# Uninstall + +***First, remove all team projects and corresponding cluster queues.*** + +Then to uninstall the MLBatch controllers and reclaim the corresponding +namespaces, run: +```sh +# OpenShift AI uninstall +oc delete dsc mlbatch-dsc +oc delete dsci mlbatch-dsci +oc delete subscription -n redhat-ods-operator rhods-operator +oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator +oc delete crd featuretrackers.features.opendatahub.io \ + dscinitializations.dscinitialization.opendatahub.io \ + datascienceclusters.datasciencecluster.opendatahub.io +oc delete operators rhods-operator.redhat-ods-operator +oc delete operatorgroup -n redhat-ods-operator rhods-operator +oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator + +# Coscheduler uninstall +helm uninstall -n scheduler-plugins scheduler-plugins +oc delete namespace scheduler-plugins +``` diff --git a/setup.RHOAI-v2.20/UPGRADE-FAST.md b/setup.RHOAI-v2.20/UPGRADE-FAST.md new file mode 100644 index 0000000..b5b17d0 --- /dev/null +++ b/setup.RHOAI-v2.20/UPGRADE-FAST.md @@ -0,0 +1,27 @@ +# Upgrading from RHOAI 2.19 + +These instructions assume you installed and configured RHOAI 2.19 following +the MLBatch [install instructions for RHOAI-v2.19](../setup.RHOAI-v2.19/CLUSTER-SETUP.md) +or the [fast stream upgrade instructions for RHOAI-V2.19](../setup.RHOAI-v2.18/UPGRADE-FAST.md) + +Your subscription will have automatically created an unapproved +install plan to upgrade to RHOAI 2.20. + +Before beginning, verify that the expected install plan exists: +```sh +oc get ip -n redhat-ods-operator +``` +Typical output would be: +```sh +NAME CSV APPROVAL APPROVED +install-kpzzl rhods-operator.2.20.0 Manual false +install-nqrbp rhods-operator.2.19.0 Manual true +``` + +There are no MLBatch modifications to the default RHOAI configuration maps +beyond those already made in previous installs. Therefore, you can simply +approve the install plan replacing the example plan name below with the actual +value on your cluster: +```sh +oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl +``` diff --git a/setup.RHOAI-v2.20/default-flavor.yaml b/setup.RHOAI-v2.20/default-flavor.yaml new file mode 100644 index 0000000..6cbccf3 --- /dev/null +++ b/setup.RHOAI-v2.20/default-flavor.yaml @@ -0,0 +1,4 @@ +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: default-flavor diff --git a/setup.RHOAI-v2.20/mlbatch-dsc.yaml b/setup.RHOAI-v2.20/mlbatch-dsc.yaml new file mode 100644 index 0000000..66336bc --- /dev/null +++ b/setup.RHOAI-v2.20/mlbatch-dsc.yaml @@ -0,0 +1,32 @@ +apiVersion: datasciencecluster.opendatahub.io/v1 +kind: DataScienceCluster +metadata: + name: mlbatch-dsc +spec: + components: + codeflare: + managementState: Managed + dashboard: + managementState: Removed + datasciencepipelines: + managementState: Removed + kserve: + managementState: Removed + serving: + ingressGateway: + certificate: + type: SelfSigned + managementState: Removed + name: knative-serving + kueue: + managementState: Managed + modelmeshserving: + managementState: Removed + ray: + managementState: Managed + trainingoperator: + managementState: Managed + trustyai: + managementState: Removed + workbenches: + managementState: Removed diff --git a/setup.RHOAI-v2.20/mlbatch-dsci.yaml b/setup.RHOAI-v2.20/mlbatch-dsci.yaml new file mode 100644 index 0000000..77785c3 --- /dev/null +++ b/setup.RHOAI-v2.20/mlbatch-dsci.yaml @@ -0,0 +1,14 @@ +apiVersion: dscinitialization.opendatahub.io/v1 +kind: DSCInitialization +metadata: + name: mlbatch-dsci +spec: + applicationsNamespace: redhat-ods-applications + monitoring: + managementState: Managed + namespace: redhat-ods-monitoring + serviceMesh: + managementState: Removed + trustedCABundle: + customCABundle: "" + managementState: Managed diff --git a/setup.RHOAI-v2.20/mlbatch-edit-role.yaml b/setup.RHOAI-v2.20/mlbatch-edit-role.yaml new file mode 100644 index 0000000..fd86cc6 --- /dev/null +++ b/setup.RHOAI-v2.20/mlbatch-edit-role.yaml @@ -0,0 +1,151 @@ +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: mlbatch-edit +rules: +- apiGroups: + - "" + resources: + - pods + verbs: + - delete + - get + - list + - watch +- apiGroups: + - apps + resources: + - deployments + - statefulsets + verbs: + - delete + - get + - list + - watch +- apiGroups: + - "" + resources: + - services + - secrets + - configmaps + - persistentvolumeclaims + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - kueue.x-k8s.io + resources: + - "*" + verbs: + - get + - list + - watch +- apiGroups: + - kubeflow.org + resources: + - pytorchjobs + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - ray.io + resources: + - rayjobs + - rayclusters + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - batch + resources: + - jobs + verbs: + - delete + - get + - list + - watch +- apiGroups: + - workload.codeflare.dev + resources: + - appwrappers + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - scheduling.k8s.io + resources: + - priorityclasses + verbs: + - get + - list + - watch +- apiGroups: + - scheduling.x-k8s.io + resources: + - podgroups + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - "" + resources: + - events + verbs: + - get + - list + - watch +- apiGroups: + - "" + resources: + - namespaces + - pods/log + verbs: + - get +- apiGroups: + - "" + resources: + - pods/exec + - pods/portforward + verbs: + - create +- apiGroups: + - route.openshift.io + resources: + - routes + verbs: + - get + - list + - watch + - delete +- apiGroups: + - "" + - project.openshift.io + resources: + - projects + verbs: + - get diff --git a/setup.RHOAI-v2.20/mlbatch-network-policy.yaml b/setup.RHOAI-v2.20/mlbatch-network-policy.yaml new file mode 100644 index 0000000..d116279 --- /dev/null +++ b/setup.RHOAI-v2.20/mlbatch-network-policy.yaml @@ -0,0 +1,25 @@ +kind: NetworkPolicy +apiVersion: networking.k8s.io/v1 +metadata: + name: mlbatch-ods-applications + namespace: redhat-ods-applications +spec: + podSelector: {} + ingress: + - ports: + - protocol: TCP + port: 8443 + - protocol: TCP + port: 8080 + - protocol: TCP + port: 8081 + - protocol: TCP + port: 5432 + - protocol: TCP + port: 8082 + - protocol: TCP + port: 8099 + - protocol: TCP + port: 8181 + - protocol: TCP + port: 9443 # default webhook of components diff --git a/setup.RHOAI-v2.20/mlbatch-priorities.yaml b/setup.RHOAI-v2.20/mlbatch-priorities.yaml new file mode 100644 index 0000000..77c8f3b --- /dev/null +++ b/setup.RHOAI-v2.20/mlbatch-priorities.yaml @@ -0,0 +1,26 @@ +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: low-priority +value: 1 +preemptionPolicy: PreemptLowerPriority +globalDefault: false +description: "This is the priority class for all lower priority jobs." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: default-priority +value: 5 +preemptionPolicy: PreemptLowerPriority +globalDefault: true +description: "This is the priority class for all jobs (default priority)." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: high-priority +value: 10 +preemptionPolicy: PreemptLowerPriority +globalDefault: false +description: "This is the priority class defined for highly important jobs that would evict lower and default priority jobs." diff --git a/setup.RHOAI-v2.20/mlbatch-subscription.yaml b/setup.RHOAI-v2.20/mlbatch-subscription.yaml new file mode 100644 index 0000000..52e271f --- /dev/null +++ b/setup.RHOAI-v2.20/mlbatch-subscription.yaml @@ -0,0 +1,190 @@ +apiVersion: v1 +kind: Namespace +metadata: + name: redhat-ods-operator +--- +apiVersion: v1 +kind: Namespace +metadata: + name: redhat-ods-applications +--- +apiVersion: operators.coreos.com/v1 +kind: OperatorGroup +metadata: + name: rhods-operator + namespace: redhat-ods-operator +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: codeflare-operator-config + namespace: redhat-ods-applications +data: + config.yaml: | + appwrapper: + enabled: true + Config: + autopilot: + injectAntiAffinities: true + monitorNodes: true + resourceTaints: + nvidia.com/gpu: + - key: autopilot.ibm.com/gpuhealth + value: ERR + effect: NoSchedule + - key: autopilot.ibm.com/gpuhealth + value: TESTING + effect: NoSchedule + - key: autopilot.ibm.com/gpuhealth + value: EVICT + effect: NoExecute + defaultQueueName: default-queue + enableKueueIntegrations: true + kueueJobReconciller: + manageJobsWithoutQueueName: true + waitForPodsReady: + blockAdmission: false + enable: false + schedulerName: scheduler-plugins-scheduler + slackQueueName: slack-cluster-queue + userRBACAdmissionCheck: false +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: mlbatch-kueue + namespace: redhat-ods-operator +data: + controller_manager_config.yaml: | + apiVersion: config.kueue.x-k8s.io/v1beta1 + kind: Configuration + health: + healthProbeBindAddress: :8081 + metrics: + bindAddress: :8443 + enableClusterQueueResources: true + webhook: + port: 9443 + leaderElection: + leaderElect: true + resourceName: c1f6bfd2.kueue.x-k8s.io + controller: + groupKindConcurrency: + Job.batch: 5 + Pod: 5 + Workload.kueue.x-k8s.io: 5 + LocalQueue.kueue.x-k8s.io: 1 + Cohort.kueue.x-k8s.io: 1 + ClusterQueue.kueue.x-k8s.io: 1 + ResourceFlavor.kueue.x-k8s.io: 1 + clientConnection: + qps: 50 + burst: 100 + #pprofBindAddress: :8082 + waitForPodsReady: + enable: false + blockAdmission: false + manageJobsWithoutQueueName: true + #managedJobsNamespaceSelector: + # matchLabels: + # kueue-managed: "true" + #internalCertManagement: + # enable: false + # webhookServiceName: "" + # webhookSecretName: "" + integrations: + frameworks: + # - "batch/job" + - "kubeflow.org/mpijob" + - "ray.io/rayjob" + - "ray.io/raycluster" + - "jobset.x-k8s.io/jobset" + - "kubeflow.org/mxjob" + - "kubeflow.org/paddlejob" + - "kubeflow.org/pytorchjob" + - "kubeflow.org/tfjob" + - "kubeflow.org/xgboostjob" + # - "pod" + # - "deployment" # requires enabling pod integration + # - "statefulset" # requires enabling pod integration + externalFrameworks: + - "AppWrapper.v1beta2.workload.codeflare.dev" + # podOptions: + # namespaceSelector: + # matchExpressions: + # - key: kubernetes.io/metadata.name + # operator: NotIn + # values: [ kube-system, kueue-system ] + fairSharing: + enable: true + preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare] + #resources: + # excludeResourcePrefixes: [] + # transformations: + # - input: nvidia.com/mig-4g.5gb + # strategy: Replace | Retain + # outputs: + # example.com/accelerator-memory: 5Gi + # example.com/accelerator-gpc: 4 +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: mlbatch-training-operator + namespace: redhat-ods-operator +data: + manager_config_patch.yaml: | + apiVersion: apps/v1 + kind: Deployment + metadata: + name: training-operator + spec: + template: + spec: + containers: + - name: training-operator + image: $(image) + args: + - "--zap-log-level=2" + - --pytorch-init-container-image + - $(image) + - "--webhook-secret-name" + - "kubeflow-training-operator-webhook-cert" + - "--webhook-service-name" + - "kubeflow-training-operator" + - "--gang-scheduler-name=scheduler-plugins-scheduler" + volumes: + - name: cert + secret: + defaultMode: 420 + secretName: kubeflow-training-operator-webhook-cert +--- +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhods-operator + namespace: redhat-ods-operator +spec: + channel: fast + installPlanApproval: Manual + name: rhods-operator + source: redhat-operators + sourceNamespace: openshift-marketplace + startingCSV: rhods-operator.2.20.0 + config: + env: + - name: "DISABLE_DSC_CONFIG" + volumeMounts: + - name: mlbatch-kueue + mountPath: /opt/manifests/kueue/components/manager/controller_manager_config.yaml + subPath: controller_manager_config.yaml + - name: mlbatch-training-operator + mountPath: /opt/manifests/trainingoperator/rhoai/manager_config_patch.yaml + subPath: manager_config_patch.yaml + volumes: + - name: mlbatch-kueue + configMap: + name: mlbatch-kueue + - name: mlbatch-training-operator + configMap: + name: mlbatch-training-operator diff --git a/setup.RHOAI-v2.20/scheduler-priority-patch.yaml b/setup.RHOAI-v2.20/scheduler-priority-patch.yaml new file mode 100644 index 0000000..278802f --- /dev/null +++ b/setup.RHOAI-v2.20/scheduler-priority-patch.yaml @@ -0,0 +1,3 @@ +- op: add + path: /spec/template/spec/priorityClassName + value: system-node-critical diff --git a/setup.tmpl/Makefile b/setup.tmpl/Makefile index a7fe221..2217c49 100644 --- a/setup.tmpl/Makefile +++ b/setup.tmpl/Makefile @@ -25,6 +25,8 @@ docs: gotmpl ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.RHOAI-v2.16/TEAM-SETUP.md -values RHOAI-v2.16.yaml ../tools/gotmpl/gotmpl -input ./CLUSTER-SETUP.md.tmpl -output ../setup.RHOAI-v2.19/CLUSTER-SETUP.md -values RHOAI-v2.19.yaml ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.RHOAI-v2.19/TEAM-SETUP.md -values RHOAI-v2.19.yaml + ../tools/gotmpl/gotmpl -input ./CLUSTER-SETUP.md.tmpl -output ../setup.RHOAI-v2.20/CLUSTER-SETUP.md -values RHOAI-v2.20.yaml + ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.RHOAI-v2.20/TEAM-SETUP.md -values RHOAI-v2.20.yaml ../tools/gotmpl/gotmpl -input ./CLUSTER-SETUP.md.tmpl -output ../setup.k8s/CLUSTER-SETUP.md -values Kubernetes.yaml ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.k8s/TEAM-SETUP.md -values Kubernetes.yaml diff --git a/setup.tmpl/RHOAI-v2.20.yaml b/setup.tmpl/RHOAI-v2.20.yaml new file mode 100644 index 0000000..e02c1f7 --- /dev/null +++ b/setup.tmpl/RHOAI-v2.20.yaml @@ -0,0 +1,7 @@ +# Values for RHOAI 2.20 + +RHOAI: true +VERSION: RHOAI-v2.20 +VERSION_NUMBER: 2.20.0 +KUBECTL: oc +FAIRSHARE: true