Scheduling k8s jobs to multiple clusters #398

nuclearcat · 2024-01-25T10:11:34Z

Right now we have simple scheduler which just pick one hardcoded cluster (context) and submit job to it.
To scale up and compile many kernels at once we need to support multiple clusters (contexts) in config and distribute jobs between them.
As issue is a bit urgent, i suggest to do it in 2 steps:

First step to implement it simplest way possible, e.g. just support in runtime multiple contexts in config and pick one of them randomly.

Main issue how to convert:

  k8s-gke-eu-west4:
    lab_type: kubernetes
    context: 'gke_android-kernelci-external_europe-west4-c_kci-eu-west4'

to

  k8s-google:
    lab_type: kubernetes
    context: 
        - 'gke_android-kernelci-external_europe-west4-c_kci-eu-west4'
        - 'gke_android-kernelci-external_europe-west1-d_kci-eu-west1'
        - 'gke_android-kernelci-external_us-west1-a_kci-us-west1'

(We will have 5 clusters in total)
So when we hit this runtime it will evenly distribute the load across available clusters.

I suggest following:
1)support multiple contexts in config
2)Use random function and pick one of the contexts based on random value
e.g.

rndval = random.randint(0, len(contexts))
context = contexts[rndval]

Second step, after other teams can start working on other issues and we can spend more time on more elegant design - we need to design and implement more sophisticated logic to pick context based on kernel "size" + cluster "size" + load on cluster.
Several considerations:
1)Some kernels require "big" clusters(RAM and CPU capacity), which we have at moment only 2. This is at least "allmodconfig" and "gki_defconfig" kernels.
2)In future we might need to monitor load on cluster and pick context based on it. So for example if cluster have too many jobs in pending state it is better to pick another cluster.
3)We need to have some kind of "weight" for each cluster, that will provide us approximate computing capacity of cluster.

The only critical requirement is N1, as allmodconfig kernel will just cause OOM on small cluster.

The text was updated successfully, but these errors were encountered:

nuclearcat added the enhancement New feature or request label Feb 7, 2024

nuclearcat added this to KernelCI v2 2024 Feb 7, 2024

nuclearcat self-assigned this Feb 7, 2024

nuclearcat added the techdebt Something that works for now, but should be done better label Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling k8s jobs to multiple clusters #398

Scheduling k8s jobs to multiple clusters #398

nuclearcat commented Jan 25, 2024

Scheduling k8s jobs to multiple clusters #398

Scheduling k8s jobs to multiple clusters #398

Comments

nuclearcat commented Jan 25, 2024