Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling k8s jobs to multiple clusters #398

Open
nuclearcat opened this issue Jan 25, 2024 · 0 comments
Open

Scheduling k8s jobs to multiple clusters #398

nuclearcat opened this issue Jan 25, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request techdebt Something that works for now, but should be done better

Comments

@nuclearcat
Copy link
Member

Right now we have simple scheduler which just pick one hardcoded cluster (context) and submit job to it.
To scale up and compile many kernels at once we need to support multiple clusters (contexts) in config and distribute jobs between them.
As issue is a bit urgent, i suggest to do it in 2 steps:

First step to implement it simplest way possible, e.g. just support in runtime multiple contexts in config and pick one of them randomly.

Main issue how to convert:

  k8s-gke-eu-west4:
    lab_type: kubernetes
    context: 'gke_android-kernelci-external_europe-west4-c_kci-eu-west4'

to

  k8s-google:
    lab_type: kubernetes
    context: 
        - 'gke_android-kernelci-external_europe-west4-c_kci-eu-west4'
        - 'gke_android-kernelci-external_europe-west1-d_kci-eu-west1'
        - 'gke_android-kernelci-external_us-west1-a_kci-us-west1'

(We will have 5 clusters in total)
So when we hit this runtime it will evenly distribute the load across available clusters.

I suggest following:
1)support multiple contexts in config
2)Use random function and pick one of the contexts based on random value
e.g.

rndval = random.randint(0, len(contexts))
context = contexts[rndval]

Second step, after other teams can start working on other issues and we can spend more time on more elegant design - we need to design and implement more sophisticated logic to pick context based on kernel "size" + cluster "size" + load on cluster.
Several considerations:
1)Some kernels require "big" clusters(RAM and CPU capacity), which we have at moment only 2. This is at least "allmodconfig" and "gki_defconfig" kernels.
2)In future we might need to monitor load on cluster and pick context based on it. So for example if cluster have too many jobs in pending state it is better to pick another cluster.
3)We need to have some kind of "weight" for each cluster, that will provide us approximate computing capacity of cluster.

The only critical requirement is N1, as allmodconfig kernel will just cause OOM on small cluster.

@nuclearcat nuclearcat added the enhancement New feature or request label Feb 7, 2024
@nuclearcat nuclearcat self-assigned this Feb 7, 2024
@nuclearcat nuclearcat added the techdebt Something that works for now, but should be done better label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request techdebt Something that works for now, but should be done better
Projects
Status: No status
Development

No branches or pull requests

1 participant