(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

smklein · 2025-01-30T21:12:48Z

Pulled out of #7076

Updates schema to include Affinity/Anti-Affinity groups, but does not use these schemas yet.

hawkw

A few questions, but overall, this looks good.

hawkw · 2025-01-31T00:23:55Z

schema/crdb/dbinit.sql

+    -- TODO Add constraint that if kind is instance, instance_id is not NULL?
+    -- Or will that break backwards compatibility?


do we want to figure this out before merging this branch?

Tying a specific column to only some rows where the column isn't applicable to all rows seems nearly identical to the address column in the dataset table that we just removed.

What if instead of having this table that may contain different types of resources, we instead changed this table from sled_resource to instance_resource and removed the sled_resource_kind enum altogether? If we ever need to track resources other than instances this way we can create a new table.

I think this should also allow us to make the instance-id non-nullable. Can we fill those in with a schema migration?

This is related to the comment above, about backfilling the instance_id: https://github.com/oxidecomputer/omicron/pull/7444/files#r1936496281

Theoretically, if we did the following:

JOIN with the vmm table to fill the instance_id field

Add a constraint that "if kind = instance, instance_id is NOT NULL"

Then we could fail an upgrade, if we were partially through the instance_start saga -- in a spot where the instance was provisioned, but the VMM record didn't exist.

Unfortunately, I do think it's possible for an upgrade to happen with a saga still partway through execution - and in this case, if this happened, the upgrade would get stuck and require manual intervention to be able to apply the CONSTRAINT (the operator would need to find the corresponding instance UUID here and insert it manually).

To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.

Your point about not being able to use the VMM table because the instance may not be allocated to it yet, and that a running instance won't be part of an (anti-)affinity group make sense for why the instance_id needs to live outside of of the VMM table as well as within it.

I still think this table should become homogeneous, and if so, having instance_id nullable is probably fine. We could also keep this table as is (or make it homogeneous to prevent future fields not applicable to all rows being added) and also:

Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.

Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.

Instances may belong to any number of affinity groups - with this proposal, would we need to write to all the "instance membership" tables any time an instance with multi-group-membership starts/stops? My concern here is that the write amplification is unbounded unless we limit the number of affinity groups / anti-affinity groups instances belong to. I also fear this may get more complicated if we allow "things other than instances" to be members (which is the goal - RFD 522 discusses allowing anti-affinity groups to contain other affinity groups as members). If memberships could be from instances or other affinity groups, which rows would we need to update to signify that "there exists a resource on this sled, belonging to this (anti-)affinity group"?

To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.

I'm down to flatten this out; I might punt it to a follow-up PR, as the "enum of one" for sled resources existed before this chain of PRs.

To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.

I'm down to flatten this out; I might punt it to a follow-up PR, as the "enum of one" for sled resources existed before this chain of PRs.

That works for me. Thank you!

Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.

Instances may belong to any number of affinity groups - with this proposal, would we need to write to all the "instance membership" tables any time an instance with multi-group-membership starts/stops? My concern here is that the write amplification is unbounded unless we limit the number of affinity groups / anti-affinity groups instances belong to. I also fear this may get more complicated if we allow "things other than instances" to be members (which is the goal - RFD 522 discusses allowing anti-affinity groups to contain other affinity groups as members). If memberships could be from instances or other affinity groups, which rows would we need to update to signify that "there exists a resource on this sled, belonging to this (anti-)affinity group"?

Oh boy, that is quite the gaping hole in my suggestion! I forgot that instances could belong to multiple affinity groups, and that they could be recursive. This is completely untenable and we definitely should not do it! Thanks for pointing this out, and sorry for the silly suggestion!

I think leaving the field nullable and flattening in a separate PR will cover my concerns. I appreciate the feedback ❤️

hawkw · 2025-01-31T00:27:48Z

schema/crdb/affinity/up11.sql

@@ -0,0 +1 @@
+ALTER TABLE omicron.public.sled_resource ADD COLUMN IF NOT EXISTS instance_id UUID;


This will, I believe, mean that when this migration runs, all existing sled reservations will have instance_id set to NULL, right? At present, all sled resource migrations are for VMMs. So perhaps we should be going and looking at the vmm row corresponding to the id column in sled_resource and using the value of its instance_id field, so that VMMs which have an associated instance will have that instance's ID added to their resource reservation.

I generally agree with you. My hesitation comes from:

omicron/nexus/src/app/sagas/instance_start.rs

Lines 136 to 138 in 7f05fce

builder.append(alloc_server_action());

builder.append(alloc_propolis_ip_action());

builder.append(create_vmm_record_action());

Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.

Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.

So, TL;DR:

To populate this column, an instance would need to be stopped and restarted

After that, we will populate this column during the sled provisioning pathway

All other instances - if they somehow remain "running" through an upgrade, and which may or may not have VMMs - will get fixed when they are stopped / started again.

Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.

Hmm, that does make sense. I still find adding the instance ID to sled_resource a bit concerning, but I can't come up with an alternative solution either, so carry on! It might be nice to have a comment explaining this someplace?

Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.

Just to clarify, you mean that an instance which is running when we perform this migration won't be part of an affinity group, not that running instances are never part of affinity groups, right? I would have thought that running instances were still members of affinity groups, because other instances in that group need to know where that instance is? Or am I misunderstanding?

Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.

Hmm, that does make sense. I still find adding the instance ID to sled_resource a bit concerning, but I can't come up with an alternative solution either, so carry on! It might be nice to have a comment explaining this someplace?

I can add some commentary in dbinit.sql

Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.

Just to clarify, you mean that an instance which is running when we perform this migration won't be part of an affinity group, not that running instances are never part of affinity groups, right? I would have thought that running instances were still members of affinity groups, because other instances in that group need to know where that instance is? Or am I misunderstanding?

That's right - this field being NULL basically means "we won't look up affinity group info for this resource".

I meant to say: Instances can only be added to affinity groups when they're stopped -- if that's the case, there shouldn't be a "sled_resource" row for that instance anyway.

Gotcha, thanks for clarifying! I think comments in dbinit regarding the instance_id in sled_resource would make me happier, at least, although if you end up doing something more like what @andrewjstone is getting at in #7444 (comment), that might render my concerns moot

hawkw · 2025-01-31T00:32:33Z

schema/crdb/dbinit.sql

+    -- The UUID of an instance, if this resource belongs to an instance.
+    instance_id UUID


I assume this is being added so that the sled_resource table can also be used to track which instances are present on a sled? But, this information is already present in the vmm table, where records have both the sled ID of the VMM that the sled is on and the ID of the instance that is running on that VMM. For my education,¹ would you mind explaining what putting instance IDs here accomplishes?

Footnotes

Don't take this as me saying I don't like adding this, I'm just curious about what it accomplishes in the overall design. :) ↩

It's a good question: #7076 (comment)

The TL;DR is that "the VMM table rows is not atomically created with the resource table rows" - it is possible to create a reservation on behalf of an instance without necessarily creating a VMM record.

andrewjstone · 2025-01-31T18:16:28Z

nexus/db-model/src/affinity.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/5.0/.
+
+// Copyright 2024 Oxide Computer Company


Nit: It's 2025 now :D

I often omit the copyright notices just to avoid this issue.

In my defense it was 2024 when I added this file. updated

andrewjstone · 2025-01-31T19:43:56Z

schema/crdb/dbinit.sql

+    -- TODO Add constraint that if kind is instance, instance_id is not NULL?
+    -- Or will that break backwards compatibility?


Tying a specific column to only some rows where the column isn't applicable to all rows seems nearly identical to the address column in the dataset table that we just removed.

What if instead of having this table that may contain different types of resources, we instead changed this table from sled_resource to instance_resource and removed the sled_resource_kind enum altogether? If we ever need to track resources other than instances this way we can create a new table.

I think this should also allow us to make the instance-id non-nullable. Can we fill those in with a schema migration?

smklein added 2 commits January 30, 2025 12:19

[nexus] Add Affinity/Anti-Affinity groups to database

4020517

Merge branch 'affinity-api' into affinity-db-model

e3113ff

smklein mentioned this pull request Jan 30, 2025

[nexus] Affinity and Anti-Affinity Groups #7076

Open

44 tasks

smklein requested review from hawkw and gjcolombo January 30, 2025 21:24

hawkw reviewed Jan 31, 2025

View reviewed changes

Merge branch 'affinity-api' into affinity-db-model

a1c97d4

andrewjstone reviewed Jan 31, 2025

View reviewed changes

smklein added 3 commits January 31, 2025 14:25

Merge branch 'affinity-api' into affinity-db-model

8bc8f0c

Typed UUID

900f09c

comments

aba9596

andrewjstone approved these changes Jan 31, 2025

View reviewed changes

hawkw approved these changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

smklein commented Jan 30, 2025

hawkw left a comment •

edited

Loading

hawkw Jan 31, 2025

andrewjstone Jan 31, 2025

smklein Jan 31, 2025

andrewjstone Jan 31, 2025

smklein Jan 31, 2025

smklein Jan 31, 2025

andrewjstone Jan 31, 2025

andrewjstone Jan 31, 2025

hawkw Jan 31, 2025

smklein Jan 31, 2025

hawkw Jan 31, 2025

smklein Jan 31, 2025

hawkw Jan 31, 2025

hawkw Jan 31, 2025

smklein Jan 31, 2025 •

edited

Loading

andrewjstone Jan 31, 2025

smklein Jan 31, 2025

andrewjstone Jan 31, 2025

		-- TODO Add constraint that if kind is instance, instance_id is not NULL?
		-- Or will that break backwards compatibility?

		@@ -0,0 +1 @@
		ALTER TABLE omicron.public.sled_resource ADD COLUMN IF NOT EXISTS instance_id UUID;

	builder.append(alloc_server_action());
	builder.append(alloc_propolis_ip_action());
	builder.append(create_vmm_record_action());

		-- The UUID of an instance, if this resource belongs to an instance.
		instance_id UUID

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

Are you sure you want to change the base?

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

Conversation

smklein commented Jan 30, 2025

hawkw left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

smklein Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment •

edited

Loading

smklein Jan 31, 2025 •

edited

Loading