Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

Open
wants to merge 6 commits into
base: affinity-api
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jan 30, 2025

Pulled out of #7076

Updates schema to include Affinity/Anti-Affinity groups, but does not use these schemas yet.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, but overall, this looks good.

Comment on lines 250 to 251
-- TODO Add constraint that if kind is instance, instance_id is not NULL?
-- Or will that break backwards compatibility?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to figure this out before merging this branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tying a specific column to only some rows where the column isn't applicable to all rows seems nearly identical to the address column in the dataset table that we just removed.

What if instead of having this table that may contain different types of resources, we instead changed this table from sled_resource to instance_resource and removed the sled_resource_kind enum altogether? If we ever need to track resources other than instances this way we can create a new table.

I think this should also allow us to make the instance-id non-nullable. Can we fill those in with a schema migration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the comment above, about backfilling the instance_id: https://github.com/oxidecomputer/omicron/pull/7444/files#r1936496281

Theoretically, if we did the following:

  • JOIN with the vmm table to fill the instance_id field
  • Add a constraint that "if kind = instance, instance_id is NOT NULL"

Then we could fail an upgrade, if we were partially through the instance_start saga -- in a spot where the instance was provisioned, but the VMM record didn't exist.

Unfortunately, I do think it's possible for an upgrade to happen with a saga still partway through execution - and in this case, if this happened, the upgrade would get stuck and require manual intervention to be able to apply the CONSTRAINT (the operator would need to find the corresponding instance UUID here and insert it manually).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.

Your point about not being able to use the VMM table because the instance may not be allocated to it yet, and that a running instance won't be part of an (anti-)affinity group make sense for why the instance_id needs to live outside of of the VMM table as well as within it.

I still think this table should become homogeneous, and if so, having instance_id nullable is probably fine. We could also keep this table as is (or make it homogeneous to prevent future fields not applicable to all rows being added) and also:

Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.

Instances may belong to any number of affinity groups - with this proposal, would we need to write to all the "instance membership" tables any time an instance with multi-group-membership starts/stops? My concern here is that the write amplification is unbounded unless we limit the number of affinity groups / anti-affinity groups instances belong to. I also fear this may get more complicated if we allow "things other than instances" to be members (which is the goal - RFD 522 discusses allowing anti-affinity groups to contain other affinity groups as members). If memberships could be from instances or other affinity groups, which rows would we need to update to signify that "there exists a resource on this sled, belonging to this (anti-)affinity group"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.

I'm down to flatten this out; I might punt it to a follow-up PR, as the "enum of one" for sled resources existed before this chain of PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.

I'm down to flatten this out; I might punt it to a follow-up PR, as the "enum of one" for sled resources existed before this chain of PRs.

That works for me. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.

Instances may belong to any number of affinity groups - with this proposal, would we need to write to all the "instance membership" tables any time an instance with multi-group-membership starts/stops? My concern here is that the write amplification is unbounded unless we limit the number of affinity groups / anti-affinity groups instances belong to. I also fear this may get more complicated if we allow "things other than instances" to be members (which is the goal - RFD 522 discusses allowing anti-affinity groups to contain other affinity groups as members). If memberships could be from instances or other affinity groups, which rows would we need to update to signify that "there exists a resource on this sled, belonging to this (anti-)affinity group"?

Oh boy, that is quite the gaping hole in my suggestion! I forgot that instances could belong to multiple affinity groups, and that they could be recursive. This is completely untenable and we definitely should not do it! Thanks for pointing this out, and sorry for the silly suggestion!

I think leaving the field nullable and flattening in a separate PR will cover my concerns. I appreciate the feedback ❤️

@@ -0,0 +1 @@
ALTER TABLE omicron.public.sled_resource ADD COLUMN IF NOT EXISTS instance_id UUID;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will, I believe, mean that when this migration runs, all existing sled reservations will have instance_id set to NULL, right? At present, all sled resource migrations are for VMMs. So perhaps we should be going and looking at the vmm row corresponding to the id column in sled_resource and using the value of its instance_id field, so that VMMs which have an associated instance will have that instance's ID added to their resource reservation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with you. My hesitation comes from:

builder.append(alloc_server_action());
builder.append(alloc_propolis_ip_action());
builder.append(create_vmm_record_action());

Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.

Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.

So, TL;DR:

  • To populate this column, an instance would need to be stopped and restarted
  • After that, we will populate this column during the sled provisioning pathway
  • All other instances - if they somehow remain "running" through an upgrade, and which may or may not have VMMs - will get fixed when they are stopped / started again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.

Hmm, that does make sense. I still find adding the instance ID to sled_resource a bit concerning, but I can't come up with an alternative solution either, so carry on! It might be nice to have a comment explaining this someplace?

Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.

Just to clarify, you mean that an instance which is running when we perform this migration won't be part of an affinity group, not that running instances are never part of affinity groups, right? I would have thought that running instances were still members of affinity groups, because other instances in that group need to know where that instance is? Or am I misunderstanding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.

Hmm, that does make sense. I still find adding the instance ID to sled_resource a bit concerning, but I can't come up with an alternative solution either, so carry on! It might be nice to have a comment explaining this someplace?

I can add some commentary in dbinit.sql

Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.

Just to clarify, you mean that an instance which is running when we perform this migration won't be part of an affinity group, not that running instances are never part of affinity groups, right? I would have thought that running instances were still members of affinity groups, because other instances in that group need to know where that instance is? Or am I misunderstanding?

That's right - this field being NULL basically means "we won't look up affinity group info for this resource".

I meant to say: Instances can only be added to affinity groups when they're stopped -- if that's the case, there shouldn't be a "sled_resource" row for that instance anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for clarifying! I think comments in dbinit regarding the instance_id in sled_resource would make me happier, at least, although if you end up doing something more like what @andrewjstone is getting at in #7444 (comment), that might render my concerns moot

Comment on lines 247 to 248
-- The UUID of an instance, if this resource belongs to an instance.
instance_id UUID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is being added so that the sled_resource table can also be used to track which instances are present on a sled? But, this information is already present in the vmm table, where records have both the sled ID of the VMM that the sled is on and the ID of the instance that is running on that VMM. For my education,1 would you mind explaining what putting instance IDs here accomplishes?

Footnotes

  1. Don't take this as me saying I don't like adding this, I'm just curious about what it accomplishes in the overall design. :)

Copy link
Collaborator Author

@smklein smklein Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question: #7076 (comment)

The TL;DR is that "the VMM table rows is not atomically created with the resource table rows" - it is possible to create a reservation on behalf of an instance without necessarily creating a VMM record.

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/5.0/.

// Copyright 2024 Oxide Computer Company
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's 2025 now :D

I often omit the copyright notices just to avoid this issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my defense it was 2024 when I added this file. updated

Comment on lines 250 to 251
-- TODO Add constraint that if kind is instance, instance_id is not NULL?
-- Or will that break backwards compatibility?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tying a specific column to only some rows where the column isn't applicable to all rows seems nearly identical to the address column in the dataset table that we just removed.

What if instead of having this table that may contain different types of resources, we instead changed this table from sled_resource to instance_resource and removed the sled_resource_kind enum altogether? If we ever need to track resources other than instances this way we can create a new table.

I think this should also allow us to make the instance-id non-nullable. Can we fill those in with a schema migration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants