-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444
base: affinity-api
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions, but overall, this looks good.
schema/crdb/dbinit.sql
Outdated
-- TODO Add constraint that if kind is instance, instance_id is not NULL? | ||
-- Or will that break backwards compatibility? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to figure this out before merging this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tying a specific column to only some rows where the column isn't applicable to all rows seems nearly identical to the address
column in the dataset
table that we just removed.
What if instead of having this table that may contain different types of resources, we instead changed this table from sled_resource
to instance_resource
and removed the sled_resource_kind
enum altogether? If we ever need to track resources other than instances this way we can create a new table.
I think this should also allow us to make the instance-id
non-nullable. Can we fill those in with a schema migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the comment above, about backfilling the instance_id
: https://github.com/oxidecomputer/omicron/pull/7444/files#r1936496281
Theoretically, if we did the following:
- JOIN with the vmm table to fill the
instance_id
field - Add a constraint that "if kind = instance, instance_id is NOT NULL"
Then we could fail an upgrade, if we were partially through the instance_start
saga -- in a spot where the instance was provisioned, but the VMM record didn't exist.
Unfortunately, I do think it's possible for an upgrade to happen with a saga still partway through execution - and in this case, if this happened, the upgrade would get stuck and require manual intervention to be able to apply the CONSTRAINT (the operator would need to find the corresponding instance UUID here and insert it manually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear my primary concern is not about the fact that the instance_id
is nullable, but that the instance_id
column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.
Your point about not being able to use the VMM table because the instance may not be allocated to it yet, and that a running instance won't be part of an (anti-)affinity group make sense for why the instance_id
needs to live outside of of the VMM table as well as within it.
I still think this table should become homogeneous, and if so, having instance_id nullable is probably fine. We could also keep this table as is (or make it homogeneous to prevent future fields not applicable to all rows being added) and also:
Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource
/vmm_resource
table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource
table later on trying to reason about whether instance_id
is valid or should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.
Instances may belong to any number of affinity groups - with this proposal, would we need to write to all the "instance membership" tables any time an instance with multi-group-membership starts/stops? My concern here is that the write amplification is unbounded unless we limit the number of affinity groups / anti-affinity groups instances belong to. I also fear this may get more complicated if we allow "things other than instances" to be members (which is the goal - RFD 522 discusses allowing anti-affinity groups to contain other affinity groups as members). If memberships could be from instances or other affinity groups, which rows would we need to update to signify that "there exists a resource on this sled, belonging to this (anti-)affinity group"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.
I'm down to flatten this out; I might punt it to a follow-up PR, as the "enum of one" for sled resources existed before this chain of PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear my primary concern is not about the fact that the instance_id is nullable, but that the instance_id column is only applicable to one type of sled resource in what is supposed to be a heterogeneous table.
I'm down to flatten this out; I might punt it to a follow-up PR, as the "enum of one" for sled resources existed before this chain of PRs.
That works for me. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an optional column into the group membership tables to point to the the assigned resource in the sled_resource/vmm_resource table when the instance gets reserved rather than putting it directly in that table. That way only each row in the affinity group tables are explicitly aware of which VMM they are tied to without other users of the sled_resource table later on trying to reason about whether instance_id is valid or should be.
Instances may belong to any number of affinity groups - with this proposal, would we need to write to all the "instance membership" tables any time an instance with multi-group-membership starts/stops? My concern here is that the write amplification is unbounded unless we limit the number of affinity groups / anti-affinity groups instances belong to. I also fear this may get more complicated if we allow "things other than instances" to be members (which is the goal - RFD 522 discusses allowing anti-affinity groups to contain other affinity groups as members). If memberships could be from instances or other affinity groups, which rows would we need to update to signify that "there exists a resource on this sled, belonging to this (anti-)affinity group"?
Oh boy, that is quite the gaping hole in my suggestion! I forgot that instances could belong to multiple affinity groups, and that they could be recursive. This is completely untenable and we definitely should not do it! Thanks for pointing this out, and sorry for the silly suggestion!
I think leaving the field nullable and flattening in a separate PR will cover my concerns. I appreciate the feedback ❤️
@@ -0,0 +1 @@ | |||
ALTER TABLE omicron.public.sled_resource ADD COLUMN IF NOT EXISTS instance_id UUID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will, I believe, mean that when this migration runs, all existing sled reservations will have instance_id
set to NULL
, right? At present, all sled resource migrations are for VMMs. So perhaps we should be going and looking at the vmm
row corresponding to the id
column in sled_resource
and using the value of its instance_id
field, so that VMMs which have an associated instance will have that instance's ID added to their resource reservation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with you. My hesitation comes from:
omicron/nexus/src/app/sagas/instance_start.rs
Lines 136 to 138 in 7f05fce
builder.append(alloc_server_action()); | |
builder.append(alloc_propolis_ip_action()); | |
builder.append(create_vmm_record_action()); |
Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.
Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.
So, TL;DR:
- To populate this column, an instance would need to be stopped and restarted
- After that, we will populate this column during the sled provisioning pathway
- All other instances - if they somehow remain "running" through an upgrade, and which may or may not have VMMs - will get fixed when they are stopped / started again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.
Hmm, that does make sense. I still find adding the instance ID to sled_resource
a bit concerning, but I can't come up with an alternative solution either, so carry on! It might be nice to have a comment explaining this someplace?
Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.
Just to clarify, you mean that an instance which is running when we perform this migration won't be part of an affinity group, not that running instances are never part of affinity groups, right? I would have thought that running instances were still members of affinity groups, because other instances in that group need to know where that instance is? Or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically: the creation of a sled resource record is not atomic with the creation of a VMM record. It is possible that a sled resource exists with no corresponding VMM.
Hmm, that does make sense. I still find adding the instance ID to
sled_resource
a bit concerning, but I can't come up with an alternative solution either, so carry on! It might be nice to have a comment explaining this someplace?
I can add some commentary in dbinit.sql
Here's the other part: this record should only exist for instances that are running, right? And if an instance is running, it won't be a part of an affinity group anyway.
Just to clarify, you mean that an instance which is running when we perform this migration won't be part of an affinity group, not that running instances are never part of affinity groups, right? I would have thought that running instances were still members of affinity groups, because other instances in that group need to know where that instance is? Or am I misunderstanding?
That's right - this field being NULL
basically means "we won't look up affinity group info for this resource".
I meant to say: Instances can only be added to affinity groups when they're stopped -- if that's the case, there shouldn't be a "sled_resource" row for that instance anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks for clarifying! I think comments in dbinit regarding the instance_id
in sled_resource
would make me happier, at least, although if you end up doing something more like what @andrewjstone is getting at in #7444 (comment), that might render my concerns moot
schema/crdb/dbinit.sql
Outdated
-- The UUID of an instance, if this resource belongs to an instance. | ||
instance_id UUID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is being added so that the sled_resource
table can also be used to track which instances are present on a sled? But, this information is already present in the vmm
table, where records have both the sled ID of the VMM that the sled is on and the ID of the instance that is running on that VMM. For my education,1 would you mind explaining what putting instance IDs here accomplishes?
Footnotes
-
Don't take this as me saying I don't like adding this, I'm just curious about what it accomplishes in the overall design. :) ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question: #7076 (comment)
The TL;DR is that "the VMM table rows is not atomically created with the resource table rows" - it is possible to create a reservation on behalf of an instance without necessarily creating a VMM record.
nexus/db-model/src/affinity.rs
Outdated
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/5.0/. | ||
|
||
// Copyright 2024 Oxide Computer Company |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's 2025 now :D
I often omit the copyright notices just to avoid this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my defense it was 2024 when I added this file. updated
schema/crdb/dbinit.sql
Outdated
-- TODO Add constraint that if kind is instance, instance_id is not NULL? | ||
-- Or will that break backwards compatibility? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tying a specific column to only some rows where the column isn't applicable to all rows seems nearly identical to the address
column in the dataset
table that we just removed.
What if instead of having this table that may contain different types of resources, we instead changed this table from sled_resource
to instance_resource
and removed the sled_resource_kind
enum altogether? If we ever need to track resources other than instances this way we can create a new table.
I think this should also allow us to make the instance-id
non-nullable. Can we fill those in with a schema migration?
Pulled out of #7076
Updates schema to include Affinity/Anti-Affinity groups, but does not use these schemas yet.