Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-10974] Add bandwidth QoS controls #9881

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

coutinhop
Copy link
Member

@coutinhop coutinhop commented Feb 21, 2025

Description

Add bandwidth limiting for ingress and egress, implemented via tc qdisc configuration based on the {ingress,egress} bandwidth and burst fields of the QoSControls struct in WorkloadEndpoint. In kubernetes, they come from annotations on the kubernetes pods.

Felix FV tests added to verify that the tc qdisc configuration is properly added and that the bandwidth is properly limited with iperf3.

Some code borrowed from the kubernetes bandwidth plugin (credit given).

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Add bandwidth limiting QoS control to workloads.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@coutinhop coutinhop self-assigned this Feb 21, 2025
@coutinhop coutinhop requested a review from a team as a code owner February 21, 2025 01:41
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Feb 21, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Feb 21, 2025
@coutinhop coutinhop force-pushed the pedro-CORE-10974 branch 2 times, most recently from 3e04ce5 to 60759ea Compare February 21, 2025 02:11
Add bandwidth limiting for ingress and egress, implemented via `tc qdisc`
configuration based on the {ingress,egress} bandwidth and burst fields of
the QoSControls struct in WorkloadEndpoint. In kubernetes, they come from
annotations on the kubernetes pods.

Felix FV tests added to verify that the `tc qdisc` configuration is
properly added and that the bandwidth is properly limited with `iperf3`.

Some code borrowed from the kubernetes bandwidth plugin (credit given).
Copy link
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and corrections...

return fmt.Errorf("create ifb qdisc: %s", err)
}
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there so much asymmetry between the ingress and egress coding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, we can only add one tbf qdisc per interface, and it only works for ingress. So for egress, the workaround is to create an ingress qdisc, an ifb interface, mirror the egress traffic from the original interface to the ifb, and then add the tbf qdisc to the ifb interface...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing some information here, if this can help:

  • LinuxBridge (OpenStack Neutron plugin) uses tbf for ingress and police filter for egress (reflected by the list filter code).
  • Libvirt uses htb for ingress and police for egress. Notice that the direction is reversed because libvirt is doing rx (ingress) limit request -> police and tx (egress) limit request -> htb, which is correct from the hypervisor perspective, but reversed from the interface/VM's perspective.
  • Regardless, all of the tools use the token bucket algorithm (tbf).

- Improve error messages
- Fix ingress bandwidth testing
- Other misc corrections
coutinhop added a commit to coutinhop/operator that referenced this pull request Feb 22, 2025
It will be replaced by felix functionality. Needs to merge in sync with
projectcalico/calico#9881.
Copy link
Member

@fasaxc fasaxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of queries from me but it looks like Nell has done a more thorough review already!

I suppose my general question is how do we make sure everything always gets cleaned up, bearing in mind that the CNI plugin can remove the cali device under felix's feet. Will we still clean up any left-over bandwidth control device?

Do we need to worry about races when changing the limits? You remove the limit and then re-add it so there'll be a moment when a pod could send too much data. Might be OK, depending on use case (And it might not be possible to adjust the value on the fly).

@coutinhop
Copy link
Member Author

coutinhop commented Feb 26, 2025

Thank you!

Couple of queries from me but it looks like Nell has done a more thorough review already!

I suppose my general question is how do we make sure everything always gets cleaned up, bearing in mind that the CNI plugin can remove the cali device under felix's feet. Will we still clean up any left-over bandwidth control device?

Yes, the tc qdisc configuration is removed when the interface is removed, regardless of who does it (felix or CNI plugin), so that takes care of the ingress controls. The egress controls (which use a mirror ifb interface) is torn down by felix regardless of who removed the "main" interface...

Do we need to worry about races when changing the limits? You remove the limit and then re-add it so there'll be a moment when a pod could send too much data. Might be OK, depending on use case (And it might not be possible to adjust the value on the fly).

Hm, right, my laziness is probably to blame here, but you make a good point on altering the config in-place instead of removing+readding, will do!

@coutinhop coutinhop removed the release-note-required Change has user-facing impact (no matter how small) label Feb 26, 2025
@marvin-tigera marvin-tigera added the release-note-required Change has user-facing impact (no matter how small) label Feb 26, 2025
Guard QoS setup against BPF mode.

Add cleanup of QoS for when BPF is enabled.

Update QoS bandwidth in place instead of remove+readd.

Minor test fixes.
if oldIngressBw != 0 {
err := removeIngressQdisc(old.Name)
if oldName != newName || oldIngressBw != newIngressBw || oldIngressBurst != newIngressBurst {
if oldName == newName && oldIngressBw != 0 && newIngressBw != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems to assume that the old values match the dataplane; we try not to make such assumptions in Felix. So what I'd expect to see is logic that reads back the dataplane state, compares it with the desired state and then "fixes" the dataplane by adding/removing/changing the qdisc.

I'd probabaly use a DeltaTracker for this with a struct that tracks the state of the dataplane:

type qosState struct {
  IngressBandwidth int
  IngressBurst int
}

then you need to be able to read back those values from the dataplane to populate the "Dataplane()` side of the delta tracker, and populate the "Desired()" side with the values from the datastore.

Then the delta tracker has methods for iterating over all the updates/additions/deletions that are needed.

Copy link

@zhanz1 zhanz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing myself as I'm making some comments in this PR: This is Zhan from the Bloomberg team. We had a meeting with Nell and Patrick and discussed a parameter that is missing in Neutron Bandwidth QoS API but we'd like to have it in Calico. Please see my comments for more details. Thanks!

return fmt.Errorf("create ifb qdisc: %s", err)
}
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing some information here, if this can help:

  • LinuxBridge (OpenStack Neutron plugin) uses tbf for ingress and police filter for egress (reflected by the list filter code).
  • Libvirt uses htb for ingress and police for egress. Notice that the direction is reversed because libvirt is doing rx (ingress) limit request -> police and tx (egress) limit request -> htb, which is correct from the hypervisor perspective, but reversed from the interface/VM's perspective.
  • Regardless, all of the tools use the token bucket algorithm (tbf).

latency := latencyInUsec(latencyInMillis)
limitInBytes := limit(uint64(rateInBytes), latency, uint32(burstInBytes))

qdisc := &netlink.Tbf{
Copy link

@zhanz1 zhanz1 Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating tbf for ingress & egress traffic, we may also need to specify peakrate, which allows users to specify the maximum depletion rate of the bucket (i.e., burst/buffer). @nelljerram This is what I was referring to during our meeting, the parameter that is missing on the Neutron QoS API.

Also, some clarifications on the Neutron QoS parameters (as we discussed in our meeting): Even though it says max_burst_kbps in the table, it should be max-burst-kbits. For example, in the "notes" section of the doc, they wrote that the burst value "will be 800kbit". Looking into the Neutron code and taking LinuxBridge as an example again, even though it says max_burst_kbps when setting up tbf, the underlying function still treats it as burst_kb, not kbps. We will sync up with the Neutron community on our end to get clarification on this.

Currently, Neutron API only supports max-kbps and max-burst-kbits, which maps to:

Neutron               tc-tbf/tc-police/tc-htb
-----------------------------------------------
max-kbps              rate
max-burst-kbits       burst

According to the token bucket algorithm, if we don't specify peakrate, interfaces will be allowed to transmit/receive burst bytes at the maximum available bandwidth on the machine, assuming the bucket is full at the start. This is less ideal. However, with peakrate, we can control the speed that tokens flow out of the bucket, which is equivalent to controlling the burst bandwidth.

Since Neutron QoS API does not support peakrate yet, we can do something just like the limits on connections - a default value that applies to all. We will work with the Neutron community on getting this in.

In terms of the code here, I do see peakrate is supported in the library you use: tbf, police, and htb (in htb it is called ceil). However, my experience with Go is limited so please double-check and see if this makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhanz1 for the clarification re: burst, this does make sense. We'll work on adding peakrate to the Tbf config (it is indeed supported, I'll only double check any necessary unit conversion)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhanz1 Do you know if someone is working to address the misnaming of max-burst-kbits on the Neutron API? I guess it will probably be best just to explain this prominently in the API documentation, instead of creating all the pain of changing the field name.

Copy link

@zhanz1 zhanz1 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coutinhop Thanks for taking a look!
@nelljerram We do have the plan to ask the Neutron community on this issue in the future, but we haven't prioritize that yet. In the conversion, ideally, we want to make sure this parameter is not used or understood differently for other Neutron drivers and their QoS extensions (e.g., another Neutron driver treats it as kbps instead of kbits - Although I think this is unlikely), even though in the tc case, it is bits instead of bps.

Change logic of maybeUpdateQoSBandwidth() to consider the dataplane
state (actual `tc qdisc` config) instead of the datastore values.

Fix QoS cleanup on BPF startup to not short-circuit on errors.
Copy link
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just filing my comments so far. I still have more to look at.

@@ -317,6 +318,22 @@ func EnsureQdisc(ifaceName string) (bool, error) {
log.WithField("iface", ifaceName).Debug("Already have a clsact qdisc on this interface")
return true, nil
}

// Clean up QoS config if it exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth commenting here that this is for transition from an iptables dataplane to eBPF, and reflects that it's currently impossible for eBPF and QoS to coexist.

func GetTBFValues(rateInBits, burstInBits uint64) (rateInBytes uint64, bufferInBytes, limitInBytes uint32) {
rateInBytes = rateInBits / 8
burstInBytes := burstInBits / 8
bufferInBytes = buffer(uint64(rateInBytes), uint32(burstInBytes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we inline buffer here? I find the levels of indirection make this code hard to follow.
Ditto for latencyInUsec and limit in the next two lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for time2tick in buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func time2Tick(time uint32) uint32 {
return uint32(float64(time) * float64(netlink.TickInUsec()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think good to rename time to timeUsec.

}

func buffer(rate uint64, burst uint32) uint32 {
return time2Tick(uint32(float64(burst) * float64(netlink.TIME_UNITS_PER_SEC) / float64(rate)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIME_UNITS_PER_SEC is actually USEC_PER_SEC, so this is:

BURST_BYTES * USEC_PER_SEC / RATE_BYTES_PER_SEC

= the time in usec for a BURST to be transmitted at RATE

applying time2Tick => the number of ticks needed for a BURST to be transmitted at RATE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it's then correct to handle that as a number of bytes.

}

func latencyInUsec(latencyInMillis float64) float64 {
return float64(netlink.TIME_UNITS_PER_SEC) * (latencyInMillis / 1000.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK, I see that TIME_UNITS_PER_SEC is actually USEC_PER_SEC. Can we replace netlink.TIME_UNITS_PER_SEC everywhere by 1000000 ? That will be more comprehensible.

}

func latencyInUsec(latencyInMillis float64) float64 {
return float64(netlink.TIME_UNITS_PER_SEC) * (latencyInMillis / 1000.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or, if you really want a constant, by a constant named ONE_MILLION.)

latency := latencyInUsec(latencyInMillis)
limitInBytes := limit(uint64(rateInBytes), latency, uint32(burstInBytes))

qdisc := &netlink.Tbf{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhanz1 Do you know if someone is working to address the misnaming of max-burst-kbits on the Neutron API? I guess it will probably be best just to explain this prominently in the API documentation, instead of creating all the pain of changing the field name.

- Generally use TokenBucketState pointers
- Use *TokenBucketState more widely
Copy link
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filing some more comments. I've made a branch at https://github.com/nelljerram/calico/tree/pedro-CORE-10974 with some of the refactorings I've suggested.

nelljerram and others added 2 commits February 28, 2025 17:10
- Merge suggested changes. A separate 'qos' package still needed to be split from 'intdataplane' in order to avoid an import loop.
- Properly attribute and copyright code from the CNI bandwidth plugin
- Add a proper update function for kdd workloads to the felix FV infra
- Adjust test expectations (using 'Consistently()' where appropriate)
- Correctly use '%w' for outputting errors
- Inline calculations used for converting rate and burst in bits {per second,} to TokenBucketState{} values
- Other misc changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants