-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-10974] Add bandwidth QoS controls #9881
base: master
Are you sure you want to change the base?
Conversation
3e04ce5
to
60759ea
Compare
Add bandwidth limiting for ingress and egress, implemented via `tc qdisc` configuration based on the {ingress,egress} bandwidth and burst fields of the QoSControls struct in WorkloadEndpoint. In kubernetes, they come from annotations on the kubernetes pods. Felix FV tests added to verify that the `tc qdisc` configuration is properly added and that the bandwidth is properly limited with `iperf3`. Some code borrowed from the kubernetes bandwidth plugin (credit given).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and corrections...
return fmt.Errorf("create ifb qdisc: %s", err) | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there so much asymmetry between the ingress and egress coding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, we can only add one tbf
qdisc per interface, and it only works for ingress. So for egress, the workaround is to create an ingress
qdisc, an ifb
interface, mirror the egress traffic from the original interface to the ifb
, and then add the tbf
qdisc to the ifb
interface...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing some information here, if this can help:
- LinuxBridge (OpenStack Neutron plugin) uses
tbf
for ingress andpolice
filter for egress (reflected by the list filter code). - Libvirt uses
htb
for ingress andpolice
for egress. Notice that the direction is reversed because libvirt is doingrx (ingress) limit request -> police
andtx (egress) limit request -> htb
, which is correct from the hypervisor perspective, but reversed from the interface/VM's perspective. - Regardless, all of the tools use the token bucket algorithm (tbf).
- Improve error messages - Fix ingress bandwidth testing - Other misc corrections
It will be replaced by felix functionality. Needs to merge in sync with projectcalico/calico#9881.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of queries from me but it looks like Nell has done a more thorough review already!
I suppose my general question is how do we make sure everything always gets cleaned up, bearing in mind that the CNI plugin can remove the cali device under felix's feet. Will we still clean up any left-over bandwidth control device?
Do we need to worry about races when changing the limits? You remove the limit and then re-add it so there'll be a moment when a pod could send too much data. Might be OK, depending on use case (And it might not be possible to adjust the value on the fly).
Thank you!
Yes, the
Hm, right, my laziness is probably to blame here, but you make a good point on altering the config in-place instead of removing+readding, will do! |
Guard QoS setup against BPF mode. Add cleanup of QoS for when BPF is enabled. Update QoS bandwidth in place instead of remove+readd. Minor test fixes.
if oldIngressBw != 0 { | ||
err := removeIngressQdisc(old.Name) | ||
if oldName != newName || oldIngressBw != newIngressBw || oldIngressBurst != newIngressBurst { | ||
if oldName == newName && oldIngressBw != 0 && newIngressBw != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems to assume that the old values match the dataplane; we try not to make such assumptions in Felix. So what I'd expect to see is logic that reads back the dataplane state, compares it with the desired state and then "fixes" the dataplane by adding/removing/changing the qdisc.
I'd probabaly use a DeltaTracker for this with a struct that tracks the state of the dataplane:
type qosState struct {
IngressBandwidth int
IngressBurst int
}
then you need to be able to read back those values from the dataplane to populate the "Dataplane()` side of the delta tracker, and populate the "Desired()" side with the values from the datastore.
Then the delta tracker has methods for iterating over all the updates/additions/deletions that are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing myself as I'm making some comments in this PR: This is Zhan from the Bloomberg team. We had a meeting with Nell and Patrick and discussed a parameter that is missing in Neutron Bandwidth QoS API but we'd like to have it in Calico. Please see my comments for more details. Thanks!
return fmt.Errorf("create ifb qdisc: %s", err) | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing some information here, if this can help:
- LinuxBridge (OpenStack Neutron plugin) uses
tbf
for ingress andpolice
filter for egress (reflected by the list filter code). - Libvirt uses
htb
for ingress andpolice
for egress. Notice that the direction is reversed because libvirt is doingrx (ingress) limit request -> police
andtx (egress) limit request -> htb
, which is correct from the hypervisor perspective, but reversed from the interface/VM's perspective. - Regardless, all of the tools use the token bucket algorithm (tbf).
latency := latencyInUsec(latencyInMillis) | ||
limitInBytes := limit(uint64(rateInBytes), latency, uint32(burstInBytes)) | ||
|
||
qdisc := &netlink.Tbf{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating tbf
for ingress & egress traffic, we may also need to specify peakrate
, which allows users to specify the maximum depletion rate of the bucket (i.e., burst/buffer). @nelljerram This is what I was referring to during our meeting, the parameter that is missing on the Neutron QoS API.
Also, some clarifications on the Neutron QoS parameters (as we discussed in our meeting): Even though it says max_burst_kbps
in the table, it should be max-burst-kbits
. For example, in the "notes" section of the doc, they wrote that the burst value "will be 800kbit". Looking into the Neutron code and taking LinuxBridge as an example again, even though it says max_burst_kbps
when setting up tbf
, the underlying function still treats it as burst_kb
, not kbps
. We will sync up with the Neutron community on our end to get clarification on this.
Currently, Neutron API only supports max-kbps
and max-burst-kbits
, which maps to:
Neutron tc-tbf/tc-police/tc-htb
-----------------------------------------------
max-kbps rate
max-burst-kbits burst
According to the token bucket algorithm, if we don't specify peakrate
, interfaces will be allowed to transmit/receive burst
bytes at the maximum available bandwidth on the machine, assuming the bucket is full at the start. This is less ideal. However, with peakrate
, we can control the speed that tokens flow out of the bucket, which is equivalent to controlling the burst bandwidth.
Since Neutron QoS API does not support peakrate
yet, we can do something just like the limits on connections - a default value that applies to all. We will work with the Neutron community on getting this in.
In terms of the code here, I do see peakrate
is supported in the library you use: tbf
, police
, and htb
(in htb
it is called ceil
). However, my experience with Go is limited so please double-check and see if this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhanz1 for the clarification re: burst
, this does make sense. We'll work on adding peakrate
to the Tbf
config (it is indeed supported, I'll only double check any necessary unit conversion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhanz1 Do you know if someone is working to address the misnaming of max-burst-kbits
on the Neutron API? I guess it will probably be best just to explain this prominently in the API documentation, instead of creating all the pain of changing the field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coutinhop Thanks for taking a look!
@nelljerram We do have the plan to ask the Neutron community on this issue in the future, but we haven't prioritize that yet. In the conversion, ideally, we want to make sure this parameter is not used or understood differently for other Neutron drivers and their QoS extensions (e.g., another Neutron driver treats it as kbps
instead of kbits
- Although I think this is unlikely), even though in the tc
case, it is bits
instead of bps
.
Change logic of maybeUpdateQoSBandwidth() to consider the dataplane state (actual `tc qdisc` config) instead of the datastore values. Fix QoS cleanup on BPF startup to not short-circuit on errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just filing my comments so far. I still have more to look at.
felix/bpf/tc/attach.go
Outdated
@@ -317,6 +318,22 @@ func EnsureQdisc(ifaceName string) (bool, error) { | |||
log.WithField("iface", ifaceName).Debug("Already have a clsact qdisc on this interface") | |||
return true, nil | |||
} | |||
|
|||
// Clean up QoS config if it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth commenting here that this is for transition from an iptables dataplane to eBPF, and reflects that it's currently impossible for eBPF and QoS to coexist.
func GetTBFValues(rateInBits, burstInBits uint64) (rateInBytes uint64, bufferInBytes, limitInBytes uint32) { | ||
rateInBytes = rateInBits / 8 | ||
burstInBytes := burstInBits / 8 | ||
bufferInBytes = buffer(uint64(rateInBytes), uint32(burstInBytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we inline buffer
here? I find the levels of indirection make this code hard to follow.
Ditto for latencyInUsec
and limit
in the next two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for time2tick
in buffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
func time2Tick(time uint32) uint32 { | ||
return uint32(float64(time) * float64(netlink.TickInUsec())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think good to rename time
to timeUsec
.
} | ||
|
||
func buffer(rate uint64, burst uint32) uint32 { | ||
return time2Tick(uint32(float64(burst) * float64(netlink.TIME_UNITS_PER_SEC) / float64(rate))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIME_UNITS_PER_SEC is actually USEC_PER_SEC, so this is:
BURST_BYTES * USEC_PER_SEC / RATE_BYTES_PER_SEC
= the time in usec for a BURST to be transmitted at RATE
applying time2Tick => the number of ticks needed for a BURST to be transmitted at RATE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it's then correct to handle that as a number of bytes.
} | ||
|
||
func latencyInUsec(latencyInMillis float64) float64 { | ||
return float64(netlink.TIME_UNITS_PER_SEC) * (latencyInMillis / 1000.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK, I see that TIME_UNITS_PER_SEC
is actually USEC_PER_SEC. Can we replace netlink.TIME_UNITS_PER_SEC
everywhere by 1000000 ? That will be more comprehensible.
} | ||
|
||
func latencyInUsec(latencyInMillis float64) float64 { | ||
return float64(netlink.TIME_UNITS_PER_SEC) * (latencyInMillis / 1000.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or, if you really want a constant, by a constant named ONE_MILLION
.)
latency := latencyInUsec(latencyInMillis) | ||
limitInBytes := limit(uint64(rateInBytes), latency, uint32(burstInBytes)) | ||
|
||
qdisc := &netlink.Tbf{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhanz1 Do you know if someone is working to address the misnaming of max-burst-kbits
on the Neutron API? I guess it will probably be best just to explain this prominently in the API documentation, instead of creating all the pain of changing the field name.
- Generally use TokenBucketState pointers - Use *TokenBucketState more widely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filing some more comments. I've made a branch at https://github.com/nelljerram/calico/tree/pedro-CORE-10974 with some of the refactorings I've suggested.
- Merge suggested changes. A separate 'qos' package still needed to be split from 'intdataplane' in order to avoid an import loop. - Properly attribute and copyright code from the CNI bandwidth plugin - Add a proper update function for kdd workloads to the felix FV infra - Adjust test expectations (using 'Consistently()' where appropriate) - Correctly use '%w' for outputting errors - Inline calculations used for converting rate and burst in bits {per second,} to TokenBucketState{} values - Other misc changes
Description
Add bandwidth limiting for ingress and egress, implemented via
tc qdisc
configuration based on the {ingress,egress} bandwidth and burst fields of the QoSControls struct in WorkloadEndpoint. In kubernetes, they come from annotations on the kubernetes pods.Felix FV tests added to verify that the
tc qdisc
configuration is properly added and that the bandwidth is properly limited withiperf3
.Some code borrowed from the kubernetes bandwidth plugin (credit given).
Related issues/PRs
Todos
Release Note
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one
docs-*
label.docs-pr-required
: This change requires a change to the documentation that has not been completed yet.docs-completed
: This change has all necessary documentation completed.docs-not-required
: This change has no user-facing impact and requires no docs.Every PR needs one
release-note-*
label.release-note-required
: This PR has user-facing changes. Most PRs should have this label.release-note-not-required
: This PR has no user-facing changes.Other optional labels:
cherry-pick-candidate
: This PR should be cherry-picked to an earlier release. For bug fixes only.needs-operator-pr
: This PR is related to install and requires a corresponding change to the operator.