Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CGROUP aware resource monitor on memory #38718

Open
wujiaqi opened this issue Mar 12, 2025 · 3 comments
Open

CGROUP aware resource monitor on memory #38718

wujiaqi opened this issue Mar 12, 2025 · 3 comments
Labels
area/overload_manager enhancement Feature requests. Not bugs or questions.

Comments

@wujiaqi
Copy link

wujiaqi commented Mar 12, 2025

Title: Add a GROUP aware resource monitor for memory

Description:
I'm opening this issue to have a preliminary discussion on how to implement this. Someone on my team can do the implementation once we get agreement.

We have an Istio Ingress Gateway today and have overload manager configured to load shed on memory utilization thresholds. This is to prevent OOMKills of our pods especially during high load events. However the fixed_heap resource monitor that exists today only reports the memory that tcmalloc believes is allocated. OOMKills are based on what the OS sees and not what tcmalloc thinks so it is important to have a monitor that sees this accordingly. It is often the case that fixed_heap is substantially lower than what is reported in CGROUPS.

Below is an experiment I conducted to demonstrate the discrepancy

During Load
Docker stats

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %
2696a94996b9   envoy     50.56%     489.5MiB / 512MiB  95.61%

Envoy metric

overload.envoy.resource_monitors.fixed_heap.pressure: 87

After Load
Docker stats

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %
2696a94996b9   envoy     0.48%     343.1MiB / 512MiB   67.01%

Envoy metric

overload.envoy.resource_monitors.fixed_heap.pressure: 16

As you can see, heap pressure is much lower than the OS reported memory consumption.

I am proposing to add a new resource monitor for memory based on CGROUPS rather than tcmalloc stats. As there is a transition at the moment where some systems are CGROUPS v1 and others are CGROUPS v2, and some could be in hybrid mode, it would be worth abstracting this detail away in the configuration to just "cgroups enabled". During object construction we can detect in the system if it is CGROUPS v1 or v2. For example it can check the filesystem for presence of the hierarchies

if the following files are present then system is on cgroups v2

  • /sys/fs/cgroup/memory.max
  • /sys/fs/cgroup/memory.current

else if the following directory exists then system is on cgroups v1

  • /sys/fs/cgroup/memory

We will pick the highest available cgroups implementation on the system during construction.

Appreciate the feedback, thanks.

[optional Relevant Links:]

Any extra documentation required to understand the issue.
related issue #36681

cc @ramaraochavali

@wujiaqi wujiaqi added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Mar 12, 2025
@botengyao botengyao added area/overload_manager and removed triage Issue requires triage labels Mar 13, 2025
@botengyao
Copy link
Member

+@KBaichoo

Thanks @wujiaqi, this makes sense to me, and a cgroup version aware memory_utilization resource monitor can be added.

A cgroup based CPU resource monitor was added recently, and you can take a ref from there #34713

@KBaichoo
Copy link
Contributor

Docker stats

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %
2696a94996b9   envoy     0.48%     343.1MiB / 512MiB   67.01%

Envoy metric

overload.envoy.resource_monitors.fixed_heap.pressure: 16

See also https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#config-bootstrap-v3-memoryallocatormanager as a way to configure giving back some memory to the OS.

A cgroup aware resource monitor sounds like a great enhancement!

@wujiaqi
Copy link
Author

wujiaqi commented Mar 13, 2025

I did happen to test that as well, it works nicely. After an idle period the memory gets released.

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %
2696a94996b9   envoy     0.65%     87.49MiB / 512MiB   17.09%
overload.envoy.resource_monitors.fixed_heap.pressure: 14
tcmalloc.released_by_timer: 92

though what I don't understand is how to make a judgement call on the value to set for bytes_to_release. I tried finding some literature on the tcmalloc docs but it wasn't super clear to me, it gave some insights on memory fragmentation etc, I would appreciate any insight

memory_allocator_manager:
  bytes_to_release: 31460000 #arbitrarily chose 30MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/overload_manager enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

3 participants