-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8343191: Cgroup v1 subsystem fails to set subsystem path #21808
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back schernyshev! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
@sercher The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
What testing have you done? Did you run existing container tests in:
As far as I can tell this breaks privileged container runs. I.e. |
I've done the standard tiers (1-3), and additionally "jtreg:jdk/internal/platform/cgroup" and "gtest::cgroupTest". I see now some of the dockers are failing. I am looking into it. |
Thanks Severin! It was the problematic change in the logic that skips duplicate cgroup contoller mount points. Failing tests are mounting duplicates of the host's cgroups with |
Yes. See https://bugs.openjdk.org/browse/JDK-8261242 for details. This patch shouldn't change it and the logic of |
Here's an updated version of the patch. The long standing behavior was to leave Create a new cgroup for memory
Run the following script
In the above script, a containerized process ( The result would be ($JAVA_HOME points to JDK before fix)
JDK updated version:
The updated version falls back to the mount point (only when Testing
|
Have you checked on cg v2? Is this a problem there as well? |
Hi Severin, thanks for this question. I didn't check cg v2 because the issue (NPE) was observed in v1 hosts only. It's an open question what happens if a process is moved between cgroups in v2 mode. I will look into it and file an issue if there are problems in v2. |
It looks to me that v2 mode is not affected, at least the way it is in v1. In v2 mode, cgroup is mounted either at leaf node (private namespace), or the complete hierarchy at /sys/fs/cgroup (host namespace). In host mode it works right away, as the full hierarchy is accessible. With a cgroup v2 created like this:
The result would be
In the private namespace (it's a default setting in v2 hosts), it may fail migrating the process between cgroups (a docker issue?). It may look like the cgroup files are not mapped at all, while
The following script
will display
means there are no files in When moved into a subgroup, such as
the cgroup will be mounted at /sys/fs/cgroup, and the correct memory limit is displayed (thanks to the conroller path adjustment) - inherited from the parent.
|
The JBS issue doesn't mention |
OK, but why is https://bugs.openjdk.org/browse/JDK-8322420 not in effect in such a case?
It would be good to see the full boot JVM output at the trace level. I'm wondering why the adjustment isn't sufficient for the use-case the bug describes. I.e. if the move happens before the JVM starts then there is a chance it works OK by detecting some limit. If not it would really be useful to understand it better. If, however, the cgroup move happens after the JVM has started, there is nothing in the JVM which "corrects" the detected physical memory (i.e. heap size et. al) and/or detected CPUs. It's not supported to do that dynamically. |
I also wonder, then, if the issue is NPE if JDK-8336881 would fix that issue. The controller adjustment doesn't yet happen on the Java (Metrics) level. Only hotspot so far. |
Answering my own question. Because the
On cg v2, on the other hand, Edit:
|
So on cg v1 you start out and end with a |
Exactly. That's why JDK-8322420 is not in effect and also JDK-8336881 does not fix it on Java side (path stays uninitialized in certain conditions). |
jlong lowest_limit = phys_mem; | ||
if (limit > 0 && limit < lowest_limit) { | ||
lowest_limit = limit; | ||
os::free(limit_cg_path); // handles nullptr | ||
limit_cg_path = os::strdup(cg_path); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid the duplicate copy of the original cgroup path, which is already captured in orig
by using:
jlong lowest_limit = limit < 0 ? phys_mem : limit;
julong orig_limit = ((julong)lowest_limit) != phys_mem ? lowest_limit : phys_mem;
And on line 91 we change the condition from:
if ((julong)lowest_limit != phys_mem) {
to:
if ((julong)lowest_limit != orig_limit) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Accepted.
if (cpus != host_cpus && cpus < lowest_limit) { | ||
lowest_limit = cpus; | ||
os::free(limit_cg_path); // handles nullptr | ||
limit_cg_path = os::strdup(cg_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the extra allocation of cg_path
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
if (strstr((char*)cgroup_path, "../") != nullptr) { | ||
log_warning(os, container)("Cgroup v2 path at [%s] is [%s], cgroup limits can be wrong.", | ||
mount_path, cgroup_path); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the cast to char*
?
We should probably move this warning to CgroupUtil::adjust_controller
, right before we've determined that we actually need to adjust. I wonder, though, if we should just print the warning and set the cgroup_path to /
and return early. Otherwise, path adjustment will run with no different result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed extra (char*) cast.
We should probably move this warning to
CgroupUtil::adjust_controller
, right before we've determined that we actually need to adjust. I wonder, though, if we should just print the warning and set the cgroup_path to/
and return early. Otherwise, path adjustment will run with no different result.
"../" only appears in corner case with cgroupns=private and the process moved to the outer group. In that specific case we should avoid concatenating with whatever starts with "../".
if (!cgroupPath.equals("/")) { | ||
// When moved to a subgroup, between subgroups, the path suffix will change. | ||
// Rely on path adjustment that determines the actual suffix. | ||
path += cgroupPath; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a simpler solution than the hotspot one. While I prefer this one, please make them consistent at the least.
* @requires os.family == "linux" | ||
* @modules java.base/jdk.internal.platform | ||
* @library /test/lib | ||
* @build jdk.test.whitebox.WhiteBox CheckOperatingSystemMXBean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckOperatingSystemMXBean
seems unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
test/jdk/jdk/internal/platform/cgroup/CgroupV1SubsystemControllerTest.java
Outdated
Show resolved
Hide resolved
test/jdk/jdk/internal/platform/cgroup/TestCgroupSubsystemFactory.java
Outdated
Show resolved
Hide resolved
src/java.base/linux/classes/jdk/internal/platform/cgroupv1/CgroupV1SubsystemController.java
Show resolved
Hide resolved
Co-authored-by: Severin Gehwolf <[email protected]>
Here's the summary of the latest state of the PR. The updated code
|
# Conflicts: # src/java.base/linux/classes/jdk/internal/platform/CgroupUtil.java
@sercher this pull request can not be integrated into git checkout JDK-8343191
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: I'll try to test and review this more thoroughly next week.
if (strstr(cgroup_path, "../") == nullptr) { | ||
ss.print_raw(cgroup_path); | ||
} else { | ||
log_warning(os, container)("Cgroup cpu/memory controller path includes '../', detected limits won't be accurate"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this warning to CgroupUtil::adjust_controller
and abort the adjustment, we don't need to issue this warning multiple times, and we'd not be able to adjust it to a path that will work. Showing the warning once should be sufficient. We shouldn't see this path in any non-moved scenarios. It would perhaps help if we included some detail why this warning is being shown. I suggest:
cgroup controller path seems to have moved (includes '.../'), detected limits won't be accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you recommand also to include the paths in that warning? Something like
cgroup controller path at '/sys/fs/cgroup' seems to have moved to '../../test', detected limits won't be accurate
This way it will have all the necessary information to investigate customer cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the log message. The log example in cg v1:
[0.001s][trace][os,container] OSContainer::init: Initializing Container Support
[0.001s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.001s][warning][os,container] Cgroup memory controller path at '/sys/fs/cgroup/memory' seems to have moved to '/../../test', detected limits won't be accurate
[0.001s][debug ][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
[0.001s][trace ][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
[0.001s][trace ][os,container] CPU Quota is: -1
[0.002s][trace ][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
[0.002s][trace ][os,container] CPU Period is: 100000
[0.002s][trace ][os,container] OSContainer::active_processor_count: 48
[0.002s][trace ][os,container] CgroupSubsystem::active_processor_count (cached): 48
[0.002s][trace ][os,container] total physical memory: 133623721984
[0.002s][trace ][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[0.002s][trace ][os,container] Memory Limit is: 419430400
[0.004s][trace ][os,container] CgroupSubsystem::active_processor_count (cached): 48
[0.027s][trace ][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
[0.027s][trace ][os,container] CPU Quota is: -1
[0.027s][trace ][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
[0.027s][trace ][os,container] CPU Period is: 100000
[0.027s][trace ][os,container] OSContainer::active_processor_count: 48
openjdk version "24-internal" 2025-03-18
OpenJDK Runtime Environment (build 24-internal-adhoc.bellsoft.jdk)
OpenJDK 64-Bit Server VM (build 24-internal-adhoc.bellsoft.jdk, mixed mode, sharing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerboaa Could you please take a look?
@sercher This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
Hi Jonas,
The PR is now ready and under review. I couldn't get back to it earlier due
to ongoing work. Apologies for the delay.
Am Fr., 24. Jan. 2025 um 11:07 Uhr schrieb Jonas ***@***.***>:
… Hi @sercher <https://github.com/sercher> and @jerboaa
<https://github.com/jerboaa> may I ask what the current state of this PR
is, because we are waiting for the bug fix?
—
Reply to this email directly, view it on GitHub
<#21808 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQZGCH4R4TSK7UERVL73Y32MIGHBAVCNFSM6AAAAABQ6PM7YSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJSGE2DCMBUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
/reviewers 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CgroupV1Controller::set_subsystem_path
needs high level comment update to describe the logic happening.
Testing:
And after the patch this would become this, right?
/sys/fs/cgroup/cpu,cpuacct/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c /sys/fs/cgroup/cpu,cpuacct/
It depends on whether it was a subgroup in the initial path. If bad/2f57368b-0eda-4e52-64d8-af5c is the subgroup, the reduction will be
/sys/fs/cgroup/cpu,cpuacct/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c /sys/fs/cgroup/cpu,cpuacct/bad/2f57368b-0eda-4e52-64d8-af5c /sys/fs/cgroup/cpu,cpuacct/bad /sys/fs/cgroup/cpu,cpuacct/
The above case, doesn't seem to be reflected by any gtest test case (or others), please add those.
src/java.base/linux/classes/jdk/internal/platform/cgroupv1/CgroupV1SubsystemController.java
Outdated
Show resolved
Hide resolved
src/java.base/linux/classes/jdk/internal/platform/cgroupv1/CgroupV1SubsystemController.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/containers/docker/TestMemoryWithSubgroups.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/containers/docker/TestMemoryWithSubgroups.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/containers/docker/TestMemoryWithSubgroups.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/containers/docker/TestMemoryWithSubgroups.java
Outdated
Show resolved
Hide resolved
test/jdk/jdk/internal/platform/docker/TestDockerMemoryMetricsSubgroup.java
Outdated
Show resolved
Hide resolved
src/java.base/linux/classes/jdk/internal/platform/cgroupv2/CgroupV2SubsystemController.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Severin Gehwolf <[email protected]>
Done, added |
The subgroup path reduction is covered by
With additional logging added before line 77, this could be looking like
Before the fix, the current path adjustment scheme would produce the following order:
Only the last path is valid in the container, others are non-existent. The result will be 200m, while the correct is 100m. |
Cgroup V1 subsustem fails to initialize mounted controllers properly in certain cases, that may lead to controllers left undetected/inactive. We observed the behavior in CloudFoundry deployments, it affects also host systems.
The relevant /proc/self/mountinfo line is
/proc/self/cgroup:
Here, Java runs inside containerized process that is being moved cgroups due to load balancing.
Let's examine the condition at line 64 here
jdk/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp
Lines 59 to 72 in 55a7cf1
It is always FALSE and the branch is never taken. The issue was spotted earlier by @jerboaa in JDK-8288019.
The original logic was intended to find the common prefix of
_root
andcgroup_path
and concatenate the remaining suffix to the_mount_point
(lines 67-68). That could lead to the following results:Example input
result _path
Here, cgroup_path comes from /proc/self/cgroup 3rd column. The man page (https://man7.org/linux/man-pages/man7/cgroups.7.html#NOTES) for control groups states:
This explicitly states the "pathname is relative to the mount point of the hierarchy". Hence, the correct result could have been
However, if Java runs in a container,
/proc/self/cgroup
and/proc/self/mountinfo
are mapped (read-only) from host, because docker uses--cgroupns=host
by default in cgroup v1 hosts. Then_root
andcgroup_path
belong to the host and do not exist in the container. In containers Java must fall back to_mount_point
of the corresponding cgroup controller.When
--cgroupns=private
is used,_root
andcgroup_path
are always equal to/
.In hosts, the
cgroup_path
should always be added to the mount point, no matter how it compares to the_root
.The PR fixes
CgroupUtil::adjust_controller
so that it handles the case when a process is moved to a supergroup or a sibling (in --cgroupns=private it produces invalid "../" paths). It also changes theCgroupV1Controller::set_subsystem_path
in Cgroup V1 mode, so that it detects the actual subgroup part of the given cgroup_path, because exactly this part should be concatenated to the mount point to get the correct path of cgroup files. The PR updates the Java metrics side accordingly.The new tests are proposed that cover processes moved over cgroups.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21808/head:pull/21808
$ git checkout pull/21808
Update a local copy of the PR:
$ git checkout pull/21808
$ git pull https://git.openjdk.org/jdk.git pull/21808/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21808
View PR using the GUI difftool:
$ git pr show -t 21808
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21808.diff
Using Webrev
Link to Webrev Comment