Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8343191: Cgroup v1 subsystem fails to set subsystem path #21808

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

sercher
Copy link
Contributor

@sercher sercher commented Oct 31, 2024

Cgroup V1 subsustem fails to initialize mounted controllers properly in certain cases, that may lead to controllers left undetected/inactive. We observed the behavior in CloudFoundry deployments, it affects also host systems.

The relevant /proc/self/mountinfo line is

2207 2196 0:43 /system.slice/garden.service/garden/good/2f57368b-0eda-4e52-64d8-af5c /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,cpu,cpuacct

/proc/self/cgroup:

11:cpu,cpuacct:/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c

Here, Java runs inside containerized process that is being moved cgroups due to load balancing.

Let's examine the condition at line 64 here

if (strcmp(_root, cgroup_path) == 0) {
ss.print_raw(_mount_point);
_path = os::strdup(ss.base());
} else {
char *p = strstr((char*)cgroup_path, _root);
if (p != nullptr && p == _root) {
if (strlen(cgroup_path) > strlen(_root)) {
ss.print_raw(_mount_point);
const char* cg_path_sub = cgroup_path + strlen(_root);
ss.print_raw(cg_path_sub);
_path = os::strdup(ss.base());
}
}
}

It is always FALSE and the branch is never taken. The issue was spotted earlier by @jerboaa in JDK-8288019.

The original logic was intended to find the common prefix of _rootand cgroup_path and concatenate the remaining suffix to the _mount_point (lines 67-68). That could lead to the following results:

Example input

_root = "/a"
cgroup_path = "/a/b"
_mount_point = "/sys/fs/cgroup/cpu,cpuacct"

result _path

"/sys/fs/cgroup/cpu,cpuacct/b"

Here, cgroup_path comes from /proc/self/cgroup 3rd column. The man page (https://man7.org/linux/man-pages/man7/cgroups.7.html#NOTES) for control groups states:

...
       /proc/pid/cgroup (since Linux 2.6.24)
              This file describes control groups to which the process
              with the corresponding PID belongs.  The displayed
              information differs for cgroups version 1 and version 2
              hierarchies.
              For each cgroup hierarchy of which the process is a
              member, there is one entry containing three colon-
              separated fields:

                  hierarchy-ID:controller-list:cgroup-path

              For example:

                  5:cpuacct,cpu,cpuset:/daemons
...
              [3]  This field contains the pathname of the control group
                   in the hierarchy to which the process belongs. This
                   pathname is relative to the mount point of the
                   hierarchy.

This explicitly states the "pathname is relative to the mount point of the hierarchy". Hence, the correct result could have been

/sys/fs/cgroup/cpu,cpuacct/a/b

However, if Java runs in a container, /proc/self/cgroup and /proc/self/mountinfo are mapped (read-only) from host, because docker uses --cgroupns=host by default in cgroup v1 hosts. Then _root and cgroup_path belong to the host and do not exist in the container. In containers Java must fall back to _mount_point of the corresponding cgroup controller.

When --cgroupns=private is used, _root and cgroup_path are always equal to /.

In hosts, the cgroup_path should always be added to the mount point, no matter how it compares to the _root.

The PR fixes CgroupUtil::adjust_controller so that it handles the case when a process is moved to a supergroup or a sibling (in --cgroupns=private it produces invalid "../" paths). It also changes the CgroupV1Controller::set_subsystem_path in Cgroup V1 mode, so that it detects the actual subgroup part of the given cgroup_path, because exactly this part should be concatenated to the mount point to get the correct path of cgroup files. The PR updates the Java metrics side accordingly.

The new tests are proposed that cover processes moved over cgroups.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8343191: Cgroup v1 subsystem fails to set subsystem path (Bug - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21808/head:pull/21808
$ git checkout pull/21808

Update a local copy of the PR:
$ git checkout pull/21808
$ git pull https://git.openjdk.org/jdk.git pull/21808/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21808

View PR using the GUI difftool:
$ git pr show -t 21808

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21808.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 31, 2024

👋 Welcome back schernyshev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 31, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 31, 2024
@openjdk
Copy link

openjdk bot commented Oct 31, 2024

@sercher The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-runtime
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Oct 31, 2024

@jerboaa
Copy link
Contributor

jerboaa commented Oct 31, 2024

What testing have you done? Did you run existing container tests in:

test/jdk/jdk/internal/platform
test/hotspot/jtreg/containers

As far as I can tell this breaks privileged container runs. I.e. docker run --privileged --memory 400m --memory-swap 400m ... /opt/jdk/bin/java -Xlog:os+container=trace wouldn't pick up the 400m container limit on CG v1?

@sercher
Copy link
Contributor Author

sercher commented Nov 1, 2024

I've done the standard tiers (1-3), and additionally "jtreg:jdk/internal/platform/cgroup" and "gtest::cgroupTest". I see now some of the dockers are failing. I am looking into it.

@sercher
Copy link
Contributor Author

sercher commented Nov 1, 2024

Thanks Severin! It was the problematic change in the logic that skips duplicate cgroup contoller mount points. Failing tests are mounting duplicates of the host's cgroups with --volume=/sys/fs/cgroup:/cgroup-in:ro. As they're in fact mounting read-write, the logic picked up rw mount option and falsely detected "host mode". Also the --privileged creates rw mounts, so the entire approach needs correction. I am changing to the draft PR for now.

@sercher sercher marked this pull request as draft November 1, 2024 13:14
@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 1, 2024
@jerboaa
Copy link
Contributor

jerboaa commented Nov 4, 2024

As they're in fact mounting read-write, the logic picked up rw mount option and falsely detected "host mode". Also the --privileged creates rw mounts, so the entire approach needs correction.

Yes. See https://bugs.openjdk.org/browse/JDK-8261242 for details. This patch shouldn't change it and the logic of OSContainer::is_containerized() shouldn't change semantically in all scenarios.

@sercher
Copy link
Contributor Author

sercher commented Nov 7, 2024

Here's an updated version of the patch. The long standing behavior was to leave _path uninitialized when _root is not "/" and not equal to cgroup_path. The issue can be reproduced as follows.

Create a new cgroup for memory

sudo mkdir -p /sys/fs/cgroup/memory/test

Run the following script

docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
    sh -c "sleep 10 ; /jdk/bin/java -Xlog:os+container=trace -version" | grep Memory\ Limit &
sleep 10
HOSTPID=$(sudo ps -ef | awk '/container=trace/ && !/docker/ && !/awk/ { print $2 }')
echo $HOSTPID | sudo tee /sys/fs/cgroup/memory/test/cgroup.procs
sleep 10

In the above script, a containerized process (/bin/sh) is moved to cgroup /test before /jdk/bin/java gets executed. Java inherits cgroup /test from its parent process, its _root will be /docker/<CONTAINER_ID>, cgroup_path will be /test.

The result would be ($JAVA_HOME points to JDK before fix)

9804
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit failed: -2
[0.002s][trace][os,container] Memory Limit failed: -2
[0.043s][trace][os,container] Memory Limit failed: -2

JDK updated version:

10001
[0.001s][trace  ][os,container] Memory Limit is: 419430400
[0.001s][trace  ][os,container] Memory Limit is: 419430400
[0.002s][trace  ][os,container] Memory Limit is: 419430400
[0.035s][trace  ][os,container] Memory Limit is: 419430400

The updated version falls back to the mount point (only when _root is other than "/").

Testing

  • Standard tiers (1-3)
  • jtreg:test/jdk/jdk/internal/platform
  • jtreg:test/hotspot/jtreg/containers
  • gtest:cgroupTest

@sercher sercher marked this pull request as ready for review November 7, 2024 18:32
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 7, 2024
@jerboaa
Copy link
Contributor

jerboaa commented Nov 8, 2024

Have you checked on cg v2? Is this a problem there as well?

@sercher
Copy link
Contributor Author

sercher commented Nov 8, 2024

Hi Severin, thanks for this question. I didn't check cg v2 because the issue (NPE) was observed in v1 hosts only.
I believe it's because v2 uses --cgroupns=private by default, in which cgroup is mounted at hierarchy leaf, so both _root and cgroup_path are /.

It's an open question what happens if a process is moved between cgroups in v2 mode. I will look into it and file an issue if there are problems in v2.

@sercher
Copy link
Contributor Author

sercher commented Nov 8, 2024

It looks to me that v2 mode is not affected, at least the way it is in v1. In v2 mode, cgroup is mounted either at leaf node (private namespace), or the complete hierarchy at /sys/fs/cgroup (host namespace).

In host mode it works right away, as the full hierarchy is accessible. With a cgroup v2 created like this:

sudo mkdir -p /sys/fs/cgroup/test
echo 200000000 | sudo tee /sys/fs/cgroup/test/memory.max

The result would be

[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/test
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/test/memory.max
[0.001s][trace][os,container] Memory Limit is: 199999488

In the private namespace (it's a default setting in v2 hosts), it may fail migrating the process between cgroups (a docker issue?). It may look like the cgroup files are not mapped at all, while cgroup_path appears to be set relative to the old cgroup (the old cgroup isn't mapped though).

[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/../../test
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../test/memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/../../test/memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/../../memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/../memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2

The following script

sudo docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
    sh -c "N=\$(ls -la /sys/fs/cgroup | wc -l) ; sleep 10 ; echo \$N ; ls -la /sys/fs/cgroup | wc -l" &
sleep 10
HOSTPID=$(sudo ps -ef | awk '/sys\/fs\/cgroup/ && !/docker/ && !/awk/ && !/grep/ { print $2 }')
echo $HOSTPID | sudo tee /sys/fs/cgroup/test/cgroup.procs > /dev/null
sleep 5

will display

74
1

means there are no files in /sys/fs/cgroup after migration. It seems like it's not something that can be fixed in Java (and it hasn't much to do with this PR too).

When moved into a subgroup, such as

sudo docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
    sh -c "sleep 10 ; /jdk/bin/java -Xlog:os+container=trace -version" &
sleep 5
HOSTPID=$(sudo ps -ef | awk '/container=trace/ && !/docker/ && !/awk/ { print $2 }')
CGPATH=$(cat /proc/$HOSTPID/cgroup | cut -f3 -d: )
sudo mkdir -p "/sys/fs/cgroup$CGPATH/test" 
echo $HOSTPID | sudo tee "/sys/fs/cgroup$CGPATH/test/cgroup.procs" > /dev/null
sleep 10

the cgroup will be mounted at /sys/fs/cgroup, and the correct memory limit is displayed (thanks to the conroller path adjustment) - inherited from the parent.

[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/test
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/test/memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/test/memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/memory.max
[0.001s][trace][os,container] Memory Limit is: 419430400

@jerboaa
Copy link
Contributor

jerboaa commented Nov 11, 2024

I didn't check cg v2 because the issue (NPE) was observed in v1 hosts only.

The JBS issue doesn't mention NullPointerException. It would be good to list the observed NPE issue.

@jerboaa
Copy link
Contributor

jerboaa commented Nov 11, 2024

Create a new cgroup for memory

sudo mkdir -p /sys/fs/cgroup/memory/test

Run the following script

docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
    sh -c "sleep 10 ; /jdk/bin/java -Xlog:os+container=trace -version" | grep Memory\ Limit &
sleep 10
HOSTPID=$(sudo ps -ef | awk '/container=trace/ && !/docker/ && !/awk/ { print $2 }')
echo $HOSTPID | sudo tee /sys/fs/cgroup/memory/test/cgroup.procs
sleep 10

In the above script, a containerized process (/bin/sh) is moved to cgroup /test before /jdk/bin/java gets executed. Java inherits cgroup /test from its parent process, its _root will be /docker/<CONTAINER_ID>, cgroup_path will be /test.

OK, but why is https://bugs.openjdk.org/browse/JDK-8322420 not in effect in such a case?

The result would be ($JAVA_HOME points to JDK before fix)

9804
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit failed: -2
[0.002s][trace][os,container] Memory Limit failed: -2
[0.043s][trace][os,container] Memory Limit failed: -2

JDK updated version:

10001
[0.001s][trace  ][os,container] Memory Limit is: 419430400
[0.001s][trace  ][os,container] Memory Limit is: 419430400
[0.002s][trace  ][os,container] Memory Limit is: 419430400
[0.035s][trace  ][os,container] Memory Limit is: 419430400

It would be good to see the full boot JVM output at the trace level. I'm wondering why the adjustment isn't sufficient for the use-case the bug describes. I.e. if the move happens before the JVM starts then there is a chance it works OK by detecting some limit. If not it would really be useful to understand it better.

If, however, the cgroup move happens after the JVM has started, there is nothing in the JVM which "corrects" the detected physical memory (i.e. heap size et. al) and/or detected CPUs. It's not supported to do that dynamically.

@jerboaa
Copy link
Contributor

jerboaa commented Nov 11, 2024

I didn't check cg v2 because the issue (NPE) was observed in v1 hosts only.

The JBS issue doesn't mention NullPointerException. It would be good to list the observed NPE issue.

I also wonder, then, if the issue is NPE if JDK-8336881 would fix that issue. The controller adjustment doesn't yet happen on the Java (Metrics) level. Only hotspot so far.

@jerboaa
Copy link
Contributor

jerboaa commented Nov 11, 2024

In the above script, a containerized process (/bin/sh) is moved to cgroup /test before /jdk/bin/java gets executed. Java inherits cgroup /test from its parent process, its _root will be /docker/<CONTAINER_ID>, cgroup_path will be /test.

OK, but why is https://bugs.openjdk.org/browse/JDK-8322420 not in effect in such a case?

Answering my own question. Because the set_subsystem_path() function for cg v1 in this unusual setup returns null.

[0.001s][trace][os,container] OSContainer::init: Initializing Container Support
[0.001s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.002s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.002s][trace][os,container] Adjusting controller path for memory: (null)
[0.002s][debug][os,container] read_string: subsystem path is null
[0.002s][trace][os,container] Memory Limit failed: -2
[0.002s][debug][os,container] read_string: subsystem path is null
[0.002s][trace][os,container] Memory Limit failed: -2
[0.002s][trace][os,container] No lower limit found for memory in hierarchy /sys/fs/cgroup/memory, adjusting to original path /test
[0.002s][debug][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
[0.003s][trace][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
[0.003s][trace][os,container] CPU Quota is: -1
[0.003s][trace][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
[0.003s][trace][os,container] CPU Period is: 100000
[0.003s][trace][os,container] OSContainer::active_processor_count: 12
[0.003s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 12
[0.003s][trace][os,container] total physical memory: 67163226112
[0.003s][debug][os,container] read_string: subsystem path is null
[0.003s][trace][os,container] Memory Limit failed: -2
[0.005s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 12
[0.021s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 12
openjdk 24-internal 2025-03-18
OpenJDK Runtime Environment (build 24-internal-adhoc.sgehwolf.jdk-jdk)
OpenJDK 64-Bit Server VM (build 24-internal-adhoc.sgehwolf.jdk-jdk, mixed mode, sharing)

On cg v2, on the other hand, set_subsystem_path() will never set the path to a null value.

Edit:
Yet, cg v2 will get into trouble since there, for example on rootless podman on cg v2 you'd end up with this instead:

[0.008s][trace][os,container] OSContainer::init: Initializing Container Support
[0.008s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.008s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.008s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/../../../../../../test
[0.008s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../../../../test/memory.max
[0.008s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/memory.max failed, No such file or directory
[0.008s][trace][os,container] Memory Limit failed: -2
[0.008s][trace][os,container] Memory Limit is: -2
[0.008s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.008s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../../../../memory.max
[0.008s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../memory.max failed, No such file or directory
[0.008s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../../../memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../../memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.009s][trace][os,container] No lower limit found for memory in hierarchy /sys/fs/cgroup, adjusting to original path /../../../../../../test
[0.009s][trace][os,container] Adjusting controller path for cpu: /sys/fs/cgroup/../../../../../../test
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../test/cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../test/cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] No lower limit found for cpu in hierarchy /sys/fs/cgroup, adjusting to original path /../../../../../../test
[0.009s][debug][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../test/cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/cpu.max failed, No such file or directory
[0.009s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../test/cpu.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/cpu.max failed, No such file or directory
[0.009s][trace][os,container] CPU Period failed: -2
[0.009s][trace][os,container] OSContainer::active_processor_count: 6
[0.009s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 6
[0.009s][trace][os,container] total physical memory: 6204755968
[0.009s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../../../../test/memory.max
[0.009s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/memory.max failed, No such file or directory
[0.009s][trace][os,container] Memory Limit failed: -2
[0.009s][trace][os,container] Memory Limit is: -2
[0.009s][debug][os,container] container memory limit failed: -2, using host value 6204755968
[0.011s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 6
[0.104s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../test/cpu.max
[0.105s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/cpu.max failed, No such file or directory
[0.105s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/../../../../../../test/cpu.max
[0.105s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/cpu.max failed, No such file or directory
[0.105s][trace][os,container] CPU Period failed: -2
[0.105s][trace][os,container] OSContainer::active_processor_count: 6
[0.112s][trace][os,container] total physical memory: 6204755968
[0.112s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../../../../../test/memory.max
[0.112s][debug][os,container] Open of file /sys/fs/cgroup/../../../../../../test/memory.max failed, No such file or directory
[0.112s][trace][os,container] Memory Limit failed: -2
[0.112s][trace][os,container] Memory Limit is: -2
[0.112s][debug][os,container] container memory limit failed: -2, using host value 6204755968
openjdk version "24-internal" 2025-03-18
OpenJDK Runtime Environment (fastdebug build 24-internal-adhoc.sgehwolf.jdk-jdk)
OpenJDK 64-Bit Server VM (fastdebug build 24-internal-adhoc.sgehwolf.jdk-jdk, mixed mode, sharing)

@jerboaa
Copy link
Contributor

jerboaa commented Nov 11, 2024

So on cg v1 you start out and end with a subsystem_path() == null and on cg v2 you start out and end with a subsystem_path() == /../../../../../../test. In both cases the memory limit of 400m won't be detected.

@sercher
Copy link
Contributor Author

sercher commented Nov 11, 2024

On cg v2, on the other hand, set_subsystem_path() will never set the path to a null value.

Exactly. That's why JDK-8322420 is not in effect and also JDK-8336881 does not fix it on Java side (path stays uninitialized in certain conditions).

@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 27, 2024
Comment on lines 65 to 70
jlong lowest_limit = phys_mem;
if (limit > 0 && limit < lowest_limit) {
lowest_limit = limit;
os::free(limit_cg_path); // handles nullptr
limit_cg_path = os::strdup(cg_path);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid the duplicate copy of the original cgroup path, which is already captured in orig by using:

jlong lowest_limit = limit < 0 ? phys_mem : limit;
julong orig_limit = ((julong)lowest_limit) != phys_mem ? lowest_limit : phys_mem;

And on line 91 we change the condition from:

if ((julong)lowest_limit != phys_mem) {

to:

if ((julong)lowest_limit != orig_limit) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Accepted.

if (cpus != host_cpus && cpus < lowest_limit) {
lowest_limit = cpus;
os::free(limit_cg_path); // handles nullptr
limit_cg_path = os::strdup(cg_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the extra allocation of cg_path;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 319 to 322
if (strstr((char*)cgroup_path, "../") != nullptr) {
log_warning(os, container)("Cgroup v2 path at [%s] is [%s], cgroup limits can be wrong.",
mount_path, cgroup_path);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the cast to char*?

We should probably move this warning to CgroupUtil::adjust_controller, right before we've determined that we actually need to adjust. I wonder, though, if we should just print the warning and set the cgroup_path to / and return early. Otherwise, path adjustment will run with no different result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed extra (char*) cast.

We should probably move this warning to CgroupUtil::adjust_controller, right before we've determined that we actually need to adjust. I wonder, though, if we should just print the warning and set the cgroup_path to / and return early. Otherwise, path adjustment will run with no different result.

"../" only appears in corner case with cgroupns=private and the process moved to the outer group. In that specific case we should avoid concatenating with whatever starts with "../".

Comment on lines 62 to 66
if (!cgroupPath.equals("/")) {
// When moved to a subgroup, between subgroups, the path suffix will change.
// Rely on path adjustment that determines the actual suffix.
path += cgroupPath;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a simpler solution than the hotspot one. While I prefer this one, please make them consistent at the least.

* @requires os.family == "linux"
* @modules java.base/jdk.internal.platform
* @library /test/lib
* @build jdk.test.whitebox.WhiteBox CheckOperatingSystemMXBean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckOperatingSystemMXBean seems unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@sercher sercher marked this pull request as ready for review November 30, 2024 00:26
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 30, 2024
@sercher
Copy link
Contributor Author

sercher commented Dec 3, 2024

Here's the summary of the latest state of the PR. The updated code

  • special cases the condition when _root is /, and cgroup_path includes ../. The condition appears in containers when a process's cgroup is moved to a supergroup. Then the cgroup files are mapped inside the container at the controller's mount point. The reason for this - path adjustment will always fail with cgroup path prefix ../;
  • calls stat() on cgroup path to make sure the directory exists - only when _root != / and _root != cgroup_path. This issue is specific to cgroup v1 containers, where /proc/self/cgroup is from host, cgroup files are mapped at controller's mount point, the mapping may include subgroups that need to be walked through to locate the smallest limits ;
  • sets Java Metrics in alignment with hotspot logic ;
  • fixes an NPE in Java Metrics ;
  • fixes an uninitialized path issue in hotspot / cgroup v1 subsystem when _root != / and _root != cgroup_path;
  • fixes a logical error with lowest_limit in the path adjustment in CgroupUtil::adjust_controller() methods ;
  • adds container tests on subgroups in hotspot and metrics.

# Conflicts:
#	src/java.base/linux/classes/jdk/internal/platform/CgroupUtil.java
@openjdk
Copy link

openjdk bot commented Dec 5, 2024

@sercher this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8343191
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed merge-conflict Pull request has merge conflict with target branch labels Dec 5, 2024
Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I'll try to test and review this more thoroughly next week.

Comment on lines 318 to 322
if (strstr(cgroup_path, "../") == nullptr) {
ss.print_raw(cgroup_path);
} else {
log_warning(os, container)("Cgroup cpu/memory controller path includes '../', detected limits won't be accurate");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this warning to CgroupUtil::adjust_controller and abort the adjustment, we don't need to issue this warning multiple times, and we'd not be able to adjust it to a path that will work. Showing the warning once should be sufficient. We shouldn't see this path in any non-moved scenarios. It would perhaps help if we included some detail why this warning is being shown. I suggest:

cgroup controller path seems to have moved (includes '.../'), detected limits won't be accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you recommand also to include the paths in that warning? Something like
cgroup controller path at '/sys/fs/cgroup' seems to have moved to '../../test', detected limits won't be accurate
This way it will have all the necessary information to investigate customer cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the log message. The log example in cg v1:

[0.001s][trace][os,container] OSContainer::init: Initializing Container Support
[0.001s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.001s][warning][os,container] Cgroup memory controller path at '/sys/fs/cgroup/memory' seems to have moved to '/../../test', detected limits won't be accurate
[0.001s][debug  ][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
[0.001s][trace  ][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
[0.001s][trace  ][os,container] CPU Quota is: -1
[0.002s][trace  ][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
[0.002s][trace  ][os,container] CPU Period is: 100000
[0.002s][trace  ][os,container] OSContainer::active_processor_count: 48
[0.002s][trace  ][os,container] CgroupSubsystem::active_processor_count (cached): 48
[0.002s][trace  ][os,container] total physical memory: 133623721984
[0.002s][trace  ][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[0.002s][trace  ][os,container] Memory Limit is: 419430400
[0.004s][trace  ][os,container] CgroupSubsystem::active_processor_count (cached): 48
[0.027s][trace  ][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
[0.027s][trace  ][os,container] CPU Quota is: -1
[0.027s][trace  ][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us
[0.027s][trace  ][os,container] CPU Period is: 100000
[0.027s][trace  ][os,container] OSContainer::active_processor_count: 48
openjdk version "24-internal" 2025-03-18
OpenJDK Runtime Environment (build 24-internal-adhoc.bellsoft.jdk)
OpenJDK 64-Bit Server VM (build 24-internal-adhoc.bellsoft.jdk, mixed mode, sharing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerboaa Could you please take a look?

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 22, 2025

@sercher This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@sonallux
Copy link

sonallux commented Jan 24, 2025

Hi @sercher and @jerboaa may I ask what the current state of this PR is, because we are waiting for the bug fix?

@sercher
Copy link
Contributor Author

sercher commented Jan 24, 2025 via email

@sercher
Copy link
Contributor Author

sercher commented Jan 27, 2025

/reviewers 2

@openjdk
Copy link

openjdk bot commented Jan 27, 2025

@sercher
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CgroupV1Controller::set_subsystem_path needs high level comment update to describe the logic happening.

Testing:

And after the patch this would become this, right?

/sys/fs/cgroup/cpu,cpuacct/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c
/sys/fs/cgroup/cpu,cpuacct/

It depends on whether it was a subgroup in the initial path. If bad/2f57368b-0eda-4e52-64d8-af5c is the subgroup, the reduction will be

/sys/fs/cgroup/cpu,cpuacct/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c
/sys/fs/cgroup/cpu,cpuacct/bad/2f57368b-0eda-4e52-64d8-af5c
/sys/fs/cgroup/cpu,cpuacct/bad
/sys/fs/cgroup/cpu,cpuacct/

The above case, doesn't seem to be reflected by any gtest test case (or others), please add those.

@sercher sercher requested a review from jerboaa February 20, 2025 17:35
@sercher
Copy link
Contributor Author

sercher commented Feb 24, 2025

CgroupV1Controller::set_subsystem_path needs high level comment update to describe the logic happening.

Done, added

@sercher
Copy link
Contributor Author

sercher commented Feb 24, 2025

The above case, doesn't seem to be reflected by any gtest test case (or others), please add those.

The subgroup path reduction is covered by TestMemoryWithSubgroups#testMemoryLimitSubgroupV1 (it wouldn't be possible in gtests as it requires touching cgroup tree on a host system). It actually checks that the subgroup directory exists and skips non-existent directories, that is reflected in the test log below. The bottom line comes from CgroupV1Controller::set_subsystem_path.

========== NEW TEST CASE:      Cgroup V1 subgroup memory limit: 100m
[COMMAND]
docker run --tty=true --rm --privileged --cgroupns=host --memory 200m jdk-internal:test-containers-docker-TestMemoryWithSubgroups-subgroup sh -c mkdir -p /sys/fs/cgroup/memory/test ; echo 100m > /sys/fs/cgroup/memory/test/memory.limit_in_bytes ; echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs ; /jdk/bin/java -Xlog:os+container=trace -version
[2025-02-24T22:33:28.049656218Z] Gathering output for process 23369
[ELAPSED: 5385 ms]
[STDERR]

[STDOUT]
[0.002s][trace][os,container] OSContainer::init: Initializing Container Support
[0.003s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.004s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.004s][trace][os,container] set_subsystem_path: cgroup v1 path reduced to: /test.

With additional logging added before line 77, this could be looking like

[STDOUT]
[0.002s][trace][os,container] OSContainer::init: Initializing Container Support
[0.002s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.003s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.004s][trace][os,container] set_subsystem_path: skipped non existent directory: /docker/cc32e455402a8c98d1df6a81c685a540e7e528e714c981b10845c31b64d8a370/test.
[0.004s][trace][os,container] set_subsystem_path: skipped non existent directory: /cc32e455402a8c98d1df6a81c685a540e7e528e714c981b10845c31b64d8a370/test.
[0.005s][trace][os,container] set_subsystem_path: cgroup v1 path reduced to: /test.

Before the fix, the current path adjustment scheme would produce the following order:

/sys/fs/cgroup/memory/docker/cc32e455402a8c98d1df6a81c685a540e7e528e714c981b10845c31b64d8a370/test
/sys/fs/cgroup/memory/docker/cc32e455402a8c98d1df6a81c685a540e7e528e714c981b10845c31b64d8a370
/sys/fs/cgroup/memory/docker
/sys/fs/cgroup/memory

Only the last path is valid in the container, others are non-existent. The result will be 200m, while the correct is 100m.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants