Skip to content

Commit

Permalink
driver/base/cpu: Retry online operation if -EBUSY
Browse files Browse the repository at this point in the history
Booting the kernel with "maxcpus=1" is a common technique for CPU
partitioning and isolation. It delays the CPU bringup process until
when the bootup scripts are ready to bring CPUs online by writing 1 to
/sys/device/system/cpu/cpu<X>/online. However, it was found that not
all the CPUs were online after bootup. The collection of offline CPUs
are different after every reboot.

Further investigation reveals that some "online" write operations
fail with an -EBUSY error. This error is returned when CPU hotplug is
temporiarly disabled when cpu_hotplug_disable() is called.

During bootup, the main caller of cpu_hotplug_disable() is
pci_call_probe() for PCI device initialization. By measuring the times
spent with cpu_hotplug_disabled set in a typical 2-socket server, most
of them last less than 10ms.  However, there are a few that can last
hundreds of ms. Note that the cpu_hotplug_disabled period of different
devices can overlap leading to longer cpu_hotplug_disabled hold time.

Since the CPU hotplug disable condition is transient and it is not
that easy to modify all the existing bootup scripts to handle this
condition, the kernel can help by retrying the online operation when
an -EBUSY error is returned. This patch retries the online operation
in cpu_subsys_online() when an -EBUSY error is returned for up to 5
times after an exponentially increasing delay that can last a total of
at least 620ms of waiting time by calling msleep().

With this patch in place, booting up the patched kernel with "maxcpus=1"
does not leave any CPU in an offline state in 10 reboot attempts.

Reported-by: Vishal Agrawal <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
  • Loading branch information
Waiman-Long authored and gregkh committed Aug 5, 2023
1 parent 3356855 commit 1fd7ab3
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions drivers/base/cpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
#include <linux/cpufeature.h>
#include <linux/tick.h>
#include <linux/pm_qos.h>
#include <linux/delay.h>
#include <linux/sched/isolation.h>

#include "base.h"
Expand Down Expand Up @@ -50,12 +51,30 @@ static int cpu_subsys_online(struct device *dev)
int cpuid = dev->id;
int from_nid, to_nid;
int ret;
int retries = 0;

from_nid = cpu_to_node(cpuid);
if (from_nid == NUMA_NO_NODE)
return -ENODEV;

retry:
ret = cpu_device_up(dev);

/*
* If -EBUSY is returned, it is likely that hotplug is temporarily
* disabled when cpu_hotplug_disable() was called. This condition is
* transient. So we retry after waiting for an exponentially
* increasing delay up to a total of at least 620ms as some PCI
* device initialization can take quite a while.
*/
if (ret == -EBUSY) {
retries++;
if (retries > 5)
return ret;
msleep(10 * (1 << retries));
goto retry;
}

/*
* When hot adding memory to memoryless node and enabling a cpu
* on the node, node number of the cpu may internally change.
Expand Down

0 comments on commit 1fd7ab3

Please sign in to comment.