You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scontrol reconfigure fails with the error message scontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0 after adding a compute node to slurm.conf if the machine the slurmd charm is deployed to has less than 1G of memory allocated to it.
The cause of this error is that value of RealMemory can never be less than the value of MemSpecLimit, however, in the slurmd charm, the value of RealMemory can be dynamic as it is reported by slurmd -C and interpreted by the charm, but the value of MemSpecLimit is set as the constant "1024":
slurmd -C can report that the machine's available RealMemory is less than 1G, but the slurmd charm has no way of handling this edge case. Since the bad node configuration can be written to the slurm.conf file without any checks at all, this syntax error won't be caught until scontrol reconfigure is run to signal to all the daemons to reload their configuration files.
The easiest fix here is to just develop a heuristic for determining what MemSpecLimit should be when the machine has < 1G of memory available. This is really only a problem for test and/or exploratory deployments, and isn't an optional configuration for production-level clusters, so we should maybe also log a warning that the node memory configuration is not optimal for production deployments, but is suitable for test/exploratory deployments.
To Reproduce
Assume Juju controller is bootstrapped on LXD cloud...
juju run slurmctld/leader resume <slurmd/0-hostname>
See uncaught SlurmOpsError exception in juju debug-log --include slurmcltd/0 --level ERROR
Environment
LXD version: 5.21.2 LTS
Juju client version: 3.6.1-genericlinux-amd64
Juju controller version: 3.6.0
slurmctld revision number: 86
slurmd revision number: 107
Relevant log output
unit-slurmctld-0: 14:20:49 ERROR unit.slurmctld/0.juju-log slurmd:5: command ['scontrol', 'reconfigure'] failed with message scontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0
slurm_reconfigure error: Socket timed out on send/recv operation
unit-slurmctld-0: 14:20:50 ERROR unit.slurmctld/0.juju-log slurmd:5: Uncaught exception whilein charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 499, in<module>
main.main(SlurmctldCharm)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/__init__.py", line 348, in main
return _legacy_main.main(
^^^^^^^^^^^^^^^^^^
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 45, in main
return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 543, in main
manager.run()
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 529, in run
self._emit()
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 518, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 134, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 347, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 857, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 947, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/src/interface_slurmd.py", line 153, in _on_relation_changed
self.on.partition_available.emit()
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 347, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 857, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 947, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 308, in _on_write_slurm_conf
self._slurmctld.scontrol("reconfigure")
File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 943, in scontrol
return _call("scontrol", *args).stdout
^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 153, in _call
raise SlurmOpsError(f"command {cmd} failed. stderr:\n{result.stderr}")
charms.hpc_libs.v0.slurm_ops.SlurmOpsError: command ['scontrol', 'reconfigure'] failed. stderr:
scontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0
slurm_reconfigure error: Socket timed out on send/recv operation
unit-slurmctld-0: 14:20:50 ERROR juju.worker.uniter.operation hook "slurmd-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
Additional context
No response
The text was updated successfully, but these errors were encountered:
Bug Description
scontrol reconfigure
fails with the error messagescontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0
after adding a compute node to slurm.conf if the machine the slurmd charm is deployed to has less than 1G of memory allocated to it.The cause of this error is that value of
RealMemory
can never be less than the value ofMemSpecLimit
, however, in the slurmd charm, the value ofRealMemory
can be dynamic as it is reported byslurmd -C
and interpreted by the charm, but the value ofMemSpecLimit
is set as the constant"1024"
:slurm-charms/charms/slurmd/src/utils/machine.py
Lines 23 to 43 in 8fd9d73
slurm-charms/charms/slurmd/src/charm.py
Line 367 in 8fd9d73
slurmd -C
can report that the machine's availableRealMemory
is less than 1G, but the slurmd charm has no way of handling this edge case. Since the bad node configuration can be written to the slurm.conf file without any checks at all, this syntax error won't be caught untilscontrol reconfigure
is run to signal to all the daemons to reload their configuration files.The easiest fix here is to just develop a heuristic for determining what
MemSpecLimit
should be when the machine has < 1G of memory available. This is really only a problem for test and/or exploratory deployments, and isn't an optional configuration for production-level clusters, so we should maybe also log a warning that the node memory configuration is not optimal for production deployments, but is suitable for test/exploratory deployments.To Reproduce
juju deploy slurmd --base [email protected] --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy slurmctld --base [email protected] --channel "edge" --constraints="virt-type=virtual-machine"
juju integrate slurmd slurmctld
juju run slurmctld/leader resume <slurmd/0-hostname>
SlurmOpsError
exception injuju debug-log --include slurmcltd/0 --level ERROR
Environment
slurmctld
revision number: 86slurmd
revision number: 107Relevant log output
Additional context
No response
The text was updated successfully, but these errors were encountered: