Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The runner does not terminate the job commands gracefully but sends sigkill #2233

Open
r4victor opened this issue Jan 27, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@r4victor
Copy link
Collaborator

Steps to reproduce

  1. Start a run (e.g. a dev environment)
  2. Stop the run
  3. The shim will log an error indicating that the container exited with 137 code (128 + 9):
time=2025-01-27T15:59:59.342462+05:00 level=error msg=failed to run err=container exited with exit code 137 task=1d2e336f-b4c8-416f-8211-f35e9cc5b71e

There is also this runner log message:

time=2025-01-27T05:46:50.87042-05:00 level=error msg=Executor failed err=[executor.go:145 executor.(*RunExecutor).Run] [executor.go:339 executor.(*RunExecutor).execJob] signal: killed

Actual behaviour

The runner executes user commands which is ["/bin/bash", "i", "-c", ...commands joined by &&]. On stopping, it sends SIGINT. If the command fails to derminate in ex.killDelay (10s), then the runner sends SIGKILL:

	cmd := exec.CommandContext(ctx, ex.jobSpec.Commands[0], ex.jobSpec.Commands[1:]...)
	cmd.Cancel = func() error {
		// returns error on Windows
		return gerrors.Wrap(cmd.Process.Signal(os.Interrupt))
	}
	cmd.WaitDelay = ex.killDelay // kills the process if it doesn't exit in time

The problem is caused by how bash (other shells as well) handles signals in interactive mode. It does not propagates SIGINT nor exits:

SIGNALS
       When bash is interactive, in the absence of any traps, it ignores SIGTERM (so that  kill  0  does  not  kill  an  interactive
       shell),  and  SIGINT  is caught and handled (so that the wait builtin is interruptible).  In all cases, bash ignores SIGQUIT.
       If job control is in effect, bash ignores SIGTTIN, SIGTTOU, and SIGTSTP.

Sending SIGTERM is not an option as well since it's ignored completely.

Possible solutions:

  • Send SIGHUP. It makes the shell exist. And "Before exiting, an interactive shell resends the SIGHUP to all jobs, running or stopped." The problem with this is that daemon processes often ignore SIGHUP (e.g. anything shielded by nohup).
  • Trap signals (e.g. SIGTERM) and terminate all jobs in the interactive shell and the shell itself.

Expected behaviour

No response

dstack version

master

Server logs

Additional information

No response

@r4victor r4victor added the bug Something isn't working label Jan 27, 2025
@r4victor
Copy link
Collaborator Author

Correction after some testing.

  1. If I run only one command, it does receive SIGINT:
type: task
commands:
  - python trap.py # runs a loop, traps signals and prints
  1. If I run multiple commands, the foreground one receives SIGHUP:
type: task
commands:
  - python trap.py # now it prints SIGHUP
  - something else
  1. If I run multiple commands and some are in the background, then only the foreground one receives SIGHUP:
type: task
commands:
  - python trap.py >&1 & # prints nothing
  - python trap.py # prints SIGHUP

So the problem is not as critical since in most cases, SIGHUP is send and only the bash itself gets killed.

The following problems do remain:

  • Processes may ignore SIGHUP and not perform clean exit.
  • Background processes do no receive any signal, so they get killed.

Ideally, all jobs/processes would receive SIGTERM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant