Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diamond leaving around zombie children of collector processes #674

Open
bakerkj opened this issue Aug 14, 2017 · 0 comments
Open

Diamond leaving around zombie children of collector processes #674

bakerkj opened this issue Aug 14, 2017 · 0 comments

Comments

@bakerkj
Copy link
Contributor

bakerkj commented Aug 14, 2017

I see child processes of diamond collectors occasionally become and stay zombies. The processes that have become zombies are (ping and unbound-control). I am also seeing messages in the log "Took too long to run! Killed!" in the logs.

From looking at the code (I am running something very close to master as of May 16th with a few small modifications to some of the collectors not related to this problem). These external programs are invoked by the UnboundCollector and PingCollector which both use the run_command() found in collector.py. run_command() uses subprocess.Popen(command, stdout=subprocess.PIPE).communicate(). The collect loop is run via utils/scheduler.py:collector_process(). This also controls the timeout behavior with signal alarm and the timeout SIGALRMException prints the log message "Took too long to run! Killed!".

I believe what is happening is that the signal.alarm is triggered because these processes (ping and unbound-control) are sometimes taking longer than the allowed timeout to run. This alarm interrupts the wait() call inside of subprocess.communicate(). The python collector thread carries on to try again at some future time. The ping/unbound-control commands eventually die, but since the wait() call was interrupted their parent (the collector thread) never comes back to check their exit code and those processes are left around as Zombies.

I believe the correct thing to do would be for the collector thread (maybe in the run_command() method), in the case of an alarm, would be to abruptly kill and then wait on the child process before going on to do other things.

Has anyone else seen this issue?

Does my proposed solution sound reasonable?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant