Thursday, November 27, 2014

Killing subprocess heirarchies with Python

I'm currently working on a distributed Python project that uses flask as a communication interface between nodes. I'm doing all my testing locally and starting instances of the application manually becomes tiresome quickly. I decided to write a test harness to simplify things. Things were going swimmingly until I tried to implement some cleanup functionality to kill all the nodes. Here's why - flask forks off a child process when you start it up:
$ ./app.py 
$ ps | grep app.py
dkudrow    4230    Ss    env/bin/python ~/src/project/app.py
dkudrow    4233    Sl    ~/src/project/env/bin/python ~/src/project/app.py
In my test harness, I start these with subprocess as suggested by the kind folks at Python:
cmd = [ os.getcwd()+'./app.py' ]
proc = subprocess.Popen(cmd)
After sending a SIGTERM like this,
proc.terminate()
I'm left with this mess:
$ ps | grep app.py
dkudrow    4230    Z+      env/bin/python ~/src/project/app.py
dkudrow    4233    Sl+   ~/src/project/env/bin/python ~/src/project/app.py
The first process is killed and becomes a zombie waiting for it's child. Like many things in Python, a simple task has just become irritatingly difficult. Thankfully there is a solution carefully hidden in the documentation.

While subprocess doesn't yet (or may never?) support management of process groups, the os module does. The os.killpg(pgid, signal) method will send a signal to every process in a group. We can mix and match modules here to achieve what we want.

First off, we modify the preexec_fn parameter of the Popen constructor,
proc = subprocess.Popen(cmd, preexec_fn=os.setsid)
preexec_fn can be used to specify a function to call in the new process before cmd is execed. In this case, we call os.setsid() which just calls the setsid system call in the new process. You can refer to the man page for more details but in a few words, setsid creates a new process group and makes the calling process the leader of this group.

We can then call
os.killpg(proc.pid, signal.SIGTERM)
and since proc is a process group leader, all processes in the group (i.e. proc's children) will all recieve the SIGTERM. 

Problem solved.