Hey Erik, thought I'd share this with you - I ran into a problem of slave processes hanging in limbo after the master process died or was killed on Beowulf, and several recent ps -aux calls demonstrated a few users evidently and probably unwittinging experiencing the same problem.  It doesn't seem to cost much in the way of cpu cycles, but all the same it's getting pretty messy - here's something that can be put in a script to ensure everything stays clean:

kill `ps -aux | grep $USER | grep -v grep | grep -v ps | awk '{print $2}'`

This unfortunately would have to be executed on every node that defunct processes are running on, but it does seem to work.  This will make a script to kill across the nodes:

grep -v "#" < machines.my | awk '{print "ssh", $1, "./killbatch"}' > killall
chmod 700 killall

where machines.my can be replaced with the user's or the default machine file.  killbatch has to be in the user's main home directory.

Thanks!
...matt


It is better for civilization to be going down the drain than to be coming up it.
        -- Henry Allen