HOW-TO: Kill Zombie Processes with Parent ID (PID) of 1 ~ pimp-my-rig reloaded

HOW-TO: Kill Zombie Processes with Parent ID (PID) of 1

As a system administrator, we are often briefed on the careful use of the "kill" command. Indeed to use this command, you need to know what you're doing.. you really really need to know what you're doing. Seasoned system administrators would often talk about the first time they used this command and the racing heart rate before hitting the execute key.

It's not that bad though. Like all critical things this command could be the "last-resort" tool you need to resolve a complex issue -- in case of fire break glass thing, sort of. But the "kill" command doesn't always work. I'd say about 99% of the time it will. You need to be a seasoned system administrator long enough to encounter this problem as it doesn't come along very often.

I decided to post this solution since last week, I encountered a runaway samba (smbd) process. Samba had to be stopped but instead of it gracefully shutting down, only the parent process did and left behind a lot of child processes. They in turn became zombie processes -- about 257 of them.

It would not have mattered much, but the load on the server increased to match the number of zombie processes. In a matter of seconds, it became a critical problem that need to be dealt with. To illustrate the criticality of the situation, here are a couple of indicators.

[root@vhost ~]# ps -ef | grep smbd | wc -l
257
[root@vhost ~]# uptime
15:18:09 up 95 days,  5:04,  6 users,  load average: 257.91, 257.98, 255.54

When ps is executed, it returned smbd processes that had a parent ID (PID) of 1 -- zombie processes.

[root@vhost ~]# ps -ef | grep smb | tail
nobody   31590     1  0 12:29 ?        00:00:00 smbd -D
nobody   31705     1  0 11:55 ?        00:00:00 smbd -D
nobody   31743     1  0 14:31 ?        00:00:00 smbd -D
nobody   32049     1  0 12:19 ?        00:00:00 smbd -D
nobody   32415     1  0 12:04 ?        00:00:00 smbd -D
nobody   32484     1  0 14:25 ?        00:00:00 smbd -D
nobody   32508     1  0 13:35 ?        00:00:00 smbd -D
nobody   32511     1  0 12:04 ?        00:00:00 smbd -D
nobody   32535     1  0 13:22 ?        00:00:00 smbd -D
nobody   31599     1  0 11:51 ?        00:00:00 smbd -D
[root@vhost ~]# kill -9 31599

Killing any of the smbd processes didn't do much. Server load is holding steady at 257 and none of the smbd processes get terminated.

Once a process becomes a zombie process (or with a PID of 1), there is no amount of "kill" that could terminate it. The only other known way to terminate the zombie process is to do a host reboot. You and I both know this is to be avoided as much as possible.

The other way to do it is not known to many. Only a handful of system administrators that I personally know, knew how to use this command. The command is "telinit". This is the alternative to rebooting the server. To kill the zombie process(es) or processes with PID of 1, execute "telinit u" or "telinit U". Executing the command "telinit u" makes init (the process with ID 1) restart, or as the man pages state it, re-execute, without having to reboot the host.

Once this was done, an immediate impact of decrease in server load was felt. Also, the smbd processes that used to be zombies are now gone.

[root@vhost ~]# uptime
15:21:22 up 95 days,  5:07,  6 users,  load average: 49.06, 184.80, 229.74

About 5 minutes after execution of telinit. The load has re-stabilized.

So if a fellow administrator asks you if you know how to kill a zombie process, let them know that you belong to those handful of seasoned system administrators who don't consider reboot a solution to terminate a zombie process.

I have asked this question in several interviews so far only a few got it right. And I consider it major point in hiring.