It's not that bad though. Like all critical things this command could be the "last-resort" tool you need to resolve a complex issue -- in case of fire break glass thing, sort of. But the "kill" command doesn't always work. I'd say about 99% of the time it will. You need to be a seasoned system administrator long enough to encounter this problem as it doesn't come along very often.
I decided to post this solution since last week, I encountered a runaway samba (smbd) process. Samba had to be stopped but instead of it gracefully shutting down, only the parent process did and left behind a lot of child processes. They in turn became zombie processes -- about 257 of them.
It would not have mattered much, but the load on the server increased to match the number of zombie processes. In a matter of seconds, it became a critical problem that need to be dealt with. To illustrate the criticality of the situation, here are a couple of indicators.
[root@vhost ~]# ps -ef | grep smbd | wc -l 257 [root@vhost ~]# uptime 15:18:09 up 95 days, 5:04, 6 users, load average: 257.91, 257.98, 255.54
When ps is executed, it returned smbd processes that had a parent ID (PID) of 1 -- zombie processes.
[root@vhost ~]# ps -ef | grep smb | tail nobody 31590 1 0 12:29 ? 00:00:00 smbd -D nobody 31705 1 0 11:55 ? 00:00:00 smbd -D nobody 31743 1 0 14:31 ? 00:00:00 smbd -D nobody 32049 1 0 12:19 ? 00:00:00 smbd -D nobody 32415 1 0 12:04 ? 00:00:00 smbd -D nobody 32484 1 0 14:25 ? 00:00:00 smbd -D nobody 32508 1 0 13:35 ? 00:00:00 smbd -D nobody 32511 1 0 12:04 ? 00:00:00 smbd -D nobody 32535 1 0 13:22 ? 00:00:00 smbd -D nobody 31599 1 0 11:51 ? 00:00:00 smbd -D [root@vhost ~]# kill -9 31599
Killing any of the smbd processes didn't do much. Server load is holding steady at 257 and none of the smbd processes get terminated.
Once a process becomes a zombie process (or with a PID of 1), there is no amount of "kill" that could terminate it. The only other known way to terminate the zombie process is to do a host reboot. You and I both know this is to be avoided as much as possible.
The other way to do it is not known to many. Only a handful of system administrators that I personally know, knew how to use this command. The command is "telinit". This is the alternative to rebooting the server. To kill the zombie process(es) or processes with PID of 1, execute "telinit u" or "telinit U". Executing the command "telinit u" makes init (the process with ID 1) restart, or as the man pages state it, re-execute, without having to reboot the host.
Once this was done, an immediate impact of decrease in server load was felt. Also, the smbd processes that used to be zombies are now gone.
[root@vhost ~]# uptime 15:21:22 up 95 days, 5:07, 6 users, load average: 49.06, 184.80, 229.74
About 5 minutes after execution of telinit. The load has re-stabilized.
So if a fellow administrator asks you if you know how to kill a zombie process, let them know that you belong to those handful of seasoned system administrators who don't consider reboot a solution to terminate a zombie process.
I have asked this question in several interviews so far only a few got it right. And I consider it major point in hiring.