June 2013 ~ pimp-my-rig reloaded

HOW-TO: Kill Zombie Processes with Parent ID (PID) of 1

As a system administrator, we are often briefed on the careful use of the "kill" command. Indeed to use this command, you need to know what you're doing.. you really really need to know what you're doing. Seasoned system administrators would often talk about the first time they used this command and the racing heart rate before hitting the execute key.

It's not that bad though. Like all critical things this command could be the "last-resort" tool you need to resolve a complex issue -- in case of fire break glass thing, sort of. But the "kill" command doesn't always work. I'd say about 99% of the time it will. You need to be a seasoned system administrator long enough to encounter this problem as it doesn't come along very often.

I decided to post this solution since last week, I encountered a runaway samba (smbd) process. Samba had to be stopped but instead of it gracefully shutting down, only the parent process did and left behind a lot of child processes. They in turn became zombie processes -- about 257 of them.

It would not have mattered much, but the load on the server increased to match the number of zombie processes. In a matter of seconds, it became a critical problem that need to be dealt with. To illustrate the criticality of the situation, here are a couple of indicators.

[root@vhost ~]# ps -ef | grep smbd | wc -l
257
[root@vhost ~]# uptime
15:18:09 up 95 days,  5:04,  6 users,  load average: 257.91, 257.98, 255.54

When ps is executed, it returned smbd processes that had a parent ID (PID) of 1 -- zombie processes.

[root@vhost ~]# ps -ef | grep smb | tail
nobody   31590     1  0 12:29 ?        00:00:00 smbd -D
nobody   31705     1  0 11:55 ?        00:00:00 smbd -D
nobody   31743     1  0 14:31 ?        00:00:00 smbd -D
nobody   32049     1  0 12:19 ?        00:00:00 smbd -D
nobody   32415     1  0 12:04 ?        00:00:00 smbd -D
nobody   32484     1  0 14:25 ?        00:00:00 smbd -D
nobody   32508     1  0 13:35 ?        00:00:00 smbd -D
nobody   32511     1  0 12:04 ?        00:00:00 smbd -D
nobody   32535     1  0 13:22 ?        00:00:00 smbd -D
nobody   31599     1  0 11:51 ?        00:00:00 smbd -D
[root@vhost ~]# kill -9 31599

Killing any of the smbd processes didn't do much. Server load is holding steady at 257 and none of the smbd processes get terminated.

Once a process becomes a zombie process (or with a PID of 1), there is no amount of "kill" that could terminate it. The only other known way to terminate the zombie process is to do a host reboot. You and I both know this is to be avoided as much as possible.

The other way to do it is not known to many. Only a handful of system administrators that I personally know, knew how to use this command. The command is "telinit". This is the alternative to rebooting the server. To kill the zombie process(es) or processes with PID of 1, execute "telinit u" or "telinit U". Executing the command "telinit u" makes init (the process with ID 1) restart, or as the man pages state it, re-execute, without having to reboot the host.

Once this was done, an immediate impact of decrease in server load was felt. Also, the smbd processes that used to be zombies are now gone.

[root@vhost ~]# uptime
15:21:22 up 95 days,  5:07,  6 users,  load average: 49.06, 184.80, 229.74

About 5 minutes after execution of telinit. The load has re-stabilized.

So if a fellow administrator asks you if you know how to kill a zombie process, let them know that you belong to those handful of seasoned system administrators who don't consider reboot a solution to terminate a zombie process.

I have asked this question in several interviews so far only a few got it right. And I consider it major point in hiring.

INFO: NetApp iSCSI to VMware Virtual Infrastructure

dillagriscsi, netapp, vmware

iSCSI, the implementation of the SCSI protocol over internet packets, is one of the cheapest (if not the cheapest) forms of connectivity to an external storage. It is a form of transmitting data from host to storage, and back, over the IP network. It is cheap since putting up a fiber channel infrastructure will cost more. But nowadays almost every infrastructure already has fiber with 10gigabit connectivity.

iSCSI enables the transmission of SCSI commands over IP-encapsulated packets. This allows organizations to consolidate storage and gain better utilization, not to mention this is easier to manage and backup. The protocol gives hosts an "illusion" that the storage is directly or locally attached to while in reality it sits somewhere else.

The biggest implementation of iSCSI that I have seen so far is its use for disaster recovery (or DR) purposes. On the VMware infrastructure, iSCSI can be utilized where there is a need to share disks or LUNs. And it is easily implemented since most servers nowadays come with adequate number of network interfaces. And it is independent of any physical hardware as long as everything is connected via IP.

The same implementation of NetApp iSCSI is used in the virtual infrastructure that I administer. Everything works after having introduced the LUNs and assigning IP addresses. And it sticks to the notion of keeping everything simple so it is easier to manage. However, during my research of best practice(s) of VMware infrastructure with NetApp storage via iSCSI, a community post in the VMware forums and NetApp community suggested that the NetApp virtual storage console plugin for vSphere needs to be installed.

True enough when the VSC plugin was installed, its interface showed alerts on the screen. Below is a screenshot of what it looks like.

Apparently, although it works after configuration, the settings are not that optimal. It needs further tweaking and the VSC interface itself can tweak the settings for optimal multipath settings, optimal active-active configuration, failover settings, etc.

Correcting or applying the right settings to the current environment is easy. It all it takes is to right click the alert icon and select "Set Recommended Values..".

It will then ask which of the recommended values are applied. All the needed options are ticket by default. The interface also displays what settings it changes.

When the recommended values are applied, the interface immediately changes from "Alert!" to "Normal". But reboot is still required as shown on the Status where it says "Pending Reboot".

After rebooting the ESXi host, the optimal settings are applied. The VSC interface on the vSphere Client application window confirms this.

To ensure that the guest machines are not impacted by the change, it is recommended to put the ESXi host in maintenance mode. It is easy to do since vMotion is there to ensure no downtime is incurred. This is the same route I applied. It also allows you to immediately reboot after applying the resolutions suggested by the virtual storage console.

After applying the tweaks, you will notice faster access to the storage infrastructure from the guest machines. If you have a similar infrastructure, install the VSC plugin and the benefits pay off immediately.

HOW-TO: Prep Work for Linux P2V (Physical to Virtual)

dillagrlinux, p2v, vmware

Virtualization is the technology trend that is "in" nowadays. It is debatable whether the money you save in hardware costs (and other variable costs) can equate or best the savings of whatever virtualization technology deployed or implemented. But that is beyond the point of this article.

Physical to virtual (otherwise known as P2V) is the process of converting a baremetal host to a virtual equivalent. For Linux hosts, there is only one way to do this. P2V is done from a Windows host that has the VMware Standalone Converter installed. The best way to do this without downtime is to do your homework prior to. Prep work is vital in a production environment.

The quickest way to avoid downtime is to choose to shutdown the physical machine and boot the virtual machine when conversion (or P2V) is done. This is via checking a tick box during the conversion process. However, in my experience the virtual machine doesn't always boot after the conversion. How then do we ensure that the virtual machine boots up after conversion?

One key strength of VMware is its virtual machines (and their corresponding drivers) are assigned or given a generic set of hardware. The hardware on the VMs are as generic as possible for maximum compatibility. This enables the VM to migrate (or vMotion) to compatible architectures with least impact or even no impact at all. The end-user will not even feel the vMotion occur or experience any glitch while the machines migrate from one ESXi host to the other. While a continuous ping is done it might drop one or two packets while a vMotion is going on but that really is not felt by the user most of the time.

Given the above, one way to ensure a successful P2V is that hardware specific limitations do not exist on the virtual machine (or VM). What then are hardware specific limitations? The MAC address is one. Virtual machines are automatically assigned a different MAC address than the one on the physical host. If the MAC address is hard-wired, then obviously the resulting Linux VM will not like it.

On Fedora flavors of Linux as well as its Community operating system counterpart, there are two locations where the MAC address is hard-wired or hard-coded. The file /etc/udev/rules.d/70-persistent-net.rules contain the hard-coded MAC address. As seen from the screenshot below, the host has two (2) network interface cards, and CentOS has detected them all and assigned them names eth0 and eth1. Their corresponding MAC addresses are also listed. This way the OS knows which NIC is assigned what name.

Sometimes, the files ifcfg-eth0, ifcfg-eth1 -- located in /etc/sysconfig/network-scripts -- also contain the hard-coded MAC addresses. This is another way for the OS to know the particular NIC has been assigned a corresponding device name. Below are entries for the file ifcfg-eth0.

On both of the above scenarios, you may choose to delete the lines containing the MAC addresses or place a hash (#) in the beggining of the line to make the OS ignore them.

Other times, the VM refuses to boot due to a missing module in the initrd file. And the solution could vary between scenarios. Still, it helps to do your homework prior to so you can minimize downtime and anticipate headaches that may come your way.

INFO: Root Backdoor to VMware ESXi Host

dillagrinfo, SSH, tweaks, vmware

Administering the virtual infrastructure has been the focus of my administrative tasks as of late. It has been a challenge learning and understanding the technology at first. It comes with the same set of rules that govern physical machines, as well as extended technology (and imagination) -- really giving meaning to the acronym elastic sky x (or ESX). While the name might be gibberish, the technology it stands for is way, way interesting.

While ESXi is an advanced technology, it does have its own set of quirks and limitations. On top of this list is the root password protection. Never forget the root password, or else.. Although there are proven steps to "crack" (for the lack of a better word) the root password, it is far too complex for beginners.

On the VMware knowledge base site, we read this:

ESXi 3.5, ESXi 4.x, and ESXi 5.x
Reinstalling the ESXi host is the only supported way to reset a password on ESXi. Any other method may lead to a host failure or an unsupported configuration due to the complex nature of the ESXi architecture. ESXi does not have a service console and as such traditional Linux methods of resetting a password, such as single-user mode do not apply.

While reinstalling is not that long, it requires tedious re-configuration right after. So never forget the root password.. Again, never forget the root password.. But this statement is never fool-proof.

It helps to anticipate, so way before the forgetfulness hits you will need to setup a backdoor to the ESXi host -- passwordless SSH. And here's how to set it up.

First, you'll need to generate SSH keys. It is outlined in our previous article outlined for passwordless SSH. If you already have SSH keys in place, skip this step.

Next, you will need to copy the generated public key to the ESXi host. In order to do that, execute the command below:

cat ~/.ssh/id_rsa.pub | ssh root@[ESXi HOST] 'cat >> /etc/ssh/keys-root/authorized_keys'

Input the root password when prompted.

NOTE: Replace the string [ESXi HOST] with the fully qualified domain name of the ESXi host or its corresponding IP address.

There are other ways to do this as well. To me the above hits the spot in one line.

To verify, try to SSH to the ESXi host and see passwordless SSH do its wonders. If the above fails, make sure that the ESXi host is not in lockdown mode and/or its SSH service is running.

To increase security and protect the ESXi host from that backdoor, disable SSH service and/or put the host in lockdown mode. To turn on the backdoor, use the configuration controls via vSphere Server.

Just note that whenever an ESXi host experiences a configuration reset, the passwordless SSH settings are wiped out. Although this rarely happens, a backdoor needs to be setup again.