Post written by contributing author Kris Boyd and Chris Colotti
As part of the Disaster Recovery solution that Duncan Epping and Chris Colotti developed, Kris Boyd has been working on a comparative VMware View Solution using the same basic principles. However, one thing both solutions came across is a unique situation on the vCenter virtual machines when trying to use Site Recovery Manager’s ability to change the IP Addresses on the Guest OS in the recovery site. This situation only seems to affect the vCenter virtual machines and is a very specific condition. It is unique because in most cases Site Recovery Manager is talking to an upper layer vCenter Server, and in our Disaster Recovery solutions there is also a vCenter Server being managed by another vCenter and SRM as depicted below.
Most of us know that SRM has the capability to change the IP on a virtual machine that is part of a recovery plan, and in most cases this works just fine. There are a few situations, such as Virtual Appliances, that have some issues due to the version of VMware Tools. Most recently we have seen a specific issue with a vCenter Server (non-appliance) virtual machine as we are doing in both DR solutions. We wanted to take a few moments to describe the condition and the high level work around that is required to deal with the vCenter Server virtual machine.
Note: This is not meant to provide the exact steps, but is intended to highlight the areas of consideration. It also only applies when the recovery site is using a different IP range and addresses must be changed during the failover process.
As you can see from the diagram above, the entirety of infrastructure components were VM’s that were being failed over to the recovery site as managed by Site Recovery Manager. Since vCenter was one of these machines, we tried, (unsuccessfully), to re-IP that VM along with all of the other servers with static IP’s using the feature in SRM. As it turns out not only does SRM not re-IP the machine, but the SRM recovery test will fail if you tell it to re-IP vCenter.
What we learned when we dug into this is the real reason why this was failing. This particular issue has to do with the vCenter services not successfully starting while waiting for the network to come up. SRM needs the VMware tools running in order to change the IP address on a virtual machine. in the case of vCenter Server, the vCenter services try to start before the tools are running. When the vCenter service starts and cannot connect to the database, (due to the fact there is a different network), the service start just hangs. The Guest sits at “Applying Computer Settings” indefinitely. This prevents the VMware Tools from even starting, thus preventing SRM from changing the IP address. It’s pretty much a rock and a hard-place situation. SRM was successful in its attempt to re-IP all of the other servers, but vCenter required a set of manual steps to accomplish the same task.
These manual tasks also had their own set of challenges, because if you do things wrong you may notice that windows will never let you login to this server again for the same reason. Without going into the gory details, here is what you need to take away from this:
- Never assume anything when testing a disaster recovery plan
- vCenter must have its IP updated manually during a fail over
- Since other services are dependent on vCenter, you should include a wait step in the SRM recovery process to give you time to update the vCenter IP.