So here I am again after working a couple days on a troubleshooting issue for a customer during a VMware vCloud upgrade that we finally got fixed. I seem to now be the resident expert on getting the entire stack of VMware vCloud products upgraded in the right order and sequence as a matter of fact. I wanted to really dig into the importance of Network Time Protocol as it relates to VMware vCloud Director and what we learned about a few things. Now we all know that VMware has long stressed the importance and the issue of Guest OS time keeping as long been debated. I am not going to deal with that here whatsoever, that horse has been beaten to a bloody pulp. No instead I want to point out a few things that is called out in the installation documents, but if not validated can cause you to, well lose a lot of TIME troubleshooting. Let’s start with the symptoms we saw.
We were following the detailed document I developed for these end to end upgrades, (Not available yet but hopefully soon), although it is based on the high level procedures I wrote about previously. We finished the vShield Manager upgrades and when we tried to validate a NAT routed network could be deployed, we started getting an error that vCloud Director “Could not find a host to place the Appliance”. Generally this error is seen is not all the hosts see the same storage or one is not connected to the Distributed Switch. It can also be seen if thee hosts are not prepared and available. All of this was checked over and over still nothing to be seen. Ultimately we tried to “Reconnect” to the vCenter and that is where we saw new errors. One of the four cells always failed when trying to connect.
So as a good VMware VCDX we began to methodically troubleshoot with the help of Bangalore, Palo Alto, GSS. I pretty much called in the full house blitz for a number of reasons. We dumped the Database before and after the upgrade for review, we dumped vCenter logs, we dumped vCloud Director logs. There was very few errors other than a few port group errors, some odds and ends etc. We verified the vShield Manager could create a vShield without vCloud Director and that worked fine too. Literally we went around in circles. Now early on we did happen to notice in passing that one cell’s time was way off, but we did not think anything of it at the time……Now you see the moral of my story coming. We stopped the cell with the bad time, and starting testing cell by cell for both vCenter reconnection and deploying vShield Edge devices. To our amazement everything started working…until we got to the cell with the wrong time.
We then inspected all four cells and discovered that NTP was not running on any of them…..okay what’s the deal there we thought. We got the Linux team to get NTP configured, check the time it was dead on with on all the cells so we rebooted them all for giggles. Upon reboot the cell with the wrong time……STILL SHOWED WRONG!! Now we were curious what was going on. Then after about 5 minutes the time showed correct! I am not one to go without a fight with a machine so we kept digging and here is what we found.
Ultimate Root Cause:
- First, NTP was not setup on the cells, so we fixed that
- Second, the VMware Tools were in fact installed and by default they do NOT get time from the hosts
- Third, ONE ESXi host that happened to be running the badly behaving Cell had NTP enabled, but it was not updating and was incorrect.
- Fourth, per this knowledge base article, (scroll to the bottom) VMware Tools will always do a one time sync with tools to a host on service startup and other key functions. Even if you have the overall setting to get time from host DISABLED.
- Fifth, the vCloud Director Installation Guide on page 13 does state that cells must have time in sync.
Interesting right? So what was happening originally is that the bad cell was ALWAYS getting bad time from the host on reboot. Since NTP was not running it in turn was not getting updated. Even once we enabled NTP, on reboot the tools still did their nice little one time sync, tossing off the time, until NTP reset it. As we say in Boston….WICKED MESSY!
We ended up putting the bad HOST in maintenance mode for someone to look at later, migrated the VM to a good host with good NTP updates, ensured all CELLS had NTP running, and all was good with life. The moral of this whole story is actually a couple things:
- If you think Network Time is not important to vSphere and vCloud Director…YOU ARE WRONG…it is extremely important
- If you think VMware Tools does not get time from ESXi hosts ever…YOU ARE WRONG…it most certainly does
- Ensure all hosts in a cluster have NTP enabled and are actually getting updates.
- Ensure all VMware vCloud Director cells all have NTP enabled.
- Ensure vShield Manager, vCenter, and Chargeback are all getting correct time.
- Time did not seem as strict with vCloud Director 1.0.1 as it is in 1.5, this I know for sure.