In the past week there have been a rash of reports about people disabling VMware DRS, while vCloud Director was in place. I have tweeted multiple times about this and also brought it up on the Community PodCast yesterday. The more I bring it up the more people are asking why someone would do such a thing, and second what really happens. Well I wanted to take a moment to address both of these. First off the reason this is so important is that vCloud Director uses DRS Resource Pools to manage the Provider and Organization Virtual Datacenters. This is no secret, if you have played with the product and looked in vCenter you see a lot of resource pools. However if you have yet to install it now you know….vCloud Director relies on resource pools like it or not. I affectionately refer to the Allocation models as “Resource Pools Done Right”. Many have heard me say this at VMworld over and over.
Why Do Folks Disable DRS?
There are a few reasons why people completely disable DRS that I have seen. One is that this is a VMware GSS troubleshooting step, but GSS needs to be made aware you are running vCloud Director. If they know this going in I have been told they will not have you disable it. One other situation had a rogue vCenter administrator disable it because they were not aware that the cluster was managed by vCloud Director. This is one reason we all suggest a second, locked down vCenter, separate from one that many people may have access to. Prevention unfortunately is one of the keys here. The last is simply a “Clicked too fast” mistake when editing settings in vCenter. This is probably more common than you think it is, even some of the best people I know have made this mistake. If you need to prevent DRS moves simply change the setting to MANUAL, but please do not uncheck the box.
Isn’t There Any Kind of Warning?
Why yes there is! The screenshot is below. However it does not currently tie the warning to the fact you will completely break vCloud Director. This is something being looked at similar to the managed object warnings on the Virtual Machines themselves. The problem is people are still ignoring the message and continuing to do so
So What Actually Happens?
Without sugarcoating it…..pretty much everything in vCloud Director. This is where we get into the meat of the issue and you need some background. With vCloud Director, there is a separate database as we all know. This Database contains metadata and other information that ties back to vCenter UUID and MoREF information. Should these ID’s change in vCenter, vCloud Director’s Database does not know about the update. However if vCloud Director inserts an object it tracks them both. Think of this as a uni-directional update from vCloud Director to vCenter. We have always said changes made in vCenter are not always reflected in vCloud Director, so again this is no secret. Below is a simple config of one cluster that has a single Provider vDC and an Org vDC. There is also a vApp deployed to this Organization for argument’s sake.
Now let’s disable DRS and see what happens in vCloud Director. As we expected the VM is still there and associated with vCloud as a managed object, but it is no longer associated with the Organization resources in any way. Imagine a large number of Organization vDC’s where all the vApp’s just get dropped to the root pool? No more service levels for all those paying customers for one!
We also start seeing errors in vCloud Director for simple operations like powering on a Virtual Machine. We end up getting all kinds of angry errors, when we want to deploy a new vApp as well. You do get the detailed name of the resource pool that was removed. I think you get the point by now so I will not bother with more screen shots.
So How The Heck Do I Fix This!?
Well it is not easy, but it is also not impossible, and it is not really something you should do without GSS help. First thing is you need to re-enable DRS, and from there a lot of re-work will need to happen to re-create the Resource Pools and re-map them to the vCloud Director Database. This may not be as easy as it sounds, as you can see from the initial snapshot, the resource pool name also has a UID associated with it. You need to re-create them exactly as they were before the mistake was made. Again this is not for people to do on their own. I may try to play with this in my now broken lab to see if I can fix it, but I will NOT be posting the Database tables and other information should I get it to work. I’m not taking that responsibility should someone break something. The real issue is not for two pools, but what if you had 10, 50, or 100 to recreate? The whole time your users will not be able to do anything on the organizations and that is not good. You also have to know exactly where every Virtual Machine was associated with the right resource pool. Some of this may be in the vCloud Director Database as indicated by the error above. Obviously it is important to get each organizations Virtual Machines back to the right resource control. I have no idea how the ChargeBack data collectors will be affected by this either and what may happen to the billing reports. One other thing to consider is you have to also replace the “System vDC” as well but that gets created when you create a Provider vDC, but must also be re-created.
The Moral of The Story
Lock users out of changing this setting, with RBAC, or a completely separate locked down vCenter. Sometimes protecting people from themselves is the best option. If you do get into trouble call support, you are going to need them, and that is what they are there for. Just don’t do it, and think twice before you do. You can always set it to manual to prevent DRS migrations, but you need to maintain those Resource Pools. The thing to remember is that one way update from vCloud Director to vCenter, and it is not currently the other way around. I am currently working on investigating ways this, and other use cases where MoREF ID’s have changed in vCenter can be recovered, but it will take some time and possibly tools like vCenter Orchestrator for example. There is also a VMware Labs Fling for Inventory Snapshot, which I have yet to test to see if it recreates with the original MoREF ID’s, but I plan on seeing the results this coming week.