I have discovered that during the intense process of host remediation using VMware Update Manager, VMware HA on a cluster may become misconfigured and no longer function correctly. This can cause issues when VUM commands a host to enter Maintenance Mode and all the VMs need to VMotion away, but all eligible hosts are reporting errors with their HA configuration (usually because all the primary HA agents are no longer responding and hosts can’t reconfigure themselves after exiting maintenance mode).
Without eligible hosts to VMotion VMs to, the enter-maintenance-mode task will fail. VUM handles this failure depending on how it is configured when the update task was created. I usually set it to Retry with 1 minute delay, a maximum of 5 times. This allows me to notice the failure if I happen to glance at the client and figure out what to do to help it along. VUM could also Fail the update task but this is annoying if the task hasn’t gotten very far with other hosts in the cluster. It would be nice if the host could be skipped after itÂ couldn’tÂ enter maintenance mode and VUM could focus on the rest of the hosts.
It seems the best thing to do is to just disable HA on the cluster before starting VUM remediation. Instead of HA being disabled/enabled for each host during the process, they are all disabled. This saves time on the overall process and now VUM will not fail the entire task.
With that being said, in addition to VUM skipping hosts that can’t enter maintenance mode, it would also bee nice if the cluster itself could recognize the total HA misconfiguration and globally disable/enable it accordingly. That’s what the Administrator would have to do anyway so automating this would be nice.