Tuesday, 15 November 2016

vSphere 6.5: What is vCenter High Availability

In 6.0 we had the option to provide high availability for the Platform Services Controller by deploying redundant PSC nodes in the same SSO domain and utilizing a manual re point command or a Load balancer to switch to a new PSC if the current one was down. However, for vCenter nodes there was no such option available, and the only way to have HA for vCenter node was to either configure Fault Tolerance or have the vCenter virtual machine in a HA enabled cluster.

Now with the release of vSphere 6.5, there has been a new much awaited feature added to provide redundancy or high availability for your vCenter node too. This is the VCHA or the vCenter High Availability feature.

The design of VCHA is somewhat similar to your regular clustering mechanism. Before we get to the working of this, here are few prerequisites for VCHA to work:

1. Applicable to vCenter Server Appliance only. Embedded VCSA is currently not supported.
2. Three unique ESXi hosts. One for each node (Active, Passive and Witness)
3. Three unique datastores to contain each of these nodes.
4. Same Single Sign On Domain for Active and Passive nodes
5. One public IP to access and use vCenter
6. 3 Private IP in a different subnet to that of public IP. This will be used for internal communication to check node state.

vCenter High Availability (VCHA) Deployment:
There are three nodes available or deployed once your vCenter is configured for high availability. Active node, Passive node and the Witness (Quorum) node. The active node will be the one that would have the Public IP vNIC in up state. This public IP will be used to access and connect to your vSphere Web Client for management purpose.

The second node is the Passive node which is the exact clone of the active node. It has the same memory, CPU and disk configurations as that of the Active node. The public IP vNIC will be down for this node and the vNIC used for Private IP will be up. The private network between Active and Passive is for cluster operations. The active node will have it's database and files updated regularly and this has to be synced across the Passive node, and these information will be synced over the Private network.

The third node, also called as quorum node acts as a witness. This node is introduced to avoid split-brain scenario which arises due to network partition. In a case of network partition we cannot have two active nodes up and running and the quorum node decides which node is active and which has to be passive.

The vPostgres Replication is used to enable database replication between active and passive nodes and this is a synchronous replication. The vCenter files are replicated using native Linux Rsync which is a asynchronous replication.

 What happens during a failover?

When the active node goes down, the passive node becomes the active and assumes the public IP address. The state of the VCHA cluster enters a degraded state since one of the node is down. The recovery time is not transparent and there will be a RTO of ~5 minutes.

Also, your cluster can enter a degraded state when your active node is still running in a healthy state, but either the passive or the witness node are down. In short, if one node in the cluster is down, then the VCHA is in a degraded state. More about VCHA states and deployment will be in a later article.

Hope this was helpful.