Support for multiple vmknics with vSAN

Products

VMware vSAN

Issue/Introduction

This article provides information on deploying active/active network configurations using multiple vmknics for vSAN clusters.

Environment

VMware vSAN (All Versions)

Resolution

Customers have expressed interest in deploying active/active fabrics using multiple vmknics for load balancing and failover purposes. This design is motivated by and analogous to deploying multipathing options in traditional storage deployments. While this configuration is supported, there are certain scenarios (described below) where the guest OS may become unresponsive. Therefore, VMware does not recommend using this configuration for vSAN clusters below 6.7.

Partial failure of the fabric occurs when some, but not all, of the network links of a fabric are down. Due to this partial failure, a subset of the hosts in the cluster are unable to communicate with each other.

In the event of a partial failure of the fabric, all vSAN communications are redirected to the healthy fabric on the affected hosts. This process of failing over to a healthy link (on the other fabric) may take up to 90 seconds or the TCP timeout value, whichever is greater. Since the failover is not instantaneous, the guest OS on the impacted host may become unresponsive. This is dependent on how storage timeouts are handled by the guest OS. If the guest OS is stalled (becomes unresponsive), manual intervention and remediation is required to establish normal operations.

In the case of full fabric failure, the failover process is instantaneous, and therefore, there is no impact on the behavior of the guest OS.

To deploy active/active network configurations, follow these guidelines to set up the fabrics:

Note: These guidelines are applicable for layer-2 and layer-3 setups.

Ensure that fabrics have complete physical and logical isolation, i.e. air-gapped.
VMware recommends that the vmknics should be on separate subnets.

For more information, see Multi-homing on ESXi/ESX.
vmknics should be backed by multiple physical NICs for adequate fault tolerance.
In addition, ensure that the traffic is not load balanced between the vmknics.

For more information on the setup details, see the VMware vSAN Network Design Guide.

As of vSAN 6.7 and later, we introduced RDTFastFailover. This new functionality provides redundancy at the RDT layer. When multiple vSAN vmk interfaces exist, vSAN monitors the health of each interface (which turns unhealthy after missing 10 cmmds heartbeats, which takes ~10 seconds), and if an interface is declared 'unhealthy', the RDT will fail over connections from that interface to a healthy interface.

For multiple vmknic configurations in 6.7 and above, please see the appropriate VMware vSAN design guide for the installed ESXi version.