DRS fails to load balance the VMs due to EVC mismatch after a network disconnection between the vCenter Server and the ESXi hosts in an EVC enabled Cluster
search cancel

DRS fails to load balance the VMs due to EVC mismatch after a network disconnection between the vCenter Server and the ESXi hosts in an EVC enabled Cluster

book

Article ID: 378718

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

You might observe following symptoms after a network outage where the vCenter Server lost communication to the hosts in an EVC enabled Cluster, the network disconnection between VC and hosts could be due to network issue on vCenter Server or on the ESXi hosts.

  • DRS fails to load balance the Virtual Machines and many of the VMs are running on single host in the Cluster which is impacting the Production as one host in the Cluster is overloaded with all the VMs.
  • Manual vMotion of Virtual Machines to the other hosts in the Cluster fails with below EVC error message for these 6 features ibpb, ibrs, stibp, ssbd, fcmd & mdclear :

    The target host does not support the virtual machine's current hardware requirements.

    Use a cluster with Enhanced vMotion Compatibility (EVC) enabled to create a uniform set of CPU features across the cluster, or use per-VM EVC for a consistent set of CPU features for a virtual machine and allow the virtual machine to be moved to a host capable of supporting that set of CPU features. See KB article 1003212 for cluster EVC information.

    Microarchitectural Data clear is unsupported.
    FCMD is unsupported.
    Speculative Store Bypass Disable is unsupported.
    Single Thread Indirect Branch Predictor is unsupported.
    Indirect Branch Restricted Speculation is unsupported.
    Indirect Branch Prediction Barrier is unsupported

  • Logs on vCenter Server (/var/log/vmware/vpxd/vpxd.log) will show similar to below snippets :

    YYYY:MM:DDTHH:MM:SSZ info vpxd[pid] [Originator@6876 sub=vpxLro opID=<opID>] [VpxLRO] -- BEGIN session[<SessionID>] -- ProvChecker -- vim.vm.check.ProvisioningChecker.checkRelocate -- <SessionID>
    YYYY:MM:DDTHH:MM:SSZ info vpxd[pid] [Originator@6876 sub=VmCheck opID=<opID>] CompatCheck results: (vim.vm.check.Result) [
    -->    (vim.vm.check.Result) {
    -->       vm = 'vim.VirtualMachine:<VC GUID>:<vm-MoID>',
    -->       host = 'vim.HostSystem:<VC GUID>:<host-MoID>,
    -->       error = (vmodl.MethodFault) [
    -->          (vim.fault.FeatureRequirementsNotMet) {
    -->             faultMessage = (vmodl.LocalizableMessage) [
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vmfeature.cpuid.ibpb",
    -->                },
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vmfeature.cpuid.ibrs",
    -->                },
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vmfeature.cpuid.stibp",
    -->                },
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vmfeature.cpuid.ssbd",
    -->                },
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vmfeature.cpuid.fcmd",
    -->                },
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vmfeature.cpuid.mdclear",
    -->                },
    -->                (vmodl.LocalizableMessage) {
    -->                   key = "com.vmware.vim.vpxd.vmcheck.featureRequirementsNotMet.useClusterOrPerVmEvc",
    -->                }
    -->             ],
    -->             featureRequirement = (vim.vm.FeatureRequirement) [
    -->                (vim.vm.FeatureRequirement) {
    -->                   key = "cpuid.ibpb",
    -->                   featureName = "cpuid.ibpb",
    -->                   value = "Bool:Min:1"
    -->                },
    -->                (vim.vm.FeatureRequirement) {
    -->                   key = "cpuid.ibrs",
    -->                   featureName = "cpuid.ibrs",
    -->                   value = "Bool:Min:1"
    -->                },
    -->                (vim.vm.FeatureRequirement) {
    -->                   key = "cpuid.stibp",
    -->                   featureName = "cpuid.stibp",
    -->                   value = "Bool:Min:1"
    -->                },
    -->                (vim.vm.FeatureRequirement) {
    -->                   key = "cpuid.ssbd",
    -->                   featureName = "cpuid.ssbd",
    -->                   value = "Bool:Min:1"
    -->                },
    -->                (vim.vm.FeatureRequirement) {
    -->                   key = "cpuid.fcmd",
    -->                   featureName = "cpuid.fcmd",
    -->                   value = "Bool:Min:1"
    -->                },
    -->                (vim.vm.FeatureRequirement) {
    -->                   key = "cpuid.mdclear",
    -->                   featureName = "cpuid.mdclear",
    -->                   value = "Bool:Min:1"
    -->                }
    -->             ],
    -->             host = 'vim.HostSystem:<VC GUID>:<host-MoID>',
    -->             msg = "",
    -->          }
    -->       ],
    -->    }
    --> ]

  • Manually verifying the config file /etc/vmware/config on ESXi host does not list the the 6 features mentioned above.

Environment

vSphere 7.x and 8.x

Resolution

This is a known issue, currently there are no resolution. Broadcom Engineering is investigating this issue to find out the root cause.

To avoid Virtual Machines migrating to single host in the Cluster, please follow the recommendations in the KB VM migrations initiated by DRS due to incompatibility between the VMs and their host.

  • Adding DRS Advanced configuration "CompatCheckTransientFailureTimeSeconds" to value "-1" will help to avoid VMs moving to single host and causing production impact.
  • vCenter Server 8.0 U3 onwards the default value of CompatCheckTransientFailureTimeSeconds is set to "-1", hence no need to add the advanced parameter.

Workaround

Please follow any of below methods to workaround the EVC mismatch :

Method 1 (Recommended):

Re-apply the same EVC mode on the impacted EVC enabled Clusters.

  • Note down the existing EVC mode on the Cluster.
  • Disable the EVC on the Cluster.
  • Enable the EVC again on the Cluster with same EVC Mode.

Method 2:

Manually add the missing features to the config file on each host on the impacted Clusters:

  • Connect to the ESXi host using SSH Client (eg. Putty)
  • Edit the config file /etc/vmware/config

    vi /etc/vmware/config

  • Add the missing entries for the 6 features mentioned in the issue/introduction section

    featMask.evc.cpuid.FCMD = "Val:1"
    featMask.evc.cpuid.IBPB = "Val:1"
    featMask.evc.cpuid.IBRS = "Val:1"
    featMask.evc.cpuid.MDCLEAR = "Val:1"
    featMask.evc.cpuid.SSBD = "Val:1"
    featMask.evc.cpuid.STIBP = "Val:1"

  • Perform the same operation on each host in the Cluster.
  • Retry vMotion to relocate the VMs to these hosts

Additional Information

Attached PowerCLI script (Compare_EVC_VC_and_ESXi.ps1) can be used to compare the EVC mode configured on the Cluster against the EVC configuration on each host in the same Cluster.

Please update the vCenter Server FQDN, Username and Password before executing the script.

Sample result after executing the script:

Cluster Name:  Cluster01
Cluster EVC Mode: intel-cascadelake

vCenter Server Current EVC EvcState.FeatureCapability Count: 75

<Host1> - Host maskedFeatureCapability bits Count: 89
Comparing the VC EVC features on the Host.
<Host1>  - EVC Status is GOOD

<Host2>  - Host maskedFeatureCapability bits Count: 84
Comparing the VC EVC features on the Host.
** <Host2> - ALERT, EVC MISMATCH on the Host **


<Host3>  - Host maskedFeatureCapability bits Count: 89
Comparing the VC EVC features on the Host.
<Host3>  - EVC Status is GOOD

Attachments

Compare_EVC_VC_and_ESXi.ps1 get_app