After enabling Fault Tolerance (FT) on a Virtual Machine, performance or hung issues reported
search cancel

After enabling Fault Tolerance (FT) on a Virtual Machine, performance or hung issues reported

book

Article ID: 377099

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vCenter Server

Issue/Introduction

After enabling Fault Tolerance (FT) on Virtual Machine

  • Windows virtual machine might report high CPU utilization ( above 85% ) within Guest OS when using any application having internet connectivity (e.g. Chrome, Edge, etc) or application which needs an external connectivity/accessibility from Guest OS
  • Linux virtual machine (e.g. RHEL) reports hung/unresponsive issues 

Environment

VMware vCenter Server

VMware vSphere ESXi

Cause

  • ESXi hosts hardware might not be compatible with Fault Tolerance requirements Or
  • Fault Tolerance checklist is not followed.

Resolution

1. Check ESXi hosts hardware compatibility to use VMware Fault Tolerance Feature (FT)

e.g. To check the compatibility of DELL Servers running ESXi 8.0 U2, refer this Compatibility Checklist: https://compatibilityguide.broadcom.com/search?program=server&persona=live&column=partnerName&order=asc

2. Follow Fault Tolerance Checklist : https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/vsphere-availability.html 

3. Make sure that you are not using the vSphere Features those are unsupported with Fault Tolerance Feature:  https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/vsphere-availability.html

4. Refer Fault Tolerance Requirements, Limits, and Licensing : https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/vsphere-availability.html

4. If issue persist even after validating all compatibility checks, then try reducing vCPU count of FT VM from 8 -> 4 or from 4 -> 2 to check whether performance improvement noticed after removing overprovisioned vCPUs

Additional Information

1. If the Primary VM cannot communicate with the Secondary VM through the FT network, it is expected behavior that the Primary VM temporarily stops responding for few seconds.

When the FT network is disconnected, the primary virtual machine will be in a stun state until the primary determines that it cannot communicate with the secondary and cancels synchronization.

Primary VM is in a stun state waiting for response from secondary, the length of wait depends on how long the underlying network stack returns a code to FT indicating a network disconnection, or, at maximum 8 seconds (configurable) timeout.

Refer below log snippet for reference: vmware.log

YYYY-MM-DDTHH:MM:SSZ cpu19:2606776)WARNING: FTCpt: 4501: (1879478490531949822 pri) Error reading zero on socket (2049/8000 ms elapsed/timeout): Already disconnected

This is by design and expected.

 

2. In order to get rid of some additional overheads of FT, you may suggest customer to turn off encryption for FT VM as it is optional when

   "FT is working in an internal network".

Steps:

  •    Turn off FT
  •    Select "Disabled" in VM "Edit Settings -> VM Options -> Encryption -> Encrypted FT"
  •    Turn on FT

Attachments

ft.emt.gz get_app