Unexpected Virtual Machine resets triggered by High Availability in VSAN cluster.

search cancel

Unexpected Virtual Machine resets triggered by High Availability in VSAN cluster.

book

Article ID: 428779

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

From the vSphere Client UI, the below error can be reported under Events of the ESXi host.

vmfs/volumes/<datastore>/vmware.log can be captured heartbeat timeout issue.

yyyy-mm-ddThh:mm:ss In(05) vcpu-0 - Tools: [AppStatus] Last heartbeat value 823042 (last received 0s ago)

yyyy-mm-ddThh:mm:ss In(05) vcpu-0 - Tools: Tools heartbeat timeout.

yyyy-mm-ddThh:mm:ss In(05) vcpu-0 - Tools: [RunningStatus] Last heartbeat value 823043 (last received 21s ago)

The below event "to reset state of vmx because failure timer expired" can be captured in /var/run/log/fdm.log on the ESXi host.

yyyy-mm-ddThh:mm:ss Db(167) Fdm[2105951]: [Originator@6876 sub=Policy opID=WorkQueue-78c20581] VM /vmfs/volumes/vsan:xxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxx/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/<VM_name>.vmx is going to reset state because failure timer expired. Reset no: 1 out of Max allowed reset count: 3

/var/run/log/vmware.log might be full of small unmap requests rapidly.

yyyy-mm-ddThh:mm:ss [439183365] [cpu##] ZDOMTraceUnmapIO: reqID 8061928

yyyy-mm-ddThh:mm:ss [439183366] [cpu##] ZDOMTraceUnmapIO: reqID 8061929

yyyy-mm-ddThh:mm:ss [439183367] [cpu##] ZDOMTraceUnmapIO: reqID 8061930

Look for journal flush messages in the journal log (log in to the vCenter though ssh and run /usr/bin/journalctl) shortly before the VM reset.

yyyy-mm-ddThh:mm:ss [cpu##] ZDOMTraceVtxJournalFlush: LSN 347076607, replay 26203

yyyy-mm-ddThh:mm:ss [cpu##] ] ZDOMTraceVtxJournalFlush: LSN 347076613, replay 250

yyyy-mm-ddThh:mm:ss [cpu##] ZDOMTraceVtxJournalFlush: LSN 347076615, replay 66

yyyy-mm-ddThh:mm:ss [cpu##] ZDOMTraceVtxJournalFlush: LSN 347076623, replay 342

yyyy-mm-ddThh:mm:ss [cpu##] ZDOMTraceVtxJournalFlush: LSN 347076630, replay 296

Environment

VMware vSAN 8.x

VMware vSAN 9.x

Cause

During routine storage optimization tasks (such as unmap operations), the system may temporarily require additional resources. In some instances, this can cause the virtual machine to experience latency or become temporarily inaccessible, and a VM reset may be required..

Resolution

Improvements to how unmap operations are managed will be included in future releases.
Although there is a workaround available on 8.x ESXi host version below:

- Log in to all the ESXi host and run the below command to decrease the unmap throttle threshold to 10%. This may help to alleviate the performance impact.

esxcfg-advcfg -s 10 /VSAN/zDOMUnmapThrottleThreshold

- If issue persists, run the below command and reboot the ESXi hosts. This setting allocates 25% of the thread pool to space reclamation. If the vSAN datastore is nearing critical capacity and a significant number of VMs or files have recently been deleted, this configuration will accelerate the process of reclaiming that free space and will help to improve performance.

esxcfg-advcfg -s 25 /VSAN/zDOMPercentUnmapThreadsPerPool

Feedback

thumb_up Yes

thumb_down No