VMkernel Race Condition: Next2Write Index Inconsistency Between Checkpoint Save and Restore Operations Causes NIC Activation Failures
search cancel

VMkernel Race Condition: Next2Write Index Inconsistency Between Checkpoint Save and Restore Operations Causes NIC Activation Failures

book

Article ID: 408934

calendar_today

Updated On:

Products

VMware vSphere ESX 8.x VMware vSphere ESX 7.x

Issue/Introduction

Port's are in blocked state after vMotion causing application  Downtime .

Pods restart due to node network related timeouts - vmxnet3 reporting tx hangTKGI kubernetes cluster

Environment

ESXi Version : 7.0.3.01900

ESXi Version : 8.0.3.24784735 

 

Cause


This can happen because of race conditions  where between the checkpoint save operation and checkpoint restore, if the vmkernel processed and delivered a packet on the source ESX after its quiesced resulting in change of next2write index of Rx completion ring. But since what got saved was earlier next2write index, it would have found gen bit of rx descriptor to be invalid, resulting in activation failure.

Logs to validate 

=============

vmware.log 

025-08-19T19:27:13.823Z In(05) vmx - MigrateSetState: Transitioning from state MIGRATE_FROM_VMX_WAITING (9) to MIGRATE_FROM_VMX_PRECOPY (10).
2025-08-19T19:27:28.448Z In(05) vmx - MigrateWaitForData: Waited for 19.18 seconds.
2025-08-19T19:27:28.449Z In(05) vmx - MigrateRPC_DrainPendingWork: Draining pending remote user messages before restore...
2025-08-19T19:27:28.449Z In(05) vmx - MigrateRPC_DrainPendingWork: All pending work completed.
2025-08-19T19:27:28.449Z In(05) vmx - MigrateSetState: Transitioning from state MIGRATE_FROM_VMX_PRECOPY (10) to MIGRATE_FROM_VMX_CHECKPT (11).
2025-08-19T19:27:28.449Z In(05) vmx - SVMotionFixParentPaths: No snapshot paths need to be validated
2025-08-19T19:27:28.449Z In(05) vmx - Migrate_Open: Restoring from <10.196.xx.xx> with migration id 5020850807499805208
2025-08-19T19:27:28.449Z In(05) vmx - DUMPER: Restoring checkpoint version 8.
2025-08-19T19:27:28.449Z In(05) vmx - Checkpointed in VMware ESX, 7.0.3, build-24585291, Linux Host
2025-08-19T19:27:28.449Z No(00) vmx - ConfigDB: Setting sched.swap.derivedName = "/vmfs/volumes/vsan:<UUID>/lt-cmg12u-MG-VM-1-PlgI-3ee5axxxx.vswp"
2025-08-19T19:27:28.449Z In(05) vmx - ConfigDB: Ignoring request to write config file
2025-08-19T19:27:28.449Z No(00) vmx - PowerOnTiming: Module Migrate took 19181614 us


2025-08-19T19:27:28.413Z In(05) vcpu-0 - Migrate: VM successfully stunned.
2025-08-19T19:27:28.449Z In(05) worker-6953703 - Migrate: Remote Log: Destination waited for 19.18 seconds.
2025-08-19T19:27:28.449Z In(05) worker-6953703 - Migrate: Remote Log: Beginning checkpoint restore.
2025-08-19T19:27:28.449Z In(05) worker-6953703 - Migrate: Remote Log: Switching to checkpoint state.


2025-08-19T19:27:28.538Z In(05) vcpu-0 - VMXNET3 user: failed to activate 'Ethernetx', status: 0xbad0001

 

vmkernel.log

2025-08-19T19:27:28.504Z In(182) vmkernel: cpu18:2401914)Net: 2238: connected lt-cmg12u-MG-VM-1-xxxx.ethx ethx to vDS, portID 0x400001b
2025-08-19T19:27:28.538Z In(182) vmkernel: cpu44:2401914)Vmxnet3: 12036: Invalid gen bit for rq: 0, World_Handle: 0x45390489f000

 

Resolution

There is no workaround .

Fixed in below versions

  • Esxi 7.0.3 P10
  • Esxi 8.0.3.0 P06
  • Esxi 9.0.0.0