NSX Manager UI Inaccessible and Proton Service Restarts Due to Underlying Storage Latency

Products

VMware NSX

Issue/Introduction

VMware NSX experiences UI inaccessibility and automatic proton service restarts. The issue occurs when the proton service loses quorum due to simultaneous storage latency or path drops across multiple NSX Manager nodes.

ESXi host (/var/log/vmkernel.log) logs indicate concurrent storage path failures and I/O aborts on multiple nodes, suggesting a centralized storage infrastructure issue:
- <Time Stamp> In(182) vmkernel: cpu49:2098438)qlnativefc: vmhba2(54:0.0): qlnativefcEhAbort:2769:qlnativefcEhAbort: abortCommand mbx success.
  <Time Stamp> In(182) vmkernel: cpu64:50446744)qlnativefc: vmhba2(54:0.0): qlnativefcStatusEntry:2069:C0:T8:L244 - FCP command status: 0x5-0x0 (0x8) portid=011800 oxid=0x506 cdb=280000 len=4096 rspInfo=0x0 resid=0x0 fwResid=0x0 host status = 0x8 device status $
  <Time Stamp> In(182) vmkernel: cpu49:2099329)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x28 (0x45cc23413880, 2100713) to dev "naa.624a937016d2f698eb5d4fe400011dd2" on path "vmhba2:C0:T8:L244" Failed:
  <Time Stamp> In(182) vmkernel: cpu49:2099329)NMP: nmp_ThrottleLogForDevice:3898: H:0x8 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x430ce349fbc0 CmdSN 0x5f995b
  <Time Stamp> Wa(180) vmkwarning: cpu49:2099329)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.624a937016d2f698eb5d4fe400011dd2" state in doubt; requested fast path state update...
  <Time Stamp> In(182) vmkernel: cpu49:2099329)ScsiDeviceIO: 4686: Cmd(0x45cc23413880) 0x28, CmdSN 0x5f995b from world 2100713 to dev "naa.624a937016d2f698eb5d4fe400011dd2" failed H:0x8 D:0x0 P:0x0
  <Time Stamp> In(182) vmkernel: cpu48:2100749 opID=34ce1fb3)Fil3: 444: timeSpentInFdsIOUs: 40312411 timeSpentInConvUS: 0 timeSpentInStatsUS: 0 hbWaitUS: 5 hbStatus: 0
  <Time Stamp> In(182) vmkernel: cpu49:2098438)qlnativefc: vmhba2(54:0.0): qlnativefcEhAbort:2751:SCSI command timeout counter incremented to 14
  <Time Stamp> In(182) vmkernel: cpu48:2100749 opID=34ce1fb3)Fil3: 451: isTimeoutFsRetry: 0, sfdLockNotFree: 0, fileopTries: 149, optlockTries: 140, timeoutTries: 9, cancelTries: 50

NSX Manager OS-level logs (/var/log/kern.log) indicate persistent I/O aborts, suggesting unstable virtual disk presentation from the hypervisor. This issue is occurring simultaneously across multiple NSX Manager nodes, resulting in cluster degradation:
- <Time Stamp> <nsx_manager_1> kernel - - - [ 7664.980240] mptscsih: ioc0: attempting task abort! (sc=ffff8881125e3110)
  <Time Stamp> <nsx_manager_1> kernel - - - [ 7664.980255] sd 2:0:0:0: [sda] tag#16 CDB: Write(10) 2a 00 0a 27 27 d8 00 00 08 00
  <Time Stamp> <nsx_manager_1> kernel - - - [ 7665.348254] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8881125e3110)
  <Time Stamp> <nsx_manager_2> kernel - - - [ 7014.769254] mptscsih: ioc0: attempting task abort! (sc=ffff88811174d010)
  <Time Stamp> <nsx_manager_2> kernel - - - [ 7014.769263] sd 2:0:0:0: [sda] tag#69 CDB: Write(10) 2a 00 04 82 91 30 00 00 10 00
  <Time Stamp> <nsx_manager_2> kernel - - - [ 7015.157261] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88811174d010)
Application Level (/var/log/proton/proton_restart.log ) logs show Corfu database writes stuck in a pending state due to underlying storage unavailability:
- <Time Stamp> OverwriteException storm begins — LogUnit-BatchProcessor-0 starts throwing on every batch entry
  <Time Stamp> Queue grows explosively: 70 → 101 → 267 → 1430 entries, all failing
  Total exceptions 1,481 OverwriteExceptions in a single second at 21:15:04
  <Time Stamp> LNM detects: RPC connection to Heartbeat+ServiceMonitor server is down (LNM01)
  <Time Stamp> PROTON=UNKNOWN — Heartbeat server down event
  <Time Stamp> StaleRevisionUpdateException in proton — Corfu writes are conflicting
  <Time Stamp> Secondary OverwriteException bursts (queue draining, 6 more exceptions)
  <Time Stamp> PROTON RESTARTS: GMLE leadership safety violation handler triggered for groupType: mp

Environment

VMware NSX

Cause

Underlying storage infrastructure latency or path failure within the ESXi cluster hosting the NSX Managers. When storage drops impact at least two NSX manager nodes simultaneously, pending write operations exhaust resources. Quorum is subsequently lost, rendering the UI inaccessible. The system forces a restart of the proton service across the nodes as an automatic self-recovery mechanism.

Resolution

Investigate the SAN/storage array backing the specific ESXi cluster where the NSX Managers reside.
Review storage switch and array logs corresponding with the NMP path failure timestamps to identify the source of the latency or path drops.
Stabilize the underlying storage infrastructure to prevent further quorum loss in the NSX management plane.

Additional Information

NSX Manager Cluster is Degraded and multiple services go down randomly

Storage latency causes NSX Manager cluster instability