NSX Manager cluster degrades and multiple services fail randomly
search cancel

NSX Manager cluster degrades and multiple services fail randomly

book

Article ID: 419529

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The NSX Manager cluster enters a degraded state.
    • Cluster services randomly fail or go down on the Manager nodes.
    • The Alarms typically self-resolve after a short period.

  • The /var/log/kern.log file on one or more Manager nodes displays SCSI host driver task aborts similar to the following:
    2025-11-14T06:14:19.658Z nsxmgr.cor.local kernel - - - [ 1591.362466] mptscsih: ioc0: attempting task abort! (sc=ffff88810efdd910)
    2025-11-14T06:14:19.682Z nsxmgr.cor.local kernel - - - [ 1591.766465] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88810efdd910)

  • The /var/log/corfu.9000.log file on the Manager nodes displays stream log errors similar to the following:
    2025-11-14T06:14:21.260Z | ERROR |       LogUnit-BatchProcessor-0 |           o.c.i.BatchProcessor | batchWriteProcessor: stream log error. Batch: [queue size=7]. StreamLog: [trim mark=51387158].
    2025-11-14T06:14:21.260Z | ERROR |       LogUnit-BatchProcessor-0 |           o.c.i.BatchProcessor | batchWriteProcessor: stream log error. Batch: [queue size=6]. StreamLog: [trim mark=51387158].

  • The /var/log/proton/proton_restart.log has logs similar to below
    2025-11-14T06:14:22.350Z  INFO application-restartor restartor 52576 - [nsx@4413 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (GMLE leadership safety violation handler triggered for groupType: mp) =====

Environment

VMware NSX

Cause

High write latencies in the Corfu database, caused by underlying host or storage performance issues, lead to overall cluster instability.

Resolution

To resolve this issue, you must relieve the storage latency affecting the impacted NSX Manager node.

  1. Identify the specific impacted NSX Manager node  where the SCSI Host Driver issues(mptscsih) are seen is impacted.

  2. Migrate the impacted NSX Manager virtual machine (via vSphere vMotion/Storage vMotion) to a different ESXi host or Datastore that has healthy storage performance metrics.