NetApp ONTAP Select VM: Panics and Reboots due to vNVRAM Flush Timeouts on ESXi
search cancel

NetApp ONTAP Select VM: Panics and Reboots due to vNVRAM Flush Timeouts on ESXi

book

Article ID: 436516

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

A NetApp ONTAP Select virtual machine (VM) experiences intermittent reboots (panics).

Review of the logs reveals the following symptoms:

  • ONTAP Panic String: vnvram_flush: vnvram_sim_kern_nvram_write_n failed with 60.
  • ESXi Storage Errors: vmkernel logs show SCSI command failures such as H:0x2 D:0x8 P:0x0 (Busy) and H:0xc D:0x0 P:0x0 (Retry) on the affected HBA (e.g., vmhba0).
  • Path Instability: Log entries showing lpfc_start_devloss and LOGO received from NPORT, forcing the host into device loss timeouts.
  • Heartbeat Degradation: hostd logs show the VM heartbeat status changing from Green to Yellow just prior to the reboot


    2026-04-10T08:27:08.758Z In(166) Hostd[2099492]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/67b0c3ce-ac0f0709-c0a3-###########/vm/vm.vmx] Setting heartbeat to yellow; Heartbeat (in 30s): expected=30 (yellow<=80%, red<=40%), actual=18 (60%)
    2026-04-10T08:27:08.758Z Db(167) Hostd[2099492]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/67b0c3ce-ac0f0709-c0a3-###########/vm/vm.vmx] Updating current heartbeatStatus: green -> yellow

Example

Logs from VMkernal.log

2026-04-10T08:25:20.331Z Wa(180) vmkwarning: cpu5:2098191)WARNING: lpfc: lpfc_start_devloss:4565: vmhba0 3248 Start 10 sec devloss tmo WWPN ##:##:##:##:##:##:##:00 NPort x012800

2026-04-10T08:25:20.358Z In(182) vmkernel: cpu43:3688490)lpfc: lpfc_handle_status:5631: vmhba0 3271: FCP cmd x8a failed <0/2> sid x010c00, did x012800, oxid xac7 iotag xded Invalid RPI Host Retry
2026-04-10T08:25:20.358Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x8a (0x45f#######, 2109563) to dev "naa.#########################" on path "vmhba0:C0:T0:L2" Failed:
2026-04-10T08:25:20.358Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3898: H:0x2 D:0x8 P:0x0 . Act:EVAL. cmdId.initiator=0x430####### CmdSN 0x##########
2026-04-10T08:25:20.358Z Wa(180) vmkwarning: cpu45:2098280)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.#########################" state in doubt; requested fast path state update...
2026-04-10T08:25:20.358Z In(182) vmkernel: cpu43:3688490)lpfc: lpfc_handle_status:5631: vmhba0 3271: FCP cmd x28 failed <0/4> sid x010c00, did x012800, oxid xeea iotag x1210 Invalid RPI Host Retry
2026-04-10T08:25:20.358Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x28 (0x430#######, 2099472) to dev "naa.#########################" on path "vmhba0:C0:T0:L4" Failed:
2026-04-10T08:25:20.358Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3898: H:0x2 D:0x8 P:0x0 . Act:EVAL. cmdId.initiator=0x430####### CmdSN 0x##########
2026-04-10T08:25:20.358Z Wa(180) vmkwarning: cpu45:2098280)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.#########################" state in doubt; requested fast path state update...
2026-04-10T08:25:20.373Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x28 (0x430#######, 2099472) to dev "naa.#########################" on path "vmhba0:C0:T0:L4" Failed:
2026-04-10T08:25:20.373Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3898: H:0xc D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x430####### CmdSN 0x##########
2026-04-10T08:25:20.400Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 10 times
2026-04-10T08:25:20.405Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x8a (0x45ea0d13f340, 2109563) to dev "naa.#########################" on path "vmhba0:C0:T0:L2" Failed:
2026-04-10T08:25:20.405Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3898: H:0xc D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x443####### CmdSN 0x##########
2026-04-10T08:25:20.405Z In(182) vmkernel: cpu45:2098280)ScsiDeviceIO: 4644: Cmd(0x45ea1d8474c0) 0x8a, CmdSN 0x8001017b from world 2109563 to dev "naa.#########################" failed H:0xc D:0x0 P:0x0
2026-04-10T08:25:20.405Z In(182) vmkernel: cpu45:2098280)ScsiDeviceIO: 4644: Cmd(0x45f2203188c0) 0x8a, CmdSN 0x300008 from world 2109563 to dev "naa.#########################" failed H:0xc D:0x0 P:0x0
2026-04-10T08:25:20.407Z In(182) vmkernel: cpu45:2098280)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 10 times
2026-04-10T08:25:20.411Z In(182) vmkernel: cpu47:2098280)NMP: nmp_ThrottleLogForDevice:3825: last error status from device naa.######################### repeated 20 times
2026-04-10T08:25:20.412Z In(182) vmkernel: cpu47:2098280)ScsiDeviceIO: 4644: Cmd(0x45f22020eac0) 0x8a, CmdSN 0x8001016e from world 2109563 to dev "naa.#########################" failed H:0xc D:0x0 P:0x0
2026-04-10T08:25:20.412Z In(182) vmkernel: cpu47:2098280)ScsiDeviceIO: 4644: Cmd(0x45f2202456c0) 0x8a, CmdSN 0x80010177 from world 2109563 to dev "naa.#########################" failed H:0xc D:0x0 P:0x0
2026-04-10T08:25:20.412Z In(182) vmkernel: cpu47:2098280)ScsiDeviceIO: 4644: Cmd(0x45ea1d9fb6c0) 0x88, CmdSN 0x80010100 from world 2109563 to dev "naa.#########################" failed H:0xc D:0x0 P:0x0
2026-04-10T08:25:20.412Z In(182) vmkernel: cpu47:2098280)ScsiDeviceIO: 4644: Cmd(0x45ea1d9662c0) 0x88, CmdSN 0x800100d3 from world 2109563 to dev "naa.#########################" failed H:0xc D:0x0 P:0x0

 

 

Environment

Netapp ontap select 

VMware Esxi (all versions)

Cause

The primary cause is a vNVRAM flush timeout triggered by physical layer errors and storage path instability on the ESXi host.

When an ESXi host encounters transient errors (like H:0xc or D:0x8) and only has a single path per fabric, it must wait for the driver to time out before switching fabrics. For latency-sensitive VMs like NetApp ONTAP Select, this delay can exceed the internal watchdog timer, causing the VM to panic and reboot. Physical layer evidence typically includes elevated Invalid Tx Word Counts or Invalid CRC Counts in HBA statistics.

Resolution

To resolve this issue, the underlying physical layer instability must be addressed:

  1. Inspect Physical Components: Check the SFP modules and fiber optic cabling for the affected HBA. Replace any components showing high error counts.
  2. Verify HBA Statistics: Run the following command on the ESXi host to check for physical layer errors: localcli storage san fc stats get -a <vmhba_name> Look for non-zero values in Invalid Tx Word Count or Invalid CRC Count.

    Example  

    esxcli storage core adapter stats get    

    vmhba0:
       Successful Commands: 1766406282
       Blocks Read: 66195583291
       Blocks Written: 31367756678
       Read Operations: 816874319
       Write Operations: 944220471
       Reserve Operations: 0
       Reservation Conflicts: 0
       Failed Commands: 46540
       Failed Blocks Read: 0
       Failed Blocks Written: 0
       Failed Read Operations: 16380
       Failed Write Operations: 28328
       Failed Reserve Operations: 0
       Total Splits: 0
       PAE Commands: 0


    esxcli storage san fc stats get    


       Adapter: vmhba0
       Tx Frames: 1029054136
       Rx Frames: 2104483743
       Lip Count: 0
       Error Frames: 0
       Dumped Frames: 0
       Link Failure Count: 0
       Loss of Signal Count: 0
       PrimSeq Protocol Err Count: 0
       Invalid Tx Word Count: 8676
       Invalid CRC Count: 0
       Input Requests: 0
       Output Requests: 0
       Control Requests: 0

     

  3. Hardware Diagnostics: Engage the server hardware vendor to perform health checks on the Host Bus Adapter (HBA).
  4. Storage Array Analysis: Work with the storage vendor to investigate why the array is returning "Busy" (D:0x8) statuses for the host initiators.
  5. Path Redundancy: Ensure the host has redundant paths on each fabric to allow for faster failover without exceeding VM watchdog timers.