Congestion Oversubscription and Credit Stall events reported in /var/log/vmkernel.log on ESXi Server as well as vCenter Server
search cancel

Congestion Oversubscription and Credit Stall events reported in /var/log/vmkernel.log on ESXi Server as well as vCenter Server

book

Article ID: 390100

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESXi 5.0 VMware vSphere ESXi 5.5 VMware vSphere ESXi 5.x - View VMware vSphere ESXi 6.0 VMware vSphere ESXi 7.0 VMware vSphere ESXi 8.0

Issue/Introduction

Symptoms 

 

  • Existing VMFS datastores intermittently disappear or become unmounted from specific ESXi hosts within the cluster.
  • In the vSphere Client, the storage device (LUN) is visible under Storage Devices, but the Datastore column shows “Not Consumed.”
  • Running esxcfg-volume -l returns no output, indicating that no mountable volumes are detected.
  • Rescanning storage on an ESXi host fails.

Messages are displayed which are similar to:

Frequent PowerOn Reset Unit Attentions are occurring on path vmhba0:C1:T0:L0. This may indicate a storage problem. Affected device: naa.###################. Affected datastores:vmdatafiles
cpu0:2659)ScsiCore: 1460: Power-on Reset occurred on vmhbaX:C0:T2:L0
cpu7:2055)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x41244038d380, 2056) to dev "naa.###################" on path "vmhbaX:C0:T1:L0" Failed: H:0xb D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:NONE

 

/var/log/vmkernel.log log file  reports SCSI warnings against the paths for one or more devices, which indicate permanent device loss.

logs may report H:0x1 SCSI code ("no connection"), or logical unit not supported (H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0), or logical unit not accessible (H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0xa). 

esxcli storage san fc list will 

FcDevice:
  Adapter: vmhbaX
  Port ID: 000000
  Node Name: 20:##:##:24:##:18:#7:12
  Port Name: 21:##:##:24:##:18:#7:12
  Speed: 8 Gbps
  Port Type: NPort
  Port State: ONLINE
  Model Description: HPE SN1100Q 16Gb 2p FC HBA
  Hardware Version:BK3210407-20  F
  OptionROM Version: 3.68
  Firmware Version: 9.15.05 (d0d5)
  Driver Name: qlnativefc
Error getting field DriverVersion

Events similar to below are reported in /var/log/vmkernel.log as well as vCenter Server /var/log/vmware/vpxd/vpxd.log :

YYYY-MM-DDTHH:MM:SS.015Z In(182) vmkernel: cpu36:2098098)StorageFPIN: 1276: Report FC FPIN Congestion Oversubscription event (hostWWPN xxxxxxxxxxxxxxxx tgtWWPN yyyyyyyyyyyyyyyy) to vobd. 231 events have occurred since last report.

YYYY-MM-DDTHH:MM:SS.079Z In(182) vmkernel: cpu45:2098118)StorageFPIN: 1276: Report FC FPIN Congestion Credit Stall event (hostWWPN xxxxxxxxxxxxxxxx tgtWWPN yyyyyyyyyyyyyyyy) to vobd. 6 events have occurred since last report.

YYYY-MM-DDTHH:MM:SS.079Z Wa(180) vmkwarning: cpu52:[REDACTED_ID])WARNING: lpfc : vmhbaX lpfc_els_rcv_fpin_cgn:[REDACTED_ID]: [REDACTED_ID] FPIN CONGESTION WARNING Notification type Credit Stall (x2) Event Duration 10000 mSecs

YYYY-MM-DDTHH:MM:SS.079Z In(182) vmkernel: cpu52:[REDACTED_ID])StorageFPIN: [REDACTED_ID]: Report FC FPIN Congestion Credit Stall event (hostWWPN [MASKED_WWPN] tgtWWPN [MASKED_WWPN]) to vobd. 4 events have occurred since last report.

YYYY-MM-DDTHH:MM:SS.079Z In(182) vmkernel: cpu52:[REDACTED_ID])StoragePath: [REDACTED_ID]: Calling MPP NMP for link event 2 on adapter vmhbaX (hostWWPN=[MASKED_WWPN] targetWWPN=[MASKED_WWPN] targetNum = [MASKED_NUM])

YYYY-MM-DDTHH:MM:SS.079Z Wa(180) vmkwarning: cpu52:[REDACTED_ID])WARNING: NMP: nmpHandleLinkEvent:[REDACTED_ID]: Marking path vmhbaX:C0:T0:L00 flaky on link event 2 with timeoutMS = 20000 flakyMarkTC = [MASKED_ID], reEvalFlakyPathTime = 20000

Events similar to below are reported in /var/run/log/vobd.log

YYYY-MM-DDTHH:MM:SS.460Z In(14) vobd[XXXXXXX]:  [HardwareCorrelator] XXXXXXXXXXXXXus: [vob.hardware.fpin.fc.congestion.creditstall] FPIN FC credit stall congestion: Host WWPN ############### , target WWPN ###############.

YYYY-MM-DDTHH:MM:SS.460Z In(14) vobd[XXXXXXX]:  [HardwareCorrelator] XXXXXXXXXXXXXus: [esx.problem.hardware.fpin.fc.congestion.creditstall] FPIN FC credit stall congestion: Host WWPN ############### , target WWPN ###############.

YYYY-MM-DDTHH:MM:SS.461Z In(14) vobd[XXXXXXX]:  [HardwareCorrelator] XXXXXXXXXXXXXus: [vob.hardware.fpin.fc.congestion.creditstall] FPIN FC credit stall congestion: Host WWPN ############### , target WWPN ###############.

YYYY-MM-DDTHH:MM:SS.461Z In(14) vobd[XXXXXXX]:  [HardwareCorrelator] XXXXXXXXXXXXXus: [esx.problem.hardware.fpin.fc.congestion.creditstall] FPIN FC credit stall congestion: Host WWPN ############### , target WWPN ###############.

Environment

VMware vSphere ESXi 8.0.x

Cause

FPIN (Fabric Performance Impact Notifications) capability was added in ESXi 8.0 U2 to be able to better understand fabric related issues/events. This module will also print to /var/log/vmkernel.log when there are fabric events happening. The events that FPIN tracks and will report on are:

  • Link Integrity
  • Delivery
  • Congestion
  • Peer Congestion

Resolution

When FPIN events are received, the fabric health of the switching fabric should immediately be investigated. For the example listed above, this initiator is receiving FPINs that points to both Congestion as well as a Credit Stall. A Credit Stall is reported when the B2B credits for an ISL connection is reaching zero, which typically happens when there is a Slow Drain device in the fabric. Regardless, the presence of FPIN events indicates that the fabric should immediately be investigated for root cause otherwise the fabric health could continue to reduce and an outage could occur.