Red Hat OCP Heartbeat Disconnections due to ESXi Host "Loss of Signal" on Fibre Channel HBAs
search cancel

Red Hat OCP Heartbeat Disconnections due to ESXi Host "Loss of Signal" on Fibre Channel HBAs

book

Article ID: 436770

calendar_today

Updated On:

Products

VMware vSphere ESX 8.x

Issue/Introduction

Customers using Red Hat OpenShift (OCP) High Availability architectures on ESXi virtual machines may report the following symptoms:

  • Guest OS logs show heartbeat issue between cluster nodes.
  • The underlying ESXi hosts show multiple "Lost access to volume... due to connectivity issues" messages:

YYYY-MM-DDTHH:MM:SS In(166) Hostd[2101374]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 226553 : Lost access to volume xxx-xxx-xxx-xxx (xxx) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
YYYY-MM-DDTHH:MM:SS In(166) Hostd[2102729]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 230491 : Lost access to volume yyy-yyy-yyy-yyy (yyy) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
YYYY-MM-DDTHH:MM:SS In(166) Hostd[2101126]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 230493 : Lost access to volume zzz-zzz-zzz-zzz (zzz) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

  • The vmkernel.log of underlying ESXi hosts show HBA card aborts the I/O (SCSI code H:0x8):

YYYY-MM-DDTHH:MM:SS  In(182) vmkernel: cpu176:2099434)ScsiDeviceIO: 4605: Cmd(0x45eb31ff6280) 0x2a, cmdId.initiator=xxxxxx CmdSN 0x5b from world xxxxxx to dev "naa.xxxxxx" failed H:0x8 D:0x0 P:0x0 Cancelled from driver layer

  • The HBA card of underlying ESXi hosts show signal loss:

  Adapter  Tx          Rx          Lip    Error   Dumped  Link           Loss of       PrimSeq Protocol  Invalid Tx   Invalid    Input     Output    Control
  ↓        Frames      Frames      Count  Frames  Frames  Failure Count  Signal Count  Err Count         Word Count   CRC Count  Requests  Requests  Requests
  -------  ------      ------      -----  ------  ------  -------------  ------------  ----------------  -----------  ---------  --------  --------  --------
  vmhbaxxx xxxxxxx       xxxxxxxxxxx  0      0       0       2              2             0                 xx          0          0         0         0

Environment

VMware vSphere ESXi 8.0

Cause

The issue is caused by a physical layer failure in the Fibre Channel fabric (fiber cable, SFP module, or physical switch port). This results in an intermittent loss of optical signal to the Host Bus Adapters (HBAs), leading the ESXi host to lose access to storage LUNs. 

Log analysis of the affected HBAs (e.g., vmhbaxxx) confirms physical link issues:

  • Link Failure Count: Indicates the physical FC link went down and back up.
  • Loss of Signal Count: Indicates the HBA detected a loss of optical signal from the fabric

Resolution

Inspect Physical Layer:

  • Inspect and/or replace the fiber optic cables connected to the reported vmhba ports.
  • Reseat or replace the SFP modules on both the ESXi host and the SAN switch.
  • Inspect SAN Switch

Additional Information

HBA statistics

https://knowledge.broadcom.com/external/article/413021/esxi-hosts-appear-offline-in-the-san-swi.html