An Administrator may observe Windows VMs crash and/or show disk timeout errors when there are CRC errors observed in the fabric. Linux VMs may also crash or remount their disks in read-only mode for the same reason.
When reviewing the /var/log/vmkernel.log for error messages, there are a LOT of timed out commands for only one out of two HBAs in that host. Here is an example of some of them:
2025-03-19T21:01:31.681Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b972002880 handle 327 from RISC. serialNumber=800e0028, Command timeout=32646 sec
2025-03-19T21:01:31.881Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b971602440 handle 366 from RISC. serialNumber=800e0006, Command timeout=32646 sec
2025-03-19T21:01:32.082Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b971602240 handle 368 from RISC. serialNumber=800e000d, Command timeout=32646 sec
2025-03-19T21:01:32.282Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b972001a80 handle 388 from RISC. serialNumber=800e0046, Command timeout=32646 sec
2025-03-19T21:01:32.482Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b972001280 handle 3b2 from RISC. serialNumber=800e0054, Command timeout=32646 sec
2025-03-19T21:01:32.682Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b979000300 handle 1cd from RISC. serialNumber=800e0032, Command timeout=32600 sec
2025-03-19T21:01:32.882Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b97b000bc0 handle 8d from RISC. serialNumber=800e005e, Command timeout=32684 sec
2025-03-19T21:01:33.082Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b97be02ec0 handle 652 from RISC. serialNumber=800e0010, Command timeout=32686 sec
2025-03-19T21:01:33.282Z cpu28:2097532)qlnativefc: vmhba2(82:0.1): qlnativefcEhVirtualReset: aborting sp 0x45b97b000dc0 handle 722 from RISC. serialNumber=800e0009, Command timeout=32686 sec
By comparison, there are zero timed out commands from vmhba1. In fact there are no transmit errors at all for vmhba1 and from the pathing policy information, we can see that there are working paths for both HBAs, meaning an equal amount of commands are going out of each HBA:
naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:
Device Display Name: DGC Fibre Channel Disk (naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
Storage Array Type: VMW_SATP_ALUA_CX
Storage Array Type Device Config: {navireg=on, ipfilter=on} {implicit_support=on; explicit_support=on; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=2,TPG_state=AO}{TPG_id=1,TPG_state=ANO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=3: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba2:C0:T0:L0, vmhba1:C0:T0:L0
Is USB: false
When IO is evenly distributed across all HBAs yet only one of those HBAs is seeing transmit errors over time, this hints that there could be a layer 1 or physical issue with that HBAs connection to the array and back. It is also crucial to review fabric switch ports for transmit errors (CRC, etc) and then remediate this issue from a physical layer perspective, which would be to replace fiber cables, SFPs, etc.