Alert: "Lost access to volume due to connectivity issues report on iSCSI connected datastores"

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Multiple hosts experience intermittent storage connectivity interruptions.
Below error report in the host event logs

"Lost access to volume XXX due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"
Check if the existing iSCSI server IP configured for impacted cluster is accidentally reused in a new cluster or assigned to a new ESXi host NIC.
- In vSphere Client
- Go to Storage Adapters
- Select the iSCSI adapter
- Check the Dynamic/Static Discovery addresses.
- Confirm that the iSCSI server IP is consistent across all hosts in the cluster.

Environment

VMware ESXi 7.x

VMware ESXi 8.x

Cause

The storage disconnection occurs because iSCSI initiators fail to receive a response from the storage target for a few seconds. When the response is not received within the timeout period, the connectivity is marked as offline before automatically recovering.

In var/log/syslog.log displays below events,

YYYY-MM-DDT##:##:## iscsid[2099254]: connection 7:0 (iqn.2010-xx.com.xxxxstorage:flasharray.3d7e8667f27df993 if=iscsi_vmk@vmk2 addr=100.104.XX.XX:3260 (TPGT:1 ISID:0x7)  (T0 C6)) Nop-out timeout after 10 sec in state (3).
YYYY-MM-DDT##:##:## iscsid[2099254]: Notice: Setting NODELACK for target=iqn.2010-06.com.purestorage:flasharray.3d7e8667f27df993 (host=100.104.XX.XX)
YYYY-MM-DDT##:##:## iscsid[2099254]: Notice: Setting NODELACK for target=iqn.2010-06.com.purestorage:flasharray.3d7e8667f27df993 (host=100.104.XX.XX)
YYYY-MM-DDT##:##:## iscsid[2099254]: connection 7:0 (iqn.2010-06.com.purestorage:flasharray.3d7e8667f27df993 if=iscsi_vmk@vmk2 addr=100.104.XX.XX:3260 (TPGT:1 ISID:0x7)  (T0 C6)) has recovered (3 attempts)

In the vmkernel logs, it can be seen that the iSCSI connections are frequently being marked as "OFFLINE" and "ONLINE".

YYYY-MM-DDT##:##:##.#### cpu67:2099071)WARNING: iscsi_vmk: iscsivmk_StopConnection:738: vmhba64:CH:2 T:0 CN:0: iSCSI connection is being marked "OFFLINE"
YYYY-MM-DDT##:##:##.#### cpu69:2099071)WARNING: iscsi_vmk: iscsivmk_StartConnection:919: vmhba64:CH:2 T:0 CN:0: iSCSI connection is being marked "ONLINE"

This indicates a connectivity issue between the ESXi host and iSCSI network. This is confirmed from the SCSI sense code "H:0x1" in the vmkernel logs:

YYYY-MM-DDT##:##:##.####  cpu30:2098300)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x12 (0x45baad9d0b88, 0) to dev "naa.##" on path "vmhba64:C2:T0:L0" Failed:
YYYY-MM-DDT##:##:##.####  cpu30:2098300)NMP: nmp_ThrottleLogForDevice:3875: H:0x1 D:0x0 P:0x0 .

Host Status

[0x1]

NO_CONNECT

This status is returned if the connection is lost to the LUN. This can occur if the LUN is no longer visible to the host from the array side or if the physical connection to the array has been removed.

Since the iSCSI LUNs are facing connectivity issues, the associated datastores will experience "Lost access to volume due to connectivity issues" alerts. This can also be seen in vobd logs:

YYYY-MM-DDT##:##:##.#### : [scsiCorrelator] : [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.##. Path vmhba64:C4:T0:L0 is down. Affected datastores: "###".
YYYY-MM-DDT##:##:##.#### : [scsiCorrelator] : [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C4:T0:L1 changed state from on

Resolution

To resolve this issue:

Verify the storage array configuration.
Verify the host networking configuration.
- To fix the VMkernel networking misconfiguration:
  1. When using multiple VMkernel ports for software iSCSI, ensure that the number of VMkernel ports is lesser than or equal to the number of physical network interfaces.
  2. Check MTU size across your environment and make it uniform (regular, Jumbo frames).
  3. Ensure following Best Practices for Configuring Networking with Software iSCSI

If the issue still persists, collect the TCP-dump during the time of the issue and share with Broadcom Support for further investigation.

Additional Information

Handling Transient APD Conditions

iSCSI connection flapping between 'offline' and 'online'

Lost access to volume due to connectivity issues OR Path redundancy to storage device degraded