Hosts cannot read/write to their Virtual Volume (vVols) datastores - READ command failed
search cancel

Hosts cannot read/write to their Virtual Volume (vVols) datastores - READ command failed

book

Article ID: 318011

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:


1. A subset of ESXi hosts in a cluster cannot read/write to their presented Virtual Volume (vVols) datastores. 

2. Protocol endpoints are still accessible from the host 

3. Attempts to create a new test virtual machine fails 

4. Rebooting the host does not help 

5. You are unable to browse config vVols 

6. A network interruption in the environment can lead to this problem  

7. Host displays 'No such device or address' upon listing vVols from vVol directory  

[root@esx-02:/vmfs/volumes/vvol:xxxxxxxxxxx] ls
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./jump-2: No such device or address
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./mc-ctf-kerbergross-dc-template-bak: No such device or address
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./all-your-ports-are-belong-to-us-grimm: No such device or address
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./GRIMM-meerkat: No such device or address
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./Jenkins-Kubernetes: No such device or address
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./jump-22.data-strategy.range: No such device or address
ls: ./naa.xxxxxxxxxxx: No such device or address
ls: ./media-uploader: No such device or address

Log excerpts from a host facing this problem : 

VMkernel.log -

[YYYY-MM-DDTHH:MM:SS] cpu25:2168990)osfs: OSFS_VmklinkLookup:508: Failure (p [vvol], c xxxxxxxxxxx-0000000000000006)
[YYYY-MM-DDTHH:MM:SS] cpu25:2168990)osfs: OSFS_Lookup:2779: Lookup error: file = naa.xxxxxxxxxxx, status = No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu40:2168991)VVol: ScsiExtentRefreshCapacity:959: Refresh capacity for VVOL(0xe2000026d56f0000): 8388608 -> 8388608
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu40:2168991)osfs: OSFS_MountChild:4216: Failed to probe OSFS child for device name 'naa.xxxxxxxxxxx': No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu25:2168990)osfs: DebugDumpVmklinkResponse:814: {ID: 17d; type:LOOKUP; pid:[vvol]; cid:xxxxxxxxxxx-0000000000000006; aid:xxxxxxxxxxx-0000000000000006; status:No filesystem on the device; bufLen:0
[YYYY-MM-DDTHH:MM:SS] cpu25:2168990)osfs: OSFS_VmklinkLookup:508: Failure (p [  vvol], c xxxxxxxxxxx-0000000000000006)
[YYYY-MM-DDTHH:MM:SS] cpu25:2168990)osfs: OSFS_Lookup:2779: Lookup error: file = docs.cyber.range, status = No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu40:2168991)VVol: VVolAllocAndInitDevice:8133: Creating a device for naa.xxxxxxxxxxx (Type VVOL_OBJTYPE_CONFIG)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu34:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu40:2168991)osfs: OSFS_MountChild:4216: Failed to probe OSFS child for device name 'naa.xxxxxxxxxxx': No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu25:2168990)osfs: DebugDumpVmklinkResponse:814: {ID: 17e; type:LOOKUP; pid:[vvol]; cid:xxxxxxxxxxx-0000000000000006; aid:xxxxxxxxxxx-0000000000000006; status:No filesystem on the device; bufLen:0

vvold.log -


[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] ObjectStoreFileSystemImpl::UnmountNamespace: BEGIN (opIDS = )
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: Sending IPC request with opcode 3
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: Receiving IPC response
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: IPC finished for opID . OpCode: 195887107 Response: Not found
[YYYY-MM-DDTHH:MM:SS] error vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: IPC failed for opID . OpCode: 195887107 Error: IPC returned error : Not found
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: END (opIDS = ) IPC returned error : Not found
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] ObjectStoreFileSystemImpl::UnmountNamespace: END (opIDS = ) UmountNamespace IPC failed for name () uuid (naa.xxxxxxxxxxx) vimOpID ()
[YYYY-MM-DDTHH:MM:SS] error vvold[2100395] [Originator@6876 sub=Default] VVolDevice::UnmountDevice, failed to unmount - could not find device (xxxxxxxxxxx-0000-000000000006/naa.xxxxxxxxxxx)
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Default] VVolDevice:PrepareDeviceForBatchUnbind Preparing device for unbind naa.xxxxxxxxxxx
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: fstype vvol
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-0000000000000006
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-000000000000000f
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-0000000000000011
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-0000000000000010
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] 2100395:VVOLLIB : VVolLib_UserLockVVol:10215: Successfully acquired user lock on VVol naa.xxxxxxxxxxx (curTime:5552 startTime:5552 timeoutSec:5

[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] ObjectStoreFileSystemImpl::UnmountNamespace: BEGIN (opIDS = )
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: Sending IPC request with opcode 3
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: Receiving IPC response
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: IPC finished for opID . OpCode: 195887107 Response: Not found
[YYYY-MM-DDTHH:MM:SS] error vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: IPC failed for opID . OpCode: 195887107 Error: IPC returned error : Not found
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] OSFSIpcCall: END (opIDS = ) IPC returned error : Not found
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] ObjectStoreFileSystemImpl::UnmountNamespace: END (opIDS = ) UmountNamespace IPC failed for name () uuid (naa.xxxxxxxxxxx) vimOpID ()
[YYYY-MM-DDTHH:MM:SS] error vvold[2100395] [Originator@6876 sub=Default] VVolDevice::UnmountDevice, failed to unmount - could not find device (6000d310-0581-ac02-0000-000000000006/naa.xxxxxxxxxxx)
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Default] VVolDevice:PrepareDeviceForBatchUnbind Preparing device for unbind naa.xxxxxxxxxxx
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: fstype vvol
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-0000000000000006
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-000000000000000f
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-0000000000000011
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid vvol:xxxxxxxxxxx-0000000000000010
[YYYY-MM-DDTHH:MM:SS] info vvold[2100395] [Originator@6876 sub=Libs] 2100395:VVOLLIB : VVolLib_UserLockVVol:10215: Successfully acquired user lock on VVol naa.xxxxxxxxxxx (curTime:5552 startTime:5552 timeoutSec:5

[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu3:2097298)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx] READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)
[YYYY-MM-DDTHH:MM:SS] cpu33:2097338)VVol: 2421: [vvolDev: naa.xxxxxxxxxxx]READ command (ioFlags 0x200128, cmd 0) failed: Not supported (0xbad0020)

[YYYY-MM-DDTHH:MM:SS] cpu66:2129727)VVol: 2298: [vvolDev: naa.xxxxxxxxxxx] Failed to issue command 0x26 (Address is not aligned on required boundary)
[YYYY-MM-DDTHH:MM:SS] cpu66:2129727)VVol: 2298: [vvolDev: naa.xxxxxxxxxxx] Failed to issue command 0x26 (Address is not aligned on required boundary)
[YYYY-MM-DDTHH:MM:SS] cpu66:2129727)VVol: 2298: [vvolDev: naa.xxxxxxxxxxx] Failed to issue command 0x26 (Address is not aligned on required boundary)
[YYYY-MM-DDTHH:MM:SS] cpu66:2129727)VVol: 2298: [vvolDev: naa.xxxxxxxxxxx] Failed to issue command 0x26 (Address is not aligned on required boundary)
[YYYY-MM-DDTHH:MM:SS] cpu66:2129727)VVol: 2298: [vvolDev: naa.xxxxxxxxxxx] Failed to issue command 0x26 (Address is not aligned on required boundary)

[YYYY-MM-DDTHH:MM:SS] cpu92:15182124)osfs: OSFS_MountChild:4216: Failed to probe OSFS child for device name 'naa.xxxxxxxxxxx': No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu49:15182103)osfs: OSFS_MountChild:4216: Failed to probe OSFS child for device name 'naa.xxxxxxxxxxx': No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu92:15182124)osfs: OSFS_MountChild:4216: Failed to probe OSFS child for device name 'naa.xxxxxxxxxxx': No filesystem on the device
[YYYY-MM-DDTHH:MM:SS] cpu49:15182103)osfs: OSFS_MountChild:4216: Failed to probe OSFS child for device name 'naa.xxxxxxxxxxx': No filesystem on the device

[YYYY-MM-DDTHH:MM:SS] warning vvold[2100375] [Originator@6876 sub=Default] VasaSession::ConvertSoapError: [SESSION HEALTH] Connection-local error with VP SC5020-360876 [Connected -> TransportError], session appears healthy (last successful command 30164 msecs ago)
[YYYY-MM-DDTHH:MM:SS] error vvold[2100375] [Originator@6876 sub=Default] VasaOp[#30] ===> FINAL FAILURE getEvents, error (TRANSPORT_FAULT / UNKNOWN FAULT) VP (SC5020-360876) Container (SC5020-360876) timeElapsed=1 msecs (#outstanding 0)
[YYYY-MM-DDTHH:MM:SS] error vvold[2100375] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP SC5020-360876: getEvents failed (TRANSPORT_FAULT, UNKNOWN FAULT) [session state: Connected]
[YYYY-MM-DDTHH:MM:SS] error vvold[2100377] [Originator@6876 sub=Default] VasaOp[#31] ===> FINAL FAILURE getEvents, error (INVALID_SESSION / Vasa not initialized. / ) VP (SC5020-360876) Container (SC5020-360876) timeElapsed=21717 msecs (#outstanding 0)
[YYYY-MM-DDTHH:MM:SS] error vvold[2100377] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP SC5020-360876: getEvents failed (INVALID_SESSION, Vasa not initialized. / ) [session state: OutOfSync]
[YYYY-MM-DDTHH:MM:SS] error vvold[2100151] [Originator@6876 sub=Default] VasaSession::DoSetContext: setContext for VP SC5020-360876 (url: https://172.xx.xx.xxx:3034/vasa-provider/vasa2/vasaService) failed [connectionState: OutOfSync]: INVALID_SESSION (Vasa not initialized. / )
[YYYY-MM-DDTHH:MM:SS] info vvold[2100151] [Originator@6876 sub=Default] VasaOp::EventPollerCB [#32]: ===> Issuing 'getEvents' to VP [SC5020-360876:TransportError (Outstanding 0/8)]
[YYYY-MM-DDTHH:MM:SS] error vvold[2100151] [Originator@6876 sub=Default] VasaOp[#32] ===> FINAL FAILURE getEvents, error (INVALID_SESSION / Bad session state (TransportError)) VP (SC5020-360876) Container (SC5020-360876) timeElapsed=242 msecs (#outstanding 0)
[YYYY-MM-DDTHH:MM:SS] error vvold[2100151] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP SC5020-360876: getEvents failed (INVALID_SESSION, Bad session state (TransportError)) [session state: TransportError]
[YYYY-MM-DDTHH:MM:SS] info vvold[2100373] [Originator@6876 sub=Default] VasaSession::InitSoap: https://172.xx.xx.xxx:3034/vasa-provider/vasa2/vasaService session state (TransportError) - recreating soap

Environment

VMware vSphere ESXi 7.0

Cause


An optimization (SCSIBlockUnsupportedOpcodesAndPages) to VMware’s Pluggable Storage Architecture (PSA) was introduced in ESXi 7.0 U3 which is designed to prevent issuing commands to storage devices after receiving a SCSI Command failure with an Invalid Operation Code.

To add more context to the statement above, ESX will not issue commands failed with "0x5 0x20" after the first failure as target will anyway fail the request with same error. This bitmap is maintained at device level, in case of vVol configuration this will be at PE device level. Since the reads failed, we mark this as unsupported command and start failing further READ requests from PSA itself.

For the impacted vVol Datastores, we can see that Protocol Endpoint (PE) Devices previously failed SCSI Commands with error code “0x5 0x20” which translates to ILLEGAL REQUEST / INVALID COMMAND OPERATION CODE.

Due to the newer SCSIBlockUnsupportedOpcodesAndPages optimization, these devices are flagged with an Unsupported status which is then cached persistently which means that all future SCSI command attempts will fail, thus resulting in the vVol Datastore Read/Write failures we observed. 

 

Resolution

This issue is resolved in vSphere ESXi 7.0 U3f (build number 20036589).

Apply the following workaround below to disable optimization – 

1. SSH to all the ESXi hosts where the vVol PE is presented.

2. Validate whether the parameter is set to 0 : esxcfg-advcfg -g /Scsi/SCSIBlockUnsupportedOpcodesAndPages  

3. Run the command to disable the SCSIBlockUnsupportedOpcodesAndPages feature - esxcfg-advcfg -s 0 /Scsi/SCSIBlockUnsupportedOpcodesAndPages

4. Get a list of all vVol Protocol Endpoint Devices by running the following command - 

# esxcli storage core device list | grep 'Is VVOL PE\|  Display Name'|grep true -B1

5. For each of the devices identified in Step # 4, run the following VSISH command to clear the cached unsupported status, replacing the "Device Name" to reflect each individual Protocol Endpoint -

# vsish -e set /storage/scsifw/devices/<DEVICE NAME>/resetOpcodeAndPageBitmaps 1

Example - 

# esxcfg-advcfg -s 0 /Scsi/SCSIBlockUnsupportedOpcodesAndPages

# vsish -e set /storage/scsifw/devices/naa.600508b1001cb7cd0fed8eec2d95ccf6/resetOpcodeAndPageBitmaps 1