Intermittent Host CPU spikes when using Veeam Volume Backups with Nimble Storage Array
search cancel

Intermittent Host CPU spikes when using Veeam Volume Backups with Nimble Storage Array

book

Article ID: 314348

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
Intermittent CPU spikes to 100% for Host's because of leftover Nimble Snapshots. Automatic Host compliance checks, Exporting logs may also trigger the CPU spike in this scenario. We can find similiar log entries in those cases:
**vmkernel.log**
2023-09-28T21:19:23.090Z cpu18:2098344)WARNING: NMP: nmp_PathDetermineFailure:3527: Cmd (0x9e) PDL error (0x5/0x25/0x0) - path vmhba1:C0:T##16:L## device eui.######################## - triggering path failover
2023-09-28T21:19:23.090Z cpu18:2098344)WARNING: NMP: nmpCompleteRetryForPath:394: Logical device "eui.########################": awaiting fast path state update before retrying failed command again...
2023-09-28T21:19:24.085Z cpu3:9757009)WARNING: VMW_SATP_ALUA: satp_alua_issueInquiry:83: Target completed the Inquiry VPD Page 0x83 request with good status but returned junk data for path vmhba#:C0:T##:L##
2023-09-28T21:19:24.085Z cpu3:9757009)WARNING: VMW_SATP_ALUA: satp_alua_getTargetPortInfo:160: Could not get page 83 INQUIRY data for path "vmhba1:C0:T##:L##" - Failure (195887105) 
2023-09-28T21:19:24.085Z cpu3:9757009)WARNING: VMW_SATP_ALUA: satp_alua_issueInquiry:83: Target completed the Inquiry VPD Page 0x83 request with good status but returned junk data for path vmhba1:C0:T##:L##

Cause

This issue could be because of leftover Nimble snapshots as the process is similiar to below  :
"Veeam is asking the array to take a snapshot and online it, so it can do its backup.  During this time, if the ESXi host does a re-scan, it will see the online snapshot as a device, then when Veeam asks the array to delete the snapshot, and the array does, VMware will lose access to the device and report a PDL.  To keep this from happening, for the volumes that Veeam is backing up, you must adjust the ACL for the hosts to "Volume Only," instead of "Volume and Snapshot."  this way the host will not see the online snapshots"

The ESXi host, believing the device is still available, retries all SCSI commands indefinitely. This has an impact on the management agents such as hostd, vpxa etc, as their commands are not responded to until the device is again accessible. This causes the ESXi host to become not-responding in the vCenter Server or have performance issues due to task's getting queued. Host CPU spikes are also observed in these scenarios. Hostd in particular can be very susceptible to storage and networking related issues.

Resolution

Please do engage the Storage array vendor for resolution.
Links for reference:
https://helpcenter.veeam.com/docs/backup/vsphere/nimble_add_name.html?ver=120
https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_NimbleOS_5.0.x_Help.whz/mgd1501525212807.html