SAP Hana VM reporting soft CPU lockups and TX hangs during a vmotion
book
Article ID: 412869
calendar_today
Updated On:
Products
VMware vSphere ESXi
Issue/Introduction
SAP Hana VM hung or lost connectivity during a live compute vmotion. The vmotion happened when the VM was in production and busy.
From the guest OS logs you may see reports of a network interface resetting or vmxnet3 TX hangs similar to the following examples:
vmxnet3 0000:13:00.0 eth1: intr type 3, mode 0, 11 vectors allocated vmxnet3 0000:13:00.0 eth1: NIC Link is Up 10000 Mbps vmxnet3 0000:13:00.0 eth1: resetting ........... kernel: vmxnet3 0000:13:00.0 eth1: tx hang
From the same guest OS logs you see reports that the OS kernel was in a CPU 'soft lock' state which would explain the TX hangs.
watchdog: BUG: soft lockup - CPU#146 stuck for 21s! [kworker/146:2:3384941]
Environment
VMware vSphere ESXi
Cause
On a Linux OS a 'soft lockup' watchdog timeout can happen if the kernel is busy, working on a huge amount of objects which need to be scanned, freed, or allocated, respectively.
On a VM if it is not caused by an actual bug on the guest OS, then the issue could be the hypervisor does not schedule the guest for a prolonged time, which could be due to a resource contention issue on a host or complications during a vmotion.
Resolution
SAP Hana VM's running on VMware have very specific requirements to run in an optimal way. In addition, there are very specific requirements if doing a live migration of a SAP Hana VM, so it is recommended to consult with the best practice guide and assure it is being followed. Please use this link for the best practice guide.
With regards to doing a vmotion of a SAP Hana VM it is important to take note of the following points from the guide:
Caution: While vMotion is a fantastic tool that helps you manage and operate production SAP HANA VMs, be very careful migrating SAP HANA VMs because doing so may cause severe performance issues that impact SAP HANA users and long-running transactions.
Don't live migrate SAP HANA VMs while a virus scanner or a backup job is running inside the VM or while people are using the SAP HANA VMs, because this can cause a soft lock of the SAP HANA application.
Don't live migrate SAP HANA VMs while a VM snapshot-based backup job is running outside the VM. This can cause a soft lock or data inconsistencies in the SAP HANA application.
Use vMotion only during non-peak times (for example, when CPU utilization is less than 25%).
Allocate sufficient bandwidth to the vMotion network, ideally 25 GbE or more.
Avoid having "noisy neighbors" active on the ESXi host during a vMotion migration of SAP HANA VMs. A noisy neighbor is another VM that is using up the host's resources, leaving few left for other activities.
A dedicated vMotion network is a strict requirement. The network should have enough bandwidth to support a fast migration time, which depends on the active SAP HANA memory; for example, >= 4GB SAP HANA VMs. A vMotion network with 25GbE or higher bandwidth is preferred. Multiple vMotion network cards will help parallelize the vMotion process and lower the impact on the VM performance and time.
Additional Information
When troubleshooting or investigating similar issues it is recommended to engage with SAP support to get their analysis of the situation. Applicable SAP Notes/KB articles could be the following: