Gateway service in the Virtual Appliance suddenly crashed and was restarted.
Upon investigation, no clue were found except for the following kernel error in the OS /var/log/message
Dec 23 14:29:59 gateway.local kernel: [15505712.281271] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [watchdog/0:11]
Release : 10.0
Component : API GATEWAY
In a virtual non physical Gateway Appliance, the above likely indicates that the underlying hypervisor (VMWare/ESX) may have caused a CPU soft lockup for more than 20 seconds during a specific maintenance task.
A common cause is usually snapshots or VMotion, which when last too long, they can often lead to a situation where Gateway process become unresponsive from more than 15 seconds and an educated service restart is triggered by the Controller.
A snapshot is usually a safe action to be taken even when Gateway service (SSG) is up and running. However, the impact of a snapshot may differs from case to case based on the specs and the current load of the VMWare system, as well latency and I/O on the storage were the snapshot is then saved.
Reason why we always recommend to execute snapshots during low traffic activity and possibly without memory option, which means it will not capture the live state of the virtual machine and will drastically decrease the amount of time for the snapshot to be completed.
For case where VMWare VMotion trigger a maintenance action, we do recommend to disable/isolate Gateway Virtual Appliance from it.
Additional recommendation is to eventually discuss with the vendor (VMWare) to understand how to improve snapshot efficiency or minimise the impact.