vdpi process crash on ESXi host causes an NSX alarm, Application on NSX node <hostname> has crashed
search cancel

vdpi process crash on ESXi host causes an NSX alarm, Application on NSX node <hostname> has crashed

book

Article ID: 323542

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • You are running VMware NSX 4.x.
  • In the NSX-T manager UI, the below alarm is generated with the following details:
Application on NSX node <hostname> has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team. 
  • On the ESXi host, In the log file /var/run/log/vobd.log we see entries:
[esx.problem.application.core.dumped] An application (/usr/lib/vmware/nsx-vdpi/bin/vdpi) running on ESXi host has crashed (2 time(s) so far). A core file may have been created at /var/core/vdpi-zdump.001.
  • On the ESXi host, we see the following core dump generated:
/var/core/vdpi-zdump.xxx
  • ​​​​​​On the ESXi host, In /var/run/log/nsx-syslog.log we see the following entries between 0 to 20 times:
Revalidating domains to generation number <x>
Note: The 'x' does not change for each FQDN revalidation.


Environment

VMware NSX-T Data Center

Cause

  • Under normal circumstance we can expect to see these log entries 'Revalidating domains to generation number <x>' between 10 to 20 times during FQDN changes, when this issue occurs we see the entry more than 20 times. 
  • The VDPI crash occurs when the FQDN has changed for a context profile firewall rule, while traffic is flowing through this rule and using the existing FQDN. 
  • The process gets caught in a loop and leads to memory issues causing the VDPI crash.

Resolution

This issue is resolved in NSX 4.1.2.
VMware NSX 4.1.2 Release Notes:

  • Fixed Issue 3245179: VDPI crash.
    FQDN resolution rule application failure. VDPI restart.

Workaround:
To avoid this issue from occurring, do not make changes to the FQDN used in the context profile firewall rule, while traffic is flowing for this FQDN.
You can disable the rule, then make the changes and enable the rule again. 
As this may impact traffic, you can do it in a maintenance window.