ESXi host fails with purple diagnostic screen during NSX-T in-place upgrade
search cancel

ESXi host fails with purple diagnostic screen during NSX-T in-place upgrade

book

Article ID: 324589

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • NSX UI displays upgrade failure error for the affected ESXi host.
  • ESXi host loses management network connectivity and gets disconnected from vCenter.
  • Management network connectivity is restored on its own after sometime.
  • ESXi host enters NSX Maintenance Mode after failure and all the virtual machines on the host lose network connectivity.
  • Virtual machines network connectivity is restored after removing ESXi host from NSX Maintenance Mode.
  • Retrying the upgrade results in the same symptoms.
  • During every upgrade attempt, the ESXi host crashes and creates vmkernel zdump files.
  • In the /var/log/vobd.log, you will find entries similar to:

  2023-01-06T18:43:12.215Z: [UserLevelCorrelator] 130936553us: [esx.problem.host.coredump] An unread host kernel core dump has been found.
  2023-01-06T19:59:40.414Z: [UserLevelCorrelator] 129648579us: [esx.problem.host.coredump] An unread host kernel core dump has been found.

 
  • From the vmkernel backtrace, we see the crash occurred during unloading of VDR module.
2023-01-06T18:16:51.366Z cpu56:84891812)@BlueScreen: #PF Exception 14 in world 84891812:vmkload_mod IP 0x420002505188 addr 0x1a
 PTEs:0x843ab78027;0x1a817e007;0x0;
 2023-01-06T18:16:51.380Z cpu56:84891812)Code start: 0x420002400000 VMK uptime: 151:22:26:19.013
 2023-01-06T18:16:51.401Z cpu56:84891812)0x453961d1b518:[0x420002505188]MCSLockWork@vmkernel#nover+0x8 stack: 0x4200045f2dbc
 2023-01-06T18:16:51.423Z cpu56:84891812)0x453961d1b520:[0x42000242b98a]vmk_SpinlockLock@vmkernel#nover+0xf stack: 0x4308c9202a18
 2023-01-06T18:16:51.448Z cpu56:84891812)0x453961d1b530:[0x4200045f2dbb]VdrPolicyDelete@(nsxt-vdrb-17883598)#<None>+0x70 stack: 0x430abaa025b0
 2023-01-06T18:16:51.474Z cpu56:84891812)0x453961d1b560:[0x4200045f34f3]VdrDeletePolicyList@(nsxt-vdrb-17883598)#<None>+0x48 stack: 0x433549202fd8


Note:The preceding log excerpts are only examples. Date,time and environmental variables may vary depending on your environment.

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

  • ESXi host crashes while unloading the VDR module.
  • This issue occurs when all of the following conditions are true.
  • NSX-T upgrades of any version before 3.2(3.0.x to 3.2.x or 3.1.x to 3.2.x)
  • In-place mode is selected for host upgrade.
  • HCX policy based routing is configured on the ESXi host.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.2. In place upgrades from NSX-T version 3.2 onwards will not have this issue.

Workaround:
The issue can be addressed by implementing one of the following workarounds:
  • Changing the upgrade mode from "In-place" to "Maintenance mode".
  • Migrating any VM that is connected to a policy route enabled segment off the host being upgraded.
  • Removing the Mobility Optimized Networking Policy Routes configuration from HCX.


Additional Information

  • For additional information on "In-place" and "Maintenance mode" host upgrades, refer NSX-T documentation
  • For additional information on HCX policy routes, refer HCX documentation .
  • To check if policy based routing is configured on the ESXi host, use the below commands on the ESXi host.
To list the policy tables:

net-vdr --policyTable -l

Example output:

 Id  Name                                     Ref
 --- ------------------------------           ---
 5   8926fa40-f856-4379-b237-3e17a67a73fc     25


To see what the policies are for a certain table:

 net-vdr  --policy  -l -B x

Note: x is the policy id

Example output:

net-vdr --policy -l -B 5
 
 Policy Table 5
 Destination      GenMask          Flags    Ref Action   HitCount
 -----------      -------          -----    --- ------   --------
 0.0.0.0          0.0.0.0                   1   allow    0
 10.0.0.0         255.0.0.0                 1   allow    0
 172.16.0.0       255.240.0.0               1   allow    0
 192.168.0.0      255.255.0.0               1   allow    0
 Policy count = 4