In container environments NSX-T host upgrades fail at 10% or get stuck trying to put host in maintenance mode
search cancel

In container environments NSX-T host upgrades fail at 10% or get stuck trying to put host in maintenance mode

book

Article ID: 324227

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
The following conditions are met:
 
  •  NSX-T ESXi host upgrade fails with this exception
   Install of offline bundle failed on host <UUID> with error
   Exception('VSIP Filters not cleared') Exception: VSIP Filters not cleared
  •  The ESXi host contains stale dvfilter entries identified by "bad vnic uuid"
  #vsipioctl getfilters

   Filter Name : nic-79253-eth47-vmware-sfw.1
   VM UUID :
   VNIC Index : 0
   VNIC UUID : bad vnic uuid
   VIF ID : bfcf0811-dbe8-492c-8fed-e15e05687601
   Service Profile : --NOT SET--
   Filter Hash : 8497
  • Containers are deployed in the environment

Note, in some cases this issue may manifest itself as a failure to enter maintenance mode at the start of the upgrade process
 
  • The ESXi host gets stuck while entering Maintenance Mode as some of the VMs/Worker nodes fail to get migrated to other ESXi hosts.
  • VM tasks reports an error similar to:

    Failed waiting for data. Error 195887105. Failure.


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.x

Cause

This issue occurs because in some cases when VMs running containers are vmotioned from a host, the dvfilter is not destroyed as part of the cleanup process. This results in stale dvfilters left behind on the ESXi host.

As part of an NSX-T upgrade dvfilters on a host must be deleted, it is not possible to remove these stale dvfilters and so the host upgrade fails.

Resolution

This issue is resolved in VMware ESXi 6.5 P04 and VMware ESXi 6.7 Update 3, available at VMware Downloads.

Before performing an NSX-T upgrade, ESXi hosts should be upgraded to the fixed version.

Workaround:

To work around this issue if you do not want to upgrade, the ESXi host must be rebooted to clear the stale dvfilters.

There are 2 options

Automatic reboot all hosts during upgrade:

Change the group upgrade mode to reboot hosts by default as follows:
  • Find the group id for the cluster using GET api/v1/upgrade/upgrade-unit-groups?component_type=HOST or from UI
  • GET api/v1/upgrade/upgrade-unit-groups/<group id>
  • In the above response, change extended_configuration value by updating {"key" : "rebootless_upgrade", "value" : "false"}.
  • PUT the modified payload to api/v1/upgrade/upgrade-unit-groups/<group uuid>

or

Manual reboot of hosts
Post reboot, the NSX-T upgrade must be performed before exiting maintenance mode as moving VMs onto and off the host again may reintroduce the problem.

  1. Select to Pause the upgrade when an upgrade unit fails to upgrade.
  2. Start host upgrade.
  3. Once a host upgrade fails due to the issue described in this article:

    a) Confirm the host is in vSphere maintenance mode and reboot it to clear the stale dvfilters.
    b) Post reboot, the ESXi host should continue to remain in vSphere maintenance mode.
    c) To retry the NSX-T upgrade, the host must be taken out of NSX maintenance mode.
       Select System > Fabric > Nodes > Host Transport Nodes.
       Select the ESXi host and from Actions and click "Exit Maintenance Mode".
       If "Exit Maintenance mode" is not available, via the UI, use the API.
       POST https://manager/api/v1/transport-nodes/{node_id}?action=exit_maintenance_mode.
    d) On the Upgrade Coordinator, click Reset to clear the error.
    e) Restart the host upgrade. It will now retry the problem host and allow the upgrade to proceed.