NSX Manager is using 100% of CPU when pushing a large number of firewall rules or rules to a large number of hosts
search cancel

NSX Manager is using 100% of CPU when pushing a large number of firewall rules or rules to a large number of hosts

book

Article ID: 345889

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

This article provides information on lowering CPU usage and ensuring firewall rules are getting published to all applicable hosts.

Symptoms:
  • NSX Manager shows it is using 100% of CPU.
  • A large number of firewall rules have been recently added to the environment.
  • A large number of hosts are in the environment and configured to receive firewall rules.
  • NSX Manager is unresponsive to Web Interface or API commands.
  • ESXi to NSX Manager communication channel appears down for several hosts.
  • In the vsfwd.log file on a host showing communication channel down, you see entries similar to:

    Re-read credentials to broker <IP Address>:5671: Logging in: Input/output error
    2018-04-18T16:00:04UTC rmqClient Closing, No Ack received for Client netClient index 7

     
  • In the vsm.log file on the affected hosts, you see entries similar to:

    2018-04-18 12:34:50.894 MDT ERROR HeartbeatManagerHeartbeatTimer HeartbeatManager$HeartbeatTask:297 - Client has not responded to the heartbeat for longer than the alert threshold. Peer name = 'com.vmware.vshield.userworld', client token = 'host-71', client id = '<UUID>', last heartbeat response = '4', last published heartbeat = '74'

    Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware NSX for vSphere 6.3.x
VMware NSX for vSphere 6.2.x

Cause

This issue occurs because the NSX Manager's communication channel to the ESXi hosts is down or unavailable. This leads to NSX Manager repeatedly trying to reconnect to the ESXi hosts and synchronize the firewall rules.

Resolution

This issue is resolved in:

Workaround:
To work around this issue if you do not want to upgrade:
  1. Stop the vsfwd services on all the hosts which should clear out pending queues by running this command:

    /etc/init.d/vShield-Stateful-Firewall stop

  2. Restart the NSX Manager and wait for a few minutes for the services to be in a ready state in the User Interface (UI) before proceeding to the next step.

    Note: There are no hosts syncs during this time as vsfwd is down.
     
  3. Start the vsfwd service on a few hosts (5-8 hosts) at a time by running this command:

    /etc/init.d/vShield-Stateful-Firewall start

    Note: This spikes the NSX Manager CPU for a few mins (~10).

  4. Once spike is done, restart the next batch of vsfwds.