cfgagent service is going down randomly on multiple Hosts along with other service like nestdb, opsagent.
search cancel

cfgagent service is going down randomly on multiple Hosts along with other service like nestdb, opsagent.

book

Article ID: 415542

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Multiple hosts are experiencing issues where some services go down, and the hosts appear in a disconnected state from the NSX UI


  • Click the 'Monitor' tab and scroll down to 'Agent Status'.  Click 'Agent Status' and it will show the Agent services and their status.

    • NSX_NESTDB shows 'Down'/Red


Environment

VMware NSX

Cause

  • Checking the Host log could see the below keepalive expired for various nsx agents such as nestdb, cfgagent, opsagent, nsx-proxy
    root# var/run/log/nsx-syslog 
    netopa[2855716]: NSX 2855716 - [nsx@6876 comp="nsx-esx" subcomp="nsx-netopa" s2comp="nsx-rpc" tid="2855728" level="INFO"] RpcConnection[28 Connected to tcp://127.0.0.1:2480 0] Closing (keepalive expired)
    netopa[2855716]: NSX 2855716 - [nsx@6876 comp="nsx-esx" subcomp="nsx-netopa" s2comp="nsx-rpc" tid="2855728" level="INFO"] RpcConnection[28 Closed to tcp://127.0.0.1:2480 0] Notifying channels on connection down (keepalive expired)
    cfgAgent[2854619]: NSX 2854619 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="6638C700" level="info"] RpcConnection[307 Connected to tcp://127.0.0.1:2480 0] Closing (keepalive expired)
    cfgAgent[2854619]: NSX 2854619 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="6638C700" level="info"] RpcConnection[307 Closed to tcp://127.0.0.1:2480 0] Notifying channels on connection down (keepalive expired)
    nsx-opsagent[3072442]: NSX 3072442 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="3072444" level="INFO"] RpcConnection[2 Negotiating to tcp://127.0.0.1:2480 0] Closing (keepalive expired)
    nsx-opsagent[3072442]: NSX 3072442 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="3072444" level="INFO"] RpcConnection[2 Closed to tcp://127.0.0.1:2480 0] Notifying channels on connection down (keepalive expired)
  • Noticed a very high rate of dfwpktlogs logging for a rule which is explicitly allowed " Any to Any ".
    for example if the rule-id is 1234 (any to any) we can check the logging count from the below command:
    root# var/run/log > less  dfwpktlogs* | grep -i "1234" | wc -l
    626617 >>number of hits.

Resolution

  • It is not recommended to enable logging on the explicit rule in a Production environment for any sustained period of time.
  • If logging must be enabled on any explicit rule (any to any), it is advised to create a new rule specific to the traffic flow in question and enable logging on that rule only.
    To disable logging follow the following steps:
    for example "Default Layer2 Section"
    Login to NSX manager > Click on Security > Distributed Firewall > ETHERNET > Expand Default Layer2 Section >

    Click on settings for Default Layer2 Rule:


    Disable Logging and apply:



    If nsx_nestdb service is down on any hosts:

  • Check if the service is down on the ESXi. SSH into the host and run:
    • /etc/init.d/nsx-nestdb status
    • The problem status is 'NSX-NESTDB not running'

  • Restart/Start the nsx-nestdb service on the ESXi host.
    • /etc/init.d/nsx-nestdb start

  • Repeat for all affected hosts. 

Additional Information

If by performing the above steps issue still persist please open a support ticket with broadcom : Creating and managing Broadcom support cases