ESXi hosts may experience operational issues if L2 DFW default rule logging is enabled
search cancel

ESXi hosts may experience operational issues if L2 DFW default rule logging is enabled

book

Article ID: 326455

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware NSX VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

  • NSX-T DFW comprises of rules at Layer 2 and Layer 3.
  • Layer 2 rules by their very nature are stateless.
  • For the majority of environments, the Layer 2 Ethernet section has a default Any/Any/Any/Allow rule and the Layer 3 section is where custom rules are added.
  • It is possible to enable logging on any DFW rule and this log will be written to /var/log/dfwpktlogs.log on the ESXi host.
  • Enabling logging on the default stateless Layer 2 rule effectively turns on logging for all traffic in an environment.
  • Depending on the size of the environment, this can generate tens of thousands of log lines per second on an ESXi host.
  • This enormous level of logging has the potential to impact some operations of an ESXi host.
  • In the var/run/log/dfwpktlogs.log on the host, you will see log flooding similar to:
    2021-11-12T20:59:37.464Z ef608aa0 L2 match PASS 1 IN 1450 00:50:##:##:##:60->00:50:##:##:##:4a ETHTYPE 0800
    2021-11-12T20:59:37.464Z ef608aa0 L2 match PASS 1 IN 1450 00:50:##:##:##:60->00:50:##:##:##:4a ETHTYPE 0800
    2021-11-12T20:59:37.464Z ef608aa0 L2 match PASS 1 IN 1450 00:50:##:##:##:60->00:50:##:##:##:4a ETHTYPE 0800
    2021-11-12T20:59:37.464Z ef608aa0 L2 match PASS 1 IN 1450 00:50:##:##:##:60->00:50:##:##:##:4a ETHTYPE 0800
    2021-11-12T20:59:37.464Z ef608aa0 L2 match PASS 1 IN 553 00:50:##:##:##:60->00:50:##:##:##:4a ETHTYPE 0800
    2021-11-12T20:59:37.464Z ef608aa0 L2 match PASS 1 IN 83 00:50:##:##:##:60->00:50:##:##:##:4a ETHTYPE 0800
    This issue seems to particularly impact vSAN environments, possible symptoms include

    vSAN health service timeout

    # esxcli vsan health cluster list
    Query timeout. Please try again

    vSAN service restart timeout

    # /etc/init.d/vsanmgmtd restart
    wait for 17215108 termination timed out
    vsanperfsvc stopped.
    vsanperfsvc started.



  • You may also see DFW policies intermittently enter an "Unknown" state:


  • Node status may show 'Degraded' on one or multiple hosts in the NSX UI > System > Fabric > Hosts > Clusters

     
    • Click the 'View Details' option of the ESXi Host showing 'Degraded'.  In the Overview tab you will see 'Controller Connectivity' has a 'Down'/Red status

    • Click the 'Monitor' tab and scroll down to 'Agent Status'.  Click 'Agent Status' and it will show the Agent services and their status.

      • NSX_NESTDB shows 'Down'/Red

Resolution

It is not recommended to enable logging on the default L2 Ethernet DFW rule in a Production environment for any sustained period of time.
If logging must be enabled on an L2 rule, it is advised to create a new L2 rule specific to the traffic flow in question and enable logging on that rule only.
To disable logging follow the following steps:
Login to NSX manager > Click on Security > Distributed Firewall > ETHERNET > Expand Default Layer2 Section >

Click on settings for Default Layer2 Rule:


Disable Logging and apply:



If nsx_nestdb service is down on any hosts:

  • Check if the service is down on the ESXi. SSH into the host and run:
    • /etc/init.d/nsx-nestdb status
    • The problem status is 'NSX-NESTDB not running'

  • Restart/Start the nsx-nestdb service on the ESXi host.
    • /etc/init.d/nsx-nestdb start

  • Repeat for all affected hosts.