Edge Datapath CPU very high alarm
search cancel

Edge Datapath CPU very high alarm

book

Article ID: 330483

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Title: Alarm for Edge Datapath CPU usage very high
Event ID: edge_health.edge_datapath_cpu_very_high
Alarm Description

  • Purpose: Indicates Edge Datapath CPU usage is high
  • Impact: Rx drops will be observed when usage reaches 100%

Environment

VMware NSX-T Data Center

Edge Form factors:

  • Bare Metal Edge
  • VM Edge
 

Cause

Reason for very high CPU usage: 

  • Current CPU usage on the Edge node can be obtained by invoking the 'get dataplane cpu stats' Edge CLI which shows packets per second per CPU core and the CPU utilization. 100% CPU usage implies you have reached the maximum capacity for one or all CPUs.

    Sample output for get dataplane cpu stats:

    > get dataplane cpu stats
    Wed Jun 26 2024 UTC 20:50:50.460
    CPU Usage
    Core      : 0
    Crypto    : 0 pps
    Intercore : 0 pps
    Kni       : 0 pps
    Rx        : 1930 pps
    Slowpath  : 0 pps
    Tx        : 0 pps
    Usage     : 100%
    
    Core      : 2
    Crypto    : 0 pps
    Intercore : 0 pps
    Kni       : 0 pps
    Rx        : 1920 pps
    Slowpath  : 0 pps
    Tx        : 0 pps
    Usage     : 100%
  • One of the reasons is the traffic rate is at 100% of what the CPU can process.
  • CPU usage also increases when there is large number of fragmented packets. Checking for MTU size along the path and adjusting the packet size can help reduce fragmentation.
  • The number of fragmented packets on the Logical router interface can be obtaining using 'get gateway interface <Logical router interface UUID> stats' Edge CLI. Logical router interface UUID is obtained using 'get interface' Edge CLI under the Logical router VRF. 
  • CPU usage may be high only on a subset of CPUs if the traffic is getting hashed only to that subset of CPUs. 

    Sample output for get interface under a given VRF:

    > get interfaces
    Wed Jun 26 2024 UTC 20:54:40.086
    Logical Router
    UUID                                  VRF  LR-ID  Name      Type
    ec39a7e1-####-####-####-cc6f579cc5dc  1    3      ########   SERVICE_ROUTER_TIERO
    Interfaces (IPv6 DAD Status A-DAD_Success, F-DAD_Duplicate, T-DAD_Tentative, U-DAD_Unavailable)
        Interface     : d0d151da-####-####-####-89a05c454cc4
        Ifuid         : ###
        Mode          : cpu
        Port-type     : cpu
        Enable-mcast  : true
    
        Interface     : 76a2d89a-####-####-####-815d27147b1c
        Ifuid         : ###
        Mode          : blackhole
        Port-type     : blackhole
    	
        Interface     : fcdd7a2a-####-####-####-034c5c0e05e7
        Ifuid         : ###
        Name          : ######
        Fwd-mode      : IPV4_AND_IPV6
        Internal name : uplink-###
        Mode          : lif
        Port-type     : uplink
        IP/Mask       : 100.##.##.##/24
        MAC           : 02:50:56:##:##:##
        VLAN          : 3901
        Access-VLAN   : untagged
        LS port       : 411f98f0-####-####-####-fbcb0b800338
        Urpf-mode     : STRICT_MODE
        DAD-mode      : LOOSE
        RA-mode       : SLAAC_DNS_THROUGH_RA(M-0, 0-0)
        Admin         : up
        Op_state      : up
        Enable-mcast  : True
        MTU           : 1500
        arp_proxy     :


    Sample output for get gateway interface:

    > get gateway interface fcdd7a2a-####-####-####-034c5c0e05e7
    Wed Jun 26 2024 UTC 20:53:35.065
    interface      : fcdd7a2a-####-####-####-034c5c0e05e7
    ifuid          : ###
    VRF            : ec39a7e1-####-####-####-cc6f579cc5dc
    name           : #######
    mode           : lif
    IP/Mask        : 100.##.##.##/24
    Fwd-mode       : IPV4_AND_IPV6
    MAC            : 02:50:56:##:##:##
    VLAN           : 3901
    Segment port   : 411f98f0-####-####-####--fbcb0b800338
    urpf-mode      : STRICT_MODE
    admin          : up
    op_state       : up
    MTU            : 1500
    arp_proxy      :

Resolution

Steps to Resolve
For 3.0.0 and higher

Recommended Action: 

  • Collect the support bundle when the alarm is raised.
  • Consider increasing the Edge appliance form factor size and rebalancing services on this Edge node to other Edge nodes in the same cluster or other Edge clusters.
  • Higher CPU usage is expected with higher packet rates. On the Edge node if the packet rate is low while cpu usage is high then check if flow-cache is disabled by invoking 'get dataplane flow-cache config' Edge CLI. If it is disabled, then consider re-enabling it using the command 'set dataplane flow-cache enabled' followed by 'restart service dataplane' (Note: This command will cause momentary disruption in traffic).

    Sample output for get dataplane flow-cache config:

    > get dataplane flow-cache config
    Wed Jun 26 2024 UTC 20:56:07.488
    Enabled            : true
    Mega_hard_timeout_ms: 4966
    Mega_size          : 262144
    Mega_soft_timeout_ms: 4898
    Micro_size         : 262144

Maintenance window required for remediation? Yes

Additional Information