Routing subsystem on the edge node is down alarm
search cancel

Routing subsystem on the edge node is down alarm

book

Article ID: 369208

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Title: Alarm to indicate that the status of the routing subsystem on the edge node is down.
Event ID: routing_down

Alarm description:

  • Purpose: The routing status determines the participation of the edge node in forwarding data traffic. This alarm indicates that the status of the routing subsystem on the edge node is down.
  • Impact: If the status of the routing subsystem on the edge node is down, no communication can be carried out of the edge node and would hence cause complete traffic loss.

Environment

VMware NSX-T Data Center
VMware NSX

Cause

  • In the edge node, the routing subsystem drives the communication to the external domain.
  • The routing subsystem principally works on a service router and uses the networking routing protocols to install and maintain routes to the external domain.
  • With the help of the routes configured, the edge node communicates between the local and the external network. 
  • The status of the routing subsystem depends on the presence of a BGP, OSPF, or BFD configuration, and in the UP/ESTABLISHED state on the edge node.
  • If none of the sessions are present or if all of these sessions are in the DOWN state, the routing subsystem is considered to be down.

Resolution

Steps to resolve
For 3.0.0 and higher

  • Check the edge node's health:
    • Check for the presence of the EdgeHealthAlarm, and follow the recommended actions provided in that alarm.
  • Check the service router configuration:
    • Verify if a service router is configured on the edge node using the API GET /policy/api/v1/infra/tier-0s.

      Sample Response: GET /policy/api/v1/infra/tier-0s
      {
        "sort_ascending"true,
        "sort_by""display_name",
        "result_count": 1,
        "results": [
          {
            "resource_type""Tier0",
            "id""vmc_prv",
            "display_name""/infra/tier-0s/vmc_prv",
            "path""/infra/tier-0s/vmc_prv",
            "parent_path""/infra/tier-0s/vmc_prv",
            "relative_path""vmc_prv",
            "ha_mode""ACTIVE_STANDBY",
            "transit_subnets": [ "##.##.##.#/24" ],
            "force_whitelisting"false,
            "_create_user""admin",
            "_create_time": 1516667421694,
            "_last_modified_user""admin",
            "_last_modified_time": 1516667421694,
            "_system_owned"false,
            "_protection""NOT_PROTECTED",
            "_revision": 0
          }
        ]
      }  
    •  If no service routers are configured, the routing subsystem on the edge node will be down, and communication over the uplinks will not be possible.
    • If a service router is configured, check the presence of a BGP, OSPF, or BFD configuration, and the state. At least one BGP, OSPF, or BFD session must be in the UP/ESTABLISHED state for the routing subsystem to be up.
  • Check for BGP configuration and the state:
    • Check for the presence of BGP neighbors, and the current state of the sessions using the API  GET /policy/api/v1/infra/tier-0s/{tier-0-id}/locale-services/{locale-service-id}/bgp/neighbors/status

    • In the above command, replace tier-0-id with the actual name of the T0 gateway and locale-service-id is usually represented as default

      Sample Response: GET /policy/api/v1/infra/tier-0s/<tier-0-id>/locale-services/<locale-service-id>/bgp/neighbors
      {
        "cursor""########-####-####-####-############",
        "sort_ascending"true,
        "sort_by""displayName",
        "result_count": 1,
        "tier0_path""/infra/sites/default/enforcement-points/default",
        "results": [{
          "edge_path""/infra/sites/default/enforcement-points/default/edge-clusters/########-####-####-####-############/edge-nodes/########-####-####-####-############",
          "source_address""##.##.##.#",
          "neighbor_address""##.##.##.#",------------------>Neighbor IP Address
          "remote_as_number""1",
          "remote_port": 179,
          "local_port": 179,
          "connection_status""CONNECTED", ---------------> Status
          "messages_received": 12,
          "messages_sent": 10,
          "connection_drop_count": 0,
          "hold_time": 180,
          "keep_alive_time": 30,
          "graceful_restart"true,
          "last_updated_timestamp": 11999191991991
        }]
      }
    • If BGP neighbors are configured and all of the sessions' current state is not ESTABLISHED, check for the presence of the bgp_down alarm, and follow the recommended actions provided in that alarm.
  • Check for OSPF configuration and the state:
    • Check for the presence of OSPF neighbor sessions using the API GET /policy/api/v1/infra/tier-0s/{tier-0-id}/locale-services/{locale-service-id}/ospf/neighbors.

    • In the above command, replace tier-0-id with the actual name of the T0 gateway and locale-service-id is usually represented as default
      Sample Response: GET /policy/api/v1/infra/tier-0s/{tier-0-id}/locale-services/{locale-service-id}/ospf/neighbors.
      {
          "gateway_path""/infra/tier-0s/tier0",
          "last_update_timestamp": 1605794187118,
          "results": [
              {
                  "edge_path""/infra/sites/default/enforcement-points/default/edge-clusters/########-####-####-####-############/edge-nodes/0",
                  "neighbors": [
                      {
                          "neighbor_address""##.##.##.#", -------------------> OSPF Neighbor Address
                          "neighbor_status_info": [
                              {
                                  "interface_name""uplink-270:##.##.##.#",
                                  "source_address""##.##.##.##",
                                  "priority": 1,
                                  "state""Full",
                                  "last_state_change""2d12h37m",
                                  "dead_time""39.895s",
                                  "retransmit_counter": 0,
                                  "request_counter": 0,
                                  "database_summary_counter": 0
                              }
                          ]
                  }
              ],
              "result_count": 2,
              "sort_by""display_name",
              "sort_ascending"true
          }
    • If OSPF is configured and all of the sessions' current state is not FULL, check for the presence of the ospf_neighbor_went_down alarm, and follow the recommended actions provided in that alarm.
    • Invoke the NSX CLI command get logical-routers.

      Sample CLI output: get logical-routers
      Edge1> get logical-routers
      Logical Router
      UUID                                   VRF    LR-ID  Name                              Type                        Ports   Neighbors
      ########-####-####-####-############   0      0                                        TUNNEL                      4       10/5000
      ########-####-####-####-############   1      3      SR-tier0                          SERVICE_ROUTER_TIER0        6       0/50000
      ########-####-####-####-############   3      1      DR-tier0                          DISTRIBUTED_ROUTER_TIER0    6       2/50000
    • Switch to the service router {sr_id} using the NSX CLI command vrf vrf_id_of_service_router
    • Invoke the NSX CLI command get ospf neighbor to obtain the Neighbor ID and check the current State.

      Sample CLI output: get ospf neighbor
      nsx-edge-1(tier0_sr)> get ospf neighbor
       
          Neighbor ID     Pri State           Dead Time Address         Interface            RXmtL RqstL DBsmL
          ##.##.##.##       1 Full/DR           30.173s ##.##.##.##     uplink-###:##.##.##.#     0     0     0
  • Check for BFD configuration and the state:
    • Follow the steps below to check for the presence of BFD sessions.
      • Invoke the NSX CLI command get logical-routers.

        Sample CLI output: get logical-routers
        Edge1> get logical-routers
        Logical Router
        UUID                                   VRF    LR-ID  Name                              Type                        Ports   Neighbors
        ########-####-####-####-############   0      0                                        TUNNEL                      4       10/5000
        ########-####-####-####-############   1      3      SR-tier0                          SERVICE_ROUTER_TIER0        6       0/50000
        ########-####-####-####-############   3      1      DR-tier0                          DISTRIBUTED_ROUTER_TIER0    6       2/50000
      • Switch to the service router {sr_id} using the NSX CLI command vrf vrf_id_of_service_router
      • Invoke the NSX CLI command get bfd-sessions and verify if any of the session's current state is UP.

        Sample CLI output: get bfd-session
        Edge1(tier0_vrf_sr[7])> get bfd-sessions
        BFD Session
        Dest_port                     : 3784 -----------------------------------> Destination Port
        Diag                          : No Diagnostic
        Encap                         : vlan
        Forwarding                    : last true (current true)
        Interface                     : ########-####-####-####-############
        Intf_type                     : LR_PORT
        Keep-down                     : false
        Last_admin_down_diag_time     : 2024-04-17 13:15:18
        Last_cp_diag                  : No Diagnostic
        Last_cp_rmt_diag              : No Diagnostic
        Last_cp_rmt_state             : up
        Last_cp_state                 : up
        Last_down_time                : 2024-04-17 13:15:18
        Last_fwd_state                : UP
        Last_local_down_diag          : Neighbor Signaled Session Down ---------> Edge Diag Code
        Last_remote_admin_down_time   : 2024-04-17 13:15:18
        Last_remote_down_diag         : Administratively Down
        Last_up_time                  : 2024-04-17 13:15:19
        Local_address                 : ##.##.##.# -----------------------------> Local Address
        Local_discr                   : 673456400 ------------------------------> Local Discriptor
        Min_rx_ttl                    : 255
        Multiplier                    : 3
        Received_remote_diag          : No Diagnostic
        Received_remote_state         : up
        Remote_address                : ##.##.##.## ----------------------------> Remote Address
        Remote_admin_down             : false
        Remote_diag                   : No Diagnostic
        Remote_discr                  : 4097 -----------------------------------> Remote Discriptor
        Remote_min_rx_interval        : 1000
        Remote_min_tx_interval        : 1000
        Remote_multiplier             : 3
        Remote_state                  : up
        Router                        : ########-####-####-####-############
        Router_down                   : false
        Rx_cfg_min                    : 500
        Rx_interval                   : 1000
        Service-link                  : false
        Session_type                  : UPLINK
        State                         : up -------------------------------------> State
        Tx_cfg_min                    : 500 ------------------------------------> Configured Transmit Min Interval
        Tx_interval                   : 1000 -----------------------------------> Transmit Interval
        Type                          : IPv4
      • If BFD sessions are configured and all of the sessions' current state is not UP, check for the presence of the bfd_down_on_external_interface alarm, and follow the recommended actions provided in that alarm.

Is a maintenance window required for remediation?

No