Intermittent Connecvity Issues Observed for VMs Attached to NSX Overlay Segments
search cancel

Intermittent Connecvity Issues Observed for VMs Attached to NSX Overlay Segments

book

Article ID: 406572

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • VMs attached to NSX overlay segments may experience connectivity issues after being vMotioned.
  • Incorrect MAC/ARP updates from cfgAgent may be observed in host logs nsx-syslog.log:
    cfgAgent[2103498]: NSX 2103498 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="9BFB9700" level="info"] VXLAN Message         ARP (v4) Update: len = 56       SwitchID:0, VNI:69633   Num of entries: 1               #0      VM IP:192.168.##.##    VM MAC:00:50:56:##:##:aa       VTEP address type: 1    VTEP IP:10.##.##.#1     VTEP IPv6:0000:0000:0000:0000:0000:0000:0000:0000   VTEP MAC:00:50:56:##:##:cc     --->     INCORRECT TEP
    2025-07-04T14:22:26.251Z cfgAgent[2103498]: NSX 2103498 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="9BFB9700" level="info"] VXLAN Message         ARP (v4) Update: len = 56       SwitchID:0, VNI:69633   Num of entries: 1               #0      VM IP:192.168.##.##    VM MAC:00:50:56:##:##:aa       VTEP address type: 1    VTEP IP:10.##.##.#2     VTEP IPv6:0000:0000:0000:0000:0000:0000:0000:0000   VTEP MAC:00:50:56:##:##:bb     --->     CORRECT TEP
    
    cfgAgent[2103498]: NSX 2103498 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="9BFB9700" level="info"] VXLAN Message         VM MAC Update: len = 54         SwitchID:0, VNI:69633   Num of removed entries: 0       Num of added entries: 1                 #0      VM MAC:00:50:56:##:##:aa       VTEP address type: 1    VTEP IP:10.##.##.#1     VTEP IPv6:0000:0000:0000:0000:0000:0000:0000:0000   VTEP MAC:00:50:56:##:##:cc     --->     INCORRECT TEP
    cfgAgent[2103498]: NSX 2103498 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="9BFB9700" level="info"] VXLAN Message         VM MAC Update: len = 54         SwitchID:0, VNI:69633   Num of removed entries: 0       Num of added entries: 1                 #0      VM MAC:00:50:56:##:##:aa       VTEP address type: 1    VTEP IP:10.##.##.#2     VTEP IPv6:0000:0000:0000:0000:0000:0000:0000:0000   VTEP MAC:00:50:56:##:##:bb     --->     CORRECT TEP
  • NSX-RPC keepalive failure entries are observed in host logs nsx-syslog.log:
    nsx-opsagent[2104023]: NSX 2104023 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104218" level="ERROR" errorCode="RPC31"] RpcConnection[41561 Negotiating to tcp://127.0.0.1:4554 0] Keepalive failed - haven't received response in time (last request was sent 60 seconds ago, response received - never)
    cfgAgent[2103498]: NSX 2103498 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="9B19D700" level="error" errorCode="RPC31"] RpcConnection[68925 Negotiating to tcp://127.0.0.1:4554 0] Keepalive failed - haven't received response in time (last request was sent 59 seconds ago, response received - never)
    nsx-proxy[2103518]: NSX 2103518 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="2103684" level="ERROR" errorCode="RPC31"] RpcConnection[107939 Negotiating to tcp://127.0.0.1:4554 0] Keepalive failed - haven't received response in time (last request was sent 60 seconds ago, response received - never)
    nsx-opsagent[2103194]: NSX 2103194 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2103265" level="ERROR" errorCode="RPC31"] RpcConnection[241 Connected on tcp://127.0.0.1:4554 0] Keepalive failed - haven't received response in time (last request was sent 59 seconds ago, response received - 299 seconds ago)

Environment

VMware NSX

Cause

vMotion can cause a duplicate MAC/IP entry in cfgAgent's LogSwitchStateMsg state in nestDB when NSX transport nodes experience frequent NSX-RPC keepalive failures.  This behaviour results in VMs losing network connectivity.

Resolution

This issue is resolved in VMware NSX 4.2.3 and VCF 9.1 & 9.0.1 and later, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.



Workaround

  • If LogSwitchStateMsg has stale forwarding entries, restart the cfgAgent.

    Command to check LogSwitchStateMsg in nestDB:
    /opt/vmware/nsx-nestdb/bin/nestdb-cli --cmd get vmware.nsx.nestdb.LogSwitchStateMsg --beautify --json
    example:
            {
                "object_type" : "vmware.nsx.nestdb.LogSwitchStateMsg",
                "value" : {
                    "id" : "b8c9####-0c14-####-9a4b-5736####fb95",
                    "vtep" : [
                        {
                            "vtep_ip" : "10.##.##.#1",
                            "vtep_label" : {
                                "label" : 69633
                            },
                            "segment_id" : "10.####",
                            "vtep_mac" : "00:50:56:##:##:cc"
                        }
                    ],
                    "mac" : [
                        {
                            "mac" : "00:50:56:##:##:aa",----->mac
                            "vtep_ip" : "10.##.##.#1",------->vtep should be actual destination TEP
                            "vtep_mac" : "00:50:56:##:##:cc"
                        }
                    ]
                }
            }

    Command to perform cfgAgent restart:
    /etc/init.d/nsx-cfgagent restart

  • To help prevent an issue reoccurance, recommendation is to disable excessive DFW logging which is a known cause for NSX-RPC keepalive failures on NSX transport nodes.