HCX Extended VLANs with MON Enabled Fail to Pass Traffic After CGW Service Restart
search cancel

HCX Extended VLANs with MON Enabled Fail to Pass Traffic After CGW Service Restart

book

Article ID: 429789

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

  • Following a patch to HCX 4.11.3.0, extended VLANs with Mobility Optimized Networking (MON) enabled may experience traffic loss.
    • While most segments recover, specific VLANs may remain down.

Symptoms:

  • HCX Network Extension (NE) appliance services (ZMQ/CGW) crash and restart.

  • Log pattern - Found in NE-APPLIANCE </var/log/messages> : panic: RestartZMQ: Failed to bind(tcp://127.0.0.1:5500), err: system call interrupted.

  • Traffic fails for specific VLANs while others on the same NE pair recover.

  • Absence of "learned MAC change" entries for the impacted VLAN in the Network Extension appliance logs </var/log/messages>.

    <timestamp> NE-R1 cgw 10378 - - [Info-arper] : New shadow IP (IP rule) = {tapbr4 00:50:56:##:##:##ff:ff:ff:ff:ff:ff 1 00:50:56:##:##:##00:00:00:00:00:00 #.#.#.#  #.#.#.#}
    <timestamp>  NE-R1 cgw 10378 - - [Info-arper] : New shadow IP (IP rule) = {tapbr4 00:50:56:##:##:## ff:ff:ff:ff:ff:ff 1 00:50:56:##:##:##00:00:00:00:00:00 #.#.#.#  #.#.#.#}
    <timestamp> NE-R1 cgw 10378 - - [Info-arper] : New shadow IP (IP rule) = {tapbr3 00:50:56:##:##:## ff:ff:ff:ff:ff:ff 1 00:50:56:##:##:## 00:00:00:00:00:00 #.#.#.#  #.#.#.#}

Environment

  • VMware HCX 4.11.#

  • NSX-T / NSX 4.x

  • Mobility Optimized Networking (MON) Enabled

 

Cause

  • The issue is triggered by a crash of the ZMQ (Messaging Service) and CGW (Configuration Agent) within the HCX NE appliance.
  • Upon auto-remediation, the CGW service must re-learn the On-Premises Gateway MAC address via ARP probing to maintain MON-optimized traffic steering.
    • In some instances, the On-Premises Gateway fails to respond to these probes, or the CGW fails to process the response, leaving the ARP entry as 00:00:00:00:00:00.

Resolution

This issue is resolved in HCX 4.11.4 , available at Broadcom downloads.
Refer >> VMware HCX 4.11.4 Release Notes

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

 

  • Verify Service Health: Access the HCX Manager and check for Network-Extension appliance stability. Review /var/log/messages for ZMQ/CGW panic or restart events.

  • Verify MAC Learning: Log into the impacted NE-R appliance via CCLI and check if the Gateway MAC has been learned for the failing segment:

    • Navigate to HCX Cloud Manager Admin Console.

    • Enter CCLI -> list -> go <Appliance_Number> -> ssh.

    • Run: grep -r "learned MAC change" /var/log/messages

    • Check for the impacted Bridge ID (e.g., br5) to see if the NewMac is populated.

      • To see the # of bridges, use the command <brctl show>

Example:

<timestamp> NE-R1 cgw 10378 - - [Info-arper] : New shadow IP (IP rule) = {tapbr3 00:50:56:##:##:## ff:ff:ff:ff:ff:ff 1 00:50:56:##:##:## 00:00:00:00:00:00 #.#.#.#  #.#.#.#}


WORKAROUND :

  • Perform HA Failover: If a specific VLAN remains down and the MAC address is not being learned (remaining 00:00:00:00:00:00), initiate an HA Failover of the Network Extension appliance pair. This forces a fresh initialization of the bridge ports and ARP resolution.
  • On-prem GW should now be learned by appliance:
    <timestamps> UTC  NE-R1 cgw 10378 - - [Info-configer] : l2ArpResolver: learned MAC change: <BrId: br3, Intf: tapbr3, Ip: 10.105.62.1, NewMac: <Learned GW MAC>, OldMac: 00:00:00:00:00:00, When: <timestamp> +0000 UTC>
    

     

If the issue persists or RCA is required, please engage Broadcom Support immediately. Creating and managing Broadcom cases

Additional Information