To understand to root cause of HCX IX/NE & other Appliances tunnel flapping/unstable issue and fix the issue. Also, need to analyze high memory alerts on all hosts.
HCX IX/NE & other Appliances tunnel flapping/unstable and High memory usage alerts on all hosts in the cluster.
HCX manager logs show the appliances had insufficient memory, making it unstable.
system_events
{"id":60002,"level":6,"timestamp":1690140168,"UTC":"<date> <time> +0000 UTC","message":"Memory usage is high","metadata":{"cache":" ","free":" ","total":" ","used":" "}}
NE appliance messages log:
NE-R1 GatewayLogs[1052]: [Info-opsEvent] : new system event: SystemEvent[<date> <time> +0000 UTC, <date> <time> +0000 UTC, 60002, critical, Memory usage is high, map[cache: free: total: used: ]]
NE-R1 GatewayLogs[1052]: [Warning-ops] : Memory usage is probably high (free: %3)
SM-IX-R1 GatewayLogs[1136]: [Info-opsEvent] : new system event: SystemEvent[<date> <time> +0000 UTC, <date> <time> +0000 UTC, 60002, critical, Memory usage is high, map[cache: free: total: used: ]]
Caused due to a memory contention in the host where the appliance are running. Memory contention could be due to high memory usage by the workload VMs, impacting the HCX appliances stability.
The "tunnel flapping" issue may re-appear if memory condition is unstable on ESXi hosts. Please ensure to have host memory under less contention to avoid getting HCX appliances impacted.
Critical impact on the extended segments using HCX NE, which results in affecting the workloads connectivity. Extending new segments might fail due to NE Tunnel flapping.