Traffic being catered to by service interfaces like Load Balancer, NAT, etc is impacted. Load Balancer VIP is reporting status as "Down" and is not pingable, but the backend pool members are reachable.
When reviewing the service interface the VIP is bound to, VIP address is no longer displaying in the proxy_arp field of the Edge Service Router Service Interface.
Packet captures on the ESXi uplinks used by the datapath interfaces of the Edge VM show that replies are not being sent for the arp_proxy addresses.
Note: In the example below, 192.168.179.180 is my VIP and should be in the arp_proxy list.
[root@esx-02:~] pktcap-uw --uplink vmnic0 --capture UplinkSndKernel,UplinkRcvKernel --mac 00:50:56:9d:33:6a --ethtype 0x0806 --ng -o - | tcpdump-uw -enr - The name of the uplink is vmnic0. The session capture point is UplinkSndKernel,UplinkRcvKernel. The session filter mac(src or dst) is 00:50:56:9d:33:6a. The session filter Ethernet type is 0x0806. The output file format is pcapng. pktcap: The output file is -. pktcap: No server port specifed, select 59368 as the port. pktcap: Local CID 2. pktcap: Listen on port 59368. pktcap: Main thread: 371054500736. pktcap: Dump Thread: 371055036160. pktcap: Recv Thread: 371055564544.reading from file -, link-type EN10MB (Ethernet) pktcap: Accept... pktcap: Vsock connection from port 1046 cid 2. 13:57:13.904505 00:50:56:01:06:ce > 00:50:56:9d:33:6a, ethertype 802.1Q (0x8100), length 64: vlan 132, p 0, ethertype ARP, Request who-has 192.168.132.1 tell 192.168.132.254, length 46 13:57:13.904910 00:50:56:9d:33:6a > 00:50:56:01:06:ce, ethertype 802.1Q (0x8100), length 64: vlan 132, p 0, ethertype ARP, Reply 192.168.132.1 is-at 00:50:56:9d:33:6a, length 46 13:57:25.680277 00:50:56:01:06:ce > 00:50:56:9d:33:6a, ethertype 802.1Q (0x8100), length 64: vlan 134, p 0, ethertype ARP, Request who-has 192.168.179.180 tell 192.168.179.129, length 46 13:57:25.680748 00:50:56:9d:33:6a > 00:50:56:01:06:ce, ethertype 802.1Q (0x8100), length 64: vlan 134, p 0, ethertype ARP, Reply 192.168.179.180 is-at 00:50:56:9d:33:6a, length 46 >>> This is working state. If this issue is realized, this arp_proxy reply will not be seen. 13:57:55.010416 00:50:56:9d:33:6a > 00:50:56:01:06:ce, ethertype 802.1Q (0x8100), length 64: vlan 134, p 0, ethertype ARP, Request who-has 192.168.179.129 (00:50:56:01:06:ce) tell 192.168.179.190, length 46 13:57:55.010764 00:50:56:01:06:ce > 00:50:56:9d:33:6a, ethertype 802.1Q (0x8100), length 64: vlan 134, p 0, ethertype ARP, Reply 192.168.179.129 is-at 00:50:56:01:06:ce, length 46 ----------------------------------------------------------- [root@esx-02:~] net-stats -l | grep 00:50:56:9d:33:6a 67108878 5 9 DvsPortset-0 00:50:56:9d:33:6a NSX-T Edge 01.eth2
/opt/vmware/nsx-nestdb/bin/nestdb-cli --cmd get vmware.nsx.nestdb.ArpTableConfigMsg --beautify --json
{
"result" : [
{
"object_type" : "vmware.nsx.nestdb.ArpTableConfigMsg",
"value" : {
"entry" : [
{
"ip" : "192.168.179.180",
"mac" : "0",
"type" : "PROXY_ARP_ENTRY"
}
],
"log_router_port_id" : "da73ef7a-ab81-4ff0-a957-ae0ab8d0bf3f",
"id" : "a5845dcc-4076-46eb-8a48-52f11814dc48"
}
}
]
}
set debug
get mediator dump <file_name> >>> location to check for the dump after running is /var/vmware/file-store/<file_name>
$ zgrep -h "ping-latency" /var/log/corfu/corfu-metrics.* | grep "2023-03-28" | grep -v "mean=[0-9]\." | grep -v "mean=[0-4][0-9]\." | sort --key=1 | sed 's/,/ /g' | sed 's/=/ /g' | awk '{print $1 ", " $3 ", " $15}'
2023-03-28T01:50:43.842Z, failure-detector_ping-latency, 296.977376 >>>>Latency here is more than 10 ms
watch -n 30 'ls -lathr /config/corfu/LAYOUT*'
---------------------------------------------
Every 30.0s: ls -lathr /config/corfu/LAYOUT* nsx-mngr-01: Thu Apr 6 21:50:49 2023
-rw-r----- 1 corfu corfu 435 Mar 31 16:22 /config/corfu/LAYOUTS_0.ds
-rw-r----- 1 corfu corfu 435 Mar 31 17:58 /config/corfu/LAYOUT_CURRENT.ds >>> Watch the timestamp for LAYOUT_CURRENT.ds when updated a layout change has occurred.
-rw-r----- 1 corfu corfu 435 Mar 31 17:58 /config/corfu/LAYOUTS_1.ds
2023-03-28T14:44:51.595Z INFO FullSyncIdsLoader AbstractFullStateSyncDataBuilder 8386 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Providers [ArpTableConfig, BfdPeer, BgpNeighbor, BridgeCluster, ComputeCollectionProperties, Crl, DADProfile, DStaticRouteConfig, DhcpIpPool, DhcpRelay, DhcpStaticBinding, DhcpV6IpPool, DhcpV6StaticBinding, DiscoveredNodeParent, DnsForwarder, EVPNService, EVPNTunnelEndpoint, EastWestSIPolicy, EdgeCluster, ExcludeList, FQDNAnalysisConfig, FirewallSchedule, FirewallSection, FirewallUrlCategorizationConfig, GiNodeSolutionInfo, HostDiscoveredNodeRelationship, IDSComputeCollectionSettings, IDSProfile, IDSSection, IDSSignatures, IDSStandaloneHostsSettings, IPPrefixList, IPSecVPNDPDProfile, IPSecVPNIKEProfile, IPSecVPNLocalEndpoint, IPSecVPNPeerEndpoint, IPSecVPNService, IPSecVPNSession, IPSecVPNTunnelProfile, IPSet, IdfwEnabledComputeCollection, IdfwMasterSwitchSetting, IdfwStandaloneHostsSetting, InstanceRuntime, InterSiteL2Forwarder, IpfixCollectorConfig, IpfixConfigModel, IpfixSwitchGlobalConfigModel, L2VpnService, L2VpnSession, LOG_ROUTER, LOG_ROUTER_PORT, LOG_SERVICE_ROUTER_CLUSTER, LRQoSProfile, Layer3SIPolicy, LbRule, LbTcpProfile, LoadBalancerApplicationProfile, LoadBalancerMonitor, LoadBalancerPersistenceProfile, LoadBalancerPool, LoadBalancerService, LoadBalancerSslProfile, LoadBalancerVirtualServer, LogicalDhcpServer, LogicalPort, LogicalRouter, LogicalSwitch, MACSet, MetadataProxy, MplsConfig, MulticastConfig, NDRAProfile, NSGroup, NSGroupMembershipStatusEvent, NSProfile, NorthSouthSIPolicy, NsxGlobalConfigEntity, OspfInterfaceConfig, PBRRule, PBRSection, PortMirroringSession, RealizationStateBarrier, RouteMap, RoutingBfdConfig, SIExcludeList, ServiceChain, ServiceConfig, ServiceDefinition, ServiceInsertionRule, ServiceInsertionSection, ServiceProfile, TransportNodeState, UpmProfile] 2023-03-28T14:44:51.595Z INFO FullSyncIdsLoader AbstractFullStateSyncDataBuilder 8386 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading IDs from ArpTableConfig
2023-03-28T14:44:52.103Z WARN FullSyncIdsLoader AbstractView 8386 Got a wrong epoch exception, updating epoch to 4020 and invalidate view 2023-03-28T14:44:52.103Z INFO FullSyncIdsLoader AbstractView 8386 layoutHelper: Retried 0 times, SystemDownHandlerTriggerLimit = 60 2023-03-28T14:44:53.108Z WARN FullSyncIdsLoader AbstractView 8386 Got a wrong epoch exception, updating epoch to 4020 and invalidate view 2023-03-28T14:44:53.108Z INFO FullSyncIdsLoader AbstractView 8386 layoutHelper: Retried 1 times, SystemDownHandlerTriggerLimit = 60 *** 2023-03-28T14:46:09.757Z WARN FullSyncIdsLoader AbstractView 8386 Server still not ready. Waiting for server to start accepting requests. 2023-03-28T14:46:09.757Z INFO FullSyncIdsLoader AbstractView 8386 layoutHelper: Retried 0 times, SystemDownHandlerTriggerLimit = 60
2023-03-28T14:46:11.105Z INFO FullSyncIdsLoader CorfuCompileProxy 8386 accessInner: Encountered trimmed address space while accessing version 1292775699 of stream f4d4cd92-067f-3427-9d5b-1ff763257067 on attempt 0
2023-03-28T14:46:11.166Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder 8386 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping FullStateSyncDataBuilder INFO | jvm 1 | 2023/03/28 14:46:14 | Exception in thread "FullSyncIdsLoader" INFO | jvm 1 | 2023/03/28 14:46:14 | org.corfudb.runtime.exceptions.unrecoverable.UnrecoverableCorfuInterruptedError: Interrupted while completing future
2023-03-28T14:46:45.923Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder 8386 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Activating FullStateSyncDataBuilder
2023-03-28T09:03:11.225Z INFO MessageProcessor FullSyncTransactionCreator 11137 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for ARP_TABLE_CONFIG: Added 0, Deleted 0, Updated 6 2023-03-28T14:47:41.368Z INFO MessageProcessor FullSyncTransactionCreator 11137 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for ARP_TABLE_CONFIG: Added 0, Deleted 6, Updated 0 2023-03-28T14:47:00.463Z WARN MessageProcessor ConfigStoreEntityTypes 771 - [nsx@6876 comp="nsx-controller" level="WARNING" subcomp="adapter-mp"] Invalid NsxConfigMsg processed. Entity type = ARP_TABLE_CONFIG, field number = 54, NsxConfigMsg = operation: DELETE entity_id { 43bc6afa-1d50-4111-acae-2b9ee3efa025 } entity_type: ARP_TABLE_CONFIG
The root cause of this issue is consistent Network latency more than NSX-T system requirement i.e. < 10ms.
High network latency between NSX-T Manager nodes can result in frequent cluster reconfiguration events within CorfuDB as evidenced by observed Corfu Layout changes.
To document the state in which it was observed that the ArpTableConfig was read as empty see the following points:
In version 3.2.x, with architecture improvements, the StateSync component was no longer needed and hence, was removed from the product. As such, this scenario is not applicable to NSX-T 3.2.x and above releases.
In NSX-T 3.1.x and below, ensure that NSX-T Manager nodes adhere to the system requirements of less than 10ms of network latency between nodes.
This issue is resolved in NSX-T 3.2.x and above releases.
set debug get mediator forcesync
set debug get mediator dump <file_name> >>> location to check for the dump /var/vmware/file-store/<file_name>