Purpose of this Article is to raise awareness of NAT issues experienced in NSX-T 2.4.0 and 2.4.1
Symptoms:
THESE ISSUES WILL ONLY OCCUR WHEN SNAT AND DNAT ARE CONFIGURED ON T0 LOGICAL ROUTERS RUNNING NSX-T 2.4.0 and 2.4.1
Scenario-1 - Intra-Tier Communication: -
-- Workloads behind same T1 communicating with one another through their NAT'ed IPs configured on T0. The communication between the Workloads will fail because of RPF-Check on the T0 LinkedPort to (T1) when the T0 and T1 SRs Active Instance is configured on the same edge Node.
-- SNAT and DNAT is configured on the T0
This issue is commonly seen in a PKS deployment with NAT Topology as shown in the diagram above
Note: This issue will be seen even when there's no PKS in the environment. We are using the PKS deployment as an example
Environmental Overview:
-- T1 MGMT PKS which has all the PKS Control Plane components such as Ops Manager, Bosh and PKS VMs.
-- T1 MGMT PKS has a Service Router (SR) instance configured
-- T0 and T1 SRs instances are Active on the Same Edge Node
Topology information:
-- Opsmgr IP: 192.168.50.11
-- Bosh Director IP: 192.168.50.12
-- Opsmgr's IP 192.x.x.11 is SNAT'ed on the T0 with IP x.x.100.81
-- Bosh's IP x.x.50.12 is DNAT'ed on the T0 with IP x.x.100.82
How Issue Occurs:
-- Opsmgr with IP x.x.50.11 will communicate with Bosh Director over its NAT'ed IP x.x.100.81
-- Expected traffic flow is that the Opsmgr’s IP will get SNAT’ed from x.x.50.11 to x.x.100.81 and Bosh’s IP will get DNAT’ed from x.x100.82 to x.x50.12 as it traverses the T0
-- Setting up packet captures on the Source host where Opsmgr resides, we see ICMP echo request (below) with Source IP x.x.50.11 and Destination IP x.x.100.82. This packet capture is egressing the Opsmgr's vnic
[root@esxcna03-s1:~] pktcap-uw --switchport 67108879 --dir 0 --stage 0 -o - | tcpdump-uw -enr - icmp
reading from file -, link-type EN10MB (Ethernet)
16:58:03.199507 00:50:56:xx:xx:xx > 02:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.100.82: ICMP echo request, id 2322, seq 1, length 64
-- Packet Capture on the Destination Host we do not see these ICMP Echo Request come in on the switchport
[root@esxcna01-s1:~] pktcap-uw --switchport 67108875 --dir 1 --stage 0 -o - | tcpdump-uw -enr - icmp
reading from file -, link-type EN10MB (Ethernet)
tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call
[root@esxcna01-s1:~]
-- Running packet captures on the T0 Linked port where T0 is connected, we see ICMP echo request arrive. 573ae3e8-6e47-4033-aff2-f896c88b521e is the UUID of the T0 LinkedPort facing the T1
nsxtedge01> start capture interface 573ae3e8-6e47-4033-aff2-f896c88b521e expression icmp
17:19:42.184909 02:50:56:xx:xx:xx > 02:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.100.82: ICMP echo request, id 2339, seq 1094, length 64
<base64>XXXXXXXXXXXXXXXXXX</base64>
Packet Walk:
nsxtedge01> get firewall 573ae3e8-6e47-4033-aff2-f896c88b521e connection | find icmp
0x04000a2974000065: x.x.50.11 -> x.x.50.12 (x.x.100.82) dir in protocol icmp
-- Checking stats of the T0 LinkedPort to the T1, we see that RPF-Check packet drop counter is increasing
nsxtedge01> get logical-router interface 573ae3e8-6e47-4033-aff2-f896c88b521e stats
interface : 573ae3e8-6e47-4033-aff2-f896c88b521e
ifuid : 283
VRF : 43705f83-f1fd-4581-be22-9f2492816d69
name : LinkedPort_Cent-T1
<Output Snipped>
Statistics
RX-Packets : 3917
RX-Bytes : 380142
RX-Drops : 3859
RPF-Check : 3764 <--------------- RPF-Check Drops increasing
<Output Snipped>
Scenario 2 - Inter-Tier1 Communication: -
-- Workloads in different T1s communicate with each other using their NAT'ed IPs configured on the T0. Communication between workloads fail when T1s are configured with only a DR instance and no Service Router Instance
-- SNAT and DNAT is configured on T0
This issue is commonly seen in PKS environment where workloads behind different T1s talk to each other via their NAT'ed IPs configured on T0.
For illustration, we'll use the below
Environment Information:
-- T1-Mgmt-PKS has all PKS Control Plane Components
-- T1-k8s has Workload VMs
-- None of the T1s have a Service Router (SR) instance, only Distributed Router (DR) instances
-- Workloads are NAT'ed at T0
Topology information:
-- Opsmgr IP: x.x.50.11
-- Opsmgr's IP x.x.50.11 is SNAT'ed on the T0 to IP x.x.100.81
-- K8s VM (called RedNode01) IP:
-- K8s VM (called RedNode01) is DNAT'ed on the T0 to IP x.x.100.90
-- Ping from Opsmgr (x.x.50.11) to Bosh's NAT'ed IP (x.x.100.90). Notice how we receive a response back from x.x.60.11 which is Bosh's real IP instead of its NAT'ed IP. This is Unexpected
-- Packet captures on the Source Host (where Opsmgr resides) as the packet egresses the Source Port shows packet leave with a source IP x.x.50.11 and destination x.x.100.90
[root@esxcna03-s1:~] pktcap-uw --switchport 67108872 --dir 0 --stage 0 -o - | tcpdump-uw -enr -
reading from file -, link-type EN10MB (Ethernet)
21:03:22.230309 02:50:56:xx:xx:xx > 02:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.100.90: ICMP echo request, id 3649, seq 1, length 64
-- Packet capture on the Destination host (where RedNode01 resides), we see packet come in with Opsmgr's real IP, instead of its NAT'ed IP. The Destination IP on the other hand is DNAT'ed
pktcap-uw --switchport 67108876 --dir 1 --stage 1 -o - | tcpdump-uw -enr -
reading from file -, link-type EN10MB (Ethernet)
21:06:14.124226 02:50:56:xx:xx:xx > 00:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.60.11: ICMP echo request, id 3649, seq 144, length 64
-- The firewall connection table on the edge node where the T0 SR instance resides, we don't see a SNAT occur i.e. x.x.50.11 get translated to x.x.100.81
nsxtedge01> get firewall 76183f46-1883-4f55-af2c-daebcd1209b8 connection | find icmp
0x04002fbdbc000007: x.x.50.11 -> x.x.60.11 (x.x.100.90) dir in protocol icmp
-- Though it seems like pings work, this will break TCP traffic, because Three Way Handshake will never complete.
Packet Walk:
Request
Response
This issue is resolved in VMware NSX-T Data Center 2.4.2
Workaround:
Workaround for both issues:
Customer can employ one of the below Workarounds, based on their configuration and Design requirements
Workaround 1:
Keep the T0 and T1 SRs on separate Edge Nodes. To perform this operation, follow steps below
Note: The Caveat in this workaround is that in case of a node failure, both T0 and T1 SRs will end up in the same Edge Node causing RPF-Check Failure.
Workaround 2:
Deploy a second Edge cluster with 2 or more additional Edge nodes and attach T1s to this new edge cluster. In this case ensure that the T0 is attached to Edge cluster 1 and T1s to Edge cluster 2. Also attach any subsequent T1s created to the second edge cluster to keep them separate from the T0 Edge cluster.
For Scenario 1, this config will prevent hair-pinning of traffic occurring on the same edge node and RPF-Check will not drop the packet
For Scenario 2, this config will ensure that Firewall is enabled on the T1 SR effectively converting the T0 LinkedPort to Downlink port
For PKS deployed T1s
In some PKS deployed T1s, the above change might not be possible using the NSX UI, since the PKS T1s are protected objects and are not editable from the UI. To perform changes on these edge nodes, we need to use an API call with X-Allow-Overwrite header to force the change. Steps Below:
Step-1: Get current T1 configuration using a GET call
GET https://192.168.110.17/api/v1/logical-routers/3e4077e6-b6e4-49f0-9d37-fa24f13526a9
Sample Output
{
"router_type": "TIER1",
"advanced_config": {
"external_transit_networks": [],
"internal_transit_network": "169.x.x.x/28"
},
"allocation_profile": {
"enable_standby_relocation": false
},
"firewall_sections": [
{
"target_id": "2ed4ebe4-8b82-44f8-b100-94b9191e7a3d",
"target_type": "FirewallSection",
"is_valid": true
}
],
"resource_type": "LogicalRouter",
"id": "3e4077e6-b6e4-49f0-9d37-fa24f13526a9",
"display_name": "Test-1",
"_create_user": "admin",
"_create_time": 1563402475042,
"_last_modified_user": "admin",
"_last_modified_time": 1563402475042,
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}
Step-2: Identify the Edge cluster UUID where you want the above T1 to be attached to
This can be identified using the UI or using below API call
GET /api/v1/edge-clusters
Step-3: Run a PUT call with the correct edge cluster with a X-Allow-Overwrite header for Protected objects
The the following should be appended to the above GET call from step-1
********************
"router_type": "TIER1",
"edge_cluster_id": "436c8915-11fe-46cb-8a0b-18ad284e0e03", #Added this section
"edge_cluster_member_indices": [
0,
1
],
********************
PUT https://192.168.110.17/api/v1/logical-routers/3e4077e6-b6e4-49f0-9d37-fa24f13526a9
Sample Body:
{
"router_type": "TIER1",
"edge_cluster_id": "436c8915-11fe-46cb-8a0b-18ad284e0e03", #Added this section
"edge_cluster_member_indices": [
0,
1
],
"advanced_config": {
"external_transit_networks": [],
"internal_transit_network": "169.x.x.x/28"
},
"allocation_profile": {
"enable_standby_relocation": false
},
"firewall_sections": [
{
"target_id": "2ed4ebe4-8b82-44f8-b100-94b9191e7a3d",
"target_type": "FirewallSection",
"is_valid": true
}
],
"resource_type": "LogicalRouter",
"id": "3e4077e6-b6e4-49f0-9d37-fa24f13526a9",
"display_name": "Test-1",
"_create_user": "admin",
"_create_time": 1563402475042,
"_last_modified_user": "admin",
"_last_modified_time": 1563402475042,
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}