NSX-T 2.4 and 2.4.1 NAT issues

Products

VMware NSX

Issue/Introduction

Purpose of this Article is to raise awareness of NAT issues experienced in NSX-T 2.4.0 and 2.4.1

Symptoms:
THESE ISSUES WILL ONLY OCCUR WHEN SNAT AND DNAT ARE CONFIGURED ON T0 LOGICAL ROUTERS RUNNING NSX-T 2.4.0 and 2.4.1

Scenario-1 - Intra-Tier Communication: -
-- Workloads behind same T1 communicating with one another through their NAT'ed IPs configured on T0. The communication between the Workloads will fail because of RPF-Check on the T0 LinkedPort to (T1) when the T0 and T1 SRs Active Instance is configured on the same edge Node.
-- SNAT and DNAT is configured on the T0

This issue is commonly seen in a PKS deployment with NAT Topology as shown in the diagram above
Note: This issue will be seen even when there's no PKS in the environment. We are using the PKS deployment as an example

Environmental Overview:

-- T1 MGMT PKS which has all the PKS Control Plane components such as Ops Manager, Bosh and PKS VMs.
-- T1 MGMT PKS has a Service Router (SR) instance configured
-- T0 and T1 SRs instances are Active on the Same Edge Node

Topology information:

-- Opsmgr IP: 192.168.50.11
-- Bosh Director IP: 192.168.50.12
-- Opsmgr's IP 192.x.x.11 is SNAT'ed on the T0 with IP x.x.100.81
-- Bosh's IP x.x.50.12 is DNAT'ed on the T0 with IP x.x.100.82

How Issue Occurs:

-- Opsmgr with IP x.x.50.11 will communicate with Bosh Director over its NAT'ed IP x.x.100.81
-- Expected traffic flow is that the Opsmgr’s IP will get SNAT’ed from x.x.50.11 to x.x.100.81 and Bosh’s IP will get DNAT’ed from x.x100.82 to x.x50.12 as it traverses the T0
-- Setting up packet captures on the Source host where Opsmgr resides, we see ICMP echo request (below) with Source IP x.x.50.11 and Destination IP x.x.100.82. This packet capture is egressing the Opsmgr's vnic

[root@esxcna03-s1:~] pktcap-uw --switchport 67108879 --dir 0 --stage 0 -o - | tcpdump-uw -enr - icmp
reading from file -, link-type EN10MB (Ethernet)
16:58:03.199507 00:50:56:xx:xx:xx > 02:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.100.82: ICMP echo request, id 2322, seq 1, length 64

-- Packet Capture on the Destination Host we do not see these ICMP Echo Request come in on the switchport

[root@esxcna01-s1:~] pktcap-uw --switchport 67108875 --dir 1 --stage 0 -o - | tcpdump-uw -enr - icmp
reading from file -, link-type EN10MB (Ethernet)
tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call
[root@esxcna01-s1:~]

-- Running packet captures on the T0 Linked port where T0 is connected, we see ICMP echo request arrive. 573ae3e8-6e47-4033-aff2-f896c88b521e is the UUID of the T0 LinkedPort facing the T1

nsxtedge01> start capture interface 573ae3e8-6e47-4033-aff2-f896c88b521e expression icmp
17:19:42.184909 02:50:56:xx:xx:xx > 02:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.100.82: ICMP echo request, id 2339, seq 1094, length 64
<base64>XXXXXXXXXXXXXXXXXX</base64>

Packet Walk:

Source Host: Packet as it leaves the Opsman on the source host - Src IP: x.x.50.11 Dst IP: x.x.100.82
Source Host: Traffic will be routed over its local T1 DR instance and sent over to the T1 SR instance which resides on nsxtedge01
Nsxtedge01: The traffic will be routed from the T1 SR to the T0 DR instance (linked Port) in the same edge node and will be routed over to the T0 SR where NAT will occur
Nsxtedge01: At this point, we see that the SNAT does not occur, while DNAT occurs. This can be confirmed by running the below command

nsxtedge01> get firewall 573ae3e8-6e47-4033-aff2-f896c88b521e connection | find icmp
0x04000a2974000065: x.x.50.11 -> x.x.50.12 (x.x.100.82) dir in protocol icmp

Nsxtedge01: We would expect the Source IP to change from x.x.50.11 to x.x.100.81 (After NAT) which does not happen. Since the traffic is supposed to hairpin off the same T0 downlink port, NAT-RPF check comes into play, dropping the packet

-- Checking stats of the T0 LinkedPort to the T1, we see that RPF-Check packet drop counter is increasing

nsxtedge01> get logical-router interface 573ae3e8-6e47-4033-aff2-f896c88b521e stats
interface   : 573ae3e8-6e47-4033-aff2-f896c88b521e
ifuid       : 283
VRF         : 43705f83-f1fd-4581-be22-9f2492816d69
name        : LinkedPort_Cent-T1

<Output Snipped>

Statistics
    RX-Packets : 3917
    RX-Bytes    : 380142
    RX-Drops    : 3859
        RPF-Check   : 3764 <--------------- RPF-Check Drops increasing

<Output Snipped>

Scenario 2 - Inter-Tier1 Communication: -
-- Workloads in different T1s communicate with each other using their NAT'ed IPs configured on the T0. Communication between workloads fail when T1s are configured with only a DR instance and no Service Router Instance
-- SNAT and DNAT is configured on T0

This issue is commonly seen in PKS environment where workloads behind different T1s talk to each other via their NAT'ed IPs configured on T0.

For illustration, we'll use the below

Environment Information:

-- T1-Mgmt-PKS has all PKS Control Plane Components
-- T1-k8s has Workload VMs
-- None of the T1s have a Service Router (SR) instance, only Distributed Router (DR) instances
-- Workloads are NAT'ed at T0

Topology information:

-- Opsmgr IP: x.x.50.11
-- Opsmgr's IP x.x.50.11 is SNAT'ed on the T0 to IP x.x.100.81
-- K8s VM (called RedNode01) IP:
-- K8s VM (called RedNode01) is DNAT'ed on the T0 to IP x.x.100.90

-- Ping from Opsmgr (x.x.50.11) to Bosh's NAT'ed IP (x.x.100.90). Notice how we receive a response back from x.x.60.11 which is Bosh's real IP instead of its NAT'ed IP. This is Unexpected

-- Packet captures on the Source Host (where Opsmgr resides) as the packet egresses the Source Port shows packet leave with a source IP x.x.50.11 and destination x.x.100.90

[root@esxcna03-s1:~] pktcap-uw --switchport 67108872 --dir 0 --stage 0 -o - | tcpdump-uw -enr -
reading from file -, link-type EN10MB (Ethernet)
21:03:22.230309 02:50:56:xx:xx:xx > 02:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.100.90: ICMP echo request, id 3649, seq 1, length 64

-- Packet capture on the Destination host (where RedNode01 resides), we see packet come in with Opsmgr's real IP, instead of its NAT'ed IP. The Destination IP on the other hand is DNAT'ed

pktcap-uw --switchport 67108876 --dir 1 --stage 1 -o - | tcpdump-uw -enr -
reading from file -, link-type EN10MB (Ethernet)
21:06:14.124226 02:50:56:xx:xx:xx > 00:50:56:xx:xx:xx, ethertype IPv4 (0x0800), length 98: x.x.50.11 > x.x.60.11: ICMP echo request, id 3649, seq 144, length 64

-- The firewall connection table on the edge node where the T0 SR instance resides, we don't see a SNAT occur i.e. x.x.50.11 get translated to x.x.100.81

nsxtedge01> get firewall 76183f46-1883-4f55-af2c-daebcd1209b8 connection | find icmp
0x04002fbdbc000007: x.x.50.11 -> x.x.60.11 (x.x.100.90) dir in protocol icmp

-- Though it seems like pings work, this will break TCP traffic, because Three Way Handshake will never complete.

Packet Walk:

Request

SYN Packet leaves Opsmgr with source IP x.x.50.11 and Destination x.x.100.90
Source IP x.x.50.11 DOES NOT translate to x.x.100.81 at T0
Destination IP x.x.100.90 gets translated to x.x.60.11 at T0
SYN Packet is sent to destination with Src IP x.x.50.11 and Dst IP x.x.60.11 after translation

Response

SYN-ACK Packet leaves Destination VM with source IP x.x.60.11 and Destination IP x.x.50.11
SYNC-ACK Packet is routed via local T1 DR to T0 DR and routed directly to the source host. No reverse translation occurs.
Opsmgr receives SYN-ACK packet with IP Source IP x.x.60.11 and Destination IP x.x.50.11
Opsmgr sends a RST back because it receives a SYN-ACK from an IP, it never sent the SYN too
Thus TCP Three way handshake never completes.

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.x

Cause

Cause of Scenario-1:

Since the source and destination reside behind the same T1, the traffic has to hair-pin back over the same T0 LinkedPort Interface towards the T1 after SNAT and DNAT operation. However, in this case, SNAT does not occur and when traffic is hairpinned over the same edge node’s T0 LinkedPort, RPF-Check comes into play and drops the packet. Currently there’s no option to disable or set RPF-Check to loose on LinkedPorts

Cause of Scenario-2:

-- This issue occurs because, in NSX-T 2.4, Firewall is not enabled by default on T0 LinkedPort interfaces (to T1). To verify this, we run the below command, which shows that Firewall is not enabled on this interface

nsxtedge01> get firewall 50b760d7-1147-45d8-9fa0-911399c3fc40
% Invalid value for argument <uuid>: 50b760d7-1147-45d8-9fa0-911399c3fc40
<uuid>: Logical interface UUID string representation.

-- Here 50b760d7-1147-45d8-9fa0-911399c3fc40, is the T0 LinkedPort Interface facing the Tier-1 T1-K8s (RedNode01), where SNAT rule would be applied as traffic leaves the downlink interface
-- By default, in NSX-T 2.4, firewall rules are enabled on Uplink, Backplane (SR-DR Ports), Service Ports, VTI ports and Downlink ports

Note: This issue would not occur on traffic leaving North of T0, outside NSX environment. These packets would get NAT’ed as expected because Firewall is enabled by default on Uplink interfaces as stated at the beginning of Workaround section.

Resolution

This issue is resolved in VMware NSX-T Data Center 2.4.2

Workaround:
Workaround for both issues:
Customer can employ one of the below Workarounds, based on their configuration and Design requirements

Workaround 1:
Keep the T0 and T1 SRs on separate Edge Nodes. To perform this operation, follow steps below

1. Click on T0 Logical Router and under Overview à High Availability Mode, find the active node name. As shown below nsxtedge01 is the active Edge node for T0 SR.

1. Go to the T1 Router and under Overview à High Availability Mode, check if T1 SR is configured or not.
  1. If not, configure it by clicking Edit and choose the Edge Cluster by clicking the drop-down as shown below.
  2. Choose Failover mode as Preemptive
  3. Under the “Preferred Member” make sure to choose the T0 Standby Node as the preferred member. This will ensure that the T0 and T1 SRs are on different Edge Nodes.

Note: The Caveat in this workaround is that in case of a node failure, both T0 and T1 SRs will end up in the same Edge Node causing RPF-Check Failure.

Workaround 2:
Deploy a second Edge cluster with 2 or more additional Edge nodes and attach T1s to this new edge cluster. In this case ensure that the T0 is attached to Edge cluster 1 and T1s to Edge cluster 2. Also attach any subsequent T1s created to the second edge cluster to keep them separate from the T0 Edge cluster.

For Scenario 1, this config will prevent hair-pinning of traffic occurring on the same edge node and RPF-Check will not drop the packet
For Scenario 2, this config will ensure that Firewall is enabled on the T1 SR effectively converting the T0 LinkedPort to Downlink port

Additional Information

For PKS deployed T1s
In some PKS deployed T1s, the above change might not be possible using the NSX UI, since the PKS T1s are protected objects and are not editable from the UI. To perform changes on these edge nodes, we need to use an API call with X-Allow-Overwrite header to force the change. Steps Below:

Step-1: Get current T1 configuration using a GET call

GET https://192.168.110.17/api/v1/logical-routers/3e4077e6-b6e4-49f0-9d37-fa24f13526a9

Sample Output
{
    "router_type": "TIER1",
    "advanced_config": {
        "external_transit_networks": [],
        "internal_transit_network": "169.x.x.x/28"
    },
    "allocation_profile": {
        "enable_standby_relocation": false
    },
    "firewall_sections": [
        {
            "target_id": "2ed4ebe4-8b82-44f8-b100-94b9191e7a3d",
            "target_type": "FirewallSection",
            "is_valid": true
        }
    ],
    "resource_type": "LogicalRouter",
    "id": "3e4077e6-b6e4-49f0-9d37-fa24f13526a9",
    "display_name": "Test-1",
    "_create_user": "admin",
    "_create_time": 1563402475042,
    "_last_modified_user": "admin",
    "_last_modified_time": 1563402475042,
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}

Step-2: Identify the Edge cluster UUID where you want the above T1 to be attached to

This can be identified using the UI or using below API call

GET /api/v1/edge-clusters

Step-3: Run a PUT call with the correct edge cluster with a X-Allow-Overwrite header for Protected objects

The the following should be appended to the above GET call from step-1

********************
    "router_type": "TIER1",
"edge_cluster_id": "436c8915-11fe-46cb-8a0b-18ad284e0e03", #Added this section
    "edge_cluster_member_indices": [
        0,
        1
    ],

********************
PUT https://192.168.110.17/api/v1/logical-routers/3e4077e6-b6e4-49f0-9d37-fa24f13526a9

Sample Body:

{
    "router_type": "TIER1",
"edge_cluster_id": "436c8915-11fe-46cb-8a0b-18ad284e0e03", #Added this section
    "edge_cluster_member_indices": [
        0,
        1
    ],
    "advanced_config": {
        "external_transit_networks": [],
        "internal_transit_network": "169.x.x.x/28"
    },
    "allocation_profile": {
        "enable_standby_relocation": false
    },
    "firewall_sections": [
        {
            "target_id": "2ed4ebe4-8b82-44f8-b100-94b9191e7a3d",
            "target_type": "FirewallSection",
            "is_valid": true
        }
    ],
    "resource_type": "LogicalRouter",
    "id": "3e4077e6-b6e4-49f0-9d37-fa24f13526a9",
    "display_name": "Test-1",
    "_create_user": "admin",
    "_create_time": 1563402475042,
    "_last_modified_user": "admin",
    "_last_modified_time": 1563402475042,
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}