Tanzu Application Service ( c2c ) Container Container networking packets are dropped in NSX environments
search cancel

Tanzu Application Service ( c2c ) Container Container networking packets are dropped in NSX environments

book

Article ID: 298181

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This Knowledge Base (KB) article details container to container networking (c2c) in Tanzu Application Service (TAS) is impacted in certain vmware NSX environments

For insight into how c2c works within TAS, please review this KB article.

VMs deployed by Bosh have checksum offloading enabled for the eth0 interface. Packet drops are caused by invalid tcp checksums errors for vxlan overlay packets.  As a result, these packets can be discarded by either network devices or the receiving VM, leading to network timeouts.


Environment

Product Version: 6.0, 10.0

Cause

There are two possible causes 

 

Issue #1 

In older versions of Vsphere 7 there is a known issue related to the vmxnet3 driver.  This is fixed in 7.0 update 3q or later. For more information on this vmxnet3 bug, please see this KB article.  Most or all production Vsphere environments to date should already have this fix. 

This bug is mentioned in the release notes here as well. 

https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-esxi-70u3q-release-notes/index.html

"PR 3282224: VXLAN traffic generated from a guest VM to port 4789 fails"

 

Issue #2

Container to Container VXLAN communication (TAS) getting dropped for TCP Checksum errors because NSX overlay does not support inner VXLAN offload + Geneve encap.  Even if Elastic Application Services do not have NCP tile installed you can still be impacted by this issue.  If the NSX segment backing the BOSH deployed vms uses and overlay then container to container network will not function unless the workaround described in this KB is applied.  

See KB https://knowledge.broadcom.com/external/article/423328/container-to-container-vxlan-communicati.html for more details.  

This issue is most often observed after upgrading the BOSH Linux stemcell to 1.894 or later.  Also the latest Windows Stemcells will see drops as well. 

 

 

Resolution

 

Windows Workaround

bosh ssh into the windows diego cell and start powershell.  The commands below show how to disable checksum offload for TCP and verify the settings take effect.  This change will persist on reboot, however a bosh vm create operation caused by a stemcell upgrade or running bosh recreate will reenable TCP checksum offload.  

PS C:\> Start-Process powershell -Verb runAs
PS C:\> Disable-NetAdapterChecksumOffload -Name "*" -TcpIPv6 -TcpIPv4

verify settings take effect:
PS C:\> Get-NetAdapterChecksumOffload -Name "*"

 

Linux Workaround 1


This method will restore the communication, however it will not persist if the VM is recreated for any reason. This method involves using the BOSH cli to ssh into the instances and running a command:

 

bosh -d <CF-GUID> ssh diego_cell -c "sudo /usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off”


Please substitute the CF-GUID for your cf deployment name, along with any necessary changes for bosh cli usage. Also note that if this is occurring in isolation segments, to do the same there.
 

bosh -d <p-isolation-segment-GUID> ssh isolated_diego_cell -c "sudo /usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off”


Please substitute the naming to match the environment's naming convention for the isolation segment. 


Linux Workaround 2


This method will restore the communication, and it will persist if the VM is recreated or restarted for any reason. This method involves leveraging os-conf via a BOSH runtime-config.

Create a file called os-conf-c2c.yml with the following content:
 

releases:
- name: os-conf
  version: 22.2.1
 
 
addons:
  - name: os-configuration
    include:
      jobs:
      - name: rep
        release: diego
      deployments:
      - cf-GUID
    jobs:
    - name: pre-start-script
      release: os-conf
      properties:
        script: |-
          #!/bin/bash
 
          /usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off
          echo "ACTION==\"add|change\", SUBSYSTEM==\"net\", KERNEL==\"eth*|en*\", RUN+=\"/usr/sbin/ethtool -K \$name tx-checksum-ip-generic off\"" > /etc/udev/rules.d/61-net.tx-checksum-ip-generic.rules


Please substitute any values specific to the environment this is being applied to. For example, the os-conf release version and the cf-GUID part. Additionally if there are isolation segments that need this, be sure to modify the placement rules however needed. For example, we may remove the deployment block if we want it to be applied to all VMs with the release diego.

Once the file is saved, it can be uploaded to BOSH:

bosh update-config --type=runtime --name=os-conf-c2c os-conf-c2c.yml


Then an Apply Changes to all applicable tiles (tiles that include diego_cells).