Tanzu Application Service c2c impact from vmxnet3 driver bug
search cancel

Tanzu Application Service c2c impact from vmxnet3 driver bug

book

Article ID: 298181

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This Knowledge Base (KB) article details a rare bug observed in the vmxnet3 driver in VMware tools that can impact container to container networking (c2c) in Tanzu Application Service (TAS).

For insight into how c2c works within TAS, please review this KB article.

VMs deployed by Bosh have checksum offloading enabled for the eth0 interface. This bug manifests itself within the checksums for vxlan overlay packets, leading to invalid checksums. As a result, these packets can be discarded by either network devices or the receiving VM, leading to network timeouts.

For more information on this vmxnet3 bug, please see this KB article.

Environment

Product Version: 4.0

Resolution

This is fixed in vSphere 7.0 Update 3q as per release notes 

https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-esxi-70u3q-release-notes/index.html

"PR 3282224: VXLAN traffic generated from a guest VM to port 4789 fails"

In the event an environment needs a workaround prior to an ESX upgrade can take place, then temporarily disabling the checksum offloading will restore communication. The command that will do this on the BOSH deployed VMs:

/usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off

 

Workaround 1

This method will restore the communication, however it will not persist if the VM is recreated for any reason. This method involves using the BOSH cli to ssh into the instances and running a command:

 

bosh -d <CF-GUID> ssh diego_cell -c "sudo /usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off”


Please substitute the CF-GUID for your cf deployment name, along with any necessary changes for bosh cli usage. Also note that if this is occurring in isolation segments, to do the same there.
 

bosh -d <p-isolation-segment-GUID> ssh isolated_diego_cell -c "sudo /usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off”


Please substitute the naming to match the environment's naming convention for the isolation segment. 


Workaround 2

This method will restore the communication, and it will persist if the VM is recreated or restarted for any reason. This method involves leveraging os-conf via a BOSH runtime-config.

Create a file called os-conf-c2c.yml with the following content:
 

releases:
- name: os-conf
  version: 22.2.1
 
 
addons:
  - name: os-configuration
    include:
      jobs:
      - name: rep
        release: diego
      deployments:
      - cf-GUID
    jobs:
    - name: pre-start-script
      release: os-conf
      properties:
        script: |-
          #!/bin/bash
 
          /usr/sbin/ethtool -K eth0 tx-checksum-ip-generic off
          echo "ACTION==\"add|change\", SUBSYSTEM==\"net\", KERNEL==\"eth*|en*\", RUN+=\"/usr/sbin/ethtool -K \$name tx-checksum-ip-generic off\"" > /etc/udev/rules.d/61-net.tx-checksum-ip-generic.rules


Please substitute any values specific to the environment this is being applied to. For example, the os-conf release version and the cf-GUID part. Additionally if there are isolation segments that need this, be sure to modify the placement rules however needed. For example, we may remove the deployment block if we want it to be applied to all VMs with the release diego.

Once the file is saved, it can be uploaded to BOSH:

bosh update-config --type=runtime --name=os-conf-c2c os-conf-c2c.yml


Then an Apply Changes to all applicable tiles (tiles that include diego_cells).