Apply changes fails in Isolation segment tile due to vxlan-policy-agent error
search cancel

Apply changes fails in Isolation segment tile due to vxlan-policy-agent error

book

Article ID: 372871

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

Apply changes fails with an error similar to:

Task xxxx | 09:59:48 | Error: Action Failed get_task: Task xxx-xxx-xxx result: 1 of 15 pre-start scripts failed. Failed Jobs: vxlan-policy-agent. Successful Jobs: loggregator_agent, silk-cni, cfdot, bpm, garden-cni, smbdriver, nfsv3driver, bosh-dns, syslog_forwarder, garden, mapfs, silk-daemon, cflinuxfs3-rootfs-setup, cflinuxfs4-rootfs-setup.

Looking in the isolation_diego_cell logs for vxlan-policy-agent we can see the pre-start.stderr.log failing with: 

pre-start error: lock: open lock file: open /var/vcap/data/garden-cni/iptables.lock: no such file or directory

Environment

VMware Tanzu Platform for Cloud Foundry 4.x

Tanzu Isolation Segment 4.x

Cause

This issue is due to a race condition happening while silk-release and garden-cni jobs are starting.

Silk-release:

  1. Creates the directory (garden-cni) if it doesn’t exist.
  2. Creates the file (iptables) if it doesn’t exist and opens it.
  3. Locks the file.

Garden-cni job 

  1. Deletes the garden-cni directory on pre-start in order to clean up everything

If the directory is deleted by garden-cni job between steps 1 and 2 of silk-release pre-start job then the condition happens and the error above occurs. 

Resolution

This issue has been fixed on components cf-networking and silk-release on version 3.47.0 which is included on TAS, IST & TASW tiles versions:

4.0.26
5.0.16
6.0.6

As a workaround manually recreate the diego cell with the error by running bosh command:

bosh -d deployment-guid recreate iso_cell_guid --no-converge --fix