Upgrade of TKGi clusters takes significantly more time than normal
search cancel

Upgrade of TKGi clusters takes significantly more time than normal

book

Article ID: 373941

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core) VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

During upgrade process of the cluster and the tiles 

The upgrade process took much longer instead of the usual previous upgrades. 

However, it did seem to complete without issue.  The upgrade of kubernetes clusters have a similar delay. 

A 1 master and 1 worker cluster usually takes around 15 minutes but now takes 1.5 hours. 

Environment

TKGi in general

Cause

Verified the status of running bosh director task during  upgrade of one cluster

bosh task <ID> --debug 

and

bosh task  <ID>

both indicated that the operation of pushing the jobs (services to the service instance is taking very long time) 

Investigating bosh director and Worker node VM where the task was running showed high queue in the connection to the worker node where the tasks were running 

we were able to identify this with 

netstat -putan  ( looking for large queue size in sent-Q)

netstat -putan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0                  0 127.0.0.1:2323          0.0.0.0:*                      LISTEN      -
tcp        0                  0 0.0.0.0:3333              0.0.0.0:*                      LISTEN      -
tcp        0                  0 0.0.0.0:22                  0.0.0.0:*                      LISTEN      -
tcp        0                  0 10.xx.xx.21:22          10.xx.xx.5:61705         ESTABLISHED -
tcp        0        132432 10.xx.xx.21:PORT    10.3xx.xx.5:51370       ESTABLISHED -

we have also tried ping with higher MTU size

ping <WORKER IP> -M do -S 1400 

where the ping was also having some reply  packets not coming back which can be indication of dropped packets over the network path.

This was indication of a networking issue in relation with the infrastructure.

 

Resolution

We competed vmotion (host was put into maintenance mode) which improved the speed and the task have completed.

It was confirmed that a single ESXi host had an issue, the bosh director was hosted on this ESXi host and the overall network speed was affected.

 

Additional Information

 netstat -putan  running on bosh source and destination can be used to determine if there is increased number of Sent Received queues 

In this scenarios we had about 200 000 packets sitting in the queue on the bosh director side waiting to be sent to the worker node affected.