Upgrade of TKGi clusters takes significantly more time than normal

search cancel

Upgrade of TKGi clusters takes significantly more time than normal

book

Article ID: 373941

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core) VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

During upgrade process of the cluster and the tiles

The upgrade process took much longer instead of the usual previous upgrades.

However, it did seem to complete without issue. The upgrade of kubernetes clusters have a similar delay.

A 1 master and 1 worker cluster usually takes around 15 minutes but now takes 1.5 hours.

Environment

TKGi in general

Cause

Verified the status of running bosh director task during upgrade of one cluster

bosh task <ID> --debug

and

bosh task <ID>

both indicated that the operation of pushing the jobs (services to the service instance is taking very long time)

Investigating bosh director and Worker node VM where the task was running showed high queue in the connection to the worker node where the tasks were running

we were able to identify this with

netstat -putan ( looking for large queue size in sent-Q)

netstat -putan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:2323 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:3333 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 10.xx.xx.21:22 10.xx.xx.5:61705 ESTABLISHED -
tcp 0 132432 10.xx.xx.21:PORT 10.3xx.xx.5:51370 ESTABLISHED -

we have also tried ping with higher MTU size

ping <WORKER IP> -M do -S 1400

where the ping was also having some reply packets not coming back which can be indication of dropped packets over the network path.

This was indication of a networking issue in relation with the infrastructure.

Resolution

We competed vmotion (host was put into maintenance mode) which improved the speed and the task have completed.

It was confirmed that a single ESXi host had an issue, the bosh director was hosted on this ESXi host and the overall network speed was affected.

Additional Information

netstat -putan running on bosh source and destination can be used to determine if there is increased number of Sent Received queues

In this scenarios we had about 200 000 packets sitting in the queue on the bosh director side waiting to be sent to the worker node affected.

Feedback

thumb_up Yes

thumb_down No