During upgrade process of the cluster and the tiles
The upgrade process took much longer instead of the usual previous upgrades.
However, it did seem to complete without issue. The upgrade of kubernetes clusters have a similar delay.
A 1 master and 1 worker cluster usually takes around 15 minutes but now takes 1.5 hours.
TKGi in general
Verified the status of running bosh director task during upgrade of one cluster
bosh task <ID> --debug
and
bosh task <ID>
both indicated that the operation of pushing the jobs (services to the service instance is taking very long time)
Investigating bosh director and Worker node VM where the task was running showed high queue in the connection to the worker node where the tasks were running
we were able to identify this with
netstat -putan ( looking for large queue size in sent-Q)
netstat -putan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:2323 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:3333 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 10.xx.xx.21:22 10.xx.xx.5:61705 ESTABLISHED -
tcp 0 132432 10.xx.xx.21:PORT 10.3xx.xx.5:51370 ESTABLISHED -
we have also tried ping with higher MTU size
ping <WORKER IP> -M do -S 1400
where the ping was also having some reply packets not coming back which can be indication of dropped packets over the network path.
This was indication of a networking issue in relation with the infrastructure.
We competed vmotion (host was put into maintenance mode) which improved the speed and the task have completed.
It was confirmed that a single ESXi host had an issue, the bosh director was hosted on this ESXi host and the overall network speed was affected.
netstat -putan running on bosh source and destination can be used to determine if there is increased number of Sent Received queues
In this scenarios we had about 200 000 packets sitting in the queue on the bosh director side waiting to be sent to the worker node affected.