Upgrade of TKG/PKS/PAS is very slow
search cancel

Upgrade of TKG/PKS/PAS is very slow

book

Article ID: 298632

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The purpose of this article is to help to determine if bosh director blobstore is the cause of slowness when upgrading TKG/PKS/TAS

This article describes how determine if bosh director blobstore is the cause of slowness when upgrading TKG/PKS/TAS

Symptoms:
When upgrading TKG/PKS, upgrade of each master node takes up to 20 mins and worker node takes up to 40 mins, consistently across all the clusters.

Environment

Product Version: 1.7

Cause

Some disk or network slowness can cause the blob store process to be slow.

Resolution

We can identify in bosh-agent logs from worker node, how long it takes to download some packages. Bosh agent logs  contain commands executed, each having a start/finish outputs. E.g.
 
2020-11-24_10:36:33.50706 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Running command 'arping -c 1 -U -I eth0 172.30.16.6'
2020-11-24_10:36:33.52261 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Stdout: ARPING 172.30.16.6 from 172.30.16.6 eth0
2020-11-24_10:36:33.52263 Sent 1 probes (1 broadcast(s))
2020-11-24_10:36:33.52264 Received 0 response(s)
2020-11-24_10:36:33.52264 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Stderr: 
2020-11-24_10:36:33.52265 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Successful: true (0)


In above example, the ARPING command starts with the line "Running command 'arping...", it  has a Stdout, Stderr and Successful lines associated and they usually come one after another. 

You would need to identify 'bosh-blobstore-dav' commands where the Stdout, Stderr, Successful lines come few minutes after the "Runing command" line. E.g

2020-11-24_10:36:30.88086 [Cmd Runner] 2020/11/24 10:36:30 DEBUG - Running command 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f3e0c9a9-1205-49e9-b1b1-a3c4ed964b15 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get940214647'
2020-11-24_10:36:31.81050 [Action Dispatcher] 2020/11/24 10:36:31 INFO - Received request with action get_task
2020-11-24_10:36:31.81053 [Action Dispatcher] 2020/11/24 10:36:31 DEBUG - Payload
.

.

.

2020-11-24_10:36:33.50706 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Running command 'arping -c 1 -U -I eth0 172.30.16.6'
2020-11-24_10:36:33.52261 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Stdout: ARPING 172.30.16.6 from 172.30.16.6 eth0
2020-11-24_10:36:33.52263 Sent 1 probes (1 broadcast(s))
2020-11-24_10:36:33.52264 Received 0 response(s)
2020-11-24_10:36:33.52264 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Stderr: 
2020-11-24_10:36:33.52265 [Cmd Runner] 2020/11/24 10:36:33 DEBUG - Successful: true (0)


.

.

.

2020-11-24_10:42:54.24887 [Cmd Runner] 2020/11/24 10:42:54 DEBUG - Stdout: 
2020-11-24_10:42:54.24889 [Cmd Runner] 2020/11/24 10:42:54 DEBUG - Stderr: 
2020-11-24_10:42:54.24889 [Cmd Runner] 2020/11/24 10:42:54 DEBUG - Successful: true (0)

 

In. the above example, we can see that the "Running command 'bosh-blobstore-dav" line, doesn't have any Stdout, Stderr, Successful lines below. Before those lines appear, we see another command (ARPING) that does come with Stdout, Stderr, Successful lines, so that shows us that the ARPING command did finish. 

At 10:42:54 we can see  Stdout, Stderr, Successful lines without any "Running command " line immediately above, so that would be a very good candidate for the "bosh-blobstore-dav" command that got executed 6 minutes before and that would explain the slowness when upgrading since downloading packages shouldn't take that long, only few seconds max.

In addition to the above, you can also run the "bosh-blobstore-dav" command from the Master or Worker VM's and measure the time it takes to download it. E.g
 

time bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f3e0c9a9-1205-49e9-b1b1-a3c4ed964b15 /tmp/filename
 
real    7m6.797s
user    0m3.726s
sys     0m11.478s


If you see that it takes long time (E.g few minutes), then it's the confirmation that the systems is experiencing some slowness, probably related to network or disk performance which should be worked with your IaaS team.