How to prevent a persistent volume failure during Tanzu Kubernetes Grid Integrated Edition maintenance windows

search cancel

How to prevent a persistent volume failure during Tanzu Kubernetes Grid Integrated Edition maintenance windows

book

Article ID: 316963

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

This article provides steps to prevent persistent volume operations from being performed during a Tanzu Kubernetes Grid Integrated Edition (TKGI) maintenance window.

Environment

VMware PKS 1.x

Cause

During a TKGI maintenance window, the underlying network may be unstable which can lead to persistent volume operations to fail or leave volumes in an inconsistent state.

Resolution

When you have a planned network maintenance or general maintenance window that may impact your TKGI environment, you can use the following steps before the maintenance operation begins:

Disable the bosh resurrector. Run commnd bosh update-resurrection off
Log in on each cluster that will go into maintenance
To list all the jobs running. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit summary"
Then stop all the jobs running. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit stop <job>"
Check if the service has stopped by issuing a command similar to bosh -d <service-instance> ssh master -c "sudo monit status <job>".

Once the maintenance window is completed you can complete the following steps to

Log in on each cluster that will go into maintenance
To list all the jobs and its status. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit summary"
Run previously stopped jobs. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit start <job>".
Check if the service has resumed by running a command similar to bosh -d <service-instance> ssh master -c "sudo monit status <job>".
Enable the bosh resurrector. Run commnd bosh update-resurrection off

Note: During the maintenance window, all persistent volume operations will be queued and will resume once the service has started again.

Feedback

thumb_up Yes

thumb_down No