How to prevent a persistent volume failure during Tanzu Kubernetes Grid Integrated Edition maintenance windows
search cancel

How to prevent a persistent volume failure during Tanzu Kubernetes Grid Integrated Edition maintenance windows

book

Article ID: 316963

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

This article provides steps to prevent persistent volume operations from being performed during a Tanzu Kubernetes Grid Integrated Edition (TKGI) maintenance window.

Environment

VMware PKS 1.x

Cause

During a TKGI maintenance window, the underlying network may be unstable which can lead to persistent volume operations to fail or leave volumes in an inconsistent state.

Resolution


When you have a planned network maintenance or general maintenance window that may impact your TKGI environment, you can use the following steps before the maintenance operation begins:

  1. Disable the bosh resurrector. Run commnd bosh update-resurrection off
  2. Log in on each cluster that will go into maintenance
  3. To list all the jobs running. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit summary" 
  4. Then stop all the jobs running. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit stop <job>" 
  5. Check if the service has stopped by issuing a command similar to bosh -d <service-instance> ssh master -c "sudo monit status <job>".

Once the maintenance window is completed you can complete the following steps to 

  1. Log in on each cluster that will go into maintenance
  2. To list all the jobs and its status. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit summary" 
  3. Run previously stopped jobs. Run a command similar to bosh -d <service-instance> ssh master -c "sudo monit start <job>".
  4. Check if the service has resumed by running a command similar to bosh -d <service-instance> ssh master -c "sudo monit status <job>".
  5. Enable the bosh resurrector. Run commnd bosh update-resurrection off

Note: During the maintenance window, all persistent volume operations will be queued and will resume once the service has started again.