Smarts NCM / Voyence Control: Database performance issues when a VM snapshot is running against a controldb host VM
search cancel

Smarts NCM / Voyence Control: Database performance issues when a VM snapshot is running against a controldb host VM

book

Article ID: 331246

calendar_today

Updated On:

Products

VMware Smart Assurance

Issue/Introduction

Symptoms:


NCM PostgreSQL Control Database (controldb) performance can degrade severly when a snapshot is running against the Virtual Machine (VM) where the controldb is installed.

Environment

VMware Smart Assurance - NCM

Cause

The NCM controldb binary change log storage media read/write profile often causes rapid expansion of VM snapshot binary change files residing on the VM host that are external to the NCM controldb VM in busy NCM production environments. This can tie up significant VM host resources and cause the VM host binary change files associated with a running VM snapshot to consume several times more space on the VM host available storage media than is occupied by the NCM controldb VM that the snapshot is running against. Cases have been seen where the snapshot related files associated with an NCM controldb VM have grown to over 20 times the size of the original NCM controldb VM. This kind of growth can adversely affect the underlying VM host stability. Binary change files of this size are extremely difficult for the VM host to work with. The challenges associated with successfully committing binary change files of this size is significant. This is because those files are usually on that large because they are continuously growing at a rapid rate, even as the the VM host is attempting to coalesce the data they contain into the VM snapshot base image and then to sequentially purge coalesced (and therefore unneeded) data from the snapshot binary change files. This process can consume up to double the compute and media IOPS resources on the VM host that are being consumed by normal operation of the NCM controldb VM. Further, since these resources are VM host processes that have higher priority than any activity running on the NCM controldb VM, they cannot be controlled by resource reservation rules assigned to the NCM Controldb VM.

Resolution

Do not run snapshots against any production NCM controldb VM except under the following scenarios:
  • A sophisticated backup solution makes use of VM snapshots
In this scenario, the backup solution may engage the VM host to sequentially back up each running VM on the host by:
  1.  Creating a shapshot of the VM
  2.  Copying the VM's base state out the snapshot to use as a backup
  3.  Immediately eliminating the snapshot by committing the snapshot as soon as the base image has been copied to the backup media
IMPORTANT NOTE:

Careful resource analysis must be performed prior to deploying any backup solution of this type to guarantee that the significant resources associated with creating, running, and committing the short-term snapshots it requires will be available to allow the snapshots be reliably committed in a timely fashion. It is preferable that backups of this kind be scheduled so that related snapshot processing does not compete with other scheduled off-peak tasks running on the VM host or on any of the VMs residing on that host.
 
  • Manual changes are being performed on the NCM controldb VM
In this scenario, manual changes may include restoring an NCM or NCM VM backup, performing an NCM upgrade, or modifying the NCM controldb VM itself to expand memory, storage, or other resources assigned to the VM. The purpose of such a snapshot is to allow quick recovery from NCM controldb outages resulting from unanticipated events that occur only during manual maintenance.
 
IMPORTANT NOTE:
 
In cases of manual changes to a production instance NCM controldb host, it is good practice to declare a formal maintenance window. The first step in any maintenance window instigated manual change process should be to:
  1. Stop all NCM services on all servers in the NCM instance
  2. Stop the NCM controldb VM
  3. Create a VM level backup of the inactive NCM controldb VM
Once a stable backup of the NCM controldb VM has been obtained, the VM can be started again, which means the VM will now be in a state where a running snapshot may be useful. Maintenance windows usually have the effect of greatly reducing the load on the NCM controldb VM which, by extension, also reduces any snapshot related load on the underlying VM host and makes snapshot processing significantly easier. Once any manual change to the NCM controldb has been completed, do not leave the snapshot running until the change has been verified. Verification of manual changes often requires several days of testing, which is a time span that is unacceptable for any snapshot running against an NCM controldb VM. If it becomes necessary to revert changes made to the NCM controldb because the changes cannot be verified, the backup of the NCM controldb VM that was made in the first step at the beginning of any maintenance window while the NCM contoldb VM was stoped should be used to revert changes, not a running snapshot.