VCF Services Platform Cluster Health Checks
search cancel

VCF Services Platform Cluster Health Checks

book

Article ID: 389510

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

This article assists in diagnosing and resolving issues across various services by leveraging health check reports and relevant data from the cluster.

Environment

  • VCF Operations 9.0
  • VCF Automation 9.0
  • VCF Identity Broker 9.0

Resolution

Health Check Overview

Health checks are background tests running in the cluster to determine both the integrity and state, it is fine for some tests to return a non 200 whilst the system is going through various Day 2 activities. Workflow in VCF Ops LCM use this state to determine whether to execute certain tasks or wait for a certain state. All tests write out log entries that can be imported into Log Insight for general cluster health check overview.

 

Prefix Description
platform- All infrastructure and cluster related health checks to ensure the integrity of the platform in which VCF Services are hosted
vidb- Identity Broker related health checks
vcfa- VCF Automation related health checks

 

The health check endpoint is available on each VCF Services Platform Cluster on the PRIMARY VIP on port 30006

curl -k https://[PRIMARY_VIP]:30006/status
HTTP Status Code Description
200 All tests are successful
218 Some tests are failing
503 All tests are failing

 

Health Checks

[  "platform-vmsp-platform-sftp",  "platform-vmsp-platform-backup-sftp",  "platform-api-server-core",  "platform-control-plane-core",  "platform-control-plane-nodes-core",  "platform-daemonsets-core",  "platform-deployments-core",  "platform-dns-core",  "platform-etcd-core",  "platform-machines-core",  "platform-networking-daemonsets-core",  "platform-networking-deployments-core",  "platform-node-disk-utilization-prom",  "platform-opsmgmt-dns",  "platform-package-core",  "platform-statefulsets-core",  "platform-storage-capacity-prom",  "platform-vc-serviceaccount-http"]

 

VIDB

[  "vidb-synthetic-checker-vidb-external-vidbReadyStatus-http",  "vidb-package-core"]

VCFA

[  "vcfa-services-health-check-prelude-health-reporter-http",  "vcfa-package-core"]

Health Check Overview and Remediation

Control Plane and Node Tests

  • platform-control-plane-nodes-core

  • platform-machines-core
  • platform-control-plane-core

Description

These health tests monitor the state of the control plane and the number of number of nodes within the cluster.  For a small VCFA deployment, this will be a single-node configuration. However, for VCFA Medium and Large deployments, as well as VIDB environments, a minimum of three control plane nodes is required to ensure high availability and resilience.

Node rotation can happen in 2 ways, planned and unplanned.

A planned change is when a certain Day 2 action is invoked, such as changes to DNS configuration or disk resizing in which each node is replaced. Note, node replacement requires the VirtualMachineTemplate to be available in vCenter. A planned node replacement will provision a new node in the cluster first, ie from 3 to 4 and then once deployed services are drained from the tainted node and rescheduled to the new one.

An unplanned change is when Node Problem Detector replaces a node, this happens as a last resort and if a node is deemed unhealthy, ie disk capacity is breached or the overlay network is unstable it is removed.  An unplanned node replacement, as described here, will cause the node to be removed first and then a new node to be provisioned. As such these health checks tests will subsequently fail, ie the cluster is running 2 nodes not 3.

Debugging & Remediation

  1. Ensure that no maintenance activity is taking place and nodes are being replaced.
  2. Check the test "platform-vc-serviceaccount-check-http" status to make sure that authentication to vCenter is active, if this is failing see the remediation steps for this test below. 
  3. If the "platform-control-plane-core" test is failing it could be because the VirtualMachine template was removed from vCenter. See Appendix 1 - Syncronize Package Section below.
  4. There is series of self remediating actions that should resolve these tests, but if after approximately 60 minutes they are still failing, please log a Broadcom support request with a support bundle.

Package Management Test

  • platform-package-core

Description

The PackageDeployment is used to deploy the core platform and VIDB/VCFa services that are hosted and contains a high level status of all services deployed and whether they are in progress, successful or in a failed state. 

Debugging & Remediation

  1. Ensure that no maintenance activity is taking place and nodes are being replaced.
  2. It is possible to push images from the core package including the vCenter template. See Appendix 1 - Syncronize Package Section below.
  3. If after approximately 60 minutes this test is still failing please log a Broadcom support request with a support bundle.

Core Services Tests

  • platform-core-daemonsets-core
  • platform-core-deployments-core
  • platform-core-statefulsets-core
  • platform-coredns-core
  • platform-etcd-core
  • platform-api-server-core
  • platform-networking-daemonsets-core
  • platform-networking-deployments-core

Description

These tests relate to services that are not fully ready, it is normal to see these tests fail as nodes are being replaced and services are scheduled on new nodes. All of these core services are required for the integrity and high availability of the platform in which VIDB/VCFA is hosted.

Debugging & Remediation

  1. Ensure that no maintenance activity is taking place and nodes are being replaced.
  2. It may be possible that a VirtualMachine was deleted from vCenter, the cluster will automatically attempt to provision a new node to join the cluster and this could take up to 60 mins for all nodes to be replaced.
  3. Check the Package Test "platform-package-core" as this test provides the status of the Core Package and if there are any package deployment issues.
  4. It is possible to push images from the core package including the vCenter template. See Appendix 1 - Syncronize Package Section below.
  5. If after approximately 60 minutes this test is still failing please log a Broadcom support request with a support bundle.

Backup/Restore Tests

  • platform-vmsp-platform-sftp
  • platform-vmsp-platform-backup-sftp

Description

These tests are enabled when SFTP is configured in VCF Ops LCM, the "platform-vmsp-platform-sftp" is used to test a connection to the SFTP endpoint and "platform-vmsp-platform-backup-sftp" is used to ensure that scheduled backups are written to the backup folder on the SFTP Server.

Debugging & Remediation

  1. Check to make sure the SFTP server is reachable over the network from the VCFA/VIDB cluster
  2. Check SFTP server disk usage
  3. Check the SFTP credentials in VCF Ops LCM and ensure you can SSH into the SFTP instance
  4. Check for errors in the VCF Ops LCM log files located in /var/log/vrlcm

Infrastructure Tests

  • platform-opsmgmt-dns
  • platform-snapshotcheck-snapshot
  • platform-vc-serviceaccount-check-http
  • platform-storage-capacity-prom
  • platform-node-disk-utilization-high-prom

Description

  • platform-opsmgmt-dns tests that the DNS name of VCF Ops LCM host is resolvable by the VCFA/VIDB cluster
  • platform-snapshotcheck-snapshot tests that there are no VirtualMachine snapshots on any of the nodes, this causes a failure when rotating the cluster nodes as PersistentVolumes cannot detach whilst there are snapshots.
  • platform-vc-serviceaccount-check-http tests that the credentials used by the cluster are valid
  • platform-storage-capacity-prom - This test will query all attached persistent volumes on the cluster and the test will fail when any disk breaches 90%
  • platform-node-disk-utilization-high-prom - This test will query each nodes local disk and the test will fail when it breaches 90%

Debugging & Remediation

  1. platform-opsmgmt-dns -  Check to make sure that DNS records for the VCF Ops LCM host are available. Check the DNS Servers configured within VCF Ops LCM. Within the VCF Ops LCM User Interface you can run the "Update DNS Configuration workflow" see Appendix 2
  2. platform-snapshotcheck-snapshot - Remove any snapshots from the Cluster Nodes from within VirtualCenter. The nodes can be found within VCF Ops LCM see Appendix 3
  3. platform-vc-serviceaccount-check-http - If this test is failing then the connection to vCenter is not working and the credentials need to be reset, see Appendix 6  - resetting vCenter Credentials
  4. platform-storage-capacity-prom - See Appendix 4
  5. platform-node-disk-utilization-high-prom - See Appendix 5

 

VCF Service Tests

Description

  • vidb-synthetic-checker-vidb-external-vidbReadyStatus-http -  This test calls checks if all VIDB services are up and running and ready to serve requests. 
  • vcfa-services-health-check-prelude-health-reporter-http - This test calls checks if all VCFA services are up and running and ready to serve requests. 

Debugging & Remediation

  • vidb-synthetic-checker-vidb-external-vidbReadyStatus-http - Self healing capabilities will attempt to restore services, in the event that this health check is failing after 60 mins please log a Broadcom Support request with a support bundle. 
  • vcfa-services-health-check-prelude-health-reporter-http - Self healing capabilities will attempt to restore services, in the event that this health check is failing after 60 mins please log a Broadcom Support request with a support bundle.
  • For both of these services and if the VCF component (VCFA/VIDB) is unavailable you can restore from the latest backup - see Appendix 7 - restoring a service

 

Appendix 1 - Syncronize Package

Description

This action available via LCM API deploys the core platform package to the cluster and ensures the VirtualMachineTemplate is present in vCenter, redeploying it if necessary.

  • Core services will be deployed to the cluster without causing downtime. In most cases, this operation should have no impact, but if a service is down, redeployment may resolve the issue. Additionally, if any container images are missing or corrupted, all images and associated services will be redeployed.
  • The VirtualMachineTemplate will also be redeployed to vCenter if it is not available and in the correct folder.

Prerequisites

  • Ensure you have access to VCF Ops LCM API with authentication token, this is a base64 encoded Basic authentication.
  • Ensure that VCFA or VIDB version binaries is deployed, the VirtualMachineTemplate is located within the package.

 

Invoke Actions API

Variable Description
vcf_op_hostname VCF Ops LCM Hostname
environmentid

EnvironmentId of either VCFA/VIDB component within VCF Ops LCM.

The environmentid can be found in the URL, example:

https://{{hostname}}/vcf-operations/ui/management/lifecycle/lcm/lcops/environments/########-####-####-####-############

product vra|vidb

Get the requestId from VCF Ops LCM

Get the status of this asynchronous request from VCF Ops LCM

Appendix 2 - Update Cluster DNS

Browse to the product within VCF Ops LCM and run the "Update DNS Configuration" action.

Appendix 3 - View Cluster Nodes

Browse to the product within VCF Ops LCM and the VM Names (cluster nodes) can be seen.

Appendix 4 - Changing Attached Storage Size

Browse to the product within VCF Ops LCM and choose Storage Resize Action

Volume Group Name Description
Database 3 x persistent volumes attached to the cluster that store the Postgres data. 1 of the services is the Postgres master with the other 2 as replicas for automated failover.
Message Broker VCFA Only - 3 x persistent volumes attached to the cluster that store RabbitMQ message data 
Operations Orchestrator Data VCFA Only - 3 x persistent volumes attached to the cluster that store ...
Shared Storage Metadata 1 x persistent volume attached to the cluster that stores the S3 Objectstorage metadata
Shared Storage Data 3 x persistent volumes attached to the cluster that stores the S3 Objectstorage data
Event Tailer Volume 1 x persistent volume attached to the cluster to maintain the state of the events collector
Log Buffer

between 3-10 x persistent volumes that are attached in the cluster for the logging agent buffer.

(3 for a small profile, 6 for a medium profile, and 10 for large profile)

Metrics Storage 1 x persistent volume attached to the cluster that stores cluster and application prometheus metrics
Application Registry 1 x persistent volume attached to the cluster that stores all images used by the cluster and VCFA/VIDB packages
Support Data 1 x persistent volume attached to the cluster that stores support/log bundles

Appendix 5 - Changing Local Node Storage Disk Size

Description

Each cluster node has 100Gb local storage and in most cases this should be good enough, if for some reason disk capacity increases it is possible to increase all cluster node storage. Note that automation will kick in at 90% disk usage for over a 5 minute period and automatically rotate the node. See node tests section above. To increase the node storage it is an API request via VCF Ops LCM.

Prerequisites

  • Ensure you have access to VCF Ops LCM API with authentication token, this is a base64 encoded Basic authentication.

Invoke Actions API

Sample to increase disk space from 100Gi to 150Gi - please note this will replace each node in the cluster 1 by 1.

Variable Description
vcf_op_hostname VCF Ops LCM Hostname
environmentid

EnvironmentId of either VCFA/VIDB component within VCF Ops LCM.

The environmentid can be found in the URL, example:

https://{{hostname}}/vcf-operations/ui/management/lifecycle/lcm/lcops/environments/########-####-####-####-############

product vra|vidb

 

Appendix 6 - Resetting vCenter Credentials

Description

The VCF Services Cluster uses 2 credentials in vCenter:

  • Admin / Breakglass account - this account uses the "VCF Services Platform Admin" vCenter role and is used to manage the provisioning account and automatically rotate the service account credentials based on the vCenter password policy.
  • Provisioning account - this account uses the "VCF Services Platform" vCenter role and is used to manage VCF Service Cluster nodes in vCenter.

Prerequisites

  • Ensure you have access to VCF Ops LCM API with authentication token, this is a base64 encoded Basic authentication.

Invoke Actions API

Variable Description
vcf_op_hostname VCF Ops LCM Hostname
environmentid

EnvironmentId of either VCFA/VIDB component within VCF Ops LCM.

The environmentid can be found in the URL, example:

https://{{hostname}}/vcf-operations/ui/management/lifecycle/lcm/lcops/environments/########-####-####-####-############

product vra|vidb
vcenter_admin_username_base64 vCenter Username in base64 format
vcenter_admin_password_base64 vCenter Password in base64 format

 

Appendix 7 - Restoring a VCF Service from SFTP Backup

Description

The VCF Services Cluster performs a full backup once per day and incremental backups of VCFA/VIDB every 15 minutes, in the event of an outage or if you want to revert to a previous backup you can restore from SFTP.

Note: restoring from SFTP will delete the service from the cluster and subsequently restore Postgres data and the VCF Service (VCFA/VIDB) hosted on the cluster.

Prerequisites

  • Ensure SFTP is configured within VCF Ops LCM (Settings section)
  • Ensure Backup is configured, automatically 15 minute backups will be taken and stored on the SFTP Server and backup path.

Browse to the product within VCF Ops LCM and choose Backup and Restore / Restore Action 

 

Action Items

Rename

From Rename Notes

core-daemonsets-k8s

platform-daemonsets-core  

core-deployments-k8s

platform-deployments-core  

core-pds-k8s

platform-packages-core

Why does this not include all tenant PackageDeployments ?

core-statefulsets-k8s

platform-statefulsets-core  

coredns-k8s

-core  

etcd-k8s

platform-etcd-core  

kube-api-k8s

platform-api-server-core  

kubeadmcontrolplane-k8s

platform-kubeadmcontrolplane-core  

machines-k8s

platform-machines-core  

networking-daemonsets-k8s

platform-networking-daemonsets-core  

networking-deployments-k8s

platform-networking-deployments-core  

opsmgmt-dns

platform-opsmgmt-dns  

optional-checks-vmsp-platform-vmsp-sftp

platform-configuration-sftp  

optional-checks-vmsp-platform-vmspBackup-sftp

platform-backups-sftp  

snapshotcheck-snapshot

platform-snapshots  

vc-serviceaccount-check-http

platform-virtualcenter-serviceaccount-http  

New Public Facing Schema - Patching

apiVersion: xxxxxxxx/v1alpha1
kind: Schema
metadata:
	annotations:
		meta.helm.sh/release-name: vmsp-hooks
		meta.helm.sh/release-namespace: vmsp-platform
	generation: 1
	labels:
		app.kubernetes.io/managed-by: Helm
		helm.toolkit.fluxcd.io/name: vmsp-hooks
		helm.toolkit.fluxcd.io/namespace: vmsp-platform
		hooks.vmsp.vmware.com/access: public
	name: patch-configuration
	namespace: vmsp-platform
spec: 
	context: 
		name: patch-configuration
		type: hook
		version: v1.0.0
	schema: 
		openAPIv3Schema
			paths: 
				/webhooks/vmsp-platform/kubectl/patch:
					post: 
						description: This action patches a cluster configuration
						summary: This action patches a cluster configuration