This article assists in diagnosing and resolving issues across various services by leveraging health check reports and relevant data from the cluster.
Health Check Overview
Health checks are background tests running in the cluster to determine both the integrity and state, it is fine for some tests to return a non 200 whilst the system is going through various Day 2 activities. Workflow in VCF Ops LCM use this state to determine whether to execute certain tasks or wait for a certain state. All tests write out log entries that can be imported into Log Insight for general cluster health check overview.
| Prefix | Description |
|---|---|
| platform- | All infrastructure and cluster related health checks to ensure the integrity of the platform in which VCF Services are hosted |
| vidb- | Identity Broker related health checks |
| vcfa- | VCF Automation related health checks |
The health check endpoint is available on each VCF Services Platform Cluster on the PRIMARY VIP on port 30006
curl -k https://[PRIMARY_VIP]:30006/status
| HTTP Status Code | Description |
|---|---|
| 200 | All tests are successful |
| 218 | Some tests are failing |
| 503 | All tests are failing |
[ "platform-vmsp-platform-sftp", "platform-vmsp-platform-backup-sftp", "platform-api-server-core", "platform-control-plane-core", "platform-control-plane-nodes-core", "platform-daemonsets-core", "platform-deployments-core", "platform-dns-core", "platform-etcd-core", "platform-machines-core", "platform-networking-daemonsets-core", "platform-networking-deployments-core", "platform-node-disk-utilization-prom", "platform-opsmgmt-dns", "platform-package-core", "platform-statefulsets-core", "platform-storage-capacity-prom", "platform-vc-serviceaccount-http"] |
[ "vidb-synthetic-checker-vidb-external-vidbReadyStatus-http", "vidb-package-core"] |
[ "vcfa-services-health-check-prelude-health-reporter-http", "vcfa-package-core"] |
platform-control-plane-nodes-core
These health tests monitor the state of the control plane and the number of number of nodes within the cluster. For a small VCFA deployment, this will be a single-node configuration. However, for VCFA Medium and Large deployments, as well as VIDB environments, a minimum of three control plane nodes is required to ensure high availability and resilience.
Node rotation can happen in 2 ways, planned and unplanned.
A planned change is when a certain Day 2 action is invoked, such as changes to DNS configuration or disk resizing in which each node is replaced. Note, node replacement requires the VirtualMachineTemplate to be available in vCenter. A planned node replacement will provision a new node in the cluster first, ie from 3 to 4 and then once deployed services are drained from the tainted node and rescheduled to the new one.
An unplanned change is when Node Problem Detector replaces a node, this happens as a last resort and if a node is deemed unhealthy, ie disk capacity is breached or the overlay network is unstable it is removed. An unplanned node replacement, as described here, will cause the node to be removed first and then a new node to be provisioned. As such these health checks tests will subsequently fail, ie the cluster is running 2 nodes not 3.
The PackageDeployment is used to deploy the core platform and VIDB/VCFa services that are hosted and contains a high level status of all services deployed and whether they are in progress, successful or in a failed state.
These tests relate to services that are not fully ready, it is normal to see these tests fail as nodes are being replaced and services are scheduled on new nodes. All of these core services are required for the integrity and high availability of the platform in which VIDB/VCFA is hosted.
These tests are enabled when SFTP is configured in VCF Ops LCM, the "platform-vmsp-platform-sftp" is used to test a connection to the SFTP endpoint and "platform-vmsp-platform-backup-sftp" is used to ensure that scheduled backups are written to the backup folder on the SFTP Server.
This action available via LCM API deploys the core platform package to the cluster and ensures the VirtualMachineTemplate is present in vCenter, redeploying it if necessary.
| Variable | Description |
|---|---|
| vcf_op_hostname | VCF Ops LCM Hostname |
| environmentid |
EnvironmentId of either VCFA/VIDB component within VCF Ops LCM. The environmentid can be found in the URL, example: https://{{hostname}}/vcf-operations/ui/management/lifecycle/lcm/lcops/environments/########-####-####-####-############ |
| product | vra|vidb |
Browse to the product within VCF Ops LCM and run the "Update DNS Configuration" action.
Browse to the product within VCF Ops LCM and the VM Names (cluster nodes) can be seen.
Browse to the product within VCF Ops LCM and choose Storage Resize Action
| Volume Group Name | Description |
|---|---|
| Database | 3 x persistent volumes attached to the cluster that store the Postgres data. 1 of the services is the Postgres master with the other 2 as replicas for automated failover. |
| Message Broker | VCFA Only - 3 x persistent volumes attached to the cluster that store RabbitMQ message data |
| Operations Orchestrator Data | VCFA Only - 3 x persistent volumes attached to the cluster that store ... |
| Shared Storage Metadata | 1 x persistent volume attached to the cluster that stores the S3 Objectstorage metadata |
| Shared Storage Data | 3 x persistent volumes attached to the cluster that stores the S3 Objectstorage data |
| Event Tailer Volume | 1 x persistent volume attached to the cluster to maintain the state of the events collector |
| Log Buffer |
between 3-10 x persistent volumes that are attached in the cluster for the logging agent buffer. (3 for a small profile, 6 for a medium profile, and 10 for large profile) |
| Metrics Storage | 1 x persistent volume attached to the cluster that stores cluster and application prometheus metrics |
| Application Registry | 1 x persistent volume attached to the cluster that stores all images used by the cluster and VCFA/VIDB packages |
| Support Data | 1 x persistent volume attached to the cluster that stores support/log bundles |
Each cluster node has 100Gb local storage and in most cases this should be good enough, if for some reason disk capacity increases it is possible to increase all cluster node storage. Note that automation will kick in at 90% disk usage for over a 5 minute period and automatically rotate the node. See node tests section above. To increase the node storage it is an API request via VCF Ops LCM.
Sample to increase disk space from 100Gi to 150Gi - please note this will replace each node in the cluster 1 by 1.
| Variable | Description |
|---|---|
| vcf_op_hostname | VCF Ops LCM Hostname |
| environmentid |
EnvironmentId of either VCFA/VIDB component within VCF Ops LCM. The environmentid can be found in the URL, example: https://{{hostname}}/vcf-operations/ui/management/lifecycle/lcm/lcops/environments/########-####-####-####-############ |
| product | vra|vidb |
The VCF Services Cluster uses 2 credentials in vCenter:
Prerequisites
| Variable | Description |
|---|---|
| vcf_op_hostname | VCF Ops LCM Hostname |
| environmentid |
EnvironmentId of either VCFA/VIDB component within VCF Ops LCM. The environmentid can be found in the URL, example: https://{{hostname}}/vcf-operations/ui/management/lifecycle/lcm/lcops/environments/########-####-####-####-############ |
| product | vra|vidb |
| vcenter_admin_username_base64 | vCenter Username in base64 format |
| vcenter_admin_password_base64 | vCenter Password in base64 format |
The VCF Services Cluster performs a full backup once per day and incremental backups of VCFA/VIDB every 15 minutes, in the event of an outage or if you want to revert to a previous backup you can restore from SFTP.
Note: restoring from SFTP will delete the service from the cluster and subsequently restore Postgres data and the VCF Service (VCFA/VIDB) hosted on the cluster.
Browse to the product within VCF Ops LCM and choose Backup and Restore / Restore Action
| From | Rename | Notes |
|---|---|---|
|
core-daemonsets-k8s |
platform-daemonsets-core | |
|
core-deployments-k8s |
platform-deployments-core | |
|
core-pds-k8s |
platform-packages-core |
Why does this not include all tenant PackageDeployments ? |
|
core-statefulsets-k8s |
platform-statefulsets-core | |
|
coredns-k8s |
-core | |
|
etcd-k8s |
platform-etcd-core | |
|
kube-api-k8s |
platform-api-server-core | |
|
kubeadmcontrolplane-k8s |
platform-kubeadmcontrolplane-core | |
|
machines-k8s |
platform-machines-core | |
|
networking-daemonsets-k8s |
platform-networking-daemonsets-core | |
|
networking-deployments-k8s |
platform-networking-deployments-core | |
|
opsmgmt-dns |
platform-opsmgmt-dns | |
|
optional-checks-vmsp-platform-vmsp-sftp |
platform-configuration-sftp | |
|
optional-checks-vmsp-platform-vmspBackup-sftp |
platform-backups-sftp | |
|
snapshotcheck-snapshot |
platform-snapshots | |
|
vc-serviceaccount-check-http |
platform-virtualcenter-serviceaccount-http |
apiVersion: xxxxxxxx/v1alpha1
kind: Schema
metadata:
annotations:
meta.helm.sh/release-name: vmsp-hooks
meta.helm.sh/release-namespace: vmsp-platform
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
helm.toolkit.fluxcd.io/name: vmsp-hooks
helm.toolkit.fluxcd.io/namespace: vmsp-platform
hooks.vmsp.vmware.com/access: public
name: patch-configuration
namespace: vmsp-platform
spec:
context:
name: patch-configuration
type: hook
version: v1.0.0
schema:
openAPIv3Schema
paths:
/webhooks/vmsp-platform/kubectl/patch:
post:
description: This action patches a cluster configuration
summary: This action patches a cluster configuration