Troubleshooting VMware Aria Automation cloud proxies and On-Premises appliance deployments

Products

VMware Aria Suite

Issue/Introduction

Provide the most common errors during the deployment of Cloud Automation appliances and how to identify them.

Symptoms:

The following list of services and cloud proxies follow a similar methodology for troubleshooting:

VMware Aria Automation appliances (On-Premise)
VMware Aria Orchestrator appliances (On-Premise)
Cloud Proxy
VMware Aria Extensibility Proxy (formerly vRealize Automation Extensibility Proxy)
Cloud Extensibility Proxy

Most common misconfigurations:

NTP
DNS
Proxy initial deployed with shortname instead of FQDN.

These errors can be found in the following logs:

/var/log/bootstrap/firstboot.log
/var/log/bootstrap/everyboot.log

NTP issues

NTP error found in/var/log/bootstrap/firstboot.log

2022-09-xx 17:42:15Z /etc/bootstrap/firstboot.d/00-apply-ntp-servers.sh starting...
+ set -e
++ ovfenv -q --key ntp-servers
+ ovf_ntpServers=<IP_Address>
+ '[' '!' -z <IP_Address> ']'
+ /usr/local/bin/vracli ntp systemd --set <IP_Address> --local
Couldn't reach NTP server <IP_Address>: No response received from <IP_Address>.
No reachable NTP server found
2022-09-xx 17:42:21Z Script /etc/bootstrap/firstboot.d/00-apply-ntp-servers.sh failed, error status 1

DNS issues

DNS error in /var/log/bootstrap/firstboot.log

+ '[' '!' -e /etc/bootstrap/firstboot.d/02-setup-kubernetes ']'
+ '[' '!' -x /etc/bootstrap/firstboot.d/02-setup-kubernetes ']'
+ log '/etc/bootstrap/firstboot.d/02-setup-kubernetes starting...'
++ date '+%Y-%m-%d %H:%M:%S'
+ echo '2022-06-xx 14:40:19 /etc/bootstrap/firstboot.d/02-setup-kubernetes starting...'
2022-06-xx 14:40:19 /etc/bootstrap/firstboot.d/02-setup-kubernetes starting...
+ /etc/bootstrap/firstboot.d/02-setup-kubernetes
+ export -f wait_health
+ timeout 300s bash -c wait_health
Running check eth0-ip

Running check non-default-hostname

Running check single-aptr
make: *** [/opt/health/Makefile:38: single-aptr] Error 1
make: Target 'firstboot' not remade because of errors.
Failed to get peer URLs
Running check eth0-ip

Shortname issues

Shortname error in /var/log/bootstrap/firstboot.log

+ kubeadm init phase preflight --config /tmp/kubeadm.config Failed to get peer URLs W0629 01:29:36.553417 4882 utils.go:69] The recommended value for "resolvConf" in "KubeletConfiguration" is: /run/systemd/resolve/resolv.conf; the provided value is: /etc/resolv.conf [preflight] Running pre-flight checks [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/ [WARNING Hostname]: hostname "<SHORTNAME>" could not be reached [WARNING Hostname]: hostname "<SHORTNAME>": lookup <FQDN>1 on <IP_Address>:53: server misbehaving [preflight] Pulling images required for setting up a Kubernetes cluster [preflight] This might take a minute or two, depending on the speed of your internet connection [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'

Shortname error in /var/log/bootstrap/everyboot.log

-- Logs begin at Fri 2022-08-19 18:09:16 UTC, end at Mon 2022-08-22 19:47:53 UTC. --
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.076279 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME>  kubelet[3429978]: E0822 00:00:00.177177 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME>  kubelet[3429978]: E0822 00:00:00.277715 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME>  kubelet[3429978]: E0822 00:00:00.285714 3429978 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"<SHORTNAME> ", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"<SHORTNAME> ", UID:"<SHORTNAME>", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"<SHORTNAME> "}, FirstTimestamp:time.Date(2022, time.August, 21, 23, 58, 32, 375534888, time.Local), LastTimestamp:time.Date(2022, time.August, 21, 23, 58, 32, 375534888, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://vra-k8s.local:6443/api/v1/namespaces/default/events": dial tcp <SHORTNAME> :6443: connect: connection refused'(may retry after sleeping)

Etcd Corruption

Similar errors found in journalctl:

Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: Starting etcd.service...
Subject: A start job for unit etcd.service has begun execution
Defined-By: systemd
Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
A start job for unit etcd.service has begun execution.
The job identifier is 4066.
Jul 10 06:26:29 AriaAutoFQDN.x.com kubelet[5269]: E0710 06:26:29.858528   5269 kubelet.go:2263] node "AriaAutoFQDN.x.com" not found
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9349]: curl: (7) Failed to connect to 127.0.0.1 port 2381 after 0 ms: Couldn't connect to server
...
...
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: setting maximum number of CPUs to 10, total number of available CPUs is 10
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: advertising using detected default host "x.x.x.x"
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: the server is already initialized as member before, starting as etcd member...
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: etcd.service: Failed with result 'exit-code'.
A start job for unit etcd.service has finished with a failure.
The job identifier is 4066 and the job result is failed.

Environment

VMware Aria Automation 8.x
VMware Aria Automation Orchestrator 8.x
VMware vRealize Automation 8.x
VMware vRealize Orchestrator 8.x

Cause

Prerequisite infrastructure services may be misconfigured or encountering an issue.

Note: This includes the configuration of the proxy appliances deployment as it relates to these infrastructure services. Be sure to check both external configurations and appliance configurations.

Resolution

NTP, DNS, and Shortname sections should be reviewed and are suggested for all proxy variations:

Cloud Automation Appliances for VMware Aria Automation
VMware Aria Orchestrator
Cloud Proxy
VMware Aria Extensibility Proxy (formerly vRealize Automation Extensibility Proxy)
Cloud Extensibility Proxy

Step 1: NTP issues

Validate the Server IP and hostname
Validate connectivity to the NTP server.
Validate the port is opened.
Delete the VMs of the failed attempt and retry the deployment.

Step 2: DNS issues

Validate that the DNS server is reachable and that the port is opened.
The DNS record must use Fully Qualified Domain Names (FQDNs), no shortname.
A Single A record and a Single PTR record is required. CNAMEs are not supported (only Multitenancy supports CNAME records, for more information refer to this document Set up multi-organization tenancy for VMware Aria Automation
Retry validation of the forward and reverse lookup using the nslookup FQDN and nslookup IPaddress command.

Validating lookup for Name resolution

Note: Fully Qualified Domain Names are required. Do not use shortnames. There should be a single A record for each appliance and VIP.

root@appliance [ ~ ]# nslookup <vra.example.com>
Server:         192.168.xx.xx
Address:        192.168.xx.xx#53

Name:   <vra.example.com>
Address: 192.168.20.xxx

Validating the reserve lookup

Note: There should be a single PTR record, CNAMEs are not supported (with the exception of Multitenant environments), and if the record is duplicated it causes issues.

root@appliance [ ~ ]# nslookup 192.168.20.xxx
xxx.20.168.192.in-addr.arpa     name = <vra.example.com>

VMware Aria Automation/VMware Aria Automation Orchestrator 8.7 and later use the dig command instead of host in order to validate the DNS service in the script /opt/health/Makefile "single-aptr"

Version 8.6.2

single-aptr: eth0-ip 
$(begin_check) 
echo Check the ip address if eth0 resolves only to a single hostname 
[ 1 -eq $$( host $$( host $$( iface-ip eth0 ) | wc -l ) ] $(end_check)

Version 8.7 onwards

single-aptr: eth0-ip
 $(begin_check) 
echo Check the ip address if eth0 resolves only to a single hostname [ 1 -eq $$(/usr/bin/dig +noall +answer -x $$( iface-ip eth0 ) | grep "PTR" | wc - l ) ] 
$(end_check)

Therefore, for these versions it is recommended to run the following commands:

/usr/bin/dig +noall +answer +nocookie -x $( iface-ip eth0 )
/usr/bin/dig +noall +answer +noedns -x $( iface-ip eth0 )
/usr/bin/dig +noall +answer -x $( iface-ip eth0 )

Scenarios:

If the answers are blank, it is required to create a PTR record.
If only the last command succeeds and a Microsoft AD DNS server is being used, please review this Microsoft Article Some DNS name queries are unsuccessful after you deploy a Windows-based DNS server.

After fixing any DNS issues, delete the VMs of the failed attempt, and retry the deployment.

Step 3: Shortname issues

Delete the VMs of the failed attempt and retry the deployment using FQDNs. If a shortname was used within the DNS record configuration, update the DNS records before retrying the deployment.

Step 4: Additional validations for Cloud Extensibility Proxy and Cloud Proxy

Once Steps 1-3 through are complete, consider the following:

OTK expires in 24 hours.
OTK cannot be reused for several proxies.
Validate there is internet connectivity.
The OVA must be deployed to a vCenter. Deployment directly to an ESXi server is NOT supported.
For the cloud proxy, a network proxy that performs TLS terminations is NOT supported.
Run the following command to validate there is connectivity to the required URLs:

sh /data-collector-status –-traceroute

Note: The URL required can change based in proxy location as explained in:

Step 5: Additional validation for VMware Aria Automation Orchestrator (formerly known as VMware vRealize Orchestrator)

Please check the following article: After deploying VMware Aria Automation Orchestrator (formerly known vRealize Orchestrator) the UI spins and never loads.

Step 6: Network Load Balancer

For cluster (3 nodes) deployments deploy a Load Balancer: VMware Aria Automation Load Balancing Guide. This is a strict requirement, except for the following products:

VMware Aria Automation with VMware Aria Lifecycle in VCF mode.
VMware Aria Operations with VMware Aria Lifecycle in VCF mode.
VMware Identity Manager with VMware Aria Lifecycle in VCF mode.

Note: VMware Aria Automation Orchestrator (formerly vRealize Orchestrator) will require manual creation of the load balancer.

Etcd Corruption

Note: Only perform these steps if you have verified the symptoms in the logs per the Symptoms Section. Ensure you have valid snapshots.

Single VA Deployment

Locate an etcd backup at /data/etcd-backup/ and copy the selected backup to /root
Reset Kubernetes by running vracli cluster leave
Restore the etcd backup in /root by using the /opt/scripts/recover_etcd.sh command.

Example: /opt/scripts/recover_etcd.sh --confirm /root/backup-123456789.db

Extract VA config from etcd with

kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml

Reset Kubernetes once again using
```
vracli cluster leave
```

Run to Install the VA config

kubectl apply -f /root/vaconfig.yaml --force

Run vracli license to confirm that VA config is installed properly.
Run
```
/opt/scripts/deploy.sh
```

Clustered Deployment (3 nodes)

Let's call one of the nodes a primary node. On the primary node, locate the etcd backup at /data/etcd-backup/ and preserved in /root.
Reset each node with
```
vracli cluster leave
```
On the primary node, restore the etcd backup taken at /root using the /opt/scripts/recover_etcd.sh command

Example:

/opt/scripts/recover_etcd.sh --confirm /root/backup-123456789.db

Extract VA config from etcd with

kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml

Reset the node once again with
```
vracli cluster leave
```

Install VA config with

kubectl apply -f /root/vaconfig.yaml --force

Run vracli license to confirm that VA config is installed properly.

Note: vracli license is not applicable for vRO and CExP installations.

Join the other 2 nodes in the cluster by running the following command on each
```
 vracli cluster join primary-node --preservedata
```
Run /opt/scripts/deploy.sh from the primary node