Provide the most common errors during the deployment of Automation appliances and how to identify them.
Symptoms:
The following list of services and cloud proxies follow a similar methodology for troubleshooting:
Most common misconfigurations:
These errors can be found in the following logs:
NTP issues
/var/log/bootstrap/firstboot.log
2022-09-xx 17:42:15Z /etc/bootstrap/firstboot.d/00-apply-ntp-servers.sh starting...
+ set -e
++ ovfenv -q --key ntp-servers
+ ovf_ntpServers=<IP_Address>
+ '[' '!' -z <IP_Address> ']'
+ /usr/local/bin/vracli ntp systemd --set <IP_Address> --local
Couldn't reach NTP server <IP_Address>: No response received from <IP_Address>.
No reachable NTP server found
2022-09-xx 17:42:21Z Script /etc/bootstrap/firstboot.d/00-apply-ntp-servers.sh failed, error status 1
DNS issues
DNS error in /var/log/bootstrap/firstboot.log
+ '[' '!' -e /etc/bootstrap/firstboot.d/02-setup-kubernetes ']'
+ '[' '!' -x /etc/bootstrap/firstboot.d/02-setup-kubernetes ']'
+ log '/etc/bootstrap/firstboot.d/02-setup-kubernetes starting...'
++ date '+%Y-%m-%d %H:%M:%S'
+ echo '2022-06-xx 14:40:19 /etc/bootstrap/firstboot.d/02-setup-kubernetes starting...'
2022-06-xx 14:40:19 /etc/bootstrap/firstboot.d/02-setup-kubernetes starting...
+ /etc/bootstrap/firstboot.d/02-setup-kubernetes
+ export -f wait_health
+ timeout 300s bash -c wait_health
Running check eth0-ip
Running check non-default-hostname
Running check single-aptr
make: *** [/opt/health/Makefile:38: single-aptr] Error 1
make: Target 'firstboot' not remade because of errors.
Failed to get peer URLs
Running check eth0-ip
Shortname issues
+ kubeadm init phase preflight --config /tmp/kubeadm.config Failed to get peer URLs W0629 01:29:36.553417 4882 utils.go:69] The recommended value for "resolvConf" in "KubeletConfiguration" is: /run/systemd/resolve/resolv.conf; the provided value is: /etc/resolv.conf [preflight] Running pre-flight checks [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/ [WARNING Hostname]: hostname "<SHORTNAME>" could not be reached [WARNING Hostname]: hostname "<SHORTNAME>": lookup <FQDN>1 on <IP_Address>:53: server misbehaving [preflight] Pulling images required for setting up a Kubernetes cluster [preflight] This might take a minute or two, depending on the speed of your internet connection [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
-- Logs begin at Fri 2022-08-19 18:09:16 UTC, end at Mon 2022-08-22 19:47:53 UTC. --
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.076279 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.177177 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.277715 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.285714 3429978 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"<SHORTNAME> ", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"<SHORTNAME> ", UID:"<SHORTNAME>", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"<SHORTNAME> "}, FirstTimestamp:time.Date(2022, time.August, 21, 23, 58, 32, 375534888, time.Local), LastTimestamp:time.Date(2022, time.August, 21, 23, 58, 32, 375534888, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://vra-k8s.local:6443/api/v1/namespaces/default/events": dial tcp <SHORTNAME> :6443: connect: connection refused'(may retry after sleeping)
Etcd Corruption
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: Starting etcd.service...
Subject: A start job for unit etcd.service has begun execution
Defined-By: systemd
Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
A start job for unit etcd.service has begun execution.
The job identifier is 4066.
Jul 10 06:26:29 AriaAutoFQDN.x.com kubelet[5269]: E0710 06:26:29.858528 5269 kubelet.go:2263] node "AriaAutoFQDN.x.com" not found
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9349]: curl: (7) Failed to connect to 127.0.0.1 port 2381 after 0 ms: Couldn't connect to server
...
...
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: setting maximum number of CPUs to 10, total number of available CPUs is 10
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: advertising using detected default host "x.x.x.x"
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: the server is already initialized as member before, starting as etcd member...
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: etcd.service: Failed with result 'exit-code'.
A start job for unit etcd.service has finished with a failure.
The job identifier is 4066 and the job result is failed.
Prerequisite infrastructure services may be misconfigured or encountering an issue.
Note: This includes the configuration of the proxy appliances deployment as it relates to these infrastructure services. Be sure to check both external configurations and appliance configurations.
NTP, DNS, and Shortname sections should be reviewed.
Step 1: NTP issues
Step 2: DNS issues
Validating lookup for Name resolution
Note: Fully Qualified Domain Names are required. Do not use shortnames. There should be a single A record for each appliance and VIP.
root@appliance [ ~ ]# nslookup <vra.example.com>
Server: 192.xxx.xx.xx
Address: 192.xxx.xx.xx#53
Name: <vra.example.com>
Address: 192.xxx.xx.xxx
Validating the reserve lookup
Note: There should be a single PTR record, CNAMEs are not supported (with the exception of Multitenant environments), and if the record is duplicated it causes issues.
root@appliance [ ~ ]# nslookup 192.xxx.xx.xx
xx.xx.xxx.192.in-addr.arpa name = <vra.example.com>
Version 8.6.2
single-aptr: eth0-ip
$(begin_check)
echo Check the ip address if eth0 resolves only to a single hostname
[ 1 -eq $$( host $$( host $$( iface-ip eth0 ) | wc -l ) ] $(end_check)
Version 8.7 onwards
single-aptr: eth0-ip
$(begin_check)
echo Check the ip address if eth0 resolves only to a single hostname [ 1 -eq $$(/usr/bin/dig +noall +answer -x $$( iface-ip eth0 ) | grep "PTR" | wc - l ) ]
$(end_check)
Therefore, for these versions it is recommended to run the following commands:
/usr/bin/dig +noall +answer +nocookie -x $( iface-ip eth0 )
/usr/bin/dig +noall +answer +noedns -x $( iface-ip eth0 )
/usr/bin/dig +noall +answer -x $( iface-ip eth0 )
Scenarios:
Step 3: Shortname issues:
Step 4: Additional validation for VMware Aria Automation Orchestrator (formerly known as VMware vRealize Orchestrator)
Step 5: Network Load Balancer
Etcd Corruption:
Note: Only perform these steps if you have verified the symptoms in the logs per the Symptoms Section. Ensure you have valid snapshots.
Single VA Deployment:
Example: /opt/scripts/recover_etcd.sh --confirm /root/backup-123456789.db
kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml
vracli cluster leave
kubectl apply -f /root/vaconfig.yaml --force
/opt/scripts/deploy.sh
Clustered Deployment (3 nodes):
Let's call one of the nodes a primary node. On the primary node, locate the etcd backup at /data/etcd-backup/ and preserved in /root.
vracli cluster leave
Example:
/opt/scripts/recover_etcd.sh --confirm /root/backup-123456789.db
kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml
vracli cluster leave
kubectl apply -f /root/vaconfig.yaml --force
vracli cluster join primary-node --preservedata