Provide the most common errors during the deployment of Cloud Automation appliances and how to identify them.
Symptoms:
The following list of services and cloud proxies follow a similar methodology for troubleshooting:
Most common misconfigurations:
These errors can be found in the following logs:
2022-09-xx 17:42:15Z /etc/bootstrap/firstboot.d/00-apply-ntp-servers.sh starting...
+ set -e
++ ovfenv -q --key ntp-servers
+ ovf_ntpServers=<IP_Address>
+ '[' '!' -z <IP_Address> ']'
+ /usr/local/bin/vracli ntp systemd --set <IP_Address> --local
Couldn't reach NTP server <IP_Address>: No response received from <IP_Address>.
No reachable NTP server found
2022-09-xx 17:42:21Z Script /etc/bootstrap/firstboot.d/00-apply-ntp-servers.sh failed, error status 1
DNS error in /var/log/bootstrap/firstboot.log
+ '[' '!' -e /etc/bootstrap/firstboot.d/02-setup-kubernetes ']'
+ '[' '!' -x /etc/bootstrap/firstboot.d/02-setup-kubernetes ']'
+ log '/etc/bootstrap/firstboot.d/02-setup-kubernetes starting...'
++ date '+%Y-%m-%d %H:%M:%S'
+ echo '2022-06-xx 14:40:19 /etc/bootstrap/firstboot.d/02-setup-kubernetes starting...'
2022-06-xx 14:40:19 /etc/bootstrap/firstboot.d/02-setup-kubernetes starting...
+ /etc/bootstrap/firstboot.d/02-setup-kubernetes
+ export -f wait_health
+ timeout 300s bash -c wait_health
Running check eth0-ip
Running check non-default-hostname
Running check single-aptr
make: *** [/opt/health/Makefile:38: single-aptr] Error 1
make: Target 'firstboot' not remade because of errors.
Failed to get peer URLs
Running check eth0-ip
+ kubeadm init phase preflight --config /tmp/kubeadm.config Failed to get peer URLs W0629 01:29:36.553417 4882 utils.go:69] The recommended value for "resolvConf" in "KubeletConfiguration" is: /run/systemd/resolve/resolv.conf; the provided value is: /etc/resolv.conf [preflight] Running pre-flight checks [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/ [WARNING Hostname]: hostname "<SHORTNAME>" could not be reached [WARNING Hostname]: hostname "<SHORTNAME>": lookup <FQDN>1 on <IP_Address>:53: server misbehaving [preflight] Pulling images required for setting up a Kubernetes cluster [preflight] This might take a minute or two, depending on the speed of your internet connection [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
-- Logs begin at Fri 2022-08-19 18:09:16 UTC, end at Mon 2022-08-22 19:47:53 UTC. --
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.076279 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.177177 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.277715 3429978 kubelet.go:2263] node "<SHORTNAME> " not found
Aug 22 00:00:00 <SHORTNAME> kubelet[3429978]: E0822 00:00:00.285714 3429978 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"<SHORTNAME> ", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"<SHORTNAME> ", UID:"<SHORTNAME>", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"<SHORTNAME> "}, FirstTimestamp:time.Date(2022, time.August, 21, 23, 58, 32, 375534888, time.Local), LastTimestamp:time.Date(2022, time.August, 21, 23, 58, 32, 375534888, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://vra-k8s.local:6443/api/v1/namespaces/default/events": dial tcp <SHORTNAME> :6443: connect: connection refused'(may retry after sleeping)
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: Starting etcd.service...
Subject: A start job for unit etcd.service has begun execution
Defined-By: systemd
Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
A start job for unit etcd.service has begun execution.
The job identifier is 4066.
Jul 10 06:26:29 AriaAutoFQDN.x.com kubelet[5269]: E0710 06:26:29.858528 5269 kubelet.go:2263] node "AriaAutoFQDN.x.com" not found
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9349]: curl: (7) Failed to connect to 127.0.0.1 port 2381 after 0 ms: Couldn't connect to server
...
...
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: setting maximum number of CPUs to 10, total number of available CPUs is 10
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: advertising using detected default host "x.x.x.x"
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: the server is already initialized as member before, starting as etcd member...
Jul 10 06:26:29 AriaAutoFQDN.x.com etcd[9343]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 06:26:29 AriaAutoFQDN.x.com systemd[1]: etcd.service: Failed with result 'exit-code'.
A start job for unit etcd.service has finished with a failure.
The job identifier is 4066 and the job result is failed.
Prerequisite infrastructure services may be misconfigured or encountering an issue.
Note: This includes the configuration of the proxy appliances deployment as it relates to these infrastructure services. Be sure to check both external configurations and appliance configurations.
NTP, DNS, and Shortname sections should be reviewed and are suggested for all proxy variations:
Validating lookup for Name resolution
Note: Fully Qualified Domain Names are required. Do not use shortnames. There should be a single A record for each appliance and VIP.
root@appliance [ ~ ]# nslookup <vra.example.com>
Server: 192.168.xx.xx
Address: 192.168.xx.xx#53
Name: <vra.example.com>
Address: 192.168.20.xxx
Validating the reserve lookup
Note: There should be a single PTR record, CNAMEs are not supported (with the exception of Multitenant environments), and if the record is duplicated it causes issues.
root@appliance [ ~ ]# nslookup 192.168.20.xxx
xxx.20.168.192.in-addr.arpa name = <vra.example.com>
Version 8.6.2
single-aptr: eth0-ip
$(begin_check)
echo Check the ip address if eth0 resolves only to a single hostname
[ 1 -eq $$( host $$( host $$( iface-ip eth0 ) | wc -l ) ] $(end_check)
Version 8.7 onwards
single-aptr: eth0-ip
$(begin_check)
echo Check the ip address if eth0 resolves only to a single hostname [ 1 -eq $$(/usr/bin/dig +noall +answer -x $$( iface-ip eth0 ) | grep "PTR" | wc - l ) ]
$(end_check)
/usr/bin/dig +noall +answer +nocookie -x $( iface-ip eth0 )
/usr/bin/dig +noall +answer +noedns -x $( iface-ip eth0 )
/usr/bin/dig +noall +answer -x $( iface-ip eth0 )
Scenarios:
Once Steps 1-3 through are complete, consider the following:
sh /data-collector-status –-traceroute
Note: The URL required can change based in proxy location as explained in:
Note: VMware Aria Automation Orchestrator (formerly vRealize Orchestrator) will require manual creation of the load balancer.
Note: Only perform these steps if you have verified the symptoms in the logs per the Symptoms Section. Ensure you have valid snapshots.
kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml
vracli cluster leave
kubectl apply -f /root/vaconfig.yaml --force
/opt/scripts/deploy.sh
vracli cluster leave
/opt/scripts/recover_etcd.sh --confirm /root/backup-123456789.db
kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml
vracli cluster leave
kubectl apply -f /root/vaconfig.yaml --force
vracli cluster join primary-node --preservedata