This knowledge article addresses the workaround for the issue of bosh failing to start after upgrading Ops Man to version 3.x. It has been observed that when this issue is present, there are multiple bosh entries inside of the postgres UAA database within the Bosh Director VM.
The reason why the multiple bosh entries appear in the UAA database is currently being investigated.
The symptoms of this issue include the following:
1. UAA Monit process does not exist and Credhub Monit process is not started
2. No listening ports for UAA and Credhub processes (8443 and 8844)
3. Invalid result size error for bosh.*.read found within the /var/vcap/sys/log/uaa.log log
Caused by: org.springframework.dao.IncorrectResultSizeDataAccessException: Invalid result size found for:bosh.*.read
Logs to gather for troubleshooting:
/var/vcap/sys/log
on the Bosh Director VM/var/vcap/sys/log
netstat -tlnp
command output on the Bosh director VMmonit summary
command output on the Bosh director VM/var/vcap/packages/postgres-13/bin/psql -h 127.0.0.1 -U postgres -d uaa -c "select * from groups" | tee groups.txt
Product Version: 2.10
The current workaround for this issue consists of deleting the older duplicate bosh entries within the UAA database and restarting the UAA and Credhub processes.
This can be done by the following steps below:
STEP 1
SSH into the Bosh Director VM if possible.
STEP 2
Log into the postgres database, switch to the UAA database, and grab all of the output from the groups table, and pipe the output into a file called groups.txt
bosh/0:~# /var/vcap/packages/postgres-13/bin/psql -h 127.0.0.1 -U postgres -d uaa -c "select * from groups" | tee groups.txt
STEP 3
Check for any duplicates within the groups.txt file
bosh/0:~# cat groups.txt | awk -F "|" '{print $2}' | sort | uniq -c | sort -nk 1 | egrep " 2 "
Example output if the issue exists:
2 bosh.*.admin 2 bosh.*.read 2 bosh.admin 2 bosh.read
STEP 4
If the duplicates are bosh entries like the output in step 3, then delete the OLDEST bosh.*.admin, bosh.*.read, bosh.admin, and bosh.read entries.
As an example, we could have the following groups.txt output:
id | displayname | created | lastmodified | version | identity_zone_id | description a2d73896-fdfc-4984-9465-676dc2d83cd1 | bosh.*.read | 2022-07-07 16:24:24.969 | 2022-07-07 16:24:24.969 | 0 | uaa | ef01eed1-452a-43eb-a557-45614b7f8cb5 | bosh.*.admin | 2022-07-07 16:24:25.007 | 2022-07-07 16:24:25.007 | 0 | uaa | 4d9518d9-7fd8-4e20-a127-86c31def99ad | bosh.read | 2022-07-07 16:24:25.065 | 2022-07-07 16:24:25.065 | 0 | uaa | f9fd7e71-895a-4298-9c85-57f0c6b089b6 | bosh.admin | 2022-07-07 16:24:10.46 | 2022-07-07 16:24:10.46 | 0 | uaa | 8bf94c7f-7520-4899-a0b2-aafd4bf6b2ac | bosh.admin | 2023-01-10 15:05:46.985 | 2023-01-10 15:05:46.986 | 0 | uaa | 5296839d-aa69-42fc-b17c-c225d50a4a46 | bosh.*.read | 2023-01-10 15:05:54.358 | 2023-01-10 15:05:54.358 | 0 | uaa | 7db7cee0-db3a-4457-ac2e-b44e9ef54a6e | bosh.*.admin | 2023-01-10 15:05:54.367 | 2023-01-10 15:05:54.367 | 0 | uaa | 398dc9b8-52a8-419f-b24a-5aa40073f6b8 | bosh.read | 2023-01-10 15:05:54.376 | 2023-01-10 15:05:54.376 | 0 | uaa |
STEP 5
We can use the queries below to delete the oldest bosh entries based on the output of groups.txt seen in step 4
# Deleting the oldest bosh.read entry delete from groups WHERE id = '4d9518d9-7fd8-4e20-a127-86c31def99ad'; # Deleting the oldest bosh.*.read entry delete from groups WHERE id = 'a2d73896-fdfc-4984-9465-676dc2d83cd1'; # Deleting the oldest bosh.admin entry delete from groups WHERE id = 'f9fd7e71-895a-4298-9c85-57f0c6b089b6'; # Deleting the oldest bosh.*.admin entry delete from groups WHERE id = 'ef01eed1-452a-43eb-a557-45614b7f8cb5';
STEP 6
Restart the Credhub and UAA processes on the Bosh Director VM
bosh/0:~# monit restart uaa && monit restart credhub
Check monit to see if the uaa and credhub processes are running again
bosh/0:~# monit summary The Monit daemon 5.2.5 uptime: 2d 3h 8m Process 'nats' running Process 'bosh_nats_sync' running Process 'postgres' running Process 'director' running Process 'worker_1' running Process 'worker_2' running Process 'worker_3' running Process 'worker_4' running Process 'worker_5' running Process 'director_scheduler' running Process 'metrics_server' running Process 'director_sync_dns' running Process 'director_nginx' running Process 'health_monitor' running Process 'uaa' running Process 'credhub' running Process 'system-metrics-agent' running Process 'system-metrics-server' running Process 'blobstore_nginx' running System 'system_1aa3bf4f-e994-4e79-71df-ff8c8d4c2f80' running
NOTE: If the monit commands above are not killing the uaa and credhub processes properly, one can try to kill each process using the pkill and kill commands.
DISCLAIMER: Using pkill or kill is a brute-force approach of killing processes that may not necessarily be safe to execute in every scenario. Before running the commands below, please contact Tanzu Support
STEP 6.1
Kill the UAA process via the pkill command
bosh/0:~# pkill uaa
STEP 6.2
Identify the process running on port 8844 which is the default port Credhub runs on
bosh/0:~# ss -plant | grep 8844 LISTEN 0 100 *:8844 *:* users:(("java",pid=5442,fd=33))
STEP 6.3
Kill the process via the PID seen in Step 6.2
bosh/0:~# kill 5442
STEP 6.4
Check monit to see if the uaa and credhub processes are running again
bosh/0:~# monit summary The Monit daemon 5.2.5 uptime: 2d 3h 8m Process 'nats' running Process 'bosh_nats_sync' running Process 'postgres' running Process 'director' running Process 'worker_1' running Process 'worker_2' running Process 'worker_3' running Process 'worker_4' running Process 'worker_5' running Process 'director_scheduler' running Process 'metrics_server' running Process 'director_sync_dns' running Process 'director_nginx' running Process 'health_monitor' running Process 'uaa' running Process 'credhub' running Process 'system-metrics-agent' running Process 'system-metrics-server' running Process 'blobstore_nginx' running System 'system_1aa3bf4f-e994-4e79-71df-ff8c8d4c2f80' running