Bosh failing to start due to duplicate bosh entries within the UAA Database

search cancel

Bosh failing to start due to duplicate bosh entries within the UAA Database

book

Article ID: 293761

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

This knowledge article addresses the workaround for the issue of bosh failing to start after upgrading Ops Man to version 3.x. It has been observed that when this issue is present, there are multiple bosh entries inside of the postgres UAA database within the Bosh Director VM.

The reason why the multiple bosh entries appear in the UAA database is currently being investigated.

The symptoms of this issue include the following:

1. UAA Monit process does not exist and Credhub Monit process is not started

2. No listening ports for UAA and Credhub processes (8443 and 8844)

3. Invalid result size error for bosh.*.read found within the /var/vcap/sys/log/uaa.log log

Caused by: org.springframework.dao.IncorrectResultSizeDataAccessException: Invalid result size found for:bosh.*.read

Logs to gather for troubleshooting:

All logs located in /var/vcap/sys/log on the Bosh Director VM

Documentation on how to SSH into the Bosh Director VM
KB article "How to collect logs from BOSH Director" on extracting logs from /var/vcap/sys/log

netstat -tlnp command output on the Bosh director VM
monit summary command output on the Bosh director VM
Output of the groups table within the UAA database which can be extracted via the command below on the Bosh Director VM:

/var/vcap/packages/postgres-13/bin/psql -h 127.0.0.1 -U postgres -d uaa -c "select * from groups" | tee groups.txt

Environment

Product Version: 2.10

Resolution

The current workaround for this issue consists of deleting the older duplicate bosh entries within the UAA database and restarting the UAA and Credhub processes.

This can be done by the following steps below:

STEP 1
SSH into the Bosh Director VM if possible.

Documentation on how to SSH into the Bosh Director VM

STEP 2
Log into the postgres database, switch to the UAA database, and grab all of the output from the groups table, and pipe the output into a file called groups.txt

bosh/0:~# /var/vcap/packages/postgres-13/bin/psql -h 127.0.0.1 -U postgres -d uaa -c "select * from groups" | tee groups.txt

STEP 3
Check for any duplicates within the groups.txt file

bosh/0:~# cat groups.txt | awk -F "|" '{print $2}' | sort | uniq -c | sort -nk 1 | egrep " 2 "

Example output if the issue exists:

2 bosh.*.admin  
2 bosh.*.read  
2 bosh.admin  
2 bosh.read

STEP 4
If the duplicates are bosh entries like the output in step 3, then delete the OLDEST bosh.*.admin, bosh.*.read, bosh.admin, and bosh.read entries.

As an example, we could have the following groups.txt output:

id | displayname | created | lastmodified | version | identity_zone_id | description                              

a2d73896-fdfc-4984-9465-676dc2d83cd1 | bosh.*.read | 2022-07-07 16:24:24.969 | 2022-07-07 16:24:24.969 | 0 | uaa | 
ef01eed1-452a-43eb-a557-45614b7f8cb5 | bosh.*.admin | 2022-07-07 16:24:25.007 | 2022-07-07 16:24:25.007 | 0 | uaa | 
4d9518d9-7fd8-4e20-a127-86c31def99ad | bosh.read | 2022-07-07 16:24:25.065 | 2022-07-07 16:24:25.065 | 0 | uaa | 
f9fd7e71-895a-4298-9c85-57f0c6b089b6 | bosh.admin | 2022-07-07 16:24:10.46 | 2022-07-07 16:24:10.46 | 0 | uaa | 
8bf94c7f-7520-4899-a0b2-aafd4bf6b2ac | bosh.admin | 2023-01-10 15:05:46.985 | 2023-01-10 15:05:46.986 | 0 | uaa | 
5296839d-aa69-42fc-b17c-c225d50a4a46 | bosh.*.read | 2023-01-10 15:05:54.358 | 2023-01-10 15:05:54.358 | 0 | uaa | 
7db7cee0-db3a-4457-ac2e-b44e9ef54a6e | bosh.*.admin | 2023-01-10 15:05:54.367 | 2023-01-10 15:05:54.367 | 0 | uaa | 
398dc9b8-52a8-419f-b24a-5aa40073f6b8 | bosh.read | 2023-01-10 15:05:54.376 | 2023-01-10 15:05:54.376 | 0 | uaa |

STEP 5
We can use the queries below to delete the oldest bosh entries based on the output of groups.txt seen in step 4

# Deleting the oldest bosh.read entry 
delete from groups WHERE id = '4d9518d9-7fd8-4e20-a127-86c31def99ad'; 

# Deleting the oldest bosh.*.read entry 
delete from groups WHERE id = 'a2d73896-fdfc-4984-9465-676dc2d83cd1'; 

# Deleting the oldest bosh.admin entry 
delete from groups WHERE id = 'f9fd7e71-895a-4298-9c85-57f0c6b089b6'; 

# Deleting the oldest bosh.*.admin entry 
delete from groups WHERE id = 'ef01eed1-452a-43eb-a557-45614b7f8cb5';

STEP 6
Restart the Credhub and UAA processes on the Bosh Director VM

bosh/0:~# monit restart uaa && monit restart credhub

Check monit to see if the uaa and credhub processes are running again

bosh/0:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 8m 

Process 'nats'                      running
Process 'bosh_nats_sync'            running
Process 'postgres'                  running
Process 'director'                  running
Process 'worker_1'                  running
Process 'worker_2'                  running
Process 'worker_3'                  running
Process 'worker_4'                  running
Process 'worker_5'                  running
Process 'director_scheduler'        running
Process 'metrics_server'            running
Process 'director_sync_dns'         running
Process 'director_nginx'            running
Process 'health_monitor'            running
Process 'uaa'                       running
Process 'credhub'                   running
Process 'system-metrics-agent'      running
Process 'system-metrics-server'     running
Process 'blobstore_nginx'           running
System 'system_1aa3bf4f-e994-4e79-71df-ff8c8d4c2f80' running

NOTE: If the monit commands above are not killing the uaa and credhub processes properly, one can try to kill each process using the pkill and kill commands.

DISCLAIMER: Using pkill or kill is a brute-force approach of killing processes that may not necessarily be safe to execute in every scenario. Before running the commands below, please contact Tanzu Support

STEP 6.1
Kill the UAA process via the pkill command

bosh/0:~# pkill uaa

STEP 6.2
Identify the process running on port 8844 which is the default port Credhub runs on

bosh/0:~# ss -plant | grep 8844
LISTEN 0 100 *:8844 *:* users:(("java",pid=5442,fd=33))

STEP 6.3
Kill the process via the PID seen in Step 6.2

bosh/0:~# kill 5442

STEP 6.4
Check monit to see if the uaa and credhub processes are running again

bosh/0:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 8m 

Process 'nats'                      running
Process 'bosh_nats_sync'            running
Process 'postgres'                  running
Process 'director'                  running
Process 'worker_1'                  running
Process 'worker_2'                  running
Process 'worker_3'                  running
Process 'worker_4'                  running
Process 'worker_5'                  running
Process 'director_scheduler'        running
Process 'metrics_server'            running
Process 'director_sync_dns'         running
Process 'director_nginx'            running
Process 'health_monitor'            running
Process 'uaa'                       running
Process 'credhub'                   running
Process 'system-metrics-agent'      running
Process 'system-metrics-server'     running
Process 'blobstore_nginx'           running
System 'system_1aa3bf4f-e994-4e79-71df-ff8c8d4c2f80' running

Feedback

thumb_up Yes

thumb_down No