Bosh failing to start due to duplicate bosh entries within the UAA Database
search cancel

Bosh failing to start due to duplicate bosh entries within the UAA Database

book

Article ID: 293761

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

This knowledge article addresses the workaround for the issue of bosh failing to start after upgrading Ops Man to version 3.x. It has been observed that when this issue is present, there are multiple bosh entries inside of the postgres UAA database within the Bosh Director VM.   

The reason why the multiple bosh entries appear in the UAA database is currently being investigated.   

The symptoms of this issue include the following: 

1. UAA Monit process does not exist and Credhub Monit process is not started
monit_summary.png

2. No listening ports for UAA and Credhub processes (8443 and 8844)
netstat-output.png

3. Invalid result size error for bosh.*.read found within the /var/vcap/sys/log/uaa.log log
Caused by: org.springframework.dao.IncorrectResultSizeDataAccessException: Invalid result size found for:bosh.*.read

 

Logs to gather for troubleshooting:

  1. All logs located in /var/vcap/sys/log on the Bosh Director VM
  1. netstat -tlnp command output on the Bosh director VM
  2. monit summary command output on the Bosh director VM
  3. Output of the groups table within the UAA database which can be extracted via the command below on the Bosh Director VM: 
/var/vcap/packages/postgres-13/bin/psql -h 127.0.0.1 -U postgres -d uaa -c "select * from groups" | tee groups.txt 


Environment

Product Version: 2.10

Resolution

The current workaround for this issue consists of deleting the older duplicate bosh entries within the UAA database and restarting the UAA and Credhub processes.  
  
This can be done by the following steps below:    

STEP 1
SSH into the Bosh Director VM if possible.
STEP 2
Log into the postgres database, switch to the UAA database, and grab all of the output from the groups table, and pipe the output into a file called groups.txt
bosh/0:~# /var/vcap/packages/postgres-13/bin/psql -h 127.0.0.1 -U postgres -d uaa -c "select * from groups" | tee groups.txt

STEP 3
Check for any duplicates within the groups.txt file
bosh/0:~# cat groups.txt | awk -F "|" '{print $2}' | sort | uniq -c | sort -nk 1 | egrep " 2 " 
Example output if the issue exists:
2 bosh.*.admin  
2 bosh.*.read  
2 bosh.admin  
2 bosh.read 

STEP 4
If the duplicates are bosh entries like the output in step 3, then delete the OLDEST bosh.*.admin, bosh.*.read, bosh.admin, and bosh.read entries.  
  
As an example, we could have the following groups.txt output: 
id | displayname | created | lastmodified | version | identity_zone_id | description                              

a2d73896-fdfc-4984-9465-676dc2d83cd1 | bosh.*.read | 2022-07-07 16:24:24.969 | 2022-07-07 16:24:24.969 | 0 | uaa | 
ef01eed1-452a-43eb-a557-45614b7f8cb5 | bosh.*.admin | 2022-07-07 16:24:25.007 | 2022-07-07 16:24:25.007 | 0 | uaa | 
4d9518d9-7fd8-4e20-a127-86c31def99ad | bosh.read | 2022-07-07 16:24:25.065 | 2022-07-07 16:24:25.065 | 0 | uaa | 
f9fd7e71-895a-4298-9c85-57f0c6b089b6 | bosh.admin | 2022-07-07 16:24:10.46 | 2022-07-07 16:24:10.46 | 0 | uaa | 
8bf94c7f-7520-4899-a0b2-aafd4bf6b2ac | bosh.admin | 2023-01-10 15:05:46.985 | 2023-01-10 15:05:46.986 | 0 | uaa | 
5296839d-aa69-42fc-b17c-c225d50a4a46 | bosh.*.read | 2023-01-10 15:05:54.358 | 2023-01-10 15:05:54.358 | 0 | uaa | 
7db7cee0-db3a-4457-ac2e-b44e9ef54a6e | bosh.*.admin | 2023-01-10 15:05:54.367 | 2023-01-10 15:05:54.367 | 0 | uaa | 
398dc9b8-52a8-419f-b24a-5aa40073f6b8 | bosh.read | 2023-01-10 15:05:54.376 | 2023-01-10 15:05:54.376 | 0 | uaa |

STEP 5  
We can use the queries below to delete the oldest bosh entries based on the output of groups.txt seen in step 4
# Deleting the oldest bosh.read entry 
delete from groups WHERE id = '4d9518d9-7fd8-4e20-a127-86c31def99ad'; 

# Deleting the oldest bosh.*.read entry 
delete from groups WHERE id = 'a2d73896-fdfc-4984-9465-676dc2d83cd1'; 

# Deleting the oldest bosh.admin entry 
delete from groups WHERE id = 'f9fd7e71-895a-4298-9c85-57f0c6b089b6'; 

# Deleting the oldest bosh.*.admin entry 
delete from groups WHERE id = 'ef01eed1-452a-43eb-a557-45614b7f8cb5'; 

STEP 6
Restart the Credhub and UAA processes on the Bosh Director VM 
bosh/0:~# monit restart uaa && monit restart credhub


Check monit to see if the uaa and credhub processes are running again

bosh/0:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 8m 

Process 'nats'                      running
Process 'bosh_nats_sync'            running
Process 'postgres'                  running
Process 'director'                  running
Process 'worker_1'                  running
Process 'worker_2'                  running
Process 'worker_3'                  running
Process 'worker_4'                  running
Process 'worker_5'                  running
Process 'director_scheduler'        running
Process 'metrics_server'            running
Process 'director_sync_dns'         running
Process 'director_nginx'            running
Process 'health_monitor'            running
Process 'uaa'                       running
Process 'credhub'                   running
Process 'system-metrics-agent'      running
Process 'system-metrics-server'     running
Process 'blobstore_nginx'           running
System 'system_1aa3bf4f-e994-4e79-71df-ff8c8d4c2f80' running

NOTE: If the monit commands above are not killing the uaa and credhub processes properly, one can try to kill each process using the pkill and kill commands. 

DISCLAIMER: Using pkill or kill is a brute-force approach of killing processes that may not necessarily be safe to execute in every scenario. Before running the commands below, please contact Tanzu Support

 

STEP 6.1
Kill the UAA process via the pkill command
bosh/0:~# pkill uaa


STEP 6.2
Identify the process running on port 8844 which is the default port Credhub runs on

bosh/0:~# ss -plant | grep 8844
LISTEN 0 100 *:8844 *:* users:(("java",pid=5442,fd=33))


STEP 6.3
Kill the process via the PID seen in Step 6.2

bosh/0:~# kill 5442


STEP 6.4
Check monit to see if the uaa and credhub processes are running again

bosh/0:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 8m 

Process 'nats'                      running
Process 'bosh_nats_sync'            running
Process 'postgres'                  running
Process 'director'                  running
Process 'worker_1'                  running
Process 'worker_2'                  running
Process 'worker_3'                  running
Process 'worker_4'                  running
Process 'worker_5'                  running
Process 'director_scheduler'        running
Process 'metrics_server'            running
Process 'director_sync_dns'         running
Process 'director_nginx'            running
Process 'health_monitor'            running
Process 'uaa'                       running
Process 'credhub'                   running
Process 'system-metrics-agent'      running
Process 'system-metrics-server'     running
Process 'blobstore_nginx'           running
System 'system_1aa3bf4f-e994-4e79-71df-ff8c8d4c2f80' running