This article will detail some methods for troubleshooting an unexpected increase in the TAS metric bbs.LRPsExtra.
A brief summary of this metric is:
Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the Auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record.
This means that there are potentially records of LRPs remaining in the database that are no longer desired and should be deleted. More info on this metric can be found here, under "BBS time to handle requests":
To confirm what your current value for this metric is, you can use the below cf nozzle command.
cf nozzle -f ValueMetric | grep -e 'origin:"bbs"' | grep -i extra
The cf nozzle plugin needs to be installed from here:
All TAS envs.
In very rare cases, some app containers fail to exit successfully, though the main process exit, the envoy process keeps running, which make the app instance as CLAIMED state. Even BBS and rep keeps sending termination command to garden, in this special situation, the orphaned container could not be cleaned up properly.
The main identifier of extra LRPs are entries in the diego database that are in a CLAIMED state, but have no actual processes behind them.
To identify these, first please check the TAS Diego DB:
$ bosh -d <cf deployment> ssh mysql/0
$ sudo mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf -D diego
mysql> SELECT *
FROM actual_lrps
JOIN domains ON actual_lrps.domain = domains.domain
LEFT JOIN desired_lrps ON actual_lrps.process_guid = desired_lrps.process_guid
WHERE actual_lrps.presence = 0 AND desired_lrps.process_guid IS NULL;
This should provide a list of LRPs in claimed state, the output should match the current value of bbsLRPsExtra.
In order to resolve bbsLRPsExtra, you can restart `garden` job on the diego_cell that is supposed to be hosting the app instance. At first, take the VM instance ID from the 'cell_id' column in the output of the previous SQL query, then run `ps` command to gather process list for investigation use.
$ bosh -d <cf deployment> ssh diego_cell/<cell_id> -c "sudo ps axjfww"
Finally restart `garden` job on the cell.
$ bosh -d <cf deployment> ssh diego_cell/<cell_id> -c "sudo /var/vcap/bosh/bin/monit restart garden"
After restarting garden job on all listed diego_cells, metric bbs.LRPsExtra should be reset to 0.