SDDC Manager sosrest service failures and hung backup tasks

Products

VMware Cloud Foundation 4.x VMware Cloud Foundation 5.x VMware SDDC Manager VMware Cloud Foundation

Issue/Introduction

In VMware Cloud Foundation environments, the SDDC Manager UI may display a high number of running tasks with missing descriptions. Users may observe that SDDC backup tasks hang at subtasks such as 'Backup SDDC Manager System Configuration' or 'Package and Encrypt SDDC Manager Backup'. Additionally, SOS health-check operations may remain "In Progress" indefinitely during host IP configuration retrieval.

The sosrest service may fail to start/restart or report errors while running. This often leads to rapid growth of the vcf-sos.log and /var/log/messages files, potentially exhausting space on the root disk partition (/).

Symptoms and Log Evidence: Checking the service status with systemctl status sosrest.service -l reveals:

YYYY-MM-DDT HH:MM:SS sosrest[####]: During handling of the above exception, another exception occurred:
YYYY-MM-DDT HH:MM:SS sosrest[####]: Traceback (most recent call last):
YYYY-MM-DDT HH:MM:SS sosrest[####]: File "framework/workflowhandler.py", line 154, in get_workflow_status
YYYY-MM-DDT HH:MM:SS sosrest[####]: File "framework/dbinterface/db_api.py", line 415, in db_to_json
YYYY-MM-DDT HH:MM:SS sosrest[####]: framework.dbinterface.db_api.DBException: Converting from db to json failed

Errors in the SDDC log located at /var/log/vmware/vcf/sddc-support/vcf-sos.log:

YYYY-MM-DDT HH:MM:SS ERROR [vcf_sos] [db_api.py::return_a_session::##::MainThread] DB commit failed
YYYY-MM-DDT HH:MM:SS ERROR [vcf_sos] [db_api.py::db_to_json::###::MainThread] Converting to JSON from db failed
YYYY-MM-DDT HH:MM:SS sqlite3.DatabaseError: database disk image is malformed
YYYY-MM-DDT HH:MM:SS framework.dbinterface.db_api.DBException: Upserting Json to DB failed

Environment

VMware Cloud Foundation 4.x
VMware Cloud Foundation 5.x
VMware Cloud Foundation 9.x

Cause

The issue is typically caused by corruption within the SOS service database or temporary JSON files. This can occur if:

Required files for the sosrest service become out of sync.
A disk partition on the SDDC appliance becomes full.
A storage event occurs while the service is writing data.
The sos service was forcefully stopped or restarted during an active operation.

Resolution

Note: This procedure will remove the Backup history and SOS Health Check history from the SDDC Manager UI. Ensure a snapshot of the SDDC Manager is taken before proceeding.

SSH into the SDDC Manager Appliance as the vcf user and elevate to root su -
Stop the sosrest service:

- systemctl stop sosrest.service

3.Relocate the corrupted database and temporary status files to a backup directory

- mv /opt/vmware/vcf/sddc-support/soservice.db /home/vcf
- mv /opt/vmware/vcf/sddc-support/status.json /home/vcf
- mv /opt/vmware/vcf/sddc-support/.status-tmp.json /home/vcf

4.Clear the SOS-related tasks from the Task Aggregator database:

For VCF 5.1.1 and higher:

- /usr/pgsql/13/bin/psql -h localhost -U postgres -d platform -c "delete from task_metadata where task_url like '%sos%'"
- /usr/pgsql/13/bin/psql -h localhost -U postgres -d platform -c "delete from task_metadata where task_type='SDDCMANAGER_BACKUP'"

For VCF 5.1.0 and lower:

- psql -h localhost -U postgres -d platform -c "delete from task_metadata where task_url like '%sos%'"
- psql -h localhost -U postgres -d platform -c "delete from task_metadata where task_type='SDDCMANAGER_BACKUP'"

5.Restart the sosrest service:

- systemctl start sosrest.service

6.Restart all SDDC services to synchronize the UI:

- /opt/vmware/vcf/operationsmanager/scripts/cli/sddcmanager_restart_services.sh

7.Log in to the SDDC Manager UI and verify that "Fetching Task" entries have resolved and the dashboard is functional.

Additional Information

If stale tasks remain after following the above attempt script removal as detailed in following:

Cleanup stale tasks after SDDC Manager recovery.