VMware Secure Token Service (STS) fails with "No healthy upstream" error

Products

VMware vCenter Server

Issue/Introduction

When attempting to access the vCenter Server Management Interface (VAMI) or the vCenter Server Appliance (vCSA) user interface, you may observe the following:

The vCenter UI displays the error: "no healthy upstream".
The VMware Secure Token Service (STS) is in a Stopped state in the VAMI.

Manual attempts to restart the STS service fail, or the service stops shortly after starting.

In the vCenter log file /var/log/vmware/vmon/vmon.log, the following error is recorded:

The STS pre-start script fails as recorded in /var/log/vmware/sso/sts-prestart.log:

In(05) host-2765 Constructed command: /usr/bin/python /usr/lib/vmidentity/install/sts-prestart-script.pyWa(03) host-2765 Service pre-start command completed successfully.Wa(03) host-2765 Service exited. Exit code 120

Checking space usage on VCSA using df -h you see similar output

Filesystem 1K-blocks Used Available Use% Mounted on
devtmpfs 4096 0 4096 0% /dev
tmpfs 29819644 2292 29817352 1% /dev/shm
tmpfs 11927860 1348 11926512 1% /run
tmpfs 4096 0 4096 0% /sys/fs/cgroup
/dev/mapper/vg_root_0-lv_root_0 49222292 19977688 26711844 43% /
/dev/sda3 498900 37428 424776 9% /boot
tmpfs 29819648 4900 29814748 1% /tmp
/dev/sda2 10202 1978 8224 20% /boot/efi
/dev/mapper/vg_lvm_snapshot-lv_lvm_snapshot 1030987928 28 978542924 1% /storage/lvm_snapshot
/dev/mapper/db_vg-db 51282400 3352864 45292124 7% /storage/db
/dev/mapper/lifecycle_vg-lifecycle 102618040 3961596 93397592 5% /storage/lifecycle
/dev/mapper/dblog_vg-dblog 25618660 2965552 21326416 13% /storage/dblog
/dev/mapper/vtsdblog_vg-vtsdblog 25618660 32804 24259164 1% /storage/vtsdblog
/dev/mapper/log_vg-log 25618660 22443040 1848928 93% /storage/log
/dev/mapper/netdump_vg-netdump 10210580 24 9670296 1% /storage/netdump
/dev/mapper/autodeploy_vg-autodeploy 25618660 40 24291928 1% /storage/autodeploy
/dev/mapper/archive_vg-archive 205305832 184772116 10031984 95% /storage/archive
/dev/mapper/core_vg-core 102618040 68137136 29222052 70% /storage/core
/dev/mapper/imagebuilder_vg-imagebuilder 25618660 36 24291932 1% /storage/imagebuilder
/dev/mapper/updatemgr_vg-updatemgr 102618040 7156960 90202228 8% /storage/updatemgr
/dev/mapper/seat_vg-seat 1474794768 12839344 1386966268 1% /storage/seat
/dev/mapper/vtsdb_vg-vtsdb 1474794768 45328 1399760284 1% /storage/vtsdb

Environment

VMware vCenter Server 8.x

Cause

This issue occurs when the /storage/log partitions on the vCenter Server Appliance reach 100% capacity (or near capacity). When these partitions are full, the vCenter Service Manager (vmon) cannot write the Process ID (PID) files necessary to track and manage service states, causing the services to exit with Code 120.

Resolution

To resolve this issue, you must identify and clear space on the affected partitions.
Note: Always take an offline snapshot of the vCenter Server before performing disk cleanup or configuration changes.

Identify the Full Partition:
- Log in to the vCenter Server Appliance via SSH as the root user.
- Run the following command to check disk space:
  df -h
- Locate the partitions for /storage/log . If the Use% is 95% or higher, proceed to the next step.
Clear Disk Space:
- Navigate to the log directory: cd /storage/log.
- Identify large files or old log bundles that can be removed or truncated.
- Refer to vCenter log disk exhaustion or /storage/log full for specific instructions on safely clearing log files.
Check for Known Log Growth Issues (vCenter 8.0 Update 3):
- If you are running vCenter 8.0 U3, check the size of /storage/log/vmware/vmware-updatemgr/updatemgr-vmon.log.stderr.
- If this file is exceptionally large, follow the workaround in File '/storage/log/vmware/vmware-updatemgr/updatemgr-vmon.log.stderr' is very large causing vCenter services to not start, to truncate the file and modify the service configuration.
Restart vCenter Services:
- Once space has been freed, restart all services to ensure they can write their PID files correctly:
  service-control --stop --all && service-control --start --all
Verify Service Status:
- Confirm all critical services are running:
  service-control --status --all
- Verify that the vCenter UI is now accessible and the "No healthy upstream" error is resolved.