SDDC Manager Upgrade fails at 1-hour timeout during 'tdnf' package query with 'IDENTITY_SAML_TOKEN_EXPIRED'
search cancel

SDDC Manager Upgrade fails at 1-hour timeout during 'tdnf' package query with 'IDENTITY_SAML_TOKEN_EXPIRED'

book

Article ID: 434263

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

  • SDDC Manager platform upgrade to version 9.x fails with the state COMPLETED_WITH_FAILURE
  • The SDDC Manager UI may show 0 update status and become inaccessible.
  • In /var/log/vmware/vcf/sddc-manager/sddcManagerServer.log the following error is observed:

    ERROR: axios.error.response.data {"errorCode":"IDENTITY_SAML_TOKEN_EXPIRED","message":"The SAML token has expired on YYYY-MM-DD"}

  • The VMware_Cloud_Foundation_Services_and_Platform_Upgrades task transitions to failure exactly 3600 seconds (1 hour) after initiation. Which is observed under /var/log/vmware/vcf/lcm/thirdparty/upgrades/########-####-####-####-########/vcf-platform/upgrade/vcf_platform_upgrade.log

    INFO: Updated /var/log/vmware/vcf/lcm/thirdparty/upgrades/########-####-####-####-########/vcf-platform/upgrade/vcf_platform_upgrade.status status file with data OrderedDict([('upgradeId', '########-####-####-####-########'), ('resourceId', '########-####-####-####-########'), ('upgradeStatusCode', 'INPROGRESS'), ('progress', 0), ('error', {'errorCode': None, 'errorDescription': None}), ('startTime', 1773073900)])
    INFO: Updated /var/log/vmware/vcf/lcm/thirdparty/upgrades/########-####-####-####-########/vcf-platform/upgrade/vcf_platform_upgrade.status status file with data OrderedDict([('upgradeId', '########-####-####-####-########'), ('resourceId', '########-####-####-####-########'), ('upgradeStatusCode', 'INPROGRESS'), ('progress', 0), ('error', {'errorCode': None, 'errorDescription': None}), ('startTime', 1773073900)])
    INFO: Execute cmd: tdnf --disablerepo=* list installed > /var/log/vmware/vcf/lcm/thirdparty/upgrades/########-####-####-####-########/vcf-platform/upgrade/tdnf_list_before_upgrade.txt

  • The /var/log/vmware/capengine/cap-update/workflow.log or /var/log/vmware/vcf/lcm/lcm-debug.log shows the process stalled at the following command: 

    INFO: Execute cmd: tdnf --disablerepo=* list installed

Environment

VCF 9.x

Cause

This issue is primarily caused by a transient hang or severe I/O delay in the Photon OS package manager (`tdnf`) while querying the local RPM database. This triggers a hardcoded 1-hour timeout gate within the SDDC Manager Lifecycle Management (LCM) orchestrator.

Why the SAML error occurs: The IDENTITY_SAML_TOKEN_EXPIRED error is a secondary symptom. Because the upgrade task hangs for over 60 minutes, the internal services may become unresponsive or lose connectivity to vCenter. When the system finally attempts to resume or report status after the timeout, it finds the original SAML authentication token has expired and cannot be refreshed while the services are in this stalled state, SDDC Manager Down.

Resolution

Since the underlying RPM lock or I/O bottleneck is transient, no manual intervention (such as clearing RPM database locks or manually removing packages) is required.

  1. Wait for the task to fully terminate: Ensure the LCM workflow has stopped the parent process (this usually happens automatically once the 1-hour timeout is reached).
  2. Retry the Upgrade: Navigate to the SDDC Manager UI and retry the upgrade workflow.
  3. Verify Success: The transient lock is typically released when the previous process is terminated, allowing the retry to succeed automatically.