STS Health status report RED without any service state change | Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']
search cancel

STS Health status report RED without any service state change | Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']

book

Article ID: 374405

calendar_today

Updated On:

Products

VMware vCenter Server VMware vCenter Server 7.0 VMware vCenter Server 8.0

Issue/Introduction

We receive an alarm in the vCenter
 [Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']

After a few seconds, we received an Alarm that status changed to green.

Alarm 'Health status changed alarm' on Datacenters triggered by event 584984 'sts status changed from red to green']

Upon checking the vcenter service we don't see the STS service in the stopped state. The service is running without any issues.

Environment

vCenter 7.0x and vCenter 8.0.x

Cause

  • In the vmafdvmdirclient.log we can see that a password reset happened for the machine account for the vCenter ([email protected]).
  • At the same time, if the vMON is polling the service health state for the service(in this case STS), the older token used to login become invalid.
  • Due to this Bind error will be noticed, the STS is not able to log in, it throws these errors. 
    In vmware-identity-sts.log,
    Authentication failed for user  ([email protected]). in tenant
    IDM rejected authentication by UPN.
    Invalid credentials.


    How to validate these from logs?

    less journalctl_-b--0.txt | grep Alarm | egrep sts
    Jul 19 14:00:24 otsl2nb0.otp-adc.local vpxd[5914]: Event [584980] [1-1] [2024-07-19T05:00:24.382335Z] [vim.event.EventEx] [info] [] [] [584980] [Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']
    Jul 19 14:00:26 otsl2nb0.otp-adc.local vpxd[5914]: Event [584986] [1-1] [2024-07-19T05:00:26.37667Z] [vim.event.EventEx] [info] [] [] [584986] [Alarm 'Health status changed alarm' on Datacenters triggered by event 584984 'sts status changed from red to green']


# In this log we can check that the password refresh for the machine account was performed at 2024-07-19T05:00:05 UTC.
/var/log/vmware/vmdir/ vmafdvmdirclient.log

2024-07-19T05:00:05.032Z:t@140708178532096:INFO: _VmDirLdapModReplaceAttribute (cn=FQDN_vCenter,ou=Domain Controllers,dc=vsphere,dc=local)(userPassword)
2024-07-19T05:00:05.033Z:t@140708178532096:INFO: Act (FQDN_vCenter@vsphere.local) password refreshed

# During the same time we can see that vmon tried to poll the Health status for the service and reported the error.
 less vmon.log | grep sts | egrep failed
2024-07-19T05:00:05.506Z Wa(03) host-2267 <sts> Health of service failed. Health data:

# In the below log we can see that till 2024-07-19T04:55:00 UTC the health status for the STS was published. 
# When the polling interval came after 4-5 minutes, we experienced the error "Provided credentials are not valid". 
# The reason is that at the very same time the password for the machine account ([email protected]) was refreshed.

"/var/log/vmware/sso/sts-health-status.log"
2024-07-19T04:39:44.292Z - __main__ - INFO - Publishing health status as GREEN to vMon.
2024-07-19T04:44:49.643Z - __main__ - INFO - Publishing health status as GREEN to vMon.
2024-07-19T04:49:55.524Z - __main__ - INFO - Publishing health status as GREEN to vMon.
2024-07-19T04:55:00.097Z - __main__ - INFO - Publishing health status as GREEN to vMon.
2024-07-19T05:00:05.490Z - __main__ - INFO - Stdout: 2024-07-19 14:00:04,740 main ERROR Unable to locate appender "fileAppender" for logger config "root"
2024-07-19 14:00:04,741 main ERROR Unable to locate appender "perfLogAppender" for logger config "com.vmware.identity.performanceSupport.PerfDataSink"
com.vmware.vim.sso.client.exception.AuthenticationFailedException: Provided credentials are not valid.
Refreshing STS signing certificates and retrying VMware STS check

2024-07-19T05:00:05.490Z - __main__ - INFO - Stderr: Picked up JAVA_TOOL_OPTIONS: -Xms32M -Xmx128M     -Dcom.sun.org.apache.xml.internal.security.ignoreLineBreaks=true     -Dorg.apache.xml.security.ignoreLineBreaks=true
log4j:WARN No appenders could be found for logger (com.vmware.identity.interop.NativeLibraryPreloader).
log4j:WARN Please initialize the log4j system properly.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/vmware/lib64/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/vmware/common-jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" com.vmware.vim.sso.client.exception.AuthenticationFailedException: Provided credentials are not valid.
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl$RequestResponseProcessor.handleFaultCondition(SecurityTokenServiceImpl.java:1066)
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl$RequestResponseProcessor.sendRequest(SecurityTokenServiceImpl.java:988)
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl$RequestResponseProcessor.executeRoundtrip(SecurityTokenServiceImpl.java:902)
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl.acquireToken(SecurityTokenServiceImpl.java:155)
        at com.vmware.identity.installer.STSInstaller.check_sts(STSInstaller.java:159)
        at com.vmware.identity.health.StsHealthStatus.main(StsHealthStatus.java:25)

2024-07-19T05:00:05.490Z - __main__ - INFO - Failed to acquire token after 2.410531 seconds.  <--- # Here we can see the token was received after 60 sec. Total time taken to receive the token : 62.41 sec. 
2024-07-19T05:00:05.490Z - __main__ - ERROR - Publishing health status as RED to vMon.  <-- Health status reported as RED.
2024-07-19T05:00:25.337Z - __main__ - INFO - Publishing health status as GREEN to vMon.
2024-07-19T05:05:10.850Z - __main__ - INFO - Publishing health status as GREEN to vMon.


# Here is proof that the token we received on 2024-07-19T05:00:05 UTC was an invalid one as below.
# In this log we can see that the machine account reported an "Authentication failed" status. 
# In addition to that we can see LDAP error 49, which is the error code for LDAP when we have "Invalid credentials".

less vmware-identity-sts.log | grep otsl2nb0 | egrep Authentication | egrep failed

2024-07-19T05:00:05.313Z INFO sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.IdentityManager] Authentication failed for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [26] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T05:00:05.443Z INFO sts[87:tomcat-http--48] [CorId=71cdfd4d-67a9-4bd3-bd5a-0477e2cab770] [com.vmware.identity.idm.server.IdentityManager] Authentication failed for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [27] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]


[Details of the failure of the user]
2024-07-19T04:58:50.979Z INFO sts[79:tomcat-http--40] [CorId=80ed8c24-a272-4519-a122-505cc8a5cacf] [com.vmware.identity.sts.impl.STSImpl] Exiting validate() token...
2024-07-19T05:00:05.286Z INFO sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.sts.ws.handlers.SOAPHeadersExtractor] Found 1 {http://docs.xxx-xxx.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd}Security headers
2024-07-19T05:00:05.287Z INFO sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.sts.impl.STSImpl] Entering issue() token...
2024-07-19T05:00:05.291Z WARN sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.interop.ldap.LdapErrorChecker] Error received by LDAP client: com.vmware.identity.interop.ldap.OpenLdapClientLibrary, error code: 49  
2024-07-19T05:00:05.291Z WARN sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.ServerUtils] cannot bind connection: [ldap://FQDN_vCenter389, FQDN_vCenter@vsphere.local]
2024-07-19T05:00:05.291Z ERROR sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.ServerUtils] cannot establish ldap connection with URI: [ldap://FQDN_vCenter:389] because [Invalid credentials] therefore will not attempt to use any secondary URIs
2024-07-19T05:00:05.291Z WARN sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider] Failed to authenticate using SRP binding
com.vmware.identity.interop.ldap.InvalidCredentialsLdapException: Invalid credentials

 

[Token within the limit]
2024-07-19T04:39:44.124Z INFO sts[69:tomcat-http--30] [CorId=c7e3a16c-2144-42f4-8efd-7a1db33f2db0] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [3] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T04:44:49.451Z INFO sts[51:tomcat-http--12] [CorId=4156777e-1ee3-4e51-ac4f-c237496f073a] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [3] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T04:49:55.349Z INFO sts[62:tomcat-http--23] [CorId=5d2ab340-1aaa-4936-9a69-81cd76e0c3e3] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [4] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T04:54:59.852Z INFO sts[40:tomcat-http--1] [CorId=244fc77d-de6c-44bc-a9bc-9c778b3365ae] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [3] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T05:00:24.971Z INFO sts[50:tomcat-http--10] [CorId=650d63b0-6405-4fa9-825e-3c0446b4da43] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [12] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T05:00:25.744Z INFO sts[52:tomcat-http--12] [CorId=9bedbb2b-fd22-4fe8-8e4f-8128bbb3e69c] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [4] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
2024-07-19T05:05:10.533Z INFO sts[70:tomcat-http--30] [CorId=d69d7ea9-924f-4f68-95f4-9027e46435f2] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [4] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]

 

The time between 4:54:59 AM and 5:00:24 AM is: 5 minutes, and 25 seconds

 

 

There are two scripts involved in this issue. 
1. sts-vmon-health-checker.py
2. vmonEventPublisher.py 


Checking the definition for these scripts
========================================================

/usr/lib/vmidentity/install/sts-vmon-health-checker.py

35     logfile = "/var/log/vmware/sso/sts-health-status.log"

109 def get_sts_status():
    110     '''Checks status of STS service.
    111
    112     :return: STS service status as string.
    113     '''
    114
.............
    119             -Dvmware.log.dir=/var/log/vmware/sso/ \
    120             -XX:ErrorFile=/var/log/vmware/sso/hs_err_stsinstaller_pid%p.log \
    121             -XX:HeapDumpPath=/var/log/vmware/sso/ \
    122             com.vmware.identity.health.StsHealthStatus"


## In our case token can be one cause of the issue, as we noticed a machine account password change. 
 
    133                 if total_time > 60: <---Suggested timeout change
    134                     # To avoid superfluous logs, log only when it takes longer than 60
    135                     # seconds to acquire a token.
    136                     log.info("Acquired token succesfully in %f seconds." % total_time)
    137                 log.info("Publishing health status as GREEN to vMon.")
    138                 health = GREEN
    139             else:
    .. 
    142                 log.info("Failed to acquire token after %f seconds." % total_time)
    143                 log.error("Publishing health status as RED to vMon.")
    144                 health = RED


/usr/lib/vmware-vmon/vmonEventPublisher.py 

     50 def get_machine_accnt_certandkey():
     51     """
     52     Get certificate and key for the machine account.
     53     """
     54     ks = VmKeyStore('VKS')
     55     ks.load('machine')
     56     cert = ks.get_certificate('machine')
     57     key = ks.get_key('machine')
     58     return (cert, key)


     61 def health_status_to_color(health_status):
     62     """
     63     Convert vMon health status string to corresponding color string.
     64     """
     65     if health_status == 'healthy' or health_status == 'green':
     66         return 'green'
     67     elif health_status == 'initialized' or health_status == 'yellow':
     68         return 'yellow'
     69     elif health_status == 'unhealthy' or health_status == 'red':
     70         return 'red'
     71     else:
     72         return 'unknown'

 

 

Resolution

What to do if we see the Alarm?
- There is no major action to be taken. Any admin can keep track of the Alarm and the frequency. 
If the Alarm is triggered every 5 minutes when the vmon checks the health then, please validate the status of the STS service. 
Command: service-control --status --all. 
VAMI : https://vcIP/FQDN:5480 > services
- If the service is stopped then these need attention from VMware.  


- increase the timeout in /usr/lib/vmidentity/install/sts-vmon-health-checker.py    

    133                 if total_time > 90:

Additional Information

/usr/lib/vmidentity/install/sts-vmon-health-checker.py

35     logfile = "/var/log/vmware/sso/sts-health-status.log"

109 def get_sts_status():
    110     '''Checks status of STS service.
    111
    112     :return: STS service status as string.
    113     '''
    114
.............
    119             -Dvmware.log.dir=/var/log/vmware/sso/ \
    120             -XX:ErrorFile=/var/log/vmware/sso/hs_err_stsinstaller_pid%p.log \
    121             -XX:HeapDumpPath=/var/log/vmware/sso/ \
    122             com.vmware.identity.health.StsHealthStatus"


## In our case token can be one cause of the issue, as we noticed a machine account password change. 
 
    133                 if total_time > 60: <---Suggested timeout change
    134                     # To avoid superfluous logs, log only when it takes longer than 60
    135                     # seconds to acquire a token.
    136                     log.info("Acquired token succesfully in %f seconds." % total_time)
    137                 log.info("Publishing health status as GREEN to vMon.")
    138                 health = GREEN
    139             else:
    .. 
    142                 log.info("Failed to acquire token after %f seconds." % total_time)
    143                 log.error("Publishing health status as RED to vMon.")
    144                 health = RED