STS Health status report RED without any service state change | Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']
search cancel

STS Health status report RED without any service state change | Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']

book

Article ID: 374405

calendar_today

Updated On:

Products

VMware vCenter Server VMware vCenter Server 7.0 VMware vCenter Server 8.0

Issue/Introduction

We receive an alarm in the vCenter
 Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red'

After a few seconds, we received an Alarm that the status changed to green.

Alarm 'Health status changed alarm' on Datacenters triggered by event 584984 'sts status changed from red to green']

Upon checking the vCenter service, we don't see the STS service in the stopped state. The service is running without any issues.

Environment

VMware vCenter Server 7.0

VMware vCenter Server 8.0

Cause

  • In the /var/log/vmware/vmdir/ vmafdvmdirclient.log we can see that a password reset happened for the machine account for the vCenter ([email protected]).
  • At the same time, if the vMON is polling the service health state for the service(in this case STS), the older token used to log in becomes invalid.
  • Due to this Bind error will be noticed, the STS is not able to log in, it throws these errors. 
    In vmware-identity-sts.log,
    Authentication failed for user  ([email protected]). in tenant
    IDM rejected authentication by UPN.
    Invalid credentials.


    How to validate these from logs?

    less journalctl_-b--0.txt | grep Alarm | egrep sts

    Jul 19 14:00:24 vcenter_fqdn vpxd[5914]: Event [584980] [1-1] [YYYY-MM-DDT05:00:24.382335Z] [vim.event.EventEx] [info] [] [] [584980] [Alarm 'Health status changed alarm' on Datacenters triggered by event 584978 'sts status changed from green to red']

    Jul 19 14:00:26 vcenter_fqdn vpxd[5914]: Event [584986] [1-1] [YYYY-MM-DDT05:00:26.37667Z] [vim.event.EventEx] [info] [] [] [584986] [Alarm 'Health status changed alarm' on Datacenters triggered by event 584984 'sts status changed from red to green']


# In this log we can check that the password refresh for the machine account was performed at YYYY-MM-DDT05:00:05 UTC.
/var/log/vmware/vmdir/ vmafdvmdirclient.log

YYYY-MM-DDT05:00:05.032Z:t@140708178532096:INFO: _VmDirLdapModReplaceAttribute (cn=FQDN_vCenter,ou=Domain Controllers,dc=vsphere,dc=local)(userPassword)
YYYY-MM-DDT05:00:05.033Z:t@140708178532096:INFO: Act (FQDN_vCenter@vsphere.local) password refreshed

# During the same time, we can see that vmon tried to poll the Health status for the service and reported the error.
 less vmon.log | grep sts | egrep failed
YYYY-MM-DDT05:00:05.506Z Wa(03) host-2267 <sts> Health of service failed. Health data:

# In the below log we can see that till YYYY-MM-DDT04:55:00 UTC the health status for the STS was published. 
# When the polling interval came after 4-5 minutes, we experienced the error "Provided credentials are not valid". 
# The reason is that at the very same time the password for the machine account ([email protected]) was refreshed.

/var/log/vmware/sso/sts-health-status.log
YYYY-MM-DDT04:39:44.292Z - __main__ - INFO - Publishing health status as GREEN to vMon.
YYYY-MM-DDT04:44:49.643Z - __main__ - INFO - Publishing health status as GREEN to vMon.
YYYY-MM-DDT04:49:55.524Z - __main__ - INFO - Publishing health status as GREEN to vMon.
YYYY-MM-DDT04:55:00.097Z - __main__ - INFO - Publishing health status as GREEN to vMon.
YYYY-MM-DDT05:00:05.490Z - __main__ - INFO - Stdout: YYYY-MM-DD 14:00:04,740 main ERROR Unable to locate appender "fileAppender" for logger config "root"
YYYY-MM-DD 14:00:04,741 main ERROR Unable to locate appender "perfLogAppender" for logger config "com.vmware.identity.performanceSupport.PerfDataSink"
com.vmware.vim.sso.client.exception.AuthenticationFailedException: Provided credentials are not valid.
Refreshing STS signing certificates and retrying VMware STS check

YYYY-MM-DDT05:00:05.490Z - __main__ - INFO - Stderr: Picked up JAVA_TOOL_OPTIONS: -Xms32M -Xmx128M     -Dcom.sun.org.apache.xml.internal.security.ignoreLineBreaks=true     -Dorg.apache.xml.security.ignoreLineBreaks=true
log4j:WARN No appenders could be found for logger (com.vmware.identity.interop.NativeLibraryPreloader).
log4j:WARN Please initialize the log4j system properly.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/vmware/lib64/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/vmware/common-jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" com.vmware.vim.sso.client.exception.AuthenticationFailedException: Provided credentials are not valid.
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl$RequestResponseProcessor.handleFaultCondition(SecurityTokenServiceImpl.java:1066)
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl$RequestResponseProcessor.sendRequest(SecurityTokenServiceImpl.java:988)
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl$RequestResponseProcessor.executeRoundtrip(SecurityTokenServiceImpl.java:902)
        at com.vmware.vim.sso.client.impl.SecurityTokenServiceImpl.acquireToken(SecurityTokenServiceImpl.java:155)
        at com.vmware.identity.installer.STSInstaller.check_sts(STSInstaller.java:159)
        at com.vmware.identity.health.StsHealthStatus.main(StsHealthStatus.java:25)

YYYY-MM-DDT05:00:05.490Z - __main__ - INFO - Failed to acquire token after 2.410531 seconds.  <--- # Here we can see the token was received after 60 sec. Total time taken to receive the token : 62.41 sec. 
YYYY-MM-DDT05:00:05.490Z - __main__ - ERROR - Publishing health status as RED to vMon.  <-- Health status reported as RED.
YYYY-MM-DDT05:00:25.337Z - __main__ - INFO - Publishing health status as GREEN to vMon.
YYYY-MM-DDT05:05:10.850Z - __main__ - INFO - Publishing health status as GREEN to vMon.


# Here is proof that the token we received on YYYY-MM-DDT05:00:05 UTC was an invalid one as below.
# In this log we can see that the machine account reported an "Authentication failed" status. 
# In addition to that we can see LDAP error 49, which is the error code for LDAP when we have "Invalid credentials".

less vmware-identity-sts.log | grep vceter_fqdn | egrep Authentication | egrep failed

YYYY-MM-DDT05:00:05.313Z INFO sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.IdentityManager] Authentication failed for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [26] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT05:00:05.443Z INFO sts[87:tomcat-http--48] [CorId=71cdfd4d-67a9-4bd3-bd5a-0477e2cab770] [com.vmware.identity.idm.server.IdentityManager] Authentication failed for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [27] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]


[Details of the failure of the user]
YYYY-MM-DDT04:58:50.979Z INFO sts[79:tomcat-http--40] [CorId=80ed8c24-a272-4519-a122-505cc8a5cacf] [com.vmware.identity.sts.impl.STSImpl] Exiting validate() token...
YYYY-MM-DDT05:00:05.286Z INFO sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.sts.ws.handlers.SOAPHeadersExtractor] Found 1 {http://docs.xxx-xxx.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd}Security headers
YYYY-MM-DDT05:00:05.287Z INFO sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.sts.impl.STSImpl] Entering issue() token...
YYYY-MM-DDT05:00:05.291Z WARN sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.interop.ldap.LdapErrorChecker] Error received by LDAP client: com.vmware.identity.interop.ldap.OpenLdapClientLibrary, error code: 49  
YYYY-MM-DDT05:00:05.291Z WARN sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.ServerUtils] cannot bind connection: [ldap://FQDN_vCenter389, FQDN_vCenter@vsphere.local]
YYYY-MM-DDT05:00:05.291Z ERROR sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.ServerUtils] cannot establish ldap connection with URI: [ldap://FQDN_vCenter:389] because [Invalid credentials] therefore will not attempt to use any secondary URIs
YYYY-MM-DDT05:00:05.291Z WARN sts[78:tomcat-http--39] [CorId=56d1f9ef-c88c-4441-9983-ba9957d6dbea] [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider] Failed to authenticate using SRP binding
com.vmware.identity.interop.ldap.InvalidCredentialsLdapException: Invalid credentials


 

[Token within the limit]
YYYY-MM-DDT04:39:44.124Z INFO sts[69:tomcat-http--30] [CorId=c7e3a16c-2144-42f4-8efd-7a1db33f2db0] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [3] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT04:44:49.451Z INFO sts[51:tomcat-http--12] [CorId=4156777e-1ee3-4e51-ac4f-c237496f073a] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [3] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT04:49:55.349Z INFO sts[62:tomcat-http--23] [CorId=5d2ab340-1aaa-4936-9a69-81cd76e0c3e3] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [4] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT04:54:59.852Z INFO sts[40:tomcat-http--1] [CorId=244fc77d-de6c-44bc-a9bc-9c778b3365ae] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [3] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT05:00:24.971Z INFO sts[50:tomcat-http--10] [CorId=650d63b0-6405-4fa9-825e-3c0446b4da43] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [12] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT05:00:25.744Z INFO sts[52:tomcat-http--12] [CorId=9bedbb2b-fd22-4fe8-8e4f-8128bbb3e69c] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [4] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]
YYYY-MM-DDT05:05:10.533Z INFO sts[70:tomcat-http--30] [CorId=d69d7ea9-924f-4f68-95f4-9027e46435f2] [com.vmware.identity.idm.server.IdentityManager] Authentication succeeded for user [FQDN_vCenter@vsphere.local] in tenant [vsphere.local] in [4] milliseconds with provider [vsphere.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider]

The time between 4:54:59 AM and 5:00:24 AM is 5 minutes and 25 seconds

 

 

There are two scripts involved in this issue. 
1. sts-vmon-health-checker.py
2. vmonEventPublisher.py 


Checking the definition for these scripts
========================================================

/usr/lib/vmidentity/install/sts-vmon-health-checker.py

35     logfile = "/var/log/vmware/sso/sts-health-status.log"

109 def get_sts_status():
    110     '''Checks status of STS service.
    111
    112     :return: STS service status as string.
    113     '''
    114
.............
    119             -Dvmware.log.dir=/var/log/vmware/sso/ \
    120             -XX:ErrorFile=/var/log/vmware/sso/hs_err_stsinstaller_pid%p.log \
    121             -XX:HeapDumpPath=/var/log/vmware/sso/ \
    122             com.vmware.identity.health.StsHealthStatus"


## In our case token can be one cause of the issue, as we noticed a machine account password change. 
 
    133                 if total_time > 60: <---Suggested timeout change
    134                     # To avoid superfluous logs, log only when it takes longer than 60
    135                     # seconds to acquire a token.
    136                     log.info("Acquired token succesfully in %f seconds." % total_time)
    137                 log.info("Publishing health status as GREEN to vMon.")
    138                 health = GREEN
    139             else:
    .. 
    142                 log.info("Failed to acquire token after %f seconds." % total_time)
    143                 log.error("Publishing health status as RED to vMon.")
    144                 health = RED


/usr/lib/vmware-vmon/vmonEventPublisher.py 

     50 def get_machine_accnt_certandkey():
     51     """
     52     Get certificate and key for the machine account.
     53     """
     54     ks = VmKeyStore('VKS')
     55     ks.load('machine')
     56     cert = ks.get_certificate('machine')
     57     key = ks.get_key('machine')
     58     return (cert, key)


     61 def health_status_to_color(health_status):
     62     """
     63     Convert vMon health status string to corresponding color string.
     64     """
     65     if health_status == 'healthy' or health_status == 'green':
     66         return 'green'
     67     elif health_status == 'initialized' or health_status == 'yellow':
     68         return 'yellow'
     69     elif health_status == 'unhealthy' or health_status == 'red':
     70         return 'red'
     71     else:
     72         return 'unknown'


 

Resolution

- There is no major action to be taken. Any admin can keep track of the Alarm and the frequency. 
If the Alarm is triggered every 5 minutes when the vmon checks the health then please validate the status of the STS service. 
Command: service-control --status --all 

From VAMI: https://vcIP/FQDN:5480 > services
- If the service is stopped, then these need attention from VMware.  


- increase the timeout in /usr/lib/vmidentity/install/sts-vmon-health-checker.py    

    133                 if total_time > 90:

Additional Information

/usr/lib/vmidentity/install/sts-vmon-health-checker.py

35     logfile = "/var/log/vmware/sso/sts-health-status.log"

109 def get_sts_status():
    110     '''Checks status of STS service.
    111
    112     :return: STS service status as string.
    113     '''
    114
.............
    119             -Dvmware.log.dir=/var/log/vmware/sso/ \
    120             -XX:ErrorFile=/var/log/vmware/sso/hs_err_stsinstaller_pid%p.log \
    121             -XX:HeapDumpPath=/var/log/vmware/sso/ \
    122             com.vmware.identity.health.StsHealthStatus"


## In our case token can be one cause of the issue, as we noticed a machine account password change. 
 
    133                 if total_time > 60: <---Suggested timeout change
    134                     # To avoid superfluous logs, log only when it takes longer than 60
    135                     # seconds to acquire a token.
    136                     log.info("Acquired token succesfully in %f seconds." % total_time)
    137                 log.info("Publishing health status as GREEN to vMon.")
    138                 health = GREEN
    139             else:
    .. 
    142                 log.info("Failed to acquire token after %f seconds." % total_time)
    143                 log.error("Publishing health status as RED to vMon.")
    144                 health = RED