Smoke-tests errand errors with "failed to retrieve logs"

Products

VMware Tanzu Application Service

Issue/Introduction

Customer performed upgrade but the smoke-tests errand failed both for Tanzu Application Service (TAS) and for Isolation Segment (ISO).
Got stuck at:
Retrieving logs for app SMOKES-APP-xxxxxxxx-xxxx ...
Sample error:

           • Failure [55.679 seconds]  
           Isolation Segment Workflow  
           /var/vcap/packages/smoke-tests/src/isolation_segment_test.go:16  
             Default isolation segment - Compute isolation disabled.   
             /var/vcap/packages/smoke-tests/src/isolation_segment_test.go:181  
               can be pushed and connected to the `default` isolation segment [It]  
               /var/vcap/packages/smoke-tests/src/isolation_segment_test.go:207  
             
               No future change is possible.  Bailing out early after 0.123s.  
               Unable to print logs for app.  
               Got stuck at:  
                   Retrieving logs for app SMOKES-APP-xxxxxxxx-xxxx in org system / space SMOKES-SPACE-xxxxxxxx-xxxx as smoke_tests...  
                     
                     
               Waiting for:  
                   (?i)\[(app).*/\d+\]  
             
               /var/vcap/packages/smoke-tests/src/isolation_segment_test.go:231  
           ------------------------------

Curiously, the smoke-tests succeeded when executed from a jumpbox in the customer's environment.

To test whether apps were generating logs, test apps were deployed to TAS and ISO; we executed a "cf logs APPNAME" to the apps and collected logs. We restarted the app while the "cf logs" command was tailing the logs in order to be sure that log entries were being created. We noted that for some calls to the app, the "start_time" parameter was a negative number, meaning it was in the past from a system point of view.

Retrieving logs for app PCF-spring-music-iso1 in org pubtest / space dev as smoke_tests...

REQUEST: [2025-02-18T16:56:11-06:00]
GET /api/v1/read/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx?descending=true&limit=1&start_time=-6795364578871345152 HTTP/1.1
Host: log-cache.system-domain.example.com
Authorization: [PRIVATE DATA HIDDEN]

RESPONSE: [2025-02-18T16:56:11-06:00]
HTTP/1.1 200 OK
Grpc-Metadata-Content-Type: application/grpc
X-Vcap-Request-Id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Content-Length: 978
Content-Type: application/json
Date: Tue, 18 Feb 2025 22:54:18 GMT

Cause

The failure to retrieve logs is a known issue when the system time of the system running the cf logs command is in the past:

https://github.com/cloudfoundry/cli/issues/1929

Resolution

To solve this, it is imperative to debug the time sync / NTP configuration in your foundation.

Make sure that the value for "NTP Server" is correct in the BOSH Director tile, Director Config tab:

Apply changes to all relevant tiles, selecting the "Recreate VMs deployed by the BOSH Director" checkbox:

To confirm that the configured value for the NTP server has been populated, you can inspect this file on each VM:

/etc/chrony/sources.d/bosh.sources

To check the time sync status of VMs, use the following command, substituting the desired deployment and job names:

Examples:

bosh -d cf-xxxxxxxxxxxxxxxxxxxx ssh clock_global -c "timedatectl status"
bosh -d cf-xxxxxxxxxxxxxxxxxxxx ssh diego_cell -c "timedatectl status"

bosh -d p-isolation-segment-xxxxxxxxxxxxxxxxxxxx ssh isolated_router -c "timedatectl status"
bosh -d p-isolation-segment-xxxxxxxxxxxxxxxxxxxx ssh isolated_diego_cell -c "timedatectl status"