Apps Manager slowness caused by "could not be resolved (110: Operation timed out)" errors

search cancel

Apps Manager slowness caused by "could not be resolved (110: Operation timed out)" errors

book

Article ID: 438451

calendar_today

Updated On:

Products

VMware Tanzu Platform Core

Issue/Introduction

Apps Manager may appear slow or laggy when a single Apps Manager instance enters a bad DNS resolution state. In this condition, the affected instance logs repeated nginx name resolution timeout errors for internal foundation endpoints, while other Apps Manager instances may remain healthy.

The issue may be isolated to one instance, for example APP/REV/1/PROC/WEB/5, and log entries on the affected instance show errors similar to:

[error] 143#0: *123456 <hostname> could not be resolved (110: Operation timed out)

Affected hostnames may include:

api.<system-domain>
log-cache.<system-domain>
app-usage.<system-domain>

In some cases, manual connectivity tests such as curl from within the affected container may still succeed, and elevated CPU usage may also be observed on the affected instance.

Environment

VMware Tanzu Application Service

Cause

The issue is likely caused by the internal Nginx resolver within a specific Apps Manager instance becoming unresponsive or hanging. This results in the Nginx process failing to resolve the system domains required to proxy requests, leading to 30-second timeouts (110: Operation timed out) and high CPU as the process attempts to handle the backlog of stalled requests.

Resolution

To restore healthy behavior, the affected Apps Manager instance must be restarted. This clears the Nginx runtime state and re-initializes the internal resolver.

Identify the Faulty Instance

1. Review the Apps Manager logs to identify which specific instance index is reporting the `110: Operation timed out` errors.
2. Note the instance index (e.g., `5`).

Restart the Instance

cf restart-app-instance apps-manager-js-blue <instance_index>

*Note: Replace `apps-manager-js-blue` (or `green`) with the correct app name and `<instance_index>` with the index identified in the logs.*

Additional Information

If the issue recurs, collect troubleshooting data before restarting the instance:

Identify the Diego Cell hosting the affected replica.
On that Diego Cell, collect:
- bosh-dns logs
- rep logs
- garden logs
If NSX-T is in scope, collect network evidence before restart, including:
- DNS allow rules
- packet traces
- Traceflow results
- any host-specific firewall realization evidence

Feedback

thumb_up Yes

thumb_down No