GET requests to catalog service fail intermittently with 503 Upstream Connection errors

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Occasionnaly GET requests to /deployment/api/deployments and /deployment/api/requests served by the catalog-service fail with 503 errors. When checking the envoy access logs, the 503 UC (Upstream Connection) error is observed:

"GET /deployment/api/deployments/<request-uuid> HTTP/2" 503 UC 0 95 2 - "<cliend-ip>" "curl/8.5.0" "<deployment-id>" "<host>" "<catalog-pod-ip>:8000"

Environment

Aria Automation 8.18.1

Cause

This is caused by a race condition between Envoy's upstream connection pool timeout and Tomcat's HTTP keep-alive timeout. Tomcat closes idle HTTP connections after 60 seconds (default connectionTimeout), while Envoy's default upstream idle timeout is 1 hour. When a client request arrives at Envoy just after Tomcat has closed the connection (but before Envoy detects it), Envoy attempts to reuse the stale connection from its pool, resulting in a 503 UC (Upstream Connection failure) error. The issue occurs because Envoy has not yet detected that Tomcat closed the connection via TCP FIN/RST, creating a brief race window where the connection appears valid in Envoy's pool but is actually closed on the backend.

Resolution

This patch adds the idleConnection timeout setting with a value of 55 seconds to the catalog-service ingress configuration.
The code fix will be provided with next patch release - Aria Automation 8.18.1 Patch 4

Workaround
Applies to Aria Automation 8.18.1 GA and subsequent 8.18.1 Patch releases.

Install patch

1. Take simultaneous virtual machines snapshot without memory of all the Aria Automation nodes
2. Run the following command on one of the Aria Automation nodes:

base64 -d <<< '/Td6WFoAAATm1rRGBMDaBrIQIQEcAAAAAAAAAFT2j8DgCDEDUl0AEYhCRj30GGqmZ696n3KPnE7ymvKEhtgGsA0/WeoTDCiOuVzRXeicm55ozATJUVSREzKiXBRanFRM5pfndfx0W4uDIuAob8NR5CuOxV3aBXmKlbC5Qe0fwnhysZ/iTqnJ9liUo3yMkvx1Rb7SLHbaMDlTynLKoLEsKhSXLA6pXINmsrhaELQ44jijFHat7/ZIMBaf6AdKjtvm3yBrVZu05ksO1pGmFFAlVnsfckA6ipKAzSNJm2IZwrSkKeBqaB54z9ncswbIShoMKLlJIzX4RqsU25MYoN62h8ugSzs3ikcnKQQ01XyY1noqhCjZiNPxask3fsUJb+B07p5oYKI0RSKD4Vi4jmYjZI/9PpHxdzhQvz7XV6h0UAmaNSUtg8PN1PTmeQqgouxHg11SyycBIJpIptOgCH575GBcGwPDm/r7RNHlD152YNAWZY7Gbz+eKYypYxtvLtFJ2QAN7efkytlr8V40qytaMiq29J0+jzOZfzWRYhtGt0b+CrSJ+2sm5kcY9kc6KMD5Bfs4AMk6pXvKY4eOr2jq0Eoz88RATJ0qpOBadqmYqLPTyquXl31NJVcAhj9LfWqxFdNGRKbVabeScupbktknO6XNTnSPXUBBRZVuH3SMyELcNCzO1/rBUzpFthhChBog+8t8rTJ3OxH7a8zvsYuikSN+5Aff65+eQa17QpisUIQ8mM6CWaVQPcQ7NTDTa+OjmVZYILNTPUlyncpJavUXLMersM/32jBfo3NseqCmCTMBYYhwdb77+lvH91aB0vH2P/XAtvW1ekqlYBDoxoPjs3aCRpuyn9/oLkGXh1TmIYHcYJT7XV8khLeVZOzh0J022p6NLIV72+t38ewa8LTYQGUZD0UxsGO1TP55pafE0NcGVhksGgpwi0p3fw193D0RLgYvTL1BlxzkE6OEmsttCjrqT4l+elS18hLRnu2hSzAh9di6oz8cHCX1WwbasjiIamE0IyNzCAooX8qiM96NGTcb1BEuQmFB5Sz8pf9Cw8mnLBy3DlsuO0o5G/ZI8tObOxf3pWf7gIifSN1Cnb8y7ZXaM641yu1v9O/QnMLP7DU275zKnhgslrlw1PEcpRDvgIdvGLo768ti3FurRcTu4XfRDlrLsd66MAAAAFrd8yADa8UlAAH2BrIQAAAY/oMkscRn+wIAAAAABFla' | xz -d | bash -

3. To apply and persist the change, services must be redeployed (requires ~ 30 minutes downtime)

/opt/scripts/deploy.sh

4. Logs can be found at /var/log/vmware/prelude in the format patch-vcfapr913-<timestamp>.log

Verification

To verify the patch has been applied:

1. Check the ingress.yaml file on all nodes:

vracli cluster exec -- bash -c 'grep -A 1 "timeoutPolicy:" /opt/charts/catalog-service/templates/ingress.yaml'

2. The output should show (3 times in case of HA):

 - timeoutPolicy:
       idleConnection: 55s