etc.
The access from your client to the app in the container involves multiple components and multiple points where the failure could occur: client (app or browser) > Load Balancer (LB) > Gorouter > Diego Cell > Envoy Proxy > app in container
This article covers how to troubleshoot and narrow down the problem scope.
To start, access the app from the nearest place (inside of container), then move outward from the container step by step, until you identify at which stage the problem is introduced.
For example, in this article, the problematic app is named “test” and deployed with route “test
.mydomain
.com
”.
Access the app inside of the container. This step helps to break down the problem and identify whether it is an app issue or platform issue. The test cases in this article are with `curl` GET
request at /PATH
, please replace HTTP method / path, and add necessary header / data accordingly for your own app.
1. Use ssh to get into the app container:
$ cf ssh test
$ curl -v http://localhost:8080/PATH
$ curl -v -k https://localhost:61001/PATH
If both fail, then the application is having problems. If just step #3 fails, then there is an issue with Envoy.
If these commands all succeed, then the app is working and the problem is elsewhere. Continue troubleshooting.
$ cf curl /v2/apps/$(cf app test --guid)/stats { "0": { "state": "RUNNING", "isolation_segment": null, "stats": { "name": "test", "uris": [ "test.mydomain.com" ], "host": "10.197.2.25", "port": 61072, ... } } } }
61072/61078/61074
are HTTP/HTTPS/SSH
external ports, 8080/61001/61002
are app/envoy/sshd
ports on the container network.$ cf ssh test -c 'echo $CF_INSTANCE_PORTS' [{"external":61072,"internal":8080,"external_tls_proxy":61078,"internal_tls_proxy":61001},{"external":61074,"internal":2222,"external_tls_proxy":61080,"internal_tls_proxy":61002}]
In case `cf ssh
` is being disabled on the platform, please locate the Diego Cell (with IP 10.197.2.25
) and check its NAT rules. In below example, export port 61078
maps container app HTTP port 8080
, and external port 61080
maps container envoy HTTPS port.
$ bosh -d DEPLOYMENT ssh diego_cell/<INDEX> -c "sudo iptables -L -t nat" ... DNAT tcp -- anywhere b649d48b-97f7-4ba4-8b25-d18d33d9fec9.diego-cell.infra.cf-f611e346a9cfc876c06f.bosh tcp dpt:61072 to:10.255.128.38:8080 DNAT tcp -- anywhere b649d48b-97f7-4ba4-8b25-d18d33d9fec9.diego-cell.infra.cf-f611e346a9cfc876c06f.bosh tcp dpt:61074 to:10.255.128.38:2222 DNAT tcp -- anywhere b649d48b-97f7-4ba4-8b25-d18d33d9fec9.diego-cell.infra.cf-f611e346a9cfc876c06f.bosh tcp dpt:61078 to:10.255.128.38:61001 DNAT tcp -- anywhere b649d48b-97f7-4ba4-8b25-d18d33d9fec9.diego-cell.infra.cf-f611e346a9cfc876c06f.bosh tcp dpt:61080 to:10.255.128.38:61002 ...
3. Then `bosh ssh
` into the Diego Cell which hosts the container:
$ curl http://localhost:61072/PATH $ curl -k -v https://localhost:61078/PATH
$ curl -v http://10.197.2.25:61072/PATH $ curl -k -v https://10.197.2.25:61078/PATH
If these commands all succeed, then port forwarding on the Cell is working OK & the problem is elsewhere. Continue troubleshooting.
If the first command fails, there is an issue forwarding plain HTTP traffic to the app.
If the second fails, there is an issue forwarding HTTPS traffic onto Envoy.
If the app can be accessed successfully with the above steps, then the problem is neither with app container and not with Diego Cell. The next step is to test against Gorouter endpoint with the Host attribute.
Use bosh ssh
to access one of your Gorouter instance, it doesn't matter which one. Then run this command:
$ curl -k -v -H “Host: test.mydomain.com” https://GOROUTER_IP/PATH
If this command fails, then there is an issue with Gorouter or it's routing tables. See the section below on how to dump the Gorouter routing tables and how to review the Gorouter logs.
If there are no problems with above steps, the problem could be with load balancer or something else. These are outside of the platform's scope and you need to talk with your Network or Load Balancer Admin to get assistance testing from these additional hops in the request path.
502 Bad Gateway
usually occurs in two scenarios
900
s), and Gorouter returns 502
to the client.503 Service Unavailable
one scenario is due to time gap between Gorouter and Diego cell.
404 Not Found
There are two types of "not found" responses:
404
with response body "404 Not Found: Requested route ('mydomain.com') does not exist.
"500 Internal Server Error
1. Fetch the routing table, on any Gorouter VM.
$ /var/vcap/jobs/gorouter/bin/retrieve-local-routes | jq . ... "test.mydomain.com": [ { "address": "10.197.2.25:61078", "tls": true, "ttl": 120, "tags": { "component": "route-emitter", "instance_id": "0", "process_id": "6b817f74-98d2-4602-a695-7fe5290ffb61", "process_instance_id": "b57df7c3-9b4c-4e01-7cd7-14b1", "process_type": "web", "source_id": "6b817f74-98d2-4602-a695-7fe5290ffb61" }, "private_instance_id": "b57df7c3-9b4c-4e01-7cd7-14b1", "server_cert_domain_san": "b57df7c3-9b4c-4e01-7cd7-14b1" } ], ...
2. Test connection from Gorouter to container:
# copy ca_certs content from /var/vcap/jobs/gorouter/config/gorouter.yml, format it into /tmp/ca.pem # this will fail because certificate subject name doesn't match, but it can tell if server certificate verification succeeds or not $ curl -v https://10.197.2.25:61078 --cacert /tmp/ca.pem ... * server certificate verification OK ... curl: (51) SSL: certificate subject name (1f20ff49-0d5e-4694-6d7c-4667) does not match target host name '10.213.60.22' # test with -k to skip SSL validation, this should succeed $ curl -v -k https://10.197.2.25:61078
3. Read the Goroute Logs:
2020/03/18 12:37:12 http: proxy error: x509: certificate is not valid for any names, but wanted to match a8a190e7-01c7-419d-4386-a1fb
The Gorouter connects to the backend container and talks to the envoy proxy over TLS. To avoid connecting to wrong containers, Gorouter verifies that the envoy certificate SAN matches the SAN in the routing table. If it doesn’t match, Gorouter will delete the route from its routing table because the backend container is for a different app.
test.mydomain.com - [2020-03-31T23:41:35.633829784Z] "GET /xxxx?xxxxxx=5000 HTTP/1.1" 200 0 107 "-" "Typhoeus - https://github.com/typhoeus/typhoeus" "10.197.2.250:41198" "10.197.2.18:9024" x_forwarded_for:"10.197.2.25, 10.197.2.250" x_forwarded_proto:"https" vcap_request_id:"e9550647-11aa-4c79-592a-5fb12d9eddf5" response_time:0.010781 gorouter_time:0.000211 app_id:"-" app_index:"-" x_b3_traceid:"2e7cbc3649835a0c" x_b3_spanid:"2e7cbc3649835a0c" x_b3_parentspanid:"-" b3:"2e7cbc3649835a0c-2e7cbc3649835a0c"
The response_time
in Gorouter access logs is the app response time, it's vital information to troubleshoot app slowness, if the slowness is only with certain apps, the problem is mostly caused by the app itself rather than the platform.
The gorouter_time
is identical to round-trip time, app response_time, which is the time consumed on Gorouter itself. Please refer to this article - Troubleshooting Slow Requests for more details.