We have a python script calling a CARA API (run-release) to run deployments and are receiving an odd error intermittently. Some deployments return a "Connection reset by peer" error. Some run/rerun normally. The deployment continues in CARA without being affected. Only the API call is affected by an early failure response.
It doesn't seem to be application-specific or architecture-specific. The error consistently occurs after 5 minutes even though the API call being made specifies a timeout significantly larger than 5 minutes.
socket.error: [Errno 104] Connection reset by peer
The problem can be reproduced if we pause a running deployment and wait for 5 minutes.
The management server's nolio_dm_all.log file shows the following error at the same:
2021-02-25T10:48:52.179-06:00 [http-nio-8080-exec-6] ERROR (com.nolio.releasecenter.controllers.RCApiController:1917) - RC API Controller method error occurred.
org.apache.catalina.connector.ClientAbortException: java.io.IOException: An existing connection was forcibly closed by the remote host
The clientabortexception java.io.ioexception and "an existing connection was forcibly closed" indicate Network or client (in this case python) problems. However, the "Connection reset by peer" error is not typically associated with client timeout settings so something happening at network level more sense.
Release : 6.7
Component : CA Release Automation Data Management Server
The problem was resolved after identifying a change recently made to a Load Balancer's profile where it closed/reset connections that were open for longer than 5 minutes. The profile was updated to close the connection after a longer period of time. However, the Network team did raise a point that running the deployment asynchronously (vs. the longer synchronous calls with a large timeout) and then periodically checking the status might be better.
This article does not suggest that all "ClientAbortException: java.io.IOException: An existing connection was forcibly closed by the remote host" errors are the result of a Network/LoadBalancer/Firewall policy issue. In this case it was extremely helpful/telling to know the error on the client side. Having the client error gives the full picture.
If the problem can be reproduced and you need to troubleshoot then some thoughts are to try other clients (like curl, Postman, Bamboo, Jenkins, TeamCity, TFS, etc..) to make the API call and see if they also have this error. This might help isolate, throughout different areas of your network, the conditions under which this occurs.