You have an application that has crashed. Looking at cf logs or cf events shows a cryptic error like below:
App instance exited with guid 26799e0c-d098-48b3-8cab-313a4c75474c payload: { "instance"=>"f1cd7363-4b62-4236-582c-21ced40d4994", "index"=>0, "reason"=>"CRASHED", "exit_description"=> "2 error(s) occurred: *2 error(s) occurred: * Exited with status 255 * cancelled * cancelled", "crash_count"=>1, "crash_timestamp"=>1450113930812542373, "version"=>"1c9b8366-56ce-4f97-911c-e3dcb740a40c"}
This article discusses the information present in the log output, what it means, and how you can use this to track down application problems.
This message indicates that your application has crashed. Application crashes can happen for a variety of reasons including the application has exceeded its memory limit, the application's health check has failed, or the application itself has terminated. In some cases, the crash log will tell you why the crash occurred, but in others, you may need to consult additional logs to narrow down on the culprit. The remainder of this article discusses specific crash situations and their meanings.
The message that is generated when an application process exits is a good starting point for debugging an application crash as it tells us some important bits of information.
Unfortunately, the exit_description itself can be a bit cryptic, but here's a break down of the common messages you'll see and what they mean.
Your application has exited normally and has returned an exit code of 0. It often happens when a program finishes it's execution and is complete. While this may seem like a normal thing, the apps that run on PWS should never finish. A well-behaved application should continue to listen for HTTP requests or work indefinitely. If you see this error, check for cases where your application might finish without any errors.
There is nothing specific about an exit status of four, it is a generic exit code that can mean a lot of things. To narrow down the error further, you'll need to look at the circumstances and log message preceding the crash.
Look for the lines Timed out after 60s: health check never passed. This message can occur when you are pushing or restarting an application and it indicates that your application did not start fast enough. The platform puts a timeout, which defaults to 60 seconds, on an application starting up. If it does not start within this time limit, you will see the error above and a crash with exit four. To work around this, you could increase the timeout, which can go up to 180 seconds. This is done with the -t argument to cf push or with the timeout attribute in a manifest.yml file.
Ex: cf push -t 180 my-cool-app
The other option is to adjust your application to start up within the given time limit. There's no easy way to do this and it might require application code changes. The below list contains some of the common causes of slow application starts up.
Another potential error that you might see associated with an exit code of four is healthcheck failed: failure to make TCP connection: dial tcp <container-ip>:8080: getsockopt: connection refused. This error message indicates that the health check for the application has failed. It can happen on push or application restart if the application is not listening on the assigned port. When this happens, make sure that your application is looking at the assigned port, which is placed into the $PORT environment variable, and listening for requests on that port. The other possible startup issue that would generate this message is if the application or not starting up and listening on the port fast enough. See the previous paragraphs for details on resolving that issue.
It is also possible for health checks to fail after an application has been running for a period of time. In this case, you will see the same error: healthcheck failed: failure to make TCP connection: dial tcp <container-ip>:8080: getsockopt: connection refused. The difference is that this means your application stopped listening for requests or did not respond to a health check request fast enough. When this happens, the recommendation is to check the load on your application. If you see high CPU usage this would be an indication that you need to scale your application. If you're not seeing high CPU usage then you may have a concurrency problem in the application. The recommendation for that scenario is to check for deadlock or blocking issues.
Lastly, it is possible to get an exit code of 64 which is a variation on exit code 4. Exit code 4 means that the TCP health check failed because a connection was actively refused. Exit code 64 means that a connection could not be established but instead of being refused, the connection attempt just timed out. The difference is typically that a connection is actively refused when nothing is listening on the port, whereas a timeout can occur when something is listening but not responding fast enough or when a firewall silently drops the packets.
These error codes are the same as 4 and 64, except they indicate that an HTTP based health check has failed rather than a TCP based health check. Again, a 6 indicates that the connection was actively refused whereas a 65 indicates that the connection timed out.
This message is telling us that the application has exceeded it's memory quota and the system has killed it. If you're seeing this issue, you'll need to take a closer look at the memory usage of your application. Application logs may show additional information here, for example Java applications may indicate there was an OutOfMemoryError. If application logs are not sufficient, a profiler or memory dump tool may be necessary to investigate the problem.
Your application has exited in error and has returned exit code X. This happens when the application encountered an error or exception, could not handle that and exited. It's not possible to say from this what caused the error, but a well written application should have logged additional information to indicate why it has exited. Look through your application logs prior to the time of the crash for more clues.
Note, the X in the error message will be an integer from 0 - 255 (unsigned char). If your application returns a negative integer it will be converted to an unsigned char. This means that if X is 255 the app could have returned either -1 or 255 since they are both the result when interpreted as an unsigned char.
When an app running on PWS crashes, PWS only gets a very small amount of information from the application. This information is conveyed through the error code above. In most cases, you'll need additional information to track down problems. Here are some common ways to do that.
You may also be curious about the format of the crash description itself.
For example:
"exit_description"=> "2 error(s) occurred: *2 error(s) occurred: * Exited with status 255 * cancelled * cancelled",
Why are there two errors? Why does it say "canceled"?
This is because Diego actually starts two processes in every application container. There is your application process and there is a very small process which allows you to SSH into the application container. When your application crashes, the SSH process is automatically canceled. All of this output is then generated as Diego converts it's internal error structure into a human readable string. For all intents and purposes, you can ignore the lines with "cancelled" and the fact that it's reporting two errors. Instead focus on the "Exited with status X" to determine the cause of the crash.