search cancel

Kubernetes On-Premise Location Engines Not Launching


Article ID: 202083


Updated On:




Engines are not launching on OPL cluster.   From the detailed Boot Summary page 1 or more of the engines never leave the Booting status.
Here is an example of the error messages found in the crane log:

{"asctime": "020-10-20 18:30:46,363", "funcName": "handle_command", "levelname": "WARNING", "pathname": "agent/", "message": "Executing command!"}

{"asctime": "2020-10-20 18:30:46,363", "funcName": "run_containers", "levelname": "INFO", "pathname": "agent/", "message": "Performing run containers commands"}

{"asctime": "2020-10-20 18:30:46,364", "funcName": "run_containers_serially", "levelname": "INFO", "pathname": "agent/", "message": "Creating containers serially"}

{"asctime": "2020-10-20 18:30:46,364", "funcName": "run_containers_serially", "levelname": "INFO", "pathname": "agent/", "message": "Performing run container (1 / 1)"}

{"asctime": "2020-10-20 18:30:46,364", "funcName": "run_container", "levelname": "INFO", "pathname": "agent/", "message": "Creating container"}

{"asctime": "2020-10-20 18:30:46,697", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:47,704", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:48,710", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:49,717", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:50,497", "funcName": "update_status", "levelname": "INFO", "pathname": "agent/", "message": "Logging agent status"}

{"asctime": "2020-10-20 18:30:50,727", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:51,735", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:51,817", "funcName": "mainloop", "levelname": "INFO", "pathname": "agent/", "message": "Getting command"}

{"asctime": "2020-10-20 18:30:52,743", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:53,750", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:54,759", "funcName": "__retry_internal", "levelname": "WARNING", "pathname": "site-packages/retry/", "message": ", retrying in 1 seconds..."}

{"asctime": "2020-10-20 18:30:55,767", "funcName": "create_deployment", "levelname": "ERROR", "pathname": "agent/container_management/kubernetes/", "message": "No pods were created for the deployment: r-v4-5f8f2cd1555f2-0-0-c"}

{"asctime": "2020-10-20 18:30:55,768", "funcName": "start_container", "levelname": "INFO", "pathname": "agent/", "message": "Starting container r-v4-5f8f2cd1555f2-0-0-c"}

The OPL is configured to allow 100 engines on each agent.
The CPU and Memory quotas configured on the Kubernetes cluster are well under the configured maxes.


A cluster upgrade reset the pod quota from a value > 100 to a value of 20.  On this cluster 3 pods are always running (1 bzm-app pod and 2 crane pods).  Any test configured to run 17 engines succeeded if no other tests were running on the cluster.  Any test that required more engines than the available number of pods would fail to run. 

If you add VERBOSE=TRUE to the crane deployment environment section and regenerate the agent, the following additional messages will appear in the crane log file:

First an indication that each of the configured the taurus-cloud containers are being started (one for each engine):

{"asctime": "2020-10-21 17:28:17,432", "funcName": "log", "levelname": "INFO", "pathname": "agent/container_management/cleaner/", "message": "[cleaner:run] {\"message\": \"Adding container r-v4-5f9051f410381-0-0-c-67d8965567-6\"}"}

And then after the verification step is logged:

{"asctime": "2020-10-21 17:28:18,131", "funcName": "verification", "levelname": "INFO",  .....

the immediate deletions of the pods appear as indicated by lines similar to the following:

{"asctime": "2020-10-21 17:28:25,150", "funcName": "process_command", "levelname": "INFO", "pathname": "agent/", "message": "Thread: Parallel handler #7. Start handle command: removeContainer"}


Release : SAAS



The Kubernetes pod quota must be configured to a value large enough to allow the maximum number of agents configure for each agent configured in the OPL plus some extra to handle the pods that are always running.