Consul Fails to Start During Upgrade in Cloud Foundry
search cancel

Consul Fails to Start During Upgrade in Cloud Foundry

book

Article ID: 297610

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

The upgrade of Pivotal Cloud Foundry may fail due to Consul issues.

The upgrade fails with the following error message:

Started updating job consul_server-partition-260de9892e7d24109dfe > consul_server-partition-260de9892e7d24109dfe/0 (canary). 
Failed: `consul_server-partition-260de9892e7d24109dfe/0' is not running after update (00:05:57)
Error 400007: `consul_server-partition-260de9892e7d24109dfe/0' is not running after update

 

Environment


Cause

This particular error message is a general error message.  It indicates that there is a problem with the software running on the VM.  For the purposes of this KB, we're talking about the consul_server VM in particular, so it means that there is a problem with the consul software starting up.  It is not possible to tell the specific problem, see Debugging Instructions below for details on how you could investigate more.

 

Resolution

In many cases, we have found that consul server failures in PCF can be corrected by wiping the data from the nodes and resetting them.  This process essentially gives the cluster a fresh start and because there is no persistent data stored on the Consul server, the operation is harmless.

Because this process is quick, non-destructive and has a high success rate for fixing Consul problems, Pivotal recommends trying this process first, before doing any additional debugging.

To perform this process, follow the instructions in the Failed Deploys, Upgrades, Split-Brain Scenarios, etc section of the following link.

https://github.com/cloudfoundry-incubator/consul-release/tree/master#failure-recovery

If you need assistance with these instructions, please open a support ticket.  If performing the steps at the link above does not help, please proceed to the next section.


Debugging Instructions

When this problem occurs, you can debug further by performing the following steps:

  • Capture the logs from the failing VM.  This can be done through Ops Manager on the Status page for the Elastic Runtime Tile.  It can also be done by running bosh logs or by manually copying the /var/vcap/sys/logs directory off the VM.
  • SSH into the failing VM and run a monit summary as the root user.  This command will list the processes that are deployed to the VM and indicate with one is not running properly.

Once you have captured the information above, you can review the information to better understand the problem or open a support ticket and Pivotal Support will help to diagnose the issue.