PCC instance VM flapping between states
search cancel

PCC instance VM flapping between states

book

Article ID: 294271

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

Symptoms:
During a PCC deployment, it can be observed that the server VMs (specifically the gemfire-server job) will toggle between failing, starting, and running states. 

In addition, the logs directory for the GemFire server will have hundreds of log files of less than 100 KB in size.
Note: Generally log files will be in order of MBs.

Environment


Cause

This is due to a bug in GemFire which is fixed in higher versions of GemFire.

In GemFire, locators hold cluster configuration details. When GemFire server processes join the cluster, they send a request to locators for cluster configuration.

Due to infrastructure issues, such a power outage, the locators might get into a bad state, causing them not respond to cluster configuration requests by the servers. As a result, the servers will enter an infinite restart (reconnect) mode.

With every reconnect, GemFire servers will create a new log file. Hence, you see might lot (hundreds) of logs files under /var/vcap/sys/log/gemfire-server/gemfire on the server VMs.

Resolution

Although the gemfire-locator job on locator VMs will appear to be running consistently and gemfire-server jobs will appear to be switching between starting, running and failing, you don't need to touch the GemFire servers.

The get around this problem we need to monit stop all the gemfire-locator jobs on the locator VMs and start them one by one.
Note: This does not mean you can use monit restart. At some point, it is required that none of the gemfire-locator jobs are running. 

Follow the below instructions:

1. Stop the gemfire-locator jobs on each locator VM. You can reference each locator using 0 based index. For example, locator/0, locator/1 etc. 
bosh -e ENV -d DEPLOYMENT ssh locator/0 
sudo su
monit stop gemfire-locator
2. Verify that all the gemfire-locator jobs in all locator VMs are actually stoped by running bosh -e ENV -d DEPLOYMENT instances --ps. All the gemfire-locator jobs should be in stopped state.

3. Now start the gemfire-locator jobs on each locator VM. 
bosh -e ENV -d DEPLOYMENT ssh locator/0 
sudo su
monit start gemfire-locator
After the above commands are executed, the flapping of gemfire-servers should now stop and the servers should be in running state. Execute the command, watch bosh -e ENV -d DEPLOYMENT instances --ps for a minute to make sure gemfire-server jobs are not switching states.