Gemfire Broker "upgrade-all-service-instances" errand fails with "Response exceeded maximum allowed length" error

Products

VMware Tanzu Gemfire

Issue/Introduction

This knowledge article addresses the workaround for the issue of the prime-cluster-for-pcc errand failing on an instance while running the upgrade-all-service-instances on a Gemfire HA cluster deployment.

The root cause of this error lies in the logging output of the prime-cluster-for-pcc errand exceeding the "maximum allowed length", which is set at 1MB.

SYMPTOM OF THE ISSUE
Running the upgrade-all-service-instances errand on the Gemfire broker deployment:

bosh -d $GEMFIRE_BROKER_DEPLOYMENT run-errand upgrade-all-service-instances

After running the command above, we notice that the upgrade-all-service-instances errand on the Gemfire broker fails at the prime-cluster-for-pcc errand on one of the VMs included within our Gemfire cluster service instance. In this case, the service instance name for the Gemfire cluster is service-instance_587e5c5c-a103-4bc7-8838-3b45e5c15f82

2210962 error Wed Apr 5 15:49:41 UTC 2023 Wed Apr 5 15:50:29 UTC 2023 p-cloudcache-91f0a7c53bd58bf88c91 
service-instance_587e5c5c-a103-4bc7-8838-3b45e5c15f82 run errand prime-cluster-for-pcc from deployment 
service-instance_587e5c5c-a103-4bc7-8838-3b45e5c15f82 Response exceeded maximum allowed length [upgrade-all-service-instances] 2023/04/05 18:48:59.134046 [upgrade-all] 
FINISHED PROCESSING Status: FAILED; Summary: Number of successful operations: 9; Number of skipped operations: 0; 
Number of service instance orphans detected: 0; Number of deleted instances before operation could happen: 0; 
Number of busy instances which could not be processed: 0; 
Number of service instances that failed to process: 1 [587e5c5c-a103-4bc7-8838-3b45e5c15f82] [upgrade-all-service-instances] 2023/04/05 18:48:59.134053 [587e5c5c-a103-4bc7-8838-3b45e5c15f82] Operation failed: bosh task id 2211658: Failed for bosh task: 2211661 Stderr Error: failed to run job-process: exit status 1 (exit status 1) 1 errand(s)

Environment

Product Version: 1.14

Resolution

WORKAROUND
The root cause of this issue revolves around the prime-cluster-for-pcc errand generating log output that exceeds the 1MB limit.

There are 2 available workarounds to choose from:

Option 1: Silencing the log output for the prime-cluster-for-pcc errand
Option 2: Deleting unused Gemfire regions

OPTION 1
The solution below consists of removing the debug flag for the rebalance script and commenting out the region response in the rebalance script. This solution can be implemented if you are not able to delete unused Gemfire regions.

STEP 1
Turn off debugging for the initialization script

bosh -d $GEMFIRE_SERVICE_INSTANCE -c "sudo sed -i 's/^set -ex/set -e/' /var/vcap/packages/cluster_utils/bin/check_initialization"

STEP 2
Comment out Gemfire regions logging in the rebalance script

bosh -d $GEMFIRE_SERVICE_INSTANCE ssh locator-server -c "sudo sed -i 's/log \"response: \$region_count_response_json\"/\#log \"response: \$region_count_response_json\"/' /var/vcap/packages/cluster_utils/bin/rebalance"

STEP 3
Re-run the upgrade-all-service-instances errand on the Gemfire broker deployment, and observe that the error is no longer present:

bosh -d $GEMFIRE_BROKER_DEPLOYMENT run-errand upgrade-all-service-instances

OPTION 2
If a Gemfire Service instance has many regions (greater than 1000), this could cause the errand output log to reach over the 1MB limit, causing this error. Deleting unused or unnecessary Gemfire regions may work to resolve this issue.

STEP 1
Using the gfsh utility, delete any unused or unnecessary Gemfire regions.

STEP 2
Re-run the upgrade-all-service-instances on the Gemfire broker service instance, and observe that error is no longer present:

bosh -d $GEMFIRE_BROKER_DEPLOYMENT run-errand upgrade-all-service-instances