Gemfire Connection Limit Exceeds, Causing Data Collection Issues and Cloud Proxy Offline
search cancel

Gemfire Connection Limit Exceeds, Causing Data Collection Issues and Cloud Proxy Offline

book

Article ID: 312266

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • Random collection issues occurring every few weeks where data collection gets stuck.
  • Cloud proxy going offline.
  • Unable to integrate management packs.


Error Message in Collector Logs:

2023-09-27T17:23:48,037+0000 INFO  [Auto discovery worker thread 1]  com.integrien.alive.collector.CommunicatorThread.sendTask - Failed to send task.
org.apache.geode.cache.client.ServerRefusedConnectionException: ##.##.##.##(vRealize Ops )<ec><v26>:10002 refused connection: exceeded max-connections 800
    at org.apache.geode.internal.cache.tier.sockets.Handshake.readMessage(Handshake.java:334) ~[geode-core-9.15.2.jar:?]
    
2023-09-27T17:27:17,117+0000 ERROR [Auto discovery worker thread 2]  com.vmware.vcops.platform.common.collector.GemfireCommunicator.sendChunk - Failed to create entry[key = CHK:15:94027:1:1234567890] in region CONTROLLER_TASKS_REGION
org.apache.geode.cache.client.ServerRefusedConnectionException: ##.##.##.##(vRealize Ops )<ec><v26>:10002 refused connection: exceeded max-connections 800

2023-09-27T17:27:17,117+0000 ERROR [Auto discovery worker thread 2]  com.vmware.vcops.platform.common.collector.GemfireCommunicator.sendChunk - Failed to create entry[key = CHK:15:94027:1:1234567890] in region CONTROLLER_TASKS_REGION
org.apache.geode.cache.client.ServerRefusedConnectionException: ##.##.##.##(vRealize Ops )<ec><v26>:10002 refused connection: exceeded max-connections 800

 

Environment

VMware Aria Operations 8.10.x
VMware Aria Operations 8.12.x
VMware Aria Operations 8.14.x
VMware Aria Operations 8.x

Cause

The issue stems from Gemfire, the tool used for cluster communication in Aria Operations. It manages communication between analytics nodes and remote collectors. In this case, Gemfire fails to clean up internal structures properly when encountering exceptions while reading from sockets, leading to exceeded connection limits and disruptions.

Resolution

VMware Engineering is currently investigating and working on a permanent fix.

Workaround:

At the moment the known workarounds are:

1. To restart the cluster(CP's, RC and Analytics) 

  • https://knowledge.broadcom.com/external/article?legacyId=59207

2. To substitute all RCs converted to Cloud Proxys with freshly deployed Cloud Proxys (since CPs are not using Gemfire for communication)