We have a standalone APM 10.7 SP3 that we are currently upgrading to APM cluster. We have created two collectors (col1 and col2) with the exact same specifications and installed a fresh EM version 10.7 SP3 on both. We configured the standalone EM to MOM with 50/50 weighted load balancing to two collectors. After the start of EMs (collectors and MOM), we saw one agent (out of around 400 agents) appeared in the MOM and then the system became unresponsive, meaning that the workstation stopped responding and the same with WebView. There are no error in the logs of either the MOM or collectors but they seem to be frozen. Looking at the MOM memory resources, we see that there is a lot of heavy memory usage as seen below.
What are some of the recommendations and suggests that we should consider for this type of upgrade?
Release : 10.7.0
Component : APMISP
This is the design of the APM product.
In APM Standalone environment:
Standalone EM keeps the /data and /traces from agents.
In APM cluster environment:
Collector EMs keep the /data and /traces from agents. MOM does not keep the /data and /traces from agents.
You are upgrading from APM Standalone environment to APM cluster environment. This is not a standard upgrade.
In these type of upgrades, the best procedure is to do a parallel upgrade (meaning setting up a new environment instead of upgrade). Install MOM and collectors in parallel (keeping the APM Standalone environment).
In this case, there is no easy way to break the /data and /traces to two collectors. So, we need to copy/move the /data and /traces to one collector and remove them from MOM EM.
APM cluster requires more memory, heap size and CPUs. Even thought the size of the agents, metrics and traces are same, it still requires more memory, heap size and CPUs.
Here are some of the factors.
communication between MOM and collectors.
size of historical data and traces.
and other tasks that MOM EM and collectors need to perform that standalone EM does not need to perform.
All requires high memory and heap size. Also needs more CPUs.
This seems like related to APM cluster performance issue. You can review the following documents to narrow down the issue: