Successfully installed Telegraf Agents randomly go into an unhealthy status across random VMs for 5-10 minute periods, then returns to a healthy status.
Aria Operations 8.18.2 and below
ucp-minion collection cycle might not complete within 5 min.
The Telegraf Agent status may intermittently flip to "Unhealthy" due to a failure in the meps metrics collection. Specifically, when the system attempts to fetch process statistics, a psutil.NoSuchProcess error occurs if an agent process exits or terminates during the collection cycle.
Because the agent's health is determined by the real-time status of these endpoint processes, this missing process causes the agent to be reported as unhealthy for that interval. The status typically returns to "Running" in the subsequent collection cycle once the process is successfully captured, leading to the observed "flipping" behavior.
Workaround for 8.18.2 and below Only:
1. SSH to Cloud Proxy as root.
2. Go to /ucp/downloads/salt directory, then type command ll to list the contents of the directory .
cd /ucp/downloads/salt
ll
3. Take a note of the file permissions of ucp-minion.zip. See example below, the file permissions and owner are -rw-r--r-- and admin admin.
4. Make a backup of existing ucp-minion.zip.
mv ucp-minion.zip ucp-minion.zip_bkp
5. Download the attached ucp-minion.zip and then use WinSCP or other utilities to transfer it to /ucp/downloads/salt.
6. Make sure the permissions and owner for the zip file is the same as noted in step 3. If not, run the following commands to change the file owner and permissions.
chown admin:admin ucp-minion.zip
chmod 644 ucp-minion.zip
7. Go to Managed Telegraf Agents in the Aria Operations UI. Select all the Windows Servers on that CP and perform Agent Action "Update".
8. Wait for the Action to complete and make sure all the Windows Servers where the content upgrade is performed are having "Last Action" status as "Content upgrade success". Wait for 10-15 minutes to see the agent status, it should change from "Agent Unhealthy" to "Agent Running".
IMPORTANT
The fix provided in this KB article has not been rolled into any releases after 8.18.2. It is not advised to apply the fix to 8.18.3 and above or VCF 9.x.
Please contact Broadcom Support for more information or Subscribe to this knowledge article to get updates on this issue.