Agent Operation Failed message appears instantly when upgrading Telegraf agent
search cancel

Agent Operation Failed message appears instantly when upgrading Telegraf agent

book

Article ID: 381643

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Telegraf agent fails to upgrade with error message:

Agent Operation Failed: Please check the health of the Cloud Proxy and the Salt service. Retry the action if components are healthy.

 

/storage/log/vcops/log/vcops-bridge.log reports:

2024-10-09T08:08:01,621+0000 ERROR [ServerConnection on port 10000 Thread 17141] com.vmware.vcops.bridge.server.UCPManager.bootstrapUcpAgent_aroundBody10 - Unable to perform operation: contentupgrade, Exception Detail: vcId=_ vcIp=- vmMor=<ResourceID for target Agent>
java.lang.NullPointerException: null
        at com.vmware.vcops.bridge.server.UCPManager.bootstrapUcpAgent_aroundBody10(UCPManager.java:1195) ~[vcops-bridge-server-1.0-SNAPSHOT.jar:?]
        at com.vmware.vcops.bridge.server.UCPManager.bootstrapUcpAgent_aroundBody11$advice(UCPManager.java:96) ~[vcops-bridge-server-1.0-SNAPSHOT.jar:?]
        at com.vmware.vcops.bridge.server.UCPManager.bootstrapUcpAgent(UCPManager.java:1) ~[vcops-bridge-server-1.0-SNAPSHOT.jar:?]
        at com.vmware.vcops.bridge.server.DataRetrieverServer.bootstrapUcpAgent(DataRetrieverServer.java:10869) ~[vcops-bridge-server-1.0-SNAPSHOT.jar:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
        at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
        at com.vmware.vcops.platform.gemfire.GemfireFunction.invokeHandlerMethod(GemfireFunction.java:112) ~[alive_platform.jar:?]
        at com.vmware.vcops.platform.gemfire.GemfireFunction.execute(GemfireFunction.java:60) ~[alive_platform.jar:?]
        at com.vmware.vcops.platform.gemfire.GemfireFunctionHandler$FunctionHandler.execute(GemfireFunctionHandler.java:368) ~[alive_platform.jar:?]
        at com.vmware.vcops.platform.gemfire.GemfireFunctionHandler$TopGemfireFunction.execute(GemfireFunctionHandler.java:165) ~[alive_platform.jar:?]
        at org.apache.geode.internal.cache.tier.sockets.command.ExecuteFunction70.executeFunctionLocally(ExecuteFunction70.java:401) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.internal.cache.tier.sockets.command.ExecuteFunction70.cmdExecute(ExecuteFunction70.java:262) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:191) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:895) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doOneMessage(ServerConnection.java:1109) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1391) ~[gemfire-core-10.0.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:707) ~[gemfire-core-10.0.1.jar:?]
        at org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:124) ~[gemfire-logging-10.0.1.jar:?]
        at java.lang.Thread.run(Unknown Source) ~[?:?]
2024-10-09T08:08:01,623+0000 ERROR [ServerConnection on port 10000 Thread 17141] com.vmware.vcops.bridge.server.BridgeTracerAspect.processBridgeResult - Agent Operation Failed: Please check the health of the Cloud Proxy and the Salt service. Retry the action if components are healthy. null

 

/storage/log/vcops/log/web.log also reports:

2024-10-09T08:08:01,502+0000 INFO  [ajp-nio-127.0.0.1-8009-exec-1379] com.vmware.vcops.ui.action.ucp.UcpAgentManagementAction.startOrStopAgent - startOrStopAgent action started
2024-10-09T08:08:01,624+0000 INFO  [ajp-nio-127.0.0.1-8009-exec-1379] com.vmware.vcops.ui.action.ucp.UcpAgentManagementAction.startOrStopAgent - startOrStopAgent action ended
2024-10-09T08:08:01,624+0000 ERROR [ajp-nio-127.0.0.1-8009-exec-1379] com.vmware.vcops.ui.util.PreResultInterceptor.processErrors - functionName = bootstrapUcpAgent, succeededPartially = false, errorMessage = Agent Operation Failed: Please check the health of the Cloud Proxy and the Salt service. Retry the action if components are healthy. null

2024-10-09T08:08:01,624+0000 INFO  [ajp-nio-127.0.0.1-8009-exec-1379] com.vmware.vcops.ui.util.PreResultInterceptor.processErrors - Component: TODO
Url: /ui/ucpAgentManagement.action
Params: mainAction=startOrStopAgent (
  Bridge Client function 'bootstrapUcpAgent' - Oct 09 08:08:01:502 - 121ms (
    Bridge Server function 'bootstrapUcpAgent [node: ops05]' - Oct 09 08:08:01:504 - 119ms
  checkIfUserHasPrivileges - Oct 09 08:08:01:504 - 1ms.

Environment

Aria Operations 8.18.x

Cause

An incorrect startOrStopAgent action is initiated when starting an upgrade of telegraf agent, resulting close to instant upgrade failure.

This issue is under investigation. Issue has mainly been observed on agents running on physical servers

Resolution

Workaround  for upgrading agents, use API instead of UI to complete the upgrade.

Locate the Resource ID's for affected agents:

  1. In Aria Operations Product UI go to Operations -> Applications
  2. Select 'Manage Telegraf Agents'
  3. Locate the affected agent in the list, and select the vertical ellipsis between the checkbox and agent name, and select 'Go to Details'
  4. You will find the ResourceID in the URL, like in the following example:
    https://<Operations IP or FQDN>/vcf-operations/ui/inventory;mode=hierarchy;resourceId=ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ;tab=summary
  5. Copy this ResourceID to a text file, or directly during API procedure below.

 

Upgrade using API:

  1. Connect to Aria Operations suite-api using url: https://<operations IP or FQDN>/suite-api
  2. Click 'Authorize' and enter admin credentials
  3. Select/Expand 'Applications' from the list
  4. Click on 'PUT /api/applications/agents/upgrade' to expand, and select 'Try it out'
  5. Replace the example ResourceID's in the example with ResourceID(s) found earlier:
    {
      "contextResourceIDs" : [ "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY" ]
    }

    Note that example ResouceID'd has been replaced with X's and Y's in the example above. You must enter the ID located earlier. Do not remove anything outside the square-brackets, if only one ResourceID, remove the comma and second ResourceID.

    Based on example from URL above, and single agent:

    {
      "contextResourceIDs" : [ "ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ" ]
    }
  6. After you've replaced the ResourceID(s), select 'Execute'. Make sure the result code is a 2xx code, and verify that the upgrade has started in the Operations Product UI