Failed to connect remote agent

book

Article ID: 212056

calendar_today

Updated On:

Products

CA Release Automation - DataManagement Server (Nolio) CA Release Automation - Release Operations Center (Nolio)

Issue/Introduction

Periodically, while executing deployments we get the following error:
Failed to connect remote agent

If we retry several times it eventually works. The problem always occurs while it is trying to execute the action: Get File Or Folder From Remote Agent.

Cause

The cause was found to be related to a PING message that was timing out after 60 seconds. The agent executing the action tries to send the nimi ping message directly to the remote agent as defined in the Get File Or Folder From Remote Agent action's field: Remote Agent NODE-ID. If it cannot ping the agent within 60 seconds then this is the response/failure it will receive.

Environment

Release : 6.6

Component : CA RELEASE AUTOMATION CORE

Resolution

There are 2 potential solutions for this problem. 

  1. Apply cumulative patch 6.6.8 and update the configuration settings so that the agent can wait longer than 60 seconds. 
  2. Change the default configuration so that agents will not try to communicate directly with the remote agents. 

 

One of the solutions described above can be implemented. Chose one. Details for implementing each solution are described below.

 

Option #1

The 60 second timeout that an agent waits was a hardcoded value. Cumulative fix 6.6.8 makes this value configurable. To implement this fix you need to:

  1. Apply cumulative fix 6.6.8 to your NAC and NES (Execution Servers) - as you normally would when applying cumulative fixes.
  2. Once the cumulative fix has been applied you need to update the nimi_config.xml file on each agent that needs to have its ping timeout adjusted. This is done by adding the /config/nimi/routing/timeout/ping property.
    Note: The value is defined in milliseconds. So setting it to 120000 is equal to 120 seconds.
    Example:
    <config>
        <nimi>
             <routing>
                   <timeout>
                        <request>180000</request>
                        <ping>120000</ping>
                   </timeout>
            </routing>
        </nimi>
    </config>
  3. After making this change the agent needs to be restarted. When the setting has been successfully applied it can confirm this setting while running the action via the following message in the nolio_all.log file on the agent: 
Send Request was called for:[email protected]<hostname> request:[email protected]vfrkpingtst03 objectId:com.nolio.platform.shared.communication.[email protected] Waiting 120000ms for response.

 

 

Option #2

By default, agents will attempt to communicate directly with another agent if the route to that agent is not greater than 2. These settings would change this behavior so that everything goes through the NES. Example: If 2 agents are reporting to the same NES then, by default, the agent will try to communicate directly with the other agent. This change would make it so that these operations (like Get Remote File or Folder from Remote Agent) would go through the NES. 

Before making these changes it is recommended to consider:

  1. This change may add additional pressure on the NES. This doesn’t mean that it will automatically result in the NES becoming overloaded. However, it is something that should be acknowledged and kept in mind while testing/evaluating/using.
  2. These settings live on each agent machine. Therefore, to implement this, it would need to be manually changed for each agent that has this kind of problem. 
  3. We hope that these settings will help. However, if you're interested in making these changes to see if they help then they should be thoroughly tested before being relied on in a production environment. Alternatively/preferably, work with your network team to get realistic expectations of how long it might take for messages to be sent between nodes and apply the fix outlined in Option #1. Or, another option to consider - maybe if the agents are in separate locations/data centers then, add a NES in each data center and have the agents in each data center connect to their respective NES and configure the two NES to communicate with each other. Then the route_check would be greater than 2 and the agent would go through the NES. 

 

To apply this solution there are two configuration settings:

  1. full_route_check
    description: If false then the agent will check to the size of the route for a remote agent. If the route is not greater than max_route_check_size then it will attempt to connect directly to the agent.
    default value: false
    recommended value: false
  2. max_route_check_size
    description: A max/cap on how many nodes can be in a route in order for an agent to connect directly to a remote agent.
    default value: 2
    recommended value: 0

Both of these settings are a part of the routing xml child node. An example of the routing xml child node with these two settings are included below so that you can understand:

  • Where you can find the settings - if they do exist.
  • Where you need to add them - if they do not exist.

 

By default, these settings are not defined in the conf/nimi_config.xml fie. If you confirm that full_route_check setting is not defined in the file then does not need to be explicitly defined - since the default and recommended values are the same.

 

Example with both settings present (in bold text):

<config>
    <nimi>
         <routing>
            <threadpool>...</threadpool>
            <full_route_check>false</full_route_check>
            <max_route_check_size>0</max_route_check_size>
            <timeout>...</timeout>
        </routing>
    </nimi>
</config>

After making these changes you should stop the agent, clear the contents of the NOLIOAGENT_HOME/persistency folder and restart the agent.