How To: Identify Bottleneck In NES_DISTRIBUTION Artifact Distribution Phase
search cancel

How To: Identify Bottleneck In NES_DISTRIBUTION Artifact Distribution Phase

book

Article ID: 204933

calendar_today

Updated On:

Products

CA Release Automation - Release Operations Center (Nolio) CA Release Automation - DataManagement Server (Nolio)

Issue/Introduction

We are seeing the "Distribute to execution server" take a long time to complete. It is understood that 2 main things happen during this time:

  1. The artifact retrieval agent obtains a copy on its local machine from the artifacts source.
  2. The artifact is transferred to the appropriate Execution Servers.

How can identify potential bottlenecks in getting the artifact to the appropriate Execution Servers (aka NES_DISTRIBUTION - #2)? 

Environment

Release : 6.6

Component : CA RELEASE AUTOMATION CORE

Cause

The Nolio Release Automation Release Operations Center (ROC) Web UI does not provide this detail. You will need to gather the logs from all of your Execution Servers and trace this through their logs. 

Resolution

There are two prerequisites before you can identify potential bottlenecks in the NES_DISTRIBUTION phase. 

  1. You must identify the MD5 for the artifact file. The following KB Article may help if you do not have a way of identifying this more easily: How To: Identify How Long The Artifact Retrieval Process Took
    • Note: Artifact Repository's usually have this information as a property of the artifact. 
  2. You must have the logs folder from all of your execution servers. 
    • If your Management, Execution Servers and/or Agents are on Linux then you might want to consider using the Nolio RA Collect Logs Scripts available on the Nolio Release Automation Community.

 

Once you have the MD5 and logs from your Execution Servers you can begin by searching the logs for:

GETTER_<MD5_of_artifact_file> got chunk,

Example Message:

./nes_serverA/nimi.log.1: 2020-08-12 07:38:14,441 [FileTransferWorker-2848] DEBUG (com.nolio.nimi.filetransfer.impl.AbsFileTransferWorker:322) - GETTER_C70EC4482650B95173A5EC9479827203 got chunk, [10,383,360] out of [310,248,458] - [3 %].

 

The set of message found from the search string above will yield the chunks of data received by the NES. From these messages you can determine how long it took for a NES to receive all of the chunks of data. 

 

If you identify a server that is experiencing a slow transfer then you can confirm which remote server was supplying the file by searching for:

GETTER_<MD5_of_artifact_file> got the route

Example Msg:

./nimi.log.1: 2020-08-12 12:35:33,953 [DiscoveryWorker-18210] DEBUG (com.nolio.nimi.filetransfer.impl.AbsFileTransferWorker:490) - GETTER_C70EC4482650B95173A5EC9479827203 got the route. The real source is not directly accessible, will use [nid:es_ServerB] instead.

 

If you identify a server (ServerA) that experiences a slower than expected transfer and you have the remote server name (ServerB) providing the file, along with timeframes, it offers details you can use to discuss with your network team to identify any possible network errors, congestion, bandwidth or other errors that might result in a slower than expected timeframe. 

 

Additional Information

The Nimi component is what Execution Servers use to transfer files between each other and agents. You may find the following type of message in your nimi logs:

com.nolio.nimi.SocketBufferFullException: Gave up after 30000ms waiting for the message to be written by Netty to the socket; message={message type=`FILES`, destination node=`nid:es_<ServerB>`, connection=NimiConnectionImpl{remoteAddress=/ip.address.of.remoteNes:6600, localAddress=/ip.address.of.localNes:6600, connectionID=nid:es_<serverB>, channel=[id: 0x15acb06c, /ip.address.of.remoteNes:54242 => /ip.address.of.localNes:6600], closed=false, lastAccessedTime=1597232343514}}

SocketBufferFull's is not really an error per se, it's a valid state where at one end we send data into the connection (tcp socket), but the other end isn't consuming it as fast as we send it, so operating system's socket buffer fills up and you can't put more data into it, which Nimi tries to do and then backs down. For example if the network is slow, or there's some packet loss, that's what will occur.