We observed a scenario where existing processes which were executing fine started to freeze in initialization stage. We observed that for some reason the parameter values are not been populated with value and hence the processes is freezing. With detail investigation it is observed that the built-in-parameters like release-name, release-id etc. are not getting value at agent end during execution.
<Please see attached file for image>
From the logs review we can find that executing agent is making a request to get value of parameter to it's parent execution server
2017-09-26 12:27:39,249 [job-7260973-jobServer-8117379-9:Init Delivery Path(P72266771.F72266793.E72266797):Set Parameter Value - String] DEBUG (com.nolio.platform.shared.parameter.ParameterResolver:280) - Querying remote resource [es_**********] about value of [root/nolio/Application Parameters/Var Test SS]
2017-09-26 12:27:39,249 [job-7260973-jobServer-8117379-9:Init Delivery Path(P72266771.F72266793.E72266797):Set Parameter Value - String] DEBUG (com.nolio.platform.shared.parameter.ParameterRequestsManager:272) - ParameterRequestsManager: registering local step for key root/nolio/Application Parameters/Var Test SS
2017-09-26 12:27:39,249 [job-7260973-jobServer-8117379-9:Init Delivery Path(P72266771.F72266793.E72266797):Set Parameter Value - String] DEBUG (com.nolio.platform.shared.datamodel.Action:284) - Requesting value for parameter [root/nolio/Application Parameters/Var Test SS]
2017-09-26 12:27:39,249 [job-7260973-jobServer-8117379-9:Init Delivery Path(P72266771.F72266793.E72266797):Set Parameter Value - String] DEBUG (com.nolio.platform.shared.communication.CommunicationNetwork:414) - Sending internal message:[ID:[email protected]_1A_Agent, from:Test_1A_Agent, to:[email protected]_***********- ParameterValueRequest: returnAddress=Test_1A_Agent,requestedKey=root/nolio/Application Parameters/Var Test SS,returnService=REMOTE_PARAM_RESOLUTION,]
With review of execution server logs we find that there seems to be some network issue obstructing communication between execution server and its connected agents. We can see some file routes been discarded may be some issue with routing and also
2017-09-26 12:29:17,604 [New I/O server worker #1-4] WARN (com.nolio.nimi.routing.impl.DiscoveryManagerImpl:126) - Discarding file routing message file_3686A3C2F791549148FEB483202A4DAC
2017-09-26 12:29:17,606 [New I/O client worker #1-2] WARN (com.nolio.nimi.routing.impl.DiscoveryManagerImpl:126) - Discarding file routing message file_3686A3C2F791549148FEB483202A4DAC
2017-09-26 12:36:17,292 [DiscoveryWorker-177915] WARN (com.nolio.nimi.comm.impl.OutboundConnectionsImpl:349) - failed to create sender, no route was found for target node [nid:es_***********].
2017-09-26 12:36:17,292 [OutboundConnectionsImpl-66016] ERROR (com.nolio.nimi.comm.impl.ForwarderImpl:162) - Failed to forward message from: node-id to:nid:es_***********
com.nolio.nimi.comm.NimiCommException: failed to create sender, no route was found for target node [nid:es_***********].
We can also found that connection between sibling super-nodes can't be establised.
2017-09-26 12:31:56,121 [New I/O client boss #1] WARN (com.nolio.nimi.comm.impl.nettysupport.BasicHandler:51) - java.net.ConnectException : Connection timed out: no further information
2017-09-26 12:31:56,121 [New I/O client boss #1] INFO (com.nolio.nimi.comm.impl.NimiConnectionImpl:133) - connection [NimiConnectionImpl{remoteAddress=null, localAddress=null, connectionID=null, channel=null, closed=true, lastAccessedTime=1506421895122}] is closed.
2017-09-26 12:31:56,122 [ReverseConnectionWorker-60099] WARN (com.nolio.nimi.comm.impl.NetworkConnectionManagerImpl:281) - could not create connection to [Agent-IP:6600]
From our analysis we determine that there was some network issue obstructing communication between execution server and agent and hence parameter value at agent end is not getting resolved.
As a recommendation we recommend to please check below and try to re-run the release/process if below is all fine.