In the use case, the discovery_server shows a high CPU consumption, and at the same time the probeDiscovery queue is growing and not being cleared of messages. In the hub logs you can see the hub disconnecting the discovery_server from the probeDiscovery queue:
hub: Subscriber 'discovery_server' at 'IP/local port' attached to queue 'probeDiscovery' (subject:probe_discovery requested bulk:0, granted bulk:1, minimum bulk:0, wait:0, heartbeat: 2, reply timeout: 60), time used: 0 ms
hub: Reply not received on queue route for 'probeDiscovery' (timeout), disconnecting
There is a discovery graph in the discovery queue so large that it is taking very long to compute the checksum of the graph. The Discover_server computes the checksum to determine if it is the same graph as before. Computing the checksum is CPU intensive and the time to compute the checksum, in this case, is longer than the hub’s postroute_reply_timeout.
If the discovery_server probe does not reply before the timeout, the hub drops the queue subscription and then will resend the current message when the Discovery_server resubscribes, looping the issue as a result.
To get around the hub behaviour please increase the hub postroute_reply_timeout.
Please open the hub in raw configure and find the reply_timeout in the <postroute><queuename> section. Please set it to 0 for probeDiscovery, and apply the changes to restart the hub.
The logs will confirm the value is set, for example:
<date><time> [thread] hub: Subscriber 'discovery_server' at '<IP>/51139' attached to queue 'probeDiscovery' (subject:probe_discovery requested bulk:0, granted bulk:1, minimum bulk:0, wait:0, heartbeat: 2, reply timeout: 0), time used: 4 ms
This should allow the discovery server to process the graph and continue to process the rest of the queued messages.