After OS patching and reboots the Data Collector (DC) was started before the Data Aggregator (DA).
The Data Collector showed Not Connected until it was restarted when it began working properly again.
How can this outage be prevented during future OS patching cycles?
All supported DX NetOps Performance Management releases
The Data Aggregator must be running and ready to accept Data Collector connections and communications before the Data Collector is started.
Ensure the Data Aggregator is started and running before starting the Data Collector.
To determine if the Data Aggregator is ready to accept Data Collector connections we can review the karaf.log on the Data Aggregator.
Default directory for the karaf.log is /opt/IMDataAggregator/apache-karaf-<version>/data/log.
When we see the following message appear in the Data Aggregator karaf.log after a restart of the dadaemon service it's ready to accept Data Collector connections.
INFO | ExtenderThread-1 | 2022-02-01T00:32:17,648 | DistributedItemRepositoryManager | DistributedItemRepositoryManager 1523 | .im.item-repository.impl | | DISTIREP-JMS: DistIrepMgr - Add RequestProcessor as messageListener for queue://DADistItemRepositoryMgr, workerPoolTheadCount=100
A properly running DC dcmd service shows two processes in it's status output. One for the dcmd karaf service and on for the ICMP service.
If the DC is started before the DA the dcmd service will die. It will leave the ICMP service running. Due to this systemctl or service status commands will show the dcmd service as active when it is not running.
When this is encountered further attempts to start the dcmd service will return an error, or possibly appear to work but nothing happens.
When this is encountered run a systemctl or service stop command against the dcmd service to shut down the running ICMP service. Then the start command should work and result in a running dcmd service.