How to properly restart the DX NetOps environment

Products

CA Performance Management Network Observability

Issue/Introduction

The specific commands and correct order to restart the different NetOps component server services.

There are times when the system that a NetOps component runs on requires a restart or reboot. Often it is required for things like regular system maintenance, OS patches, or any other of a variety of reasons. When this is done there are some steps required to ensure the health of the system is maintained. This helps ensure a trouble-free restart.

Environment

All supported DX NetOps releases

Resolution

Data Aggregator (DA) Host
The DA runs the dadaemon and activemq services on the DA host. When restarting it is recommended you run both the stop and start commands even if the process are not running.
Stopping and Starting the DA dadaemon service on its own is done using the 'service' command.
Run it as the root, or sudo root, user that owns the installation (the same user that ran the installation).
Check the activemq and/or dadaemon service status:
service dadaemon status
service activemq status

Stop and then restart the services in the following order:

Stop the dadaemon service:
service dadaemon stop
If still running stop the activemq service:
service activemq stop

Start the dadaemon service:
service dadaemon start
Note that the activemq service should start when the dadaemon service is started.

Per the user guides it is recommended that the dadaemon service have a cron job configured to automatically restart it after 60 seconds should it go down for some reason on its own. Its very important to note this because if the goal is to stop the Data Repository (DR) DB the DA dadaemon service must be shut down first to close any open connections it has with the DR DB. If this cron job is configured, before stopping the dadaemon, disable or remove the cron job to prevent the dadaemon from restarting on its own.

Fault-Tolerant Data Aggregator (DA) Host

In a fault-Tolerant DA Environment, the DA and Activemq services are managed by the Proxy server.

Stopping and starting the dadaemon and activemq are handled using scripts that tell the consul service that the da is available and the proxy decides which one to start and when.

Stop and start the da in a proxy environment:

/opt/IMDataAggregator/scripts/dadaemon activate

/opt/IMDataAggregator/scripts/dadaemon maintenance

Please note that it can take more than 5 minutes for the proxy to send the actual start commands to the da host.

You can monitor the

/opt/IMDataAggregator/consul-ext/data/logs/consul-ext.log

To see when the start message arrives.

Data Collector (DC) Host
The DC runs the dcmd and activemq services on the DC host. When restarting it is recommended you run both the stop and start commands even if the process are not running.
Stopping and Starting the DC dcmd and activemq services on their own is done using the 'service' command.
Run it as the root, or sudo root, user that owns the installation (the same user that ran the installation).

Check the activemq and/or dcmd service:
service dcmd status
service activemq status

Stop the dcmd service:
service dcmd stop
If still running stop the activemq service:
service activemq stop

Start the dcmd service:
service dcmd start
Note that the activemq service should start when the dcmd service is started.

Data Repository (DR) Host
The DR host only requires the DB be stopped and shut down prior to system restarts. The Vertica processes that run the DB can be left as is during a restart. The key detail is ensuring the DR DB is stopped. Also, as noted above in the DA section, the dadaemon service for the DA MUST but shut down before stopping the DR DB. If this is not done an error will be seen. When trying to stop the DB via the adminTools utility, if the DA remains running a message will be seen that states:
"Error: NOTICE 2519: Cannot shut down while users are connected"
If that message is observed in a popup message in adminTools when trying to step the DB, check to ensure the dadaemon service is shut down and not running.
To shut down the DR DB (can be done from any active node in a multi-node cluster):
Log into the DR DB host as the dradmin or equivalent DB admin user created during installation.
Go to the /opt/vertica/bin directory and run:
./adminTools
If the environment is configured properly you may alternatively run this from any location:
/opt/vertica/bin/adminTools
Choose option 4 "Stop the Database".
Choose the DB name to stop. Standard environments should only have one entry.
Enter the password and wait for the DB to stop.
When complete in the main menu for adminTools, select option 1 "View Database Cluster State" to ensure all nodes show as down before they are rebooted.
To start the DB again after the host is rebooted, if not automatically restarted, choose option 3 "Start the Database", choose the DB to restart, enter the correct password and wait for the restart to complete.
Once more to check status, choose option 1 "View Database Cluster State" to ensure all nodes show as up post reboot.
Note that while the DB will not allow shut down while open connections from the dadaemon exist, once that is shut down, we have seen instances where the DB didn't fully complete shut down despite the status in adminTools showing as down. This is known as a 'dirty' DB shut down whereby the DB appears to be down to the user and the system but isn't fully and truly down. Thus when the restart is performed the DB won't restart without restoration using the last known good epoch. While there is no simple way to check for the DB state outside of the adminTools UI, the simple act of patience will often help avoid this problem. When stopping the DB allow if possible an extra 20-40 minutes before restarting the DR host.

NetOps Portal Host
The CAPM host consists of four different primary services. They are:

caperfcenter_console
caperfcenter_devicemanager
caperfcenter_eventmanager
caperfcenter_sso

There are also MySql services. Under normal conditions the MySql services do not require a user involved stop or start cycle.
The four primary services should be stopped and started in a specific order for best results. When stopping the services do so in this order:

Stop the caperfcenter_console service
Stop the caperfcenter_devicemanager service
Stop the caperfcenter_eventmanager service
Stop the caperfcenter_sso service

Restart the services in the reverse order, Please note the delay between the startup:

Start the caperfcenter_sso service
Start the caperfcenter_eventmanager service
Start the caperfcenter_devicemanager service
Wait one minute, then start the console service:
Start the caperfcenter_console service

To check the Portal services status values individually run:
systemctl status caperfcenter_<serviceName>
For example:
systemctl status caperfcenter_console

To stop the Portal services individually run:
systemctl stop caperfcenter_<serviceName>
For example:
systemctl stop caperfcenter_console

To start the Portal services individually run:
systemctl start caperfcenter_<serviceName>
For example:
systemctl start caperfcenter_console