A Data Repository node is consuming excessive CPU resources
Zombie process left over post upgrade that is impacting the operation of the problem node which is node 1 in a 3 node cluster. As a result it is consuming 98% of the CPU resources on a Vertica node.
This results in an inability to start the database and use it.
The details of the offending process are:
Note that this is a system that was just upgraded 10 days prior from older CAPM release 2.4.1 to the latest 2.8 release.
This /opt/vertica/bin/dialog command and related process is what is started when the /opt/vertica/bin/adminTools UI is launched by the dradmin user. Under normal circumstances we should see something like the the following running when adminTools has been started properly.
Another key clue is that the errant process showed an older release of Vertica than was actually installed. It still showed as release of 7.0.2-5 when it should be 7.1.2-6.
To resolve this:
This was a somewhat unique situation. It is unlikely other users will run into this but if they do it was worth getting this information published to the user community.
This was an odd problem because the current release was shown when launching adminTools. It shows release 7.1.2-6 which is correct.
Why the process was never killed off during the upgrade is odd due to it showing a run date starting on 10/23 which is also when the upgrade was run.
An educated guess is that someone logged into node 1 and launched the adminTools UI to stop the DB in preparation for running the upgrade. Somehow they closed that terminal window without first exiting adminTools properly. That left the process hanging around and caused the node to not start properly post upgrade.