All jobs are stuck in running state or Job Status is "Queued for execution"

Products

VMware Smart Assurance

Issue/Introduction

Symptoms:

All jobs in Smarts NCM are stuck in a queued for execution state

Environment

NCM-10.1.x

Cause

There can be multiple causes for this issue. One possible cause is a loss of connectivity between the Application Server (AS) and a Device Server (DS). This can cause command files that are used by the AS and DS hosts to communicate with each other to become unsynchronized.

Resolution

Connectivity between the AS and DS hosts should always be tested prior to any attempt to resolve this issue via the steps listed below, since the below steps will have little or no effect if connectivity is at issue. One way to test connectivity is to use the openSSL instance installed by NCM on both the AS and DS hosts to directly test the connection from a command line session on each host. This can be accomplished running the following command in a command line session on the AS and DS hosts to establish a direct connect to each other outside of NCM:

openssl s_client -connect {target host ip}:443 -CApath {NCM home path}/conf/CA/

A successful connection will usually yield a fairly verbose result that contains information about the connection request and the certificate validation. Among the various lines returned, a successful connection will contain output similar to the following two lines:

...
CONNECTED(00000003)
...
SSL handshake has read 2726 bytes and written 383 bytes
...

If the connection fails, the last line of the output will usually return a failure code in the form of a number that is non-zero. If this occurs, connectivity troubleshooting should be pursued before proceeding with the below steps. The steps listed below may be unnecessary if lost connectivity can be restored. However, if the connection succeeds, or if the issue persists after connectivity is restored, it is likely that the command files NCM uses to communicate between the AS and the DS may be out legitimately out of sync. Out of sync command files must be cleared from the instance in order to allow an all new set of command files to be created by any new jobs that will be properly syncrhonized and able to flow normally between the AS and DS hosts.

To clear NCM command files from the instance, do as follows:

Instance Wide Preliminary Steps:

Simultaneously Log into distinct Linux shells for each NCM Device Server (DS) host in the instance as well as the Application Server (AS) and DB (if on a separate host from the AS) hosts as 'root'.
Run the following command to set NCM related shell session variables in the AS host shell session as well as all DS host shell sessions opened in Step 1 above:

source /etc/voyence.conf

Stop NCM services on all Device Server (DS) hosts to prevent receipt and processing of any new snmp traps from devices under management during this process.

/etc/init.d/vcmaster stop

Application Server:

Log into the NCM Client Application as 'sysadmin'.
Cancel all currently running jobs from all users, including the system user (ex. pull jobs scheduled in response to receipt of snmp device configuration state change traps) as well as any pending jobs that are scheduled to begin during the maintenance window declared for this process. Note: It is not necessary to cancel recurring series jobs, however, you must verify that no child jobs are scheduled to be created by them during this process.
Log out of the NCM Client Application.
Switch to the controldb/DB shell you opened in Step 1 above (which may also be the AS shell if the controldb resides on the AS host) and log into the controldb using the following command (Note: The current controldb password will be required):

su - pgdba -c 'psql voyencedb voyence'

Run the following query to confirm that there are no jobs are in a running state:

SELECT
  status,
  count(*)
FROM cm_job
WHERE status LIKE '%running'
GROUP BY status;

If the query returns a number greater than zero (0), run the following query to cancel the remaining jobs that failed to cancel in Step 5 above:

UPDATE cm_job
SET    status = 'enum.taskStatus.canceled'
WHERE  status LIKE '%running';

Log out of the controldb using the following command:

\q

Switch to the AS shell you opened in Step 1 above and run the following command to stop services on the AS host:

/etc/init.d/vcmaster stop

Run the following commands in sequence in the AS shell to clear all command files residing on the AS host:

cd $VOYENCE_HOME/data/appserver/pops
find . -name "acmd_*xml" -exec rm -f {} \;
find . -name "cmd_*xml" -exec rm -f {} \;
find . -name "status_*" -exec rm -f {} \;

Device Server:

Switch to each DS shell in turn, then run the following commands in sequence in each DS to clear all command files residing on each DS host (Note: If the AS host is also a DS because the Combination Server option was selected at time of installation, the commands indicated below need to be run on the AS host as well):

cd $VOYENCE_HOME/data/devserver/syssync
find . -name "acmd_*xml" -exec rm -f {} \;
find . -name "cmd_*xml" -exec rm -f {} \;
find . -name "status_*" -exec rm -f {} \;

Start all NCM Services on all NCM hosts again by running the following command in each shell you opened in Step 1 above - first on the AS host, then on each DS host:

/etc/init.d/vcmaster start

Confirm that jobs are running again by logging into the NCM Client from your workstation, then running a "Test Credentials" job against one device residing on each device server in your instance.