Schema attribute creation and updates are not applied correctly

Products

CA Service Management - Service Desk Manager CA Service Desk Manager ServiceDesk

Issue/Introduction

When attempting to publish schema changes, receiving an error during startup attempt. Error that presents is "error 1067: The process terminated unexpectedly". A subsequent startup attempt will cause the schema changes to fail, relegating the changes into several ver_old files.

C:\PROGRA~2\CA\SERVIC~1\site
ddict.sch.ver_old
tblobj.cfg.ver_old

C:\PROGRA~2\CA\SERVIC~1\site\mods
server_secondary_custom.ver.ver_old

C:\PROGRA~2\CA\SERVIC~1\site\mods\majic
wsp.mods.pub.ver_old
wsp.mods.ver_old

In the scenario below that leads to the problem:

On Server1/BG, create a schema update
Failover to make Server2/BG and Server1/SB (run pdm_server_control -b on Server2). This will automatically stop services on Server1/SB
On Server1/SB, stop services, run pdm_publish, hit "N" to allow for manual startup/failover.
Attempt to start services on Server1/SB.
Startup of services on Server1/SB fails. "error 1067: The process terminated unexpectedly"
Subsequent restart of Server1 causes schema change loss on Server1 (ver_old files)

Environment

CA Service Desk Manager (all versions) running Advance Availability.

Cause

There is a checksum failure taking place during startup. The stdlogs across all servers involved may indicate the following messages during startup attempts

Logging which may present in the given servers. Entries of interest highlighted.

Server1 may report:

04/24 14:12:29.24 SERVER1 pdm_rfbroker_nxd         372 ERROR        rfbroker.c            1422 Register server SERVER1 (24267) failed: checksum mismatch

04/24 14:12:29.25 SERVER1 pdm_rfbroker_nxd         372 SIGNIFICANT  api_misc.c             585 Requesting shutdown of system. Reason (Server checksum mismatch)

Server2 may in turn report:

04/24 14:12:29.20 SERVER2 pdm_rfbroker_nxd        6988 ERROR        ServerStatusMonitor.  1956 Unable to register server SERVER1 (24267) as checksum count 3 would exceed maximum of 2 (2 servers registered)

In the given scenario, there will also exist a Server3/App server in play and has its own checksum values that are causing the checksum tests to fail.

During startup of the given servers in Advance Availability, a checksum test is done on the ddict.sch, majic, and wsp.mods files to ensure consistency across these servers. What is happening is that Server1, Server2, and Server3 will each have differing content in some or all of these files, causing the checksum test to fail when Server1/SB tries to start under such conditions,

A common scenario that could give rise to this situation: all three servers each have their own unique version of the ddict.sch files which is causing the fault.

In such a scenario with schema changes that are being made, the Server3/App may not have been regularly recycled previously, which could cause its own ddict.sch, majic, and wsp.mods files to fall out of sync with both Server1 or Server2. When a custom change is being rolled out, the general procedure would be to establish the changes first between BG and SB Servers, then do a recycle to disseminate the changes out to the other constituent App servers in the environment.

Resolution

Preventive action is to perform a recycle of Services on the SB and app servers before any schema updates are performed on Server1.

Revising the above steps:

On Server2/SB, and Server3/App, restart services. If there are multiple app servers, restart each one in turn. This will cause the ddict.sch, majic, and wsp.mods files on Server2/SB and Server3/App to synchronise with Server1/BG.
On Server1/BG, create a schema update
Failover to make Server2/BG and Server1/SB. (run pdm_server_control -b on Server2). This will automatically stop services on Server1/SB
On Server1/SB, run pdm_publish, hit "N" to allow for manual startup/failover.
Start services on Server1/SB

Server1/SB services should then startup successfully. You can then run "pdm_server_control -b" on Server1/SB to failover and make Server1/BG and Server2/SB, then recycle Server2 and the app servers to have them synchronise with Server1/BG.

Additional Information

If you are in a situation where the schema updates appear to be lost on Server1, started services after receiving the "error 1067" message, you will need to manually restore the ver_old files as the active files on Server1/SB.

Verify that the given ver_old files contain the updated schema content

NX_ROOT\site
ddict.sch.ver_old
tblobj.cfg.ver_old

NX_ROOT\site\mods
server_secondary_custom.ver.ver_old

NX_ROOT\site\mods\majic
wsp.mods.pub.ver_old
wsp.mods.ver_old
Stop services on Server1/SB. Make sure to maintain Server2/BG in a running state; do not failover to make Server1/BG.
Make a backup of these files on Server1/SB (copy the files out to a separate location away from the SDM install directory; do not copy/paste the files in place)

NX_ROOT\site
ddict.sch.ver_old
tblobj.cfg.ver_old
ddict.sch
tblobj.cfg

NX_ROOT\site\mods
server_secondary_custom.ver.ver_old
server_secondary_custom.ver

NX_ROOT\site\mods\majic
wsp.mods.pub.ver_old
wsp.mods.ver_old
wsp.mods.pub
wsp.mods
Delete the existing versions of these files on Server1/SB

NX_ROOT\site
ddict.sch
tblobj.cfg

NX_ROOT\site\mods
server_secondary_custom.ver

NX_ROOT\site\mods\majic
wsp.mods.pub
wsp.mods
Rename the existing ver_old files on Server1/SB

NX_ROOT\site
ddict.sch.ver_old -> ddict.sch
tblobj.cfg.ver_old -> tblobj.cfg

NX_ROOT\site\mods
server_secondary_custom.ver.ver_old -> server_secondary_custom.ver

NX_ROOT\site\mods\majic
wsp.mods.pub.ver_old -> wsp.mods.pub
wsp.mods.ver_old -> wsp.mods
Recycle all of the app servers to have them synchronise their checksum content with Server2/BG
Run pdm_server_control -v on Server1/SB to suppress version control

Server1/SB services should then startup successfully. You can then run "pdm_server_control -b" on Server1/SB to failover and make Server1/BG and Server2/SB, then recycle Server2 and the app servers to have them synchronise with Server1/BG.