Multiple SpectroSERVERs fail over to secondary at once without any problems seen on the SpectroSERVERs.
search cancel

Multiple SpectroSERVERs fail over to secondary at once without any problems seen on the SpectroSERVERs.

book

Article ID: 121249

calendar_today

Updated On:

Products

Spectrum

Issue/Introduction

  • Multiple SpectroSERVERs fail over to secondary simultaneously.
  • They all fail back to primary simultaneously without any intervention.
  • No crashes or dumps are seen on the SpectroSERVERs except for the following error in the VNM.OUT:

ERROR TRACE at CsIHVnmMdl.cc(2276): The OneClick application on host <host> has not responded to updates from this server for a period of time. The application is either not responding or a network problem is preventing access.

Cause

The VNM errors point to a possible communication or networking issue between OneClick and the SpectroSERVER and this will cause the SpectroSERVER to fail over but this should not cause the SS to crash.   

The secondary SpectroSERVER continuously monitors its paired primary SpectroSERVERs status using heartbeat signals.
Upon detecting the primary's failure, the secondary SpectroSERVER initiates the failover sequence and the secondary becomes active.  OneClick also does similar to fail over in the console. This is not orchestrated by the MLS as it just manages the landscape map which each of the SS and OC pull and use to ensure they work off a single map.

Resolution

Verify the following:

1. Hostname resolution from the SS to the OC and from the OC to the SS

2. Ensure the ip address or hostname of the OC is in the $SPECROOT/.hostrc file on the SS

3. Ensure there ports are open through firewalls between the SS and OC. Reference the "Communication Across Firewalls" section of the documentation for a list of the ports that are required to be open for SS/OC communication.

4. Are there network issues causing dropped packets?  The best way to test would be from OC to SS filtering on 14001 and 14002, checking on Corba communications.  You should see no response to Corba requests from the SS when we get these errors in the VNM.OUT and during the false failovers. 

Additional Information

There are two parts to SpectroSERVER fault-tolerance, and each are independent from each other.

The secondary SS "polls" (via API call) the primary SS every minute.  If the polling determines the primary is down, the secondary starts SNMP polling, processing SNMP traps, etc, and essentially takes over the role of the primary SS.  If a network issue is preventing the polling, both SpectroSERVER's can be 'active'.

The Oneclick Server polls both the primary and the backup SpectroSERVER, via CORBA, every 10s and a 'deeper' poll every 60s.  If the polling determines the primary is down, OneClick will failover and connect to the secondary.  It is possible, if a network issue is blocking the communication, that OneClick server will failover to a secondary that is not 'active' which will display as a grey/suppressed VNM icon.

The API calls are not just "Are you up?" but require an intelligent response that only a non hung process could give.  This is to avoid problems when SS hangs but does not crash, and if we only checked the SS process was we would not fail over. So the primary need to give an intelligent response that only an active Ss process could.