SS hangs in mutex causing OneClick to be switched over and the SS will not stop
search cancel

SS hangs in mutex causing OneClick to be switched over and the SS will not stop

book

Article ID: 410085

calendar_today

Updated On:

Products

Network Observability Spectrum

Issue/Introduction

We noticed that OneClick had a yellow box around the client and it showed switched however the primary SpectroSERVER was actually still running.

We tried to stop the SpectroSERVER using the ./stopSS.pl script however it just sits in the "Stopping SpectroSERVER" phase.

 

To diagnose the issue, run pstack:

Obtain the process id of the SpectroSERVER (<PID of SS>).  Either use the linux "top" command or "ps -ef | grep -i SpectroSERVER"

As root, or sudo, run pstack and output the stack to a file:
while true; do pstack <SS_pid>; sleep 30; done >> pstack_SS.out

for example, if the process ID of the SpectroSERVER is 3442, then:
while true; do pstack 3442; sleep 30; done >> pstack_SS.out

Let this run for about 15 minutes then kill it (contrl c)


Please upload the pstack_SS.out to a support ticket.  

In review of the pstack_SS.out file,  "Thread 1" shows a mutex lock in triggerFailSafe code:


Thread 1 (Thread 0x7f286467dbc0 (LWP 95352)):
#0  0x00007f28544a94cd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f28544a2ac9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f285ddd2b39 in CsHPSERequestSender::rem_request_sender(unsigned int) () from /opt/SPECTRUM/lib/../SS/libhpse.so.1
#3  0x00007f28625db1c6 in SnmpItcInterface::triggerFailSafe(unsigned int) () from /opt/SPECTRUM/lib/../SS/libsv1mm.so.1

Cause

Code issue with processing SNMPv3 data that optimizes the performance of unmanaged trap processing by using a separate mutex to avoid thread latencies.

Resolution

This is scheduled to be fixed in NetOps 25.4.2 and above.

Prior to 25.4.2 you can use the following entry in the $SPECROOT/SS/.vnmrc file to resolve the issue:

fix_v3_profile_lock=false

You must restart the SpectroSERVER for this change to take effect.

In most cases the SS will not stop.  You will need to kill -9 the pid of the SS and then initialize and reload a previously saved database.

Additional Information

Engineering is investigating how to resolve this via defect DE175710 - as of Sep 11 a fix will not come until after 25.4.3 so using the fix_v3_profile_lock=false is needed until the fix is created.