Collecting Spectrum Performance data

Products

Spectrum Network Observability

Issue/Introduction

We are having various Performance issues within Spectrum. This may be sluggish behavior of the OneClick console, spikes in CPU and/or RAM seen on a SpectroSERVER, or occasional hangs or lockups. What are some common reasons for such behavior, and what data is collected to determine the cause?

Environment

Spectrum 10.x for PerfCollector9
Spectrum 10.3.2 and higher for Self-Health Monitoring
Spectrum 10.4.1 and higher for InsideView

Cause

There are various causes for performance drain in Spectrum. This document will discuss some common areas to check.

Resolution

Available Resources:

One thing to be aware of is the CPU and Memory allotted to the Servers. While the Minimum requirement for a since SpectroSERVER is 8GB, in reality a server that is tasked to poll thousands of devices, manage alarms, custom configurations and integrations should be running around 32GB RAM recommended. For standalone OneClick server, if integrated with products like Service Desk, DX Infrastructure Manager, DX Performance Center, DX Operations Insight, etc., the minimum RAM on the OC server is recommended to be 16GB or higher (especially if all of those are integrated).

A Quad core processor is recommended, although its also important to note that in virtual machine clusters, its recommended to ensure the available resources are dedicated to the SpectroSERVER or OneClick, and not in a shared pool.

Tomcat memory allocation: By default this is also a low value and allowing higher memory allocation for Tomcat can improve performance of OneClick. Memory setting is in <SPECROOT>/tomcat/bin/catalina.sh (Linux) or OneClickService.conf (Windows). The value is set via GUI on OneClick Admin > Web Server Memory. 8192M required for OneClick, or 16384 (16GB) required if -ANY- integrations are enabled with OneClick.

Collecting Performance Data:

1. Log in and run the perfCollector9 script located in <SPECROOT>. Details can be found here.
2. The perfCollector9 script outputs to <SPECROOT>/Performance. Here you will find the performance data files, as well as a compressed folder of the same, labelled results.tar.gz
3. Provide the results.tar.gz to a Support Case to have the data evaluated

Reading the Performance Data:

If you review the contents of <SPECROOT>/Performance directory you will see <hostname>_1.prf, <hostname>_2.prf files, etc. - these are performance log files that provide data collected from the SSperformance event 0x00010f91. If you don't see them, be sure you ran ./perfCollector9 as the Spectrum install owner. By themselves they are not of much use however Spectrum Support Engineers have a proprietary app which can parse the data. These *.prf files are included in the results.tar.gz file and provide trending metrics on CPU, Memory, Event rate, Trap rate, Thread use and much more. In addition, starting in Spectrum 10.4.1, these views are now available in the OneClick console.

Also within the Performance directory, (as well as the results.tar.gz zip), you will find text files which collect counts of certain Events which can contribute to performance loss. These docs show event count per model handle, and you can use Locator Search to find the problematic devices. The timeframe for this data collection is about a month, or roughly the time period the SSPerformance model retains data (which is actually defined as the DDM database retention, defaulted at 45 days). Each file starts with hostname of the server and the event code which is collected. There are "data" files which can be ignored. You want the "summary" files.

<hostname>_0x10d35_summary.txt – Device Stopped Responding to Polls

This event summary doc will show the count of models asserting Stopped Responding alarms. High count of this alarm indicates a "flapping" device - we recommend addressing models which show a count of 0x10D35 of 100 or more events during the perfCollector period.

<hostname>_0x10daa_summary.txt – SNMP Management Agent Lost

This summary doc will show you models with excessive "MAL" alarms - indicating either an issue with a device SNMP agent, or the more likely scenario is network latency prevents consistent SNMP response. For models in this doc showing a count on MAL alarms higher than 100, you want to take a closer look at those. Try increasing the DCM timeout for the models and try to allow Spectrum some additional time for consistent SNMP response from the device.

<hostname>_0x10f94_summary.txt – Trap rate exceeds 100 per second

This summary doc lists models that Spectrum has detected are generating a high consistent rate of traps. Any models in this list should have its trap rate evaluated. Configure only necessary traps, or use Trap Director to filter out unwanted traps. Place models in Maintenance if this cannot be addressed right away.

<hostname>_0x10253_summary.txt – Trap Storm Detected

The Trap Storm Detected summary will show you those models which received a burst of traps - the count of the number of trap storms is recorded here. Any models shown here should be looked at for the cause of the trap storms.

<hostname>_0x10050_summary.txt – Excessive Reconfigurations

By default, Spectrum is set up to reconfigure interface models if it detects any changes to the ifstack. If a flapping interface on the network is causing excessive interface reconfiguration for Spectrum, this can have an adverse effect for Spectrum Performance. This doc lists the model handles of interfaces which Spectrum has detected an excessive rate or reconfigurations. Any models in this list should be addressed. There are KB articles describing in further detail how to manage these - and starting in 10.3.2 and above, there is an option for Spectrum to handle these automatically, in the Self Health subview under VNM information tab.

<hostname>_0x10f21_summary.txt – Excessive Global Collection search time

If Spectrum detects any Global Collection searches taking excessive amount of time, those GC models will get listed - check the Search Criteria of the Global Collection for any external attributes being used, which will use additional Spectrum resources. There are a number of "internal" attributes which can be used instead - such as swapping "X-ifAlias" for "ifAlias"

Additional text files to review

There are a few other text files contained within the Performance Directory that may be helpful to review.

<hostname>_DevPortInfo.txt - shows you models with over 100 ports configured. Models with many hundreds or even thousands of ports should be be made aware of for any potential performance issues (for example devices with high port count may have Discovery issues)

<hostname>_SizerInfo.txt - this doc will show you some polling statistics - pay attention to the poll interval count - too many devices with too low or too high a poll interval can cause problems. The default poll interval is 300ms and for most devices this should be sufficient (except perhaps for models showing up in the 0x10D35 summary as mentioned above). Be aware that having too many models with too short a poll interval can cause unnecessary stress to the SpectroSERVER.