SA S500 server unreachable and not communicating

Products

Security Analytics

Issue/Introduction

An S500 may be found unexpectedly down and not responding. It may come up for a short period of time and then become unresponsive in a few minutes. Connecting a USB serial cable to the console allows the user to see the console. Messages may be seen like:

2022-06-06T02:51:02-04:00 hostname_here kernel: : [16000412.476274] mce: [Hardware Error]: Machine check events logged
2022-06-06T02:51:04-04:00 hostname_here kernel: : [16000414.726654] EDAC MC1: 8 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#1 (channel:2 slot:1 page:0x348efd9 offset:0xb80 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:1 ha:0 channel_mask:4 rank:5)

The keywords in these lines is Hardware, memory and DIMM.

This does not apply to Dell hardware.

Environment

Security Analytics running on the S500 hardware.

Cause

The memory DIMMs are failing and the application accesses the failed banks and crashes. Once the system is powered up again, it will crash again at some point.

Resolution

To determine if this is the problem you are seeing, look for the keyword DIMM in /var/log/messages. For example, as root run "grep DIMM /var/log/messages". For the best chance of getting a CSR from the system before it shuts down again, shut down the Security Analytics application as root with "scotus stop". You can collect a CSR from the command line by running csr.sh. The CSR will be collected and stored in /home/csr. Copy the .bz2 file to your desktop and attach it to your support case.

Sample messages from /var/log/messages

2022-06-06T02:51:02-04:00 hostname kernel: : [16000412.476274] mce: [Hardware Error]: Machine check events logged
2022-06-06T02:51:04-04:00 hostname kernel: : [16000414.726654] EDAC MC1: 8 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#1 (channel:2 slot:1 page:0x348efd9 offset:0xb80 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:1 ha:0 channel_mask:4 rank:5)

If there are failed DIMM(s), then they will need to be replaced. To replace the failed DIMMs, you will need to power down the system, if it is not already down. Remove all data and power cables in the rear of the appliance. Be sure they are labeled in a way that you will know where to replace them once the system is ready to power up. Slide the chassis forward. Then follow the instructions below.

To access the DIMMS:

Lift the two blue tabs on each side of the computer chassis top to unlock a small hinged lid.
Open the lid that covers the fans, folding it forward.
There is a blue tab on the right side that needs to be pulled up to release the large cover.
Push the large cover back, lift it, and set it aside.
There are silkscreen labels on the motherboard with the DIMM bank IDs. The DIMM banks are labeled A through H and each bank has slots 1, 2 and 3. Support will let you know which DIMMs to replace.
There is a clear plastic cover that next needs to be removed. This covers the DIMMs A, B, C and D. E, F, G, and H are under the PCI card riser. For banks A-D lift the clear cover. For banks E-H lift the PCI riser in the back of the chassis.

Reverse the process to replace the covers. There are two plastic tabs in the side rails which need to be pulled forward to release the rail locks.