Random Lock-ups and Disks Disabled

Husker_N7242C · April 17, 2022

Hi everyone, I'm at a bit of a loss - every couple of days the server goes offline and is not responsive (including GUI and webGUI). Sometimes a disk is randomly disabled but once added back again, often doesn't get disabled again, but another disk gets disabled a few days later.

I've replaced all SATA and power cables and swapped the drives around to try to find a pattern, but can't see one.

Someone suggested that 8TB Seagate Ironwolf drives have an issue with spinning down, so I disabled spin-down which didn't help.

I've tried running it with all dockers disabled and just one VM running (no hardware passthrough) and still get hangs.

I'm not sure if diagnostics is any help as I think it clears each hang but I've attached just in case.

Any suggestions of what to try next??

Much appreciated guys!

Hardware basics:

AsRock x79 Extreme 11 with E5-2670

SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)

Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

32GB DDR3 1333MHz ECC Memory (4x 8GB)

7x 8TB Seagate Ironwolf ST8000VN04-2M2101 (Parity+6 data)

2x 3TB Seagate Barracuda ST3000DM007-1WY10G (data)

2x 2TB Seagate NAS ST2000VN000-1HJ164 (data)

1x 2TB Seagate Barracuda ST2000DM006-2DM164 (data)

Plus 3x Cache pools with 7 SSDs total (mix of Samsung and Crucial/Micron)

diagnostics-20220417-1647.zip

JorgeB · April 19, 2022

Enable the syslog server and post that after a crash.

Husker_N7242C · April 23, 2022

On 4/19/2022 at 5:50 PM, JorgeB said:

Enable the syslog server and post that after a crash.

Thanks JorgeB, I enabled this and I have PM'd you the log from after today's lock-up with a bit more explanation. Thanks again for helping!

JorgeB · April 23, 2022

2 hours ago, Husker_N7242C said:

I enabled this and I have PM'd you the log from after today's lock-up with a bit more explanation.

Next time please use the forum.

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

As for the disks getting disabled see below:

Husker_N7242C · April 25, 2022

On 4/23/2022 at 5:42 PM, JorgeB said:

Next time please use the forum.

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

As for the disks getting disabled see below:

Thanks again.
I've disabled EPC and low current spinup on all drives (8 of which are the model mentioned in the guide and another is a 3TB which had both features enabled). This is currently the best value NAS Seagate drive in Australia (and has been for some time now). Is there a way to build the SeaChest executable files into UNRAID and have them automatically check and disable these features on all Seagate drives? when adding a new drive? It seems like this would be a relatively simple script to embed as a bug-fix? (I know nothing about this stuff really, just seems silly to expect the users to have to fix this bug for themselves and repeat everytime they add a drive).

Regarding the crashing, I've ordered a new PSU (haven't tested but it is due anyway) and will run a memory test and go from there.

Are there any hardware diagnostic tools that you would suggest to check the LSI controllers and the motherboard?? Anywhere else that helpful logs might be hiding?

Husker_N7242C · May 3, 2022

On 4/23/2022 at 5:42 PM, JorgeB said:

Next time please use the forum.

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

As for the disks getting disabled see below:

I wasn't able to pin-down a hardware fault at first. I ran memtest86 with 3 passes on each of the modules 1 at a time then for 2 days straight with all 8 modules and got 0 errors. I replaced the HX850 PSU with a new RM850X single rail and checked that my total power draw is within spec (it is easily). I booted again and had a freeze within a day.
Syslog now shows memory errors (as below). Unfortunately though doesn't show which module (I understand why it doesn't).

So I probably have a memory issue, but if memtest86 can't find a fault I'm not sure how to work out which module is bad. I can't really affort to replace 64GB of ECC memory TBH. ANy ideas on how to narrow this down.

Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274769153 in 24h
Apr 29 07:29:50 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:50 NAS mcelog: Too many trigger children running already
Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 17062 exceeded threshold: 2274704085 in 24h
Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274736619 in 24h
Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Cannot collect child 9515: No child processes
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274838278 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Too many trigger children running already
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS kernel: mce_notify_irq: 120 callbacks suppressed
Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274880692 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Too many trigger children running already
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 31707 exceeded threshold: 2274912400 in 24h
Apr 29 07:29:53 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:53 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 23544 exceeded threshold: 2274935945 in 24h

itimpi · May 3, 2022

3 minutes ago, Husker_N7242C said:

So I probably have a memory issue, but if memtest86 can't find a fault

Are you using the memtest that comes with Unraid or did you download a version from memtest86.com that can test ECC ram?

JorgeB · May 3, 2022

You should replace that DIMM, if there's an uncorrectable error the server will halt, system event log might have more info on the affected DIMM, if not just remove one at a time until the errors stop.

Husker_N7242C · May 3, 2022

14 minutes ago, itimpi said:

Are you using the memtest that comes with Unraid or did you download a version from memtest86.com that can test ECC ram?

The one that comes with UNRAID. From what I can see, memtest86 supoprts testing ecc memory. Is there another version that would be better for DDR3 ECC?

Husker_N7242C · May 3, 2022

11 minutes ago, JorgeB said:

You should replace that DIMM, if there's an uncorrectable error the server will halt, system event log might have more info on the affected DIMM, if not just remove one at a time until the errors stop.

Thanks for replying again. Of the 8 dimms, it is possible that there could be more more than one with a fault. It is also quad channel so would have to remove more than 1 at a time and it will be a LONG process to eliminate. There MUST be a way to scan the memory properly. If UNRAID can encounter the issues within 24 hours so consistently, surely a memory testing program can too. I'll check the system event log as well for a hint

JorgeB · May 3, 2022

22 minutes ago, JorgeB said:

system event log might have more info on the affected DIMM

JorgeB · May 3, 2022

It will depend on the board, Supermicro server boards for example usually always list the affected DIMM in the SEL.

Husker_N7242C · May 3, 2022

3 minutes ago, JorgeB said:

Unfortunately not, it is nearly 40MB of "Location: SOCKET:0 CHANNEL:? DIMM:? []" in the syslog from the crash. When I look at cat /var/log/syslog now (after hard reset) it doesn't mention memory errors yet.

Husker_N7242C · May 3, 2022

5 minutes ago, JorgeB said:

It will depend on the board, Supermicro server boards for example usually always list the affected DIMM in the SEL.

It's just an ASRock x79 Extreme 11. Pretty cool board (for it's time) but I don't know of any diagnostic tools like that for it. It threw a boot code last week before I ran all of the memory tests but the code was a general memory error that didn't indicate a particular dimm.

JorgeB · May 3, 2022

SEL, if it exists, would be in the board BIOS.

ChatNoir · May 3, 2022

6 hours ago, Husker_N7242C said:

The one that comes with UNRAID. From what I can see, memtest86 supoprts testing ecc memory. Is there another version that would be better for DDR3 ECC?

AFAIK, Memtest included with unraid is an older version (licencing issues) and do not detect errors on ECC modules.

As suggested, you should try with the version from the official website : https://www.memtest86.com/

Husker_N7242C · May 4, 2022

13 hours ago, ChatNoir said:

AFAIK, Memtest included with unraid is an older version (licencing issues) and do not detect errors on ECC modules.

As suggested, you should try with the version from the official website : https://www.memtest86.com/

Looks like you are right! Version 5 doesn't support ECC. Higher versions don't support older BIOSs... guess unraid is stuck between a rock and a hard place... or should include both and descibe the choice in the boot menu... or just remove it completely. I think I would have seen this if I had to download and create a boot USB myself but missed it because I know UNRAID recommends ECC memory, so incorrectly assumed that the memory tool packaged with OS would work with ECC memory. Probably needs a re-think.

itimpi · May 4, 2022

3 hours ago, Husker_N7242C said:

or should include both and descibe the choice in the boot menu.

The problem is that licensing issues do not allow the later versions to be included.

I think it would be a good idea to make it more obvious that a later version can be downloaded and used for free (for personal use).

Random Lock-ups and Disks Disabled

Recommended Posts

Husker_N7242C

Link to comment

JorgeB

Link to comment

Husker_N7242C

Link to comment

JorgeB

Link to comment

Husker_N7242C

Link to comment

Husker_N7242C

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

Husker_N7242C

Link to comment

Husker_N7242C

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

Husker_N7242C

Link to comment

Husker_N7242C

Link to comment

JorgeB

Link to comment

ChatNoir

Link to comment

Husker_N7242C

Link to comment

itimpi

Link to comment

Join the conversation