Jump to content

Random Lock-ups and Disks Disabled


Recommended Posts

Hi everyone, I'm at a bit of a loss - every couple of days the server goes offline and is not responsive (including GUI and webGUI). Sometimes a disk is randomly disabled but once added back again, often doesn't get disabled again, but another disk gets disabled a few days later.

I've replaced all SATA and power cables and swapped the drives around to try to find a pattern, but can't see one.

Someone suggested that 8TB Seagate Ironwolf drives have an issue with spinning down, so I disabled spin-down which didn't help.

I've tried running it with all dockers disabled and just one VM running (no hardware passthrough) and still get hangs.

I'm not sure if diagnostics is any help as I think it clears each hang but I've attached just in case.

Any suggestions of what to try next?? 

Much appreciated guys!
 

Hardware basics:

AsRock x79 Extreme 11 with E5-2670

SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)

Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

32GB DDR3 1333MHz ECC Memory (4x 8GB)

7x 8TB Seagate Ironwolf ST8000VN04-2M2101 (Parity+6 data)

2x 3TB Seagate Barracuda ST3000DM007-1WY10G (data)

2x 2TB Seagate NAS ST2000VN000-1HJ164 (data)

1x 2TB Seagate Barracuda ST2000DM006-2DM164 (data)

Plus 3x Cache pools with 7 SSDs total (mix of Samsung and Crucial/Micron)

diagnostics-20220417-1647.zip

Link to comment
2 hours ago, Husker_N7242C said:

I enabled this and I have PM'd you the log from after today's lock-up with a bit more explanation.

Next time please use the forum.

 

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

As for the disks getting disabled see below:

 

 

Link to comment
On 4/23/2022 at 5:42 PM, JorgeB said:

Next time please use the forum.

 

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

As for the disks getting disabled see below:

 

 

Thanks again.
I've disabled EPC and low current spinup on all drives (8 of which are the model mentioned in the guide and another is a 3TB which had both features enabled). This is currently the best value NAS Seagate drive in Australia (and has been for some time now). Is there a way to build the SeaChest executable files into UNRAID and have them automatically check and disable these features on all Seagate drives? when adding a new drive? It seems like this would be a relatively simple script to embed as a bug-fix? (I know nothing about this stuff really, just seems silly to expect the users to have to fix this bug for themselves and repeat everytime they add a drive).

Regarding the crashing, I've ordered a new PSU (haven't tested but it is due anyway) and will run a memory test and go from there.

Are there any hardware diagnostic tools that you would suggest to check the LSI controllers and the motherboard?? Anywhere else that helpful logs might be hiding?


 

Link to comment
  • 2 weeks later...
On 4/23/2022 at 5:42 PM, JorgeB said:

Next time please use the forum.

 

Not seeing anything logged relevant to the crashing, could be hardware related, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

As for the disks getting disabled see below:

 

 

I wasn't able to pin-down a hardware fault at first. I ran memtest86 with 3 passes on each of the modules 1 at a time then for 2 days straight with all 8 modules and got 0 errors. I replaced the HX850 PSU with a new RM850X single rail and checked that my total power draw is within spec (it is easily). I booted again and had a freeze within a day.
Syslog now shows memory errors (as below). Unfortunately though doesn't show which module (I understand why it doesn't).

So I probably have a memory issue, but if memtest86 can't find a fault I'm not sure how to work out which module is bad. I can't really affort to replace 64GB of ECC memory TBH. ANy ideas on how to narrow this down.

Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274769153 in 24h
Apr 29 07:29:50 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:50 NAS mcelog: Too many trigger children running already
Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 17062 exceeded threshold: 2274704085 in 24h
Apr 29 07:29:50 NAS mcelog: Fallback Socket memory error count 32533 exceeded threshold: 2274736619 in 24h
Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:50 NAS mcelog: Cannot collect child 9515: No child processes
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274838278 in 24h
Apr 29 07:29:51 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:51 NAS mcelog: Too many trigger children running already
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 18505 exceeded threshold: 2274819772 in 24h
Apr 29 07:29:51 NAS mcelog: Fallback Socket memory error count 32112 exceeded threshold: 2274801266 in 24h
Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:51 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS kernel: mce_notify_irq: 120 callbacks suppressed
Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
Apr 29 07:29:52 NAS kernel: mce: [Hardware Error]: Machine check events logged
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274880692 in 24h
Apr 29 07:29:52 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:52 NAS mcelog: Too many trigger children running already
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 18025 exceeded threshold: 2274856304 in 24h
Apr 29 07:29:52 NAS mcelog: Fallback Socket memory error count 12193 exceeded threshold: 2274868498 in 24h
Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:52 NAS mcelog: Location: SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 31707 exceeded threshold: 2274912400 in 24h
Apr 29 07:29:53 NAS mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
Apr 29 07:29:53 NAS mcelog: Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Apr 29 07:29:53 NAS mcelog: Fallback Socket memory error count 23544 exceeded threshold: 2274935945 in 24h

Link to comment
14 minutes ago, itimpi said:

Are you using the memtest that comes with Unraid or did you download a version from memtest86.com that can test ECC ram?

The one that comes with UNRAID. From what I can see, memtest86 supoprts testing ecc memory. Is there another version that would be better for DDR3 ECC?

Link to comment
11 minutes ago, JorgeB said:

You should replace that DIMM, if there's an uncorrectable error the server will halt, system event log might have more info on the affected DIMM, if not just remove one at a time until the errors stop.

Thanks for replying again. Of the 8 dimms, it is possible that there could be more more than one with a fault. It is also quad channel so would have to remove more than 1 at a time and it will be a LONG process to eliminate. There MUST be a way to scan the memory properly. If UNRAID can encounter the issues within 24 hours so consistently, surely a memory testing program can too. I'll check the system event log as well for a hint

Link to comment
5 minutes ago, JorgeB said:

It will depend on the board, Supermicro server boards for example usually always list the affected DIMM in the SEL.

It's just an ASRock x79 Extreme 11. Pretty cool board (for it's time) but I don't know of any diagnostic tools like that for it. It threw a boot code last week before I ran all of the memory tests but the code was a general memory error that didn't indicate a particular dimm.

Link to comment
6 hours ago, Husker_N7242C said:

The one that comes with UNRAID. From what I can see, memtest86 supoprts testing ecc memory. Is there another version that would be better for DDR3 ECC?

AFAIK, Memtest included with unraid is an older version (licencing issues) and do not detect errors on ECC modules.

As suggested, you should try with the version from the official website : https://www.memtest86.com/

Link to comment
13 hours ago, ChatNoir said:

AFAIK, Memtest included with unraid is an older version (licencing issues) and do not detect errors on ECC modules.

As suggested, you should try with the version from the official website : https://www.memtest86.com/

Looks like you are right! Version 5 doesn't support ECC. Higher versions don't support older BIOSs... guess unraid is stuck between a rock and a hard place... or should include both and descibe the choice in the boot menu... or just remove it completely. I think I would have seen this if I had to download and create a boot USB myself but missed it because I know UNRAID recommends ECC memory, so incorrectly assumed that the memory tool packaged with OS would work with ECC memory. Probably needs a re-think.

Link to comment
3 hours ago, Husker_N7242C said:

or should include both and descibe the choice in the boot menu.

The problem is that licensing issues do not allow the later versions to be included.

 

I think it would be a good idea to make it more obvious that a later version can be downloaded and used for free (for personal use).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...