Server repeatedly crashing


Recommended Posts

Hey folks,

 

So, I've been living the Unraid dream for just over 6 months now, however over the last few weeks I've been battling a server that frequently becomes entirely unresponsive. By this I mean that not only is network down (no ssh or ping) but when running it using the local GUI, even that freezes and stops responding to user input.

 

I've identified that it freezes early on when the parity check kicks in - if Unraid attempts to start one due to the unclean shutdown I have to race to log in and stop it! But I think it happens at other times too - it went again last night and I've disabled the parity check schedule till I can work out how to sort this.

 

When trying to diagnose this I've had it crash despite removing almost all plugins and having disabled the VM manager and Docker also so have run out of things to turn off! I've also been persisting the syslog to the flash drive and there's no entry prior to it freezing that would (obviously) explain whats going on. I also ran the memtest for about 6 hours yesterday with 0 errors having seen that fixed someone else's similar sounding problem.

 

I did have several unclean shutdowns in short succession a while ago which caused some issues (had to rebuild my docker Plex library as it wouldn't load) and I'm fairly certain there will be inconsistencies that the parity check needs to fix - but I assume even if the array was completely borked, Unraid should run regardless.

 

Any advice would be *very* welcome! Otherwise all I've got left is to start from scratch. Our lights are run thru this box using Hass and the wife is starting to lose patience with me having to go through a twenty minute song and dance every time she wants to go to a different room :D

 

Attached is the diagnostics zip. I've also added a longer syslog that spans multiple crashes if its useful.

 

(Last note: this is running v6.8.2. it was on v6.8.3 and doing the same, I just wondered whether the downgrade would help!)

 

Rough specs if they're useful:

 

Mobo:     ASRock B450 PRO4

CPU:      AMD Ryzen 7 2700

RAM:      8GB DDR4-2666

GPU:      GT 710 1GB

PSU:      EVGA 550w

Parity:    4TB WD Red

Array:    4TB WDRed & 2x 2TB WD (Green I think)

 

alfred-diagnostics-20200413-0951.zip syslog

Edited by BishBashBodge
Link to comment

I vaguely remember seeing some of that when setting up but think I came to the conclusion that given my build was stable, I'd leave it alone. That said, I've now disabled "Global C-State Control" and set "Power Supply Idle Control" in BIOS to "Typical Current Idle".

 

I had a bit of a mare with BIOS originally as the latest update at the time ruined all of the IOMMU groups which stopped me from setting up VMs like I wanted to. Got through that (not that I immediately remember how - fairly sure I deferred to SpaceInvader's expertise) in the end but I get chills every time I go in there now :D

 

Unfortunately, with that applied it has just locked up a couple of minutes into a parity check. Worth ruling out though and I really should have done that from the beginning given how widely those settings seem to be advocated for a rig like mine.

 

Now running an extended SMART test just 'cos I'm desperate.

 

I'll keep digging through that bug report but I think the link to the comment you highlighted mentioned all the useful bits.

 

Any other thoughts massively appreciated.

Link to comment

12 hours down of the memtest, 9 passes and no errors. I've left it running, but at only 8 GB it's had a fairly thorough thrashing :D

 

It stayed up for the entire duration of running the SMART tests (6-7 hours perhaps) whilst the array was offline and then locked up 5 minutes into the parity check I subsequently attempted so with a clean bill of health from the drives I'd be inclined to believe its not hardware(?).

 

If that was the case though, you'd think it'd at least be able to log the triggering event.

 

I'm still so new to Unraid, I've no clue where to go from here!

Link to comment

So, I ran some prime95 stress tests and pinned the box for some time without issue. Relieved that this didn't freak it out.

 

Unfortunately testing a large move of music to one of the two 2TB Toshiba drives (I thought they were WD Greens, I was wrong) the box hung again. In honesty the value of that data was sufficiently little that I just binned it after the reboot.

 

With limited options and a full knowledge that the parity drive is effectively useless (as I'm certain there are errors and I'm unable to run a parity check) I've taken what is likely a risky and foolhardy route and shrunk the array - removing the two 2TB drives. Perhaps this could help the power issue as it ran fine for 4 months without them?! This was only an option as those drives had yet to be populated with anything, with the 4TB sitting at about 85% full (I'd intended on spreading this out a little at some point). If it is a power issue, this could help; if it was caused by something I did when adding those drives, likewise.

 

An hour and a half into a parity rebuild and over 20% complete is looking promising and worlds better than previous attempts so I'm moderately hopeful I might get something back out of this - albeit with the chance that some of my data could well still be corrupt.

 

Feel free to tell me I'm an impatient idiot 😦 I'll report back later.

 

Edited by BishBashBodge
Link to comment
16 hours ago, BishBashBodge said:

It stayed up for the entire duration of running the SMART tests (6-7 hours perhaps) whilst the array was offline and then locked up 5 minutes into the parity check

Prime95 and memtest aslo show stable, that shouldn't be power issue.

 

5 hours ago, BishBashBodge said:

removing the two 2TB drives.

You take out two 2TB disk, those disk was connect to onboard ( that AHCI controller use same PCIe bus with the USB ). And in syslog, there are some USB disconnect message found. The remain two 4TB disk was connect to ASM1062 ...... So, all this show some problem on PCIe 03:00, I think your mainboard have issue, when loading on PCIe 03:00 then system crash. Pls state more when problem occur, what will see in local console ( not local GUI ), hangs ? reboot ? error message ?

 

Apr 13 09:47:15 Alfred kernel: usb 4-4: USB disconnect, device number 2
Apr 13 09:47:16 Alfred kernel: usb 3-4: USB disconnect, device number 2
Apr 13 09:47:16 Alfred kernel: usb 3-4.1: USB disconnect, device number 3
Apr 13 09:47:16 Alfred acpid: input device has been disconnected, fd 5
Apr 13 09:47:16 Alfred kernel: usb 3-4.2: USB disconnect, device number 4
03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01)
	Subsystem: ASRock Incorporation Device [1849:43d0]
	Kernel driver in use: xhci_hcd
03:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
	Subsystem: ASRock Incorporation Device [1849:43c8]
	Kernel driver in use: ahci
	Kernel modules: ahci
03:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
	Kernel driver in use: pcieport

 

There are no disk use FCH SATA controller, so I suggest you replug those 2TB disk to OTHER SATA port and post diagnostics again.

 

28:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
	Subsystem: ASRock Incorporation FCH SATA Controller [AHCI mode] [1849:7901]
	Kernel driver in use: ahci
	Kernel modules: ahci

 

Edited by Benson
Link to comment

Hey thanks @Benson hell of a lot of detail there. Appreciate it.

 

The parity rebuild succeeded and I was flying high this morning.

 

Just tried pulling in some new data and the relatively heavy write (think it was about 4GB in out of 40GB) using an rsync operation from a remote server via a docker image. Unfortunately it tanked and crashed during this operation.

 

Following that the parity check once again causes the machine to crash.

 

I've fully, physically removed the two 2TB drives so I think that potential Mobo issue could be able to be ruled out as the cause for crash?

Also, could plugging & unplugging a keyboard count for the USB disconnect? I've been swapping my USB hub with kbd & mouse back and forward from my desktop to the server for diagnosis :D 

 

 

p.s. You guys are awesome. Thanks.

 

Edited by BishBashBodge
Link to comment
54 minutes ago, BishBashBodge said:

Just tried pulling in some new data and the relatively heavy write (think it was about 4GB in out of 40GB) using an rsync operation from a remote server via a docker image. Unfortunately it tanked and crashed during this operation.

That means problem not only on 03:00 SATA controller or not SATA controller relate.

 

54 minutes ago, BishBashBodge said:

Also, could plugging & unplugging a keyboard count for the USB disconnect?

Yes

 

Could you also try plug the Unraid USB to other port not under 03:00 USB controller.

 

 

On 4/13/2020 at 5:29 PM, BishBashBodge said:

I've identified that it freezes early on when the parity check kicks in


If all above no help, I would suggest you start array in maintenence mode, then perofrm parity check / sync ( pls perform several times and no need full operation ). This should prove is hardware problem or not. If no problem on maintenance mode, then I think it may something wrong in data disk filesystem and cause system crash.

Edited by Benson
Link to comment

Hey folks,

 

So.. following on from your earlier queries (for which, once again - thanks!)...

 

1) Tried a few different usb ports (on different controllers) and one continued to crash whilst the other, though possibly by coincidence, seemed to kill network connectivity?!

2) Tried running parity check with array in maintenance mode - crashed.

3) Upgraded to 6.9-beta1 - crashed on parity check.

 

Am I in "just rebuild the server" territory at this point? Ultimately, the sod's worked since October with zero issues so unless there's subtle hardware failure at play, this thing should run! There's a whole bunch of set up with vms (along with IOMMU GPU passthru magic) and docker for Plex and home automation duties that'd take some time but I'm not seeing many other routes. I'm still of naive hope that I'll either be able to get the data off, or use UD to get the data back on?

 

Wisdom still very welcome.

 

Apologies that this is feeling all a little desperate and defeatist. All of your assistance has been invaluable, I guess I'm just not used to feeling quite as tech-clueless as this!

 

Link to comment
9 hours ago, BishBashBodge said:

hope that I'll either be able to get the data off, or use UD to get the data back on?

In most cass, data (storage) migrate to new MB just plug and play. For all service resume, it may need some tuning / config.

For IOMMU, you may need more effort to source suitable hardware or need work with ACS patch.

 

This hard to say it is hardware or software problem.

 

In first, one thing draw my attention, the 8G 2666Mhz RAM just run in 1333Mhz, it is abnormal. But you said memtest success ( personally I don't fully trust memetest ), you may try replug / plug to other slot / change, most crash case also cause by memory issue.

 

For mainboard troubleshoot, I usually unplug Unraid USB and disks, put spare disk to install Windows for stress test, if need may change some hardware etc. 

Edited by Benson
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.