Unraid becomes unresponsive - Hard reboot required

Kris6673 · May 12, 2022

Hello there,

I'm seeing issues with my server hanging randomly. It's happened a few times and om not sure what to do. My memory is not the best in the world, but i'll be as precise as possible.

My setup is as follows:

- Motherboard: Asus Prime Z370-P BIOS Version 3004 - REALTEK NIC

- RAM: 2x Corsair Vengeance LPX 2x8 GB 3000 Mhz and 2x HyperX Fury 2x8 GB 3200 Mhz

- GPU: iGPU UHD 630

- CPU: before 31.03.22:I7-8700

In Between: i3-8100

After 08.05.22 Intel i7-8700K

- HBA: Dell Perc H310 flashed to IT mode

- Power Supply: Corsair TX550

- Unraid USB: Kingston DataTraveler_3.0 64GB

- Unraid version: 6.9.2

Both RAM sets are in different channels and are running stock speed, no XMP since with that enabled it wont POST.

I use the iGPU for Plex and unmanic HEVC transcoding.

syslog is mirroring to the USB

It has acted strange and hung randomly a couple months ago but then i had a CPU die in the server, and thought that was the cause.

The dead CPU was a i7-8700. This CPU threw CPU stall errors, and from what i can see, it's not now.

I replaced the CPU with a i3-8100 and went on my merry way. It hung once or twice here too in the about 1,5 months i had it, but didn't think anything of it for some reason.

First time(April 8 i think): All the GUI/Docker/SMB was unresponsive, but i could SSH to it. I couldn't force a reboot via the SSH, and a hard reboot was required. Parity sync seemed to crash it, but hard reboot, parity sync again and everything was fine.

Second time(May 11): The GUI got a NGINX 502 error, a few docker containers were still running and accessible, like qBitorrent, NginxProxyManager and Plex. SSH worked too. I got diagnostics after the hard rebootkbkunraid-diagnostics-20220512-1704.zip

A parity sync ran here IIRC, and i used this command to see that it was stuck: /root/mdcmd status | egrep "mdResync|mdState|sbSync"

Third time(May 12): I turned off the docker service and ran the parity sync. All the GUI/Docker/SMB was unresponsive, no SSH this time though. This hang is the most recent one. Diagnostics have been collected after the hard reboot. kbkunraid-diagnostics-20220512-2027.zip

One of my drives a showing value:188 SMART errors, and needs to be replaced but i can't get the system stable for the 20 hours required for a parity sync. I have done my best google-fu and searching on the forums. I have seen the possible solution of upgrading to 6.10-RC8 for the MACVLAN setting issue, but since the system is unstable I thought it might break while updating leaving me SOL. As described earlier I tried disabling the docker service to mimic the upgrade, not sure if it has the same effect.

Could the mismatched RAM cause this mess?

USB device fault maybe?

I found someone having the same fault code with the kernel 5.8.X but the fix was upgrading to 5.9.X or higher.

I'm quite overwhelmed, and I hope someone can point me in the right direction. Thank you for your time.

syslog.zip

Edited May 12, 2022 by Kris6673
USB device question, title is rude and wrong, google-fu Linux kernel stuffs

JorgeB · May 13, 2022

There are a lot of XFS related call traces, it doesn't identify the filesystem so run xfs_repair (without -n) in all XFS filesystems, hard crashing could be related to this:

Kris6673 · May 13, 2022

I'll run that for all the disks. After is has run, ill try the parity sync again and report back with the results.

15 hours ago, Kris6673 said:

I replaced the CPU with a i3-8100 and went on my merry way. It hung once or twice here too in the about 1,5 months i had it, but didn't think anything of it for some reason.

I found out this is wrong. The first described hang, was the only one this CPU had.

It just seems strange to me, it doesn't crash without the parity sync trying to run. I can run multiple transcodes with plex and unmanic without it hanging. And it was stable on the same kernel for almost a year with a CPU with the same iGPU in it.

Kris6673 · May 13, 2022

11 hours ago, JorgeB said:

There are a lot of XFS related call traces, it doesn't identify the filesystem so run xfs_repair (without -n) in all XFS filesystems, hard crashing could be related to this:

The first 2 disks just finished with this error:

xfs_repair dev/sdc
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...

...........(for a long time)

Sorry, could not find valid secondary superblock
Exiting now.

All disks have said the same at the start.

I feel like i could be in trouble ^^

Edited May 13, 2022 by Kris6673

JorgeB · May 13, 2022

7 minutes ago, Kris6673 said:

xfs_repair dev/sdc

That's not how you do it, you need to use the md device, see the check filesystem wiki.

Kris6673 · May 13, 2022

16 minutes ago, JorgeB said:

That's not how you do it, you need to use the md device, see the check filesystem wiki.

I see now. Thank you!

I think it all went okay now. I have attached a log for all disks when i ran it.

Do i go for the parity sync again now?

xfs_repair_results.txt

Kris6673 · May 13, 2022

I went for the parity sync, and the system hung after about 3 hours. Hard reboot and it's online again. Syslog seems to be empty from anything relating to the hang.

I'll remove the i915 from my go file now and try to parity sync again.

kbkunraid-diagnostics-20220514-0023.zip syslog (1).zip

JonathanM · May 13, 2022

23 minutes ago, Kris6673 said:

Hard reboot and it's online again. Syslog seems to be empty from anything relating to the hang.

Logs are in RAM by default to keep from wearing out the flash drive unnecessarily. You must set up the syslog server and specify a destination to keep the logs, be sure to disable it again after you solve the issue if you log to the flash drive.

Kris6673 · May 13, 2022

12 minutes ago, JonathanM said:

Logs are in RAM by default to keep from wearing out the flash drive unnecessarily. You must set up the syslog server and specify a destination to keep the logs, be sure to disable it again after you solve the issue if you log to the flash drive.

Is the setting "Mirror to Flash" not enough?

JonathanM · May 13, 2022

4 minutes ago, Kris6673 said:

Is the setting "Mirror to Flash" not enough?

Yes. I assumed from the way you worded your statement that I quoted that you didn't have it set up.

If there is nothing logged, that typically means it's hardware related, where the crash happens before anything can be logged. Since your combo has a history of hangs, the first thing I would do is swap around hardware if possible. Intermittent faults like this can be a bear to track down, because you change something, and the error can appear to go away until it happens again randomly.

Maybe run with 1/2 RAM for a period of time, then switch to the other pair?

JorgeB · May 14, 2022

8 hours ago, Kris6673 said:

I'll remove the i915 from my go file now and try to parity sync again.

Yes, try this.

Kris6673 · May 14, 2022

7 hours ago, JonathanM said:

Yes. I assumed from the way you worded your statement that I quoted that you didn't have it set up.

If there is nothing logged, that typically means it's hardware related, where the crash happens before anything can be logged. Since your combo has a history of hangs, the first thing I would do is swap around hardware if possible. Intermittent faults like this can be a bear to track down, because you change something, and the error can appear to go away until it happens again randomly.

Maybe run with 1/2 RAM for a period of time, then switch to the other pair?

Yeah tracking stuff like this down is pretty rough, especially if you dont have spare parts lying around

I think taking out a pair of RAM sticks is the next step.

22 minutes ago, JorgeB said:

Yes, try this.

After about 7 hours(longest time so far) it hung again, but this time with an error i have not seen before from the GUI. 500 internal server error.

I could still ping it, but not SSH into it. I'm unsure if the parity sync was still running.

There's actual entries in the syslog now too. syslog (2).zip

JorgeB · May 14, 2022

Crashing looks more hardware related, you can try running memtest if you haven't yet, can also try upgrading to v6.10-rc8 mostly to rule out any kernel compatibility issue.

Kris6673 · May 14, 2022

5 hours ago, JorgeB said:

Crashing looks more hardware related, you can try running memtest if you haven't yet, can also try upgrading to v6.10-rc8 mostly to rule out any kernel compatibility issue.

I'll run the memtest a bit later today. In the meantime i tried booting to safe mode and running the sync again. This time it's only frozen in the party sync, the GUI is still fine. So this time i have actual relevant diagnostics!

I cannot stop the parity sync and I had to do another hard reboot.

syslog (3).zip kbkunraid-diagnostics-20220514-1538.zip

itimpi · May 14, 2022

If it keeps stopping during parity sync this suggest to me that you may well have a power related issue as that is one time that all drives are simultaneously active.

JorgeB · May 15, 2022

18 hours ago, Kris6673 said:

This time it's only frozen in the party sync,

Unraid driver is crashing, this sometimes happens with some hardware/kernel combinations, I would suggest upgrading to v6.10.0-rc8 and trying again.

Kris6673 · May 15, 2022

18 hours ago, itimpi said:

If it keeps stopping during parity sync this suggest to me that you may well have a power related issue as that is one time that all drives are simultaneously active.

I sure hope not, but it's a possibility.

On 5/14/2022 at 10:12 AM, JorgeB said:

Crashing looks more hardware related, you can try running memtest if you haven't yet, can also try upgrading to v6.10-rc8 mostly to rule out any kernel compatibility issue.

Memtest failed with over 10.000 errors. MemTest86.logI'm gonna test both pairs of RAM individually and in different RAM slots.

It does say in the memtest log, that it could be the BIOS on the mobo that's bad.

I sure hope it's not the memory controller on the CPU that's bad 😅

Kris6673 · May 17, 2022

System is stable now. Both pairs of RAM were okay separately, but together in the same system, they broke stuff. Very strange, but nice that it works now

Thanks for the help, all of you!

bombz · November 1, 2022

I have also recently run into this concern with both of my unraid servers.
once parity kicks off you can no longer access the system, ping requests do not respond either.
Only way to safely shut down is with the powerdown command via local console.
Wondering why this sparked out of no where as of recent on both nas

Unraid becomes unresponsive - Hard reboot required

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation