Unraid becomes unresponsive - Hard reboot required


Go to solution Solved by JorgeB,

Recommended Posts

Hello there, 

I'm seeing issues with my server hanging randomly. It's happened a few times and om not sure what to do. My memory is not the best in the world, but i'll be as precise as possible.

My setup is as follows:

- Motherboard: Asus Prime Z370-P BIOS Version 3004 - REALTEK NIC

- RAM: 2x Corsair Vengeance LPX 2x8 GB 3000 Mhz and 2x HyperX Fury 2x8 GB 3200 Mhz

- GPU: iGPU UHD 630

- CPU: before 31.03.22:I7-8700

In Between: i3-8100

After 08.05.22 Intel i7-8700K

- HBA: Dell Perc H310 flashed to IT mode

- Power Supply: Corsair TX550

- Unraid USB: Kingston DataTraveler_3.0 64GB

- Unraid version: 6.9.2

 

Both RAM sets are in different channels and are running stock speed, no XMP since with that enabled it wont POST. 

I use the iGPU for Plex and unmanic HEVC transcoding.

syslog is mirroring to the USB

 

It has acted strange and hung randomly a couple months ago but then i had a CPU die in the server, and thought that was the cause. 

The dead CPU was a i7-8700. This CPU threw CPU stall errors, and from what i can see, it's not now. 

I replaced the CPU with a i3-8100 and went on my merry way. It hung once or twice here too in the about 1,5 months i had it, but didn't think anything of it for some reason.

 

First time(April 8 i think): All the GUI/Docker/SMB was unresponsive, but i could SSH to it. I couldn't force a reboot via the SSH, and a hard reboot was required. Parity sync seemed to crash it, but hard reboot, parity sync again and everything was fine.

 

Second time(May 11): The GUI got a NGINX 502 error, a few docker containers were still running and accessible, like qBitorrent, NginxProxyManager and Plex. SSH worked too. I got diagnostics after the hard rebootkbkunraid-diagnostics-20220512-1704.zip

A parity sync ran here IIRC, and i used this command to see that it was stuck: /root/mdcmd status | egrep "mdResync|mdState|sbSync"

 

Third time(May 12): I turned off the docker service and ran the parity sync. All the GUI/Docker/SMB was unresponsive, no SSH this time though. This hang is the most recent one. Diagnostics have been collected after the hard reboot. kbkunraid-diagnostics-20220512-2027.zip

 

One of my drives a showing value:188 SMART errors, and needs to be replaced but i can't get the system stable for the 20 hours required for a parity sync. I have done my best google-fu and searching on the forums. I have seen the possible solution of upgrading to 6.10-RC8 for the MACVLAN setting issue, but since the system is unstable I thought it might break while updating leaving me SOL. As described earlier I tried disabling the docker service to mimic the upgrade, not sure if it has the same effect. 

Could the mismatched RAM cause this mess?

USB device fault maybe?

I found someone having the same fault code with the kernel 5.8.X but the fix was upgrading to 5.9.X or higher.

 

I'm quite overwhelmed, and I hope someone can point me in the right direction. Thank you for your time.

syslog.zip

Edited by Kris6673
USB device question, title is rude and wrong, google-fu Linux kernel stuffs
Link to comment
  • Kris6673 changed the title to Unraid becomes unresponsive - Hard reboot required

I'll run that for all the disks. After is has run, ill try the parity sync again and report back with the results. 

 

15 hours ago, Kris6673 said:

 

I replaced the CPU with a i3-8100 and went on my merry way. It hung once or twice here too in the about 1,5 months i had it, but didn't think anything of it for some reason.

I found out this is wrong. The first described hang, was the only one this CPU had. 

It just seems strange to me, it doesn't crash without the parity sync trying to run. I can run multiple transcodes with plex and unmanic without it hanging. And it was stable on the same kernel for almost a year with a CPU with the same iGPU in it.

 

Link to comment
11 hours ago, JorgeB said:

There are a lot of XFS related call traces, it doesn't identify the filesystem so run xfs_repair (without -n) in all XFS filesystems, hard crashing could be related to this:

 

 

The first 2 disks just finished with this error: 

xfs_repair dev/sdc
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...

...........(for a long time)

Sorry, could not find valid secondary superblock
Exiting now.

 

All disks have said the same at the start.

I feel like i could be in trouble ^^

Edited by Kris6673
Link to comment
23 minutes ago, Kris6673 said:

Hard reboot and it's online again. Syslog seems to be empty from anything relating to the hang. 

Logs are in RAM by default to keep from wearing out the flash drive unnecessarily. You must set up the syslog server and specify a destination to keep the logs, be sure to disable it again after you solve the issue if you log to the flash drive.

Link to comment
12 minutes ago, JonathanM said:

Logs are in RAM by default to keep from wearing out the flash drive unnecessarily. You must set up the syslog server and specify a destination to keep the logs, be sure to disable it again after you solve the issue if you log to the flash drive.

Is the setting "Mirror to Flash" not enough? 

Link to comment
4 minutes ago, Kris6673 said:

Is the setting "Mirror to Flash" not enough? 

Yes. I assumed from the way you worded your statement that I quoted that you didn't have it set up.

 

If there is nothing logged, that typically means it's hardware related, where the crash happens before anything can be logged. Since your combo has a history of hangs, the first thing I would do is swap around hardware if possible. Intermittent faults like this can be a bear to track down, because you change something, and the error can appear to go away until it happens again randomly. 

 

Maybe run with 1/2 RAM for a period of time, then switch to the other pair?

  • Thanks 1
Link to comment
7 hours ago, JonathanM said:

Yes. I assumed from the way you worded your statement that I quoted that you didn't have it set up.

 

If there is nothing logged, that typically means it's hardware related, where the crash happens before anything can be logged. Since your combo has a history of hangs, the first thing I would do is swap around hardware if possible. Intermittent faults like this can be a bear to track down, because you change something, and the error can appear to go away until it happens again randomly. 

 

Maybe run with 1/2 RAM for a period of time, then switch to the other pair?

Yeah tracking stuff like this down is pretty rough, especially if you dont have spare parts lying around :)

I think taking out a pair of RAM sticks is the next step. 

 

22 minutes ago, JorgeB said:

Yes, try this.

After about 7 hours(longest time so far) it hung again, but this time with an error i have not seen before from the GUI. 500 internal server error. 

I could still ping it, but not SSH into it. I'm unsure if the parity sync was still running. 

There's actual entries in the syslog now too. syslog (2).zip

 

Link to comment
5 hours ago, JorgeB said:

Crashing looks more hardware related, you can try running memtest if you haven't yet, can also try upgrading to v6.10-rc8 mostly to rule out any kernel compatibility issue.

I'll run the memtest a bit later today. In the meantime i tried booting to safe mode and running the sync again. This time it's only frozen in the party sync, the GUI is still fine. So this time i have actual relevant diagnostics!

I cannot stop the parity sync and I had to do another hard reboot. 

 

syslog (3).zip kbkunraid-diagnostics-20220514-1538.zip

Link to comment
18 hours ago, itimpi said:

If it keeps stopping during parity sync this suggest to me that you may well have a power related issue as that is one time that all drives are simultaneously active.

I sure hope not, but it's a possibility.

On 5/14/2022 at 10:12 AM, JorgeB said:

Crashing looks more hardware related, you can try running memtest if you haven't yet, can also try upgrading to v6.10-rc8 mostly to rule out any kernel compatibility issue.

Memtest failed with over 10.000 errors. MemTest86.logI'm gonna test both pairs of RAM individually and in different RAM slots. 

It does say in the memtest log, that it could be the BIOS on the mobo that's bad.

I sure hope it's not the memory controller on the CPU that's bad 😅

 

  • Like 1
Link to comment
  • 5 months later...

I have also recently run into this concern with both of my unraid servers.
once parity kicks off you can no longer access the system, ping requests do not respond either.
Only way to safely shut down is with the powerdown command via local console.
Wondering why this sparked out of no where as of recent on both nas

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.