Issues with repeated hanging and unclean shutdowns

August 12, 20241 yr

Hi,

So I've had my first Unraid setup now for a few months and it has unfortunately been rather unstable. Every few weeks or so there is something that hangs it, forcing me to shut it down. I am running it on a brand new TerraMaster F4-424 Pro with 4 brand new HDDs + 1 brand new NVMe as cache. I've run memtest86 and extended smart tests without seeing any issues.

Occasionally I've had 1 or 2 sync errors that have occured after cold shutdowns. These have been corrected and I've since installed the dynamix file integrity to have more control over the health of my system. I've also tweaked and fixed some stuff over time and though I had it finally sorted out, until yesterday..

The system hung, I could not access it in anyway. In the end I had to force a cold shutdown. Once I brought it up again it appears pretty much all files on one of my disks have been corrupted according to the dynamix file integrity plugin.

I have a hard time understanding how almost all files were corrupted on that specific disc like that, but it seems that is indeed the case.

After cold shutdown the parity check started (which apparently have correct errors enabled) and it started correcting thousands of issues.

I have the syslog leading up to me forcing a shutdown from my remote syslog server, as well as the diagnostics from after the servers was back up, attached.

Any help in getting to the root cause of this issue would be much appreciated.

tower-diagnostics-20240812-1128.zip

Quote

August 12, 20241 yr

Community Expert

Multiple call traces logged, if memtest didn't find anything, and since it's only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.

Quote

August 12, 20241 yr

Author

4 minutes ago, JorgeB said:

Multiple call traces logged

Can you please elaborate on what this means?

I'm afraid not, but I can pick up a new one later today. What do you want me to do once I've changed it, run memtest again?

Quote

August 12, 20241 yr

Community Expert

48 minutes ago, l1ck said:

Can you please elaborate on what this means?

Can mean a lot of things, including hardware or software issues, but they look more hardware related to me for now.

49 minutes ago, l1ck said:

What do you want me to do once I've changed it, run memtest again?

You can, but like mentioned, it's only definitive if it finds errors, so I would recommend just using the server normally, if the same crashes continue to happen, unlikely that it would be RAM related.

Quote

August 12, 20241 yr

Author

Okay, so RAM is a probable culprit for the unstability as of now. I am a bit reluctant to just keep running the server though now that I have so many corrupted files.

Can we assume that the RAM is related to the corruption of the files as well and not just the crashing?

I ran a check just 5 days ago where all checksums were fine. Now 5 days later one of the disks is pretty much trashed.

Were you able to spot anything suspicous related to to disk2?

Quote

August 12, 20241 yr

Community Expert

38 minutes ago, l1ck said:

Can we assume that the RAM is related to the corruption of the files as well and not just the crashing?

It's a possibilty.

Quote

August 12, 20241 yr

Author

Alright, in the mean time, about the trashed disk; Should I try to rebuild it from parity?

The parity sync started up after the cold shutdown and autocorrected some parity before I noticed it was enabled (paused it 9% in).

I assume the rest 91% can be rebuilt using the parity, or am I better off just deleting the data om that specific disk and start over?

I currently have 1 parity and 3 data drives.

Quote

August 13, 20241 yr

Community Expert

I would avoid rebuilding disks while there are stability issues.

Quote

August 13, 20241 yr

Author

Ok.

I was spending some time going through the log leading up to the shutdown.

The first issue is ofcourse that the system became unresponsive, which as you said could be RAM related.

I think I noticed a second one though, which as I understand it, would be unrelated to the RAM?

I pressed the power key after determining that I could not access the system anymore.

However the system never went down, instead the last line I see is this --> Active pids left on /mnt/*

I waited about 25 minutes for it to resolve it self before doing an unclean shutdown.

I have the tips and tweaks plugin installed and configured to kill ssh and bash processes before shutting down the array.

Any idea of what could have been preventing the system to shut down gracefully?

Quote

August 13, 20241 yr

Community Expert

Try booting in safe mode mode to rule out plugins, but could be related to the stability issues.

Quote

August 13, 20241 yr

Author

Do you mean I should run it in safe mode for an extended period of time to see if it crashes again?

Quote

August 13, 20241 yr

Community Expert
Solution

That was for the not being able to stop the array issue, but you can also try to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Quote

August 15, 20241 yr

Author

So I've changed out the RAM (ran memtest on the new stick as well without errors) and booted it up in safe mode.

Before I did that I uninstalled all dockers except for Plex, and most of the plugins (I am not using VMs and have that option disabled).

So far, so good. I'll let it run like this for some time and observe.

I did notice something odd regarding the parity check that you might shed some light on though.

When it initially booted up from the unclean shutdown, unraid automatically started the parity check and eventually found 2.5 milion errors.

During the check I saw that the parity drive was being written to, so I assumed that the error correction was enabled for this type of check.

Today I manually started a parity check while in safe mode, with error correction disabled.

However I see the same behaviour, the parity drive in continously being written to as the parity check is again finding 100ks of errors.

Is the automated parity check after unclean shutdown not correcting the errors? If so, that explains why it is again finding errors thousands of errors.

But why is it writing to the parity drive?

I guess the same question goes for my manually initiated parity check.

Quote

August 15, 20241 yr

Community Expert

IIRC there's an old bug where the first check after an array start is always correcting.

Quote

August 15, 20241 yr

Author

Okay, but what about the manual parity check I started with error correction disabled? This one is also writing to the parity disk.

Quote

August 15, 20241 yr

Community Expert

I would suggest you are likely to get better informed feedback if you attach your system’s diagnostics zip file to your next post in this thread with this problem occurring so we can see what is going on.

Quote

August 15, 20241 yr

Author

3 hours ago, itimpi said:

I would suggest you are likely to get better informed feedback if you attach your system’s diagnostics zip file to your next post in this thread with this problem occurring so we can see what is going on.

Hi!

Of course, sorry about that! Please see attached diagnostics.tower-diagnostics-20240815-1729.zip

FYI - I was seeing previously seeing a lot of messages like these.

Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: AER: Corrected error message received from 0000:00:1c.6
Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: device [8086:54be] error status/mask=00000001/00002000
Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: [ 0] RxErr

I was reading up on the forum and was able to get rid of them by setting pcie_aspm=off in the syslinux config as recommended in another post.

Reason I bring it up was that I had turned it off only in OS section, and I was reminded of it now that I booted up the server in safemode. I have since added pcie_aspm=off to the other sections as well.

Please let me know if you have some comment on that solution.

Quote

August 15, 20241 yr

Community Expert

According to the diagnostics you are currently running a correcting check:

Aug 15 00:34:03 Tower kernel: mdcmd (36): check correct

Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414576848
Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414587320
Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414597848
Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414608456

which would explain why you are getting writes.

Quote

August 15, 20241 yr

Author

Yea, this is what is throwing me off.. I am 100% sure I started a non-correcting check.

When it initially booted up from the unclean shutdown, unraid automatically started a parity check and found 2.5 milion errors.

During the check I saw that the parity drive was being written to, and this finished yesterday. To me it was unclear though if this check is running with correction enabled or not.

I started a new check soon after (with correction disabled), and I am currently at 1.65 milion errors (63.6% done). Its about the same level as the previous run.

Quote

Issues with repeated hanging and unclean shutdowns

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)