l1ck Posted August 12 Share Posted August 12 Hi, So I've had my first Unraid setup now for a few months and it has unfortunately been rather unstable. Every few weeks or so there is something that hangs it, forcing me to shut it down. I am running it on a brand new TerraMaster F4-424 Pro with 4 brand new HDDs + 1 brand new NVMe as cache. I've run memtest86 and extended smart tests without seeing any issues. Occasionally I've had 1 or 2 sync errors that have occured after cold shutdowns. These have been corrected and I've since installed the dynamix file integrity to have more control over the health of my system. I've also tweaked and fixed some stuff over time and though I had it finally sorted out, until yesterday.. The system hung, I could not access it in anyway. In the end I had to force a cold shutdown. Once I brought it up again it appears pretty much all files on one of my disks have been corrupted according to the dynamix file integrity plugin. I have a hard time understanding how almost all files were corrupted on that specific disc like that, but it seems that is indeed the case. After cold shutdown the parity check started (which apparently have correct errors enabled) and it started correcting thousands of issues. I have the syslog leading up to me forcing a shutdown from my remote syslog server, as well as the diagnostics from after the servers was back up, attached. Any help in getting to the root cause of this issue would be much appreciated. tower-diagnostics-20240812-1128.zip Quote Link to comment
JorgeB Posted August 12 Share Posted August 12 Multiple call traces logged, if memtest didn't find anything, and since it's only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM. Quote Link to comment
l1ck Posted August 12 Author Share Posted August 12 4 minutes ago, JorgeB said: Multiple call traces logged Can you please elaborate on what this means? I'm afraid not, but I can pick up a new one later today. What do you want me to do once I've changed it, run memtest again? Quote Link to comment
JorgeB Posted August 12 Share Posted August 12 48 minutes ago, l1ck said: Can you please elaborate on what this means? Can mean a lot of things, including hardware or software issues, but they look more hardware related to me for now. 49 minutes ago, l1ck said: What do you want me to do once I've changed it, run memtest again? You can, but like mentioned, it's only definitive if it finds errors, so I would recommend just using the server normally, if the same crashes continue to happen, unlikely that it would be RAM related. Quote Link to comment
l1ck Posted August 12 Author Share Posted August 12 Okay, so RAM is a probable culprit for the unstability as of now. I am a bit reluctant to just keep running the server though now that I have so many corrupted files. Can we assume that the RAM is related to the corruption of the files as well and not just the crashing? I ran a check just 5 days ago where all checksums were fine. Now 5 days later one of the disks is pretty much trashed. Were you able to spot anything suspicous related to to disk2? Quote Link to comment
JorgeB Posted August 12 Share Posted August 12 38 minutes ago, l1ck said: Can we assume that the RAM is related to the corruption of the files as well and not just the crashing? It's a possibilty. Quote Link to comment
l1ck Posted August 12 Author Share Posted August 12 Alright, in the mean time, about the trashed disk; Should I try to rebuild it from parity? The parity sync started up after the cold shutdown and autocorrected some parity before I noticed it was enabled (paused it 9% in). I assume the rest 91% can be rebuilt using the parity, or am I better off just deleting the data om that specific disk and start over? I currently have 1 parity and 3 data drives. Quote Link to comment
JorgeB Posted August 13 Share Posted August 13 I would avoid rebuilding disks while there are stability issues. Quote Link to comment
l1ck Posted August 13 Author Share Posted August 13 Ok. I was spending some time going through the log leading up to the shutdown. The first issue is ofcourse that the system became unresponsive, which as you said could be RAM related. I think I noticed a second one though, which as I understand it, would be unrelated to the RAM? I pressed the power key after determining that I could not access the system anymore. However the system never went down, instead the last line I see is this --> Active pids left on /mnt/* I waited about 25 minutes for it to resolve it self before doing an unclean shutdown. I have the tips and tweaks plugin installed and configured to kill ssh and bash processes before shutting down the array. Any idea of what could have been preventing the system to shut down gracefully? Quote Link to comment
JorgeB Posted August 13 Share Posted August 13 Try booting in safe mode mode to rule out plugins, but could be related to the stability issues. Quote Link to comment
l1ck Posted August 13 Author Share Posted August 13 Do you mean I should run it in safe mode for an extended period of time to see if it crashes again? Quote Link to comment
Solution JorgeB Posted August 13 Solution Share Posted August 13 That was for the not being able to stop the array issue, but you can also try to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
l1ck Posted August 15 Author Share Posted August 15 So I've changed out the RAM (ran memtest on the new stick as well without errors) and booted it up in safe mode. Before I did that I uninstalled all dockers except for Plex, and most of the plugins (I am not using VMs and have that option disabled). So far, so good. I'll let it run like this for some time and observe. I did notice something odd regarding the parity check that you might shed some light on though. When it initially booted up from the unclean shutdown, unraid automatically started the parity check and eventually found 2.5 milion errors. During the check I saw that the parity drive was being written to, so I assumed that the error correction was enabled for this type of check. Today I manually started a parity check while in safe mode, with error correction disabled. However I see the same behaviour, the parity drive in continously being written to as the parity check is again finding 100ks of errors. Is the automated parity check after unclean shutdown not correcting the errors? If so, that explains why it is again finding errors thousands of errors. But why is it writing to the parity drive? I guess the same question goes for my manually initiated parity check. Quote Link to comment
JorgeB Posted August 15 Share Posted August 15 IIRC there's an old bug where the first check after an array start is always correcting. Quote Link to comment
l1ck Posted August 15 Author Share Posted August 15 Okay, but what about the manual parity check I started with error correction disabled? This one is also writing to the parity disk. Quote Link to comment
itimpi Posted August 15 Share Posted August 15 I would suggest you are likely to get better informed feedback if you attach your system’s diagnostics zip file to your next post in this thread with this problem occurring so we can see what is going on. Quote Link to comment
l1ck Posted August 15 Author Share Posted August 15 3 hours ago, itimpi said: I would suggest you are likely to get better informed feedback if you attach your system’s diagnostics zip file to your next post in this thread with this problem occurring so we can see what is going on. Hi! Of course, sorry about that! Please see attached diagnostics.tower-diagnostics-20240815-1729.zip FYI - I was seeing previously seeing a lot of messages like these. Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: AER: Corrected error message received from 0000:00:1c.6 Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: device [8086:54be] error status/mask=00000001/00002000 Aug 14 21:41:29 Tower kernel: pcieport 0000:00:1c.6: [ 0] RxErr I was reading up on the forum and was able to get rid of them by setting pcie_aspm=off in the syslinux config as recommended in another post. Reason I bring it up was that I had turned it off only in OS section, and I was reminded of it now that I booted up the server in safemode. I have since added pcie_aspm=off to the other sections as well. Please let me know if you have some comment on that solution. Quote Link to comment
itimpi Posted August 15 Share Posted August 15 According to the diagnostics you are currently running a correcting check: Aug 15 00:34:03 Tower kernel: mdcmd (36): check correct Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414576848 Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414587320 Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414597848 Aug 15 02:14:22 Tower kernel: md: recovery thread: P corrected, sector=2414608456 which would explain why you are getting writes. Quote Link to comment
l1ck Posted August 15 Author Share Posted August 15 Yea, this is what is throwing me off.. I am 100% sure I started a non-correcting check. When it initially booted up from the unclean shutdown, unraid automatically started a parity check and found 2.5 milion errors. During the check I saw that the parity drive was being written to, and this finished yesterday. To me it was unclear though if this check is running with correction enabled or not. I started a new check soon after (with correction disabled), and I am currently at 1.65 milion errors (63.6% done). Its about the same level as the previous run. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.