New Unraid User - Frequest system crashes. Help!


Go to solution Solved by Vr2Io,

Recommended Posts

Hi All, I've had my system set up for a few weeks now. After I put the hardware together and created my first share, it seemed to run without issue for a week or two.

 

I then started adding things from the community apps and it was perhaps at this point I started experiencing crashes. The network share would be inaccessible from other systems on my network, and the WebGUI would not respond. At this point I would hard power down, power back on, re-log in and re-start the array. 

 

Plugins that are installed: 

Community Applications

CA Mover Tuning

Fix Common Problems

Parity Checking Tuning

unBALANCE

Unraid Connect

 

I have two docker containers that are not currently started, not sure if their presence affects anything but I haven't been able to get them to work yet: 

 

blueiris

DiskSpeed

 

The system crashed several times yesterday, at approximately 10:48AM, then 3:37PM then 9:37PM. So far it's been running since the last crash for 13 hours without crashing. 

 

I am attaching the three diagnostic zip files from yesterday, please let me know if there are any recommended actions. Thanks!

 

znas-diagnostics-20240102-1048.zip znas-diagnostics-20240102-1537.zip znas-diagnostics-20240102-2137.zip

Edited by zali01
Link to comment

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
12 minutes ago, JorgeB said:

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

Thanks for the response! The system is currently about 45% through a parity check after the last crash. Should I wait for it to complete (the concern here is that it may crash before finishing) and then clean shut down? Or should I try to shut down now and then do a safe mode restart?

 

If the latter, does it continue the parity check after restarting? Or will it have to start over? Or does it not do a parity check in safe mode? 

 

Current estimated finish for parity check is 11 hours. 

Link to comment
6 hours ago, zali01 said:

If the latter, does it continue the parity check after restarting?

The default Unraid behaviour is to restart from the beginning, although if you have the Parity Check Tuning plugin installed it does have an option to support restarting from point previously reached.

Link to comment
1 hour ago, itimpi said:

The default Unraid behaviour is to restart from the beginning, although if you have the Parity Check Tuning plugin installed it does have an option to support restarting from point previously reached.

Thanks @itimpi. I do have the parity check tuning plugin installed, but I was planning on restarting into safe mode as @JorgeB suggested. Will restarting in safe mode disable those plugins? I would assume it does, in which case wouldn't the behavior be as @JorgeB suggested (to do a full parity check if it's interrupted)? 

 

My plan was to let it continue the parity check, it's at about 80% done now. If it crashes, then restart into safe mode at that point. If it doesnt crash, still restart into safe mode but wouldn't have to do another parity check. 

 

Follow up question: if i restart into safe mode, can i then "start" plug ins one by one while in that mode over time?

Link to comment
7 hours ago, zali01 said:

Will restarting in safe mode disable those plugins

It will disable the plugin as you mention.    Not sure what will happen if the parity check was not completed before rebooting into Safe mode when you next go into Normal mode - not something I have tested as far as I can remember.

 

No easy way to selectively start plugins while in Safe mode except by re-installing them.   Having said that it is probably possible using the command line, but not sure of the details.

Link to comment
4 hours ago, itimpi said:

It will disable the plugin as you mention.    Not sure what will happen if the parity check was not completed before rebooting into Safe mode when you next go into Normal mode - not something I have tested as far as I can remember.

 

No easy way to selectively start plugins while in Safe mode except by re-installing them.   Having said that it is probably possible using the command line, but not sure of the details.

 

A few new updates... 

I left the system running the parity check last night, hoping it would finish and that I could then safely/cleanly shut down. Anticipating this, I decided I would uninstall all plugins that I added because if I can't even have my system stable then there's no use for plugins. So I uninstalled all plugins except community applications and unraid connect. I also deleted both docker containers that I had installed. I was hoping that by doing this, I could pre-emptively avoid whatever was causing the crash, if it was indeed software-related (plugin configuration/etc)

 

Unfortunately when I woke up this morning the system had already become unresponsive, similar to the same crash behavior as before. I restarted into GUI Safe mode (because I'm not yet well-versed enough to have the system running without a GUI), and have started the array. The parity check has automatically started. 

 

I noticed in the Plugins tab, it lists No plugins installed (not even community apps or unraid connect). That's just an observation. 

 

I am glad to see that safe mode still allows network connectivity, which is critical since obviously this is a NAS system. I just wasn't 100% sure since I am coming from a 99% Microsoft Windows background, in which safe mode disables network connectivity. 

 

I have attached the diagnostics file in case anyone can identify anything. 

 

20 hours ago, JorgeB said:

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

I am really hoping that what @JorgeB said is not the case (that it's a hardware issue). I spent a not-insignificant amount of $ on new parts for this system, hoping I can migrate all of the services currently running on my desktop PC to it. Including Plex, Blueiris, influxdb, grafana, tautulli, and probably others I can't remember off the top of my head. Most of the parts are brand new, specifically everything except the hard drives (which are from serverpartdeals). The parts I bought are leaning toward higher end too, not cheap unknown brands. Here is a list of the parts I bought: https://pcpartpicker.com/list/rskDkJ 

 

It would be strange to me for any of these brand new parts to be causing a failure, and if that's the case I dont know how I would go about troubleshooting them individually, I don't really have another system that has similar sockets/etc to swap parts with. My current desktop PC was built in 2018, and so not all parts are swappable/compatible. 

 

In the event that the system does continue to run in safe mode successfully for at least a few days and then continuing for perhaps 1-3 weeks, this would be great but I would then ask how come it crashed last night after I deleted all plugins/docker containers. 

 

Thoughts anyone? 

znas-diagnostics-20240104-0827.zip

Link to comment

Update: Still running in safe mode. So far the system has not crashed while in safe mode, uptime is currently at 1 day 1 hr 15min. 

 

As I mentioned, I hope it does not crash as this would validate my assumptions that new hardware should generally perform reliably. But this would introduce the question as to why the last crash happened even though I deleted all plugins and docker containers. 

 

Again, if anyone has suggestions/ideas I'm all ears. Thanks!

Link to comment

Update, well unfortunately the system crashed again even though I was running in safe mode. I have attached the diagnostics file.

znas-diagnostics-20240105-2016.zip

 

I noticed the log listed a particular media file from 2007 as corrupt, not sure if I should just delete that file to avoid issues in the future. 

 

Can anyone suggest what approach I should take next? Aside from the hard drives from serverpartdeals, these are brand new parts that I bought to eventually build a high powered server to host several services and fulfill multiple purposes. 

 

I'm feeling kind of frustrated that I haven't even gotten it to be a stable NAS yet as lately it's been crashing almost every day, sometimes several times a day. Parity checks don't even get to finish before the system crashes again. 

 

Reading through some of the other help requests, it seems some folks who had other unrelated issues had some success with using an earlier version of Unraid. Is this something I should try? 

 

@JorgeB @itimpi

 

Thanks

 

 

Link to comment
  • Solution
Posted (edited)

Call trace have found in previous log, Do you perform memory test to prove system stability ?

 

1 hour ago, zali01 said:

it seems some folks who had other unrelated issues had some success with using an earlier version of Unraid. Is this something I should try? 

 

Yes .... some post say newer Intel platform trend unstable with current Unraid, but I can't verify that because I stay in 9th gen.

Edited by Vr2Io
Link to comment
14 minutes ago, Vr2Io said:

Call trace have found in previous log, Do you perform memory test to prove system stability ?

 

 

Yes .... some post say newer Intel platform trend unstable with current Unraid, but I can't verify that because I stay in 9th gen.

Thanks for the suggestions. Yes, I did run a memtest when building the system in December. I don't recall seeing any errors, but if the system keeps crashing I can probably try running it again. 

 

For now, I have restored a backup of my flash drive that I made last week. Who knows if that will help anything but we will see. If it crashes again i will try running another mem test.

  • Like 1
Link to comment
Posted (edited)
On 1/5/2024 at 10:25 PM, zali01 said:

Thanks for the suggestions. Yes, I did run a memtest when building the system in December. I don't recall seeing any errors, but if the system keeps crashing I can probably try running it again. 

 

For now, I have restored a backup of my flash drive that I made last week. Who knows if that will help anything but we will see. If it crashes again i will try running another mem test.

The system had run for about 2 or 3 days without a crash, and then it crashed last night. 

 

I took @Vr2Io's advice and started taking action to run another memtest.

On 1/5/2024 at 10:08 PM, Vr2Io said:

Call trace have found in previous log, Do you perform memory test to prove system stability ?

 

 

Yes .... some post say newer Intel platform trend unstable with current Unraid, but I can't verify that because I stay in 9th gen.

 

I discovered the original memtest that I ran back in December was a memtest associated with the motherboard directly in the BIOS (if I understand correctly). That test originally ran for 2-3 hours and passed. 

 

AsusMemtestPass.thumb.jpeg.ac825300469964964c4dbe7bf9d0f52a.jpeg

 

Today I ran the memtest that is an option in the Unraid boot menu (where you choose GUI/safe mode/etc), and that test ran for less than an hour before finding errors. 

 

UnraidMemtestFail.thumb.jpeg.b4d3bbda58b1e92c9e2270bcf4c05a65.jpeg

 

I immediately set up a return for the RAM sticks since I just bought them a month or two ago, and ordered a new pair of RAM sticks. They should get here tomorrow. 

 

Question 1: Was the RAM test that was done through the ASUS BIOS an inferior test? Why would that test pass and the unraid memtest detect errors/fail?

 

Question 2: Is it safe to assume that most if not all of my system crashes are due to these errors/issues that the memtest is finding? 

 

I've seen a number of posts where folks said they system was crashing and replacing the RAM fixed the issue. I'm hoping I will be one of those folks soon. 

Thanks also to @JorgeB for the suggestion of hardware being a potential cause of these crashes, and to @itimpi for contributing as well. 

On 1/3/2024 at 11:59 AM, JorgeB said:

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

Thanks in advance

Edited by zali01
Link to comment
5 minutes ago, itimpi said:

The free-standing RAM tests are nearly always more thorough that ones built into a BIOS.

 

Any RAM faults can cause unpredictable errors.

Very interesting. Thanks for the info! 

 

When I install the new sticks tomorrow, should I run the Unraid memtest against them or should I just try to load Unraid and let it start doing it's parity check and hope it doesn't crash anymore? 

 

If I do run the memtest against the new sticks, how long should I run it for? I've seen some folks say they ran it for like 4 days continuously, is that something reasonable/expected? I've also seen other posts where people said don't bother running it unless you are experiencing issues such as crashes and are trying to troubleshoot. 

 

Thanks again. 

Link to comment

my 2 cents:
Your Memtest screenshots show that the BIOS test detected 4000 MHz corsairs while the test from Unraid shows you are clocking them at 4800 MHz. Did you by any chance leave the memory clock settings on AUTO? Most motherboards are horrible at auto detecting and setting correct values for memory. It's possible on your previous test with the BIOS the memory clocked at the default 4000 Mhz speed while on later reboots it thought your system could handle more and started overclocking (they call that AI overclocking but it's really bad in my experience).
Anyhow, my suggestion is to verify what settings you have for the memory/CPU speed in the BIOS.  Disable any auto overclocking features, use XMP when possible and set the correct voltage for DDR5 as per the Ram sticks you bought. Run the test again and if it still fails maybe something else is going on but it's also possible you simply haven't found the perfect voltage/speed/CAS settings for your memory.
I run an AMD setup for Unraid but even on Intel the biggest problem I always face is finding the right settings for memory. It can take hours to find the perfect sweet spot to run stable (even without overclocking).

Link to comment
7 hours ago, zali01 said:

Question 1: Was the RAM test that was done through the ASUS BIOS an inferior test? Why would that test pass and the unraid memtest detect errors/fail?

That depends on how robust of those memtest program to make test, different software have different test algorithms.

 

Unraid have update the build-in memtest86+ ver. since 6.12.5, in old ver., it also unreliable and I always use 3rd party memtest program.

 

Anyway, all depends on which test program use, it doesn't relate come from BIOS, Unraid or whatever.

 

image.png.ee897608b80b403141986336ce3b0329.png

 

https://github.com/memtest86plus/memtest86plus?tab=readme-ov-file#memory-testing-philosophy ( your BIOS memtest were come from PassMark software, they are compititor )

 

image.png.c7bceb46eecb0ed0e7a7ba652ddc4bae.png

 

 

Edited by Vr2Io
Link to comment
15 hours ago, trurl said:

At least one pass. Often it will fail quickly if it will fail at all. Even one error is one too many.

 

Just got the new sticks and ran for one pass, so far no errors! 

 

1313460541_20240109-UnraidMemtestPass.thumb.jpeg.5095615f8b4b796c1d52a445217de734.jpeg

 

14 hours ago, sheldon said:

my 2 cents:
Your Memtest screenshots show that the BIOS test detected 4000 MHz corsairs while the test from Unraid shows you are clocking them at 4800 MHz. Did you by any chance leave the memory clock settings on AUTO? Most motherboards are horrible at auto detecting and setting correct values for memory. It's possible on your previous test with the BIOS the memory clocked at the default 4000 Mhz speed while on later reboots it thought your system could handle more and started overclocking (they call that AI overclocking but it's really bad in my experience).
Anyhow, my suggestion is to verify what settings you have for the memory/CPU speed in the BIOS.  Disable any auto overclocking features, use XMP when possible and set the correct voltage for DDR5 as per the Ram sticks you bought. Run the test again and if it still fails maybe something else is going on but it's also possible you simply haven't found the perfect voltage/speed/CAS settings for your memory.
I run an AMD setup for Unraid but even on Intel the biggest problem I always face is finding the right settings for memory. It can take hours to find the perfect sweet spot to run stable (even without overclocking).

Man, I tried going into the BIOS and looking at the memory clock setting and it was not easy to decipher. I don't think I've ever messed with memory clock settings before, any overclocking I've done with my Windows (primary) system was only done through the manufacturer's GUI after loading the OS. I've always favored stability over performance. 

 

Hopefully now that I've got these new RAM sticks in there, I won't have to try and mess with the BIOS settings again. And hopefully now it runs without crashing so I can start installing docker containers and a Windows VM and the other fun stuff that I had planned. 

 

10 hours ago, Vr2Io said:

That depends on how robust of those memtest program to make test, different software have different test algorithms.

 

Unraid have update the build-in memtest86+ ver. since 6.12.5, in old ver., it also unreliable and I always use 3rd party memtest program.

 

Anyway, all depends on which test program use, it doesn't relate come from BIOS, Unraid or whatever.

 

image.png.ee897608b80b403141986336ce3b0329.png

 

https://github.com/memtest86plus/memtest86plus?tab=readme-ov-file#memory-testing-philosophy ( your BIOS memtest were come from PassMark software, they are compititor )

 

image.png.c7bceb46eecb0ed0e7a7ba652ddc4bae.png

 

Thanks for that info @Vr2Io, this is one of the many things I am learning through this project. I never knew the significance of memtests, and never knew there was a difference between the different approaches that various sources were using. 

 

Thanks, all! Hopefully will update later on with a successful update. 

  • Like 1
Link to comment

Update: so far the system just crossed 24h uptime after I replaced the ram that showed errors in memtest. I'm hopeful it will continue to stay up. Most recent parity check after the last crash is currently at 84.7% complete. 

 

Thanks again everyone!

Link to comment

Update: current uptime is 2 days, 7 hours. At this point I won't update this thread again until/unless it crashes. I've most recently installed nextcloud and other apps/dockers/plugins but I have a long way to go. Thanks!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.