toughiv Posted July 6 Share Posted July 6 Please help, this is driving me insane. Please note I used to have a lot more frequent crashes when I was running lots of containers ... I clicked scrub in anger and wiped out all my appdata...ive got to rebuild everything :''( I have done the following: attached both the persistent syslog & the anonymised diagnostics. Run memtest for 3 passes and it come back clean...all disks pass SMART test. Noticed the system clock is off (will sort) and also noticed . Made the ipvlan change also from macvlan The only thing i can think is that 3 passes on memtest is not enough OR that the networking setup is causing issues. Just as an FYI my Unraid box is hooked up to pfSense. It has 1 x 1Gb port connecting it. I have 3 x VLANs to manage different subnets. All the subnets, outbound NAT, routing, etc. is handled by pfSense. tower-smart-20240708-0148 (2).zip tower-smart-20240708-0148 (1).zip tower-smart-20240708-0148.zip tower-smart-20240708-0147.zip tower-smart-20240708-0146.zip tower-diagnostics-20240706-1135.zip syslog-1720306751 Quote Link to comment
JorgeB Posted July 6 Share Posted July 6 If you mean that the server is rebooting by itself, that is almost always a hardware problem. Quote Link to comment
toughiv Posted July 6 Author Share Posted July 6 1 hour ago, JorgeB said: If you mean that the server is rebooting by itself, that is almost always a hardware problem. I am then assuming the most usual culprit is RAM? Then drives? I've not had a hardware issue before and struggling to diagnose then. Got any pointers on common tests to run and order of probability? Thank you Quote Link to comment
JorgeB Posted July 7 Share Posted July 7 RAM, PSU and board/CPU would be the main suspects, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM. Quote Link to comment
toughiv Posted July 16 Author Share Posted July 16 Hey @JorgeB how would you test board/CPU? I have run memtest for over 24h with no errors. I have checked my PSU with a power supply tester - no issues Run SMART report on all drives - no issues The crashes do not occur for anything particular like heavy usage of moving files or stressing the GPU. It occurred even when not running docker containers/just idling. The frequency varies, sometimes once in 24-48h window then others 5 times in a row, then again 6 hours later. Only had these issues after upgrading from UnRaid 5.X -> UnRaid 6.X Quote Link to comment
JorgeB Posted July 16 Share Posted July 16 28 minutes ago, toughiv said: how would you test board/CPU? You'd need to swap with a different one. Quote Link to comment
toughiv Posted July 24 Author Share Posted July 24 On 7/16/2024 at 9:12 AM, JorgeB said: You'd need to swap with a different one. Okay so I am still experiencing crashes. I have done: 36 hour memtest -> 0 errors PSU test -> nothing wrong with PSU Swapped CPU for new one Swapped Mobo for new one All that expense for naught - as it crashed again 1 hour after booting up the new motherboard and cpu... Quote Link to comment
JorgeB Posted July 24 Share Posted July 24 Is the server crashing or still rebooting on its own? Does it still do that if you don't start the VM and docker services? Quote Link to comment
toughiv Posted July 24 Author Share Posted July 24 should i run it in safe mode for a while to see if the restarts occur? and dont turn on VM/Docker Then incrementally bring services back online until crashes occur? Safe mode Not Safe mode (no docker) Not Safe mode /w docker Quote Link to comment
JorgeB Posted July 24 Share Posted July 24 1 hour ago, toughiv said: should i run it in safe mode for a while to see if the restarts occur? and dont turn on VM/Docker Then incrementally bring services back online until crashes occur? It's worth a try. Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 On 7/24/2024 at 11:09 AM, JorgeB said: It's worth a try. So I did the following: Safe Mode /w no gui = 3 days with no crashes, so proceeded to next step GUI Mode with no array = 2 days with no crashes so proceeded to next step GUI Mode with array, but no Docker and VMS = 1 day but then a crash All disks are healthy. What could this be do you think? Quote Link to comment
JorgeB Posted July 30 Share Posted July 30 Try with docker enabled but VMs disabled. Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 1 hour ago, JorgeB said: Try with docker enabled but VMs disabled. I meant both no docker and no vms. Just idling, with Gui + array started Quote Link to comment
JorgeB Posted July 30 Share Posted July 30 Not sure I follow, I understood it only crashed with VMs running, not idling. Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 (edited) 5 hours ago, JorgeB said: Not sure I follow, I understood it only crashed with VMs running, not idling. Originally it was crashing with Docker running. I had one VM but it was barely anything and a new addition to the stack. This has been ongoing for quite some time, i ensured the array started automatically and just let it restart & do its thing... However, the crashes seemed to be more frequent lately, to the point where it was becoming a blocker to me doing my stuff. That's when i decided to make this thread and get real serious about trying to diagnose this issue. So after doing the checks mentioned before: 1 ) RAM memtest for 36 hours 2) PSU test 3) SMART Reporting for all drives You then said it may be a CPU/MoBo problem, but the system just halting and restarting more often than not is a hardware issue. So, I went on eBay and spent a couple hundred on those new bits. However, the issue persists. That's now why i run the series of tests: - Running both [Safe Mode] & [GUI w/o Array] didn't cause any crashes. - It seems turning on the array is causing crashes...and the crashes happen more frequently if i run all my docker containers. My gut it telling me to say two things: 1) The array, once started, cannot be stopped - it always says "retry unmounting disk shares" 2) Maybe there is a driver / storage issue somewhere (given crashes happen with the array turned on) However, nothing shows in the Persistent Syslog. Edited July 30 by toughiv Quote Link to comment
JorgeB Posted July 30 Share Posted July 30 OK, I think I misunderstood your previous post, as VMs were running when it crashed, IMHO, if it crashes with just the array running, without docker and VMs, it's almost certainly hardware, but you should try to figure out what is preventing the array from stopping, even though that is most likely unrelated to the crashing. Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 2 minutes ago, JorgeB said: OK, I think I misunderstood your previous post, as VMs were running when it crashed, IMHO, if it crashes with just the array running, without docker and VMs, it's almost certainly hardware, but you should try to figure out what is preventing the array from stopping, even though that is most likely unrelated to the crashing. It has just crashed now and even when booting, it'll crash again and do so a couple times until it manages to stay up and running Given that it'll crash before the array has started in those instances, what do you think that could be? In fact it just happened and i was watching the screen and it crashed when the third party drivers (nvidia plugin) was being installed on boot ... maybe a coincidence. It is booting back up now...let's see if it happens there again. Yep it happened there twice!! Maybe it is the nvidia plugin? Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 2 minutes ago, toughiv said: It has just crashed now and even when booting, it'll crash again and do so a couple times until it manages to stay up and running Given that it'll crash before the array has started in those instances, what do you think that could be? In fact it just happened and i was watching the screen and it crashed when the third party drivers (nvidia plugin) was being installed on boot ... maybe a coincidence. It is booting back up now...let's see if it happens there again. Yep it happened there twice!! Maybe it is the nvidia plugin? I just tried to boot into Safe Mode - Gui - No plugins and it still crashed... maybe red herring on the nvidia front Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 Just now, toughiv said: I just tried to boot into Safe Mode - Gui - No plugins and it still crashed... maybe red herring on the nvidia front it actually still tries to install the nvidia third party plugin even when that safe mode option selected Quote Link to comment
toughiv Posted July 30 Author Share Posted July 30 in fact perhaps it just means the GPU is dying? Quote Link to comment
JorgeB Posted July 30 Share Posted July 30 44 minutes ago, toughiv said: what do you think that could be? I think it still continues to point to hardware, but still difficult to say which component exactly, since there are several that can cause that, if you have already swapped board and CPU, RAM and PSU would be the main remaining suspects, if you have multiple sticks of RAM, try just one, if the same, try the other one, that will basically rule out the RAM. Quote Link to comment
toughiv Posted July 31 Author Share Posted July 31 22 hours ago, JorgeB said: I think it still continues to point to hardware, but still difficult to say which component exactly, since there are several that can cause that, if you have already swapped board and CPU, RAM and PSU would be the main remaining suspects, if you have multiple sticks of RAM, try just one, if the same, try the other one, that will basically rule out the RAM. So just run the box without the GPU plugged in and it crashed. Even though i run memtest for 36H, it could still be the RAM? If so, i'll buy a little 16GB and give it a go... Feels like such an odd issue Quote Link to comment
JorgeB Posted July 31 Share Posted July 31 10 minutes ago, toughiv said: Even though i run memtest for 36H, it could still be the RAM? Yep, memtest is only definitive if it finds errors. Quote Link to comment
toughiv Posted July 31 Author Share Posted July 31 50 minutes ago, JorgeB said: Yep, memtest is only definitive if it finds errors. Okay dont close this thread please - ill get some RAM bought and post results (fingers crossed!) Quote Link to comment
toughiv Posted August 23 Author Share Posted August 23 @JorgeB - I have replaced the RAM and UnRaid still randomly restarted. So now I have proven: - PSU is fine - Replaced the MoBo - Replaced the CPU - Replaced the RAM - Run it without the GPU All still crashes. Surely this is an UnRaid issue? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.