Lime1028

Members
  • Posts

    23
  • Joined

  • Last visited

Everything posted by Lime1028

  1. Glad you were able to sort it out. In the end it ended up being a dead CPU core. What was happening is that core priority order was such that the dead core wouldn't get any of the normal load and would only be put into use when the system was under heavy load, like a parity check. When the system decided to send something to that core, it would fail, and the system would crash. AMD was a bit annoying to deal with on the RMA as I had to ship the CPU to the US at my cost (a bit crazy to ask for international shipping on an RMA), but they did confirm that it was broken and sent another CPU back. Though they didn't pre-pay the import tax on the replacement CPU, and refused to when I asked about it, so I had to cover it. In the end RMAing the CPU cost almost as much as buying a new one as this was just a cheap Ryzen 3600. When I RMAed my 3080 after a shunt resistor blew Asus just sent me a shipping label and a week latter a new card was on my doorstep.
  2. Thanks for the reply. I'm running the newest BIOS. I've tried with fTPM disabled and enabled. RAM is running at it's default 2666 MHz. Ultimately I now think the CPU is the issue. I managed to get it stable enough to run prime95 for about 10 hours and it looks like core 4 has some serious issues. Every time it's always core 4 failing. I only had one day left on my warranty so I decided to RMA it, let's hope that fixes the problems.
  3. I'm back again with the same problem that's been plaguing me for months, but now with more data. To state the issue plainly, my sever can't go more than about 10ish hours without crashing. I was more like 2 hours a few weeks ago but I've been able to stretch it out a bit longer now. Regardless this makes it completely unusable. Things I've tried: Two different motherboards (MSI B450 Tomahawk MAX, and ASRock B550 Pro4), with the newest BIOS one one and second newest on the other (For some reason I can't get Unraid to boot with the newest MSI BIOS). Three different sets of RAM, one of them being Multi-bit ECC RAM. Setting the "Typical Idle Power" setting in BIOS. Disabling C-States in BIOS. Adding the line "/usr/local/sbin/zenstates --c6-disable" to the "go" file in the config folder. The one thing that had worked that falsely lead me to believe I had solved the problem last time was completely swapping the system out for an ancient Intel based setup (old board/cpu/ram I had laying around). After having no luck with the first motherboard (MSI) and trying everything I could think of I figured the board was the issue and all I had on hand was an old Intel board and CPU to go along with it. I through it in and it worked. So the board was the problem and I RMAd it. Ran the server for a week on the Intel system with no issues. MSI didn't actually check the board out and just sent me a refurbished board in a beat up box. Rebuilding the system with the replacement board brought me back to square one. I got my hands on another AM4 board and stability is arguable a bit better, but what has really made the difference so far is 4 and 5 on the above list, which has increased the time before it crashes from between 15 mins to 2 hours, up to about 9 or 10 hours. However It's still crashing, and seeing as a parity check takes me about 26 hours, I can't even get through that. At this point I have a couple of questions. Any idea what might be causing this? Is my CPU the problem (Ryzen 3600)? If it's possible, is there any way I can go about testing it? Recommendations for RAM and motherboards that might solve my issue? (official compatibility page seems many years out of date) I wasn't able to log the most recent crash with the ECC RAM, but I have the previous crash which was just with standard DDR4. Logs and Diagnostics attached. Thanks for any advice, help, or for just generally taking the time to read this. tower-diagnostics-20220703-1622.zip tower-syslog-20220703-2021.zip
  4. Well life stuff got in the way of validating everything early in the week. In the end you guys were correct, hardware error, seemingly the motherboard to be exact. I'm not entirely sure what on the board is causing it as the everything works fine until I start a parity check. Might be the SATA controller, since Unraid is run off a USB it doesn't need to touch anything SATA related until the check starts. Either way the board still had a few months of warranty left on it so it's currently being shipped off to MSI and we'll see what they find. Got a loaner board from a friend in their right now and everything is back up and running. Jorge, and trurl, thank you both for your help with these issues and your patience with me. It's all much appreciated.
  5. I've got a really weird jerry-rigged setup going on right now with a different motherboard, CPU, and RAM. It's currently running a parity check so if it passes (about 19h is what it's estimating) then it will confirm that the issue is in one of those 3 components. I also tried updating the BIOS for the original motherboard, but the newest BIOS just refused to boot Unraid no matter how I adjusted the boot settings. Restoring the older BIOS brought it back to booting, and crashing on parity check, as normal. I am still just curious as to what component is the issue. Unraid runs fine, and I can do anything within the interface I want, it's only when it begins a parity check that it crashes. Maybe something to do with the SATA ports on the motherboard? Either way I'll send an update crash or pass.
  6. So I've confirmed two things. Firstly, the power supply is not the problem, swapped it out for another I had and the issues still persist, it was worth a try. Secondly, If I check without writing corrections the check will hang at some point like normal but it won't crash the server. If I do select it to write corrections it will instead crash when it hangs. Also where it hangs is inconsistent, I've had it stop between 2 and 180 GB into the check. None of this really sheds any light on what the problem is but it's progress I guess.
  7. I have performed memtests with no errors. I've tested a few different versions with similar errors. Last night I began testing on 6.10.2-rc3 and it has not crashed yet 19 hours into the check, however about 3 minutes into the check it got to 28.1 GB (of 12 tb) and it has stalled there and is still there. Unlike previous cases the server has not crashed out though and the elapsed time (which usually freezes right before it crashes) has continued to count up. The estimated speed has continued to drop and is now at 85.1 KB/s. The issue is that nothing hardware wise has changed in the server from when it was previously running to now, and memtests and drive health checks have all come back clean. I swapped the USB based on some previous advice, but the issue persists. Is there any easy way for me use a different USB to boot a completely clean system on the same hardware (minus the drives)? So it'd be the same CPU, MB, RAM, etc... just different drives to not ruin my data and a different USB to handle the separate install. Presumably if it managed to build and array in that case then it's either the drives or software. I've added new logs, though only two lines have been added since I began yesterday (logs of which were posted in my previous reply). tower-syslog-20220604-1835.zip
  8. Had to take a break from working on this due to work stuff but I'm back on it now. Still not solution. I've got a new set of logs though, this time it failed at 0.2% which is a bit better than normal. The server isn't bricked though, so I'm able to download the logs. I also downloaded the diagnostics. Unlike normal the elapsed time is still working 15 minutes into the check, though the speed keeps dropping, so it's at like 7 MB/s now and I can't actually tell if it's making any progress. Gonna let it sit until it crashes. Still no idea as to the cause though I did see someone mention crashes during a parity check and it turned out their PSU was damaged. Not sure if my errors are representative of that but if/when this crashes I might try swapping the PSU. tower-syslog-20220603-2309.zip tower-diagnostics-20220603-1916.zip
  9. I just verified, my RAM is not overclocked and Power Supply Idle Control is set to "typical current idle", as indicated in the thread. So it seems that's not the issue. As a sanity check I changed the RAM speed around a bit and still got the same issues. Got some more logs from earlier, but it's likely more of the same. tower-syslog-20220512-2005.zip
  10. Well that sucks. I can't image what hardware though, this server was running for months just fine. I've run memory tests, drive tests, nothing has ever shown any issues.
  11. Unfortunately it turns out that there are still issues. I had some issue which was plaguing it for a while that was causing crashed but I couldn't get any logs of it. Now it's just stuck. It's been 42.4 GB into the parity check for about 6 hours now. However it hasn't crashed, the server is still responsive, though the logs show it throwing errors over and over (logs attached). tower-syslog-20220512-0332.zip
  12. So far so good. I've updated to v6.10.0-rc7, and the parity check is at 1% right now. I'll report back if it fails again or if it's successful. Thanks for the help.
  13. Sorry about that, that was all I could get at the time. After a few more attempts I've managed to get the parity check to fail without the entire server bricking itself. The check is still completely stalled and hasn't moved for an hour. This set of logs should be more complete (though it doesn't capture the entire time the server has been on, it's just repeating the same errors over and over anyway). tower-syslog-20220510-1521.zip
  14. Hello, I'm having this issue with my sever where every time I try to do a parity check it gets between 5 to 50 GB in (on a 12tb server) and then completely locks up. Can't do anything after it freezes, have to force a shutdown by pressing the power button. I was running 6.9.2 but decided to restore the previous version, which was 6.8.3. After a few failed checks, I tried another parity check and it spewed out a bunch of USB read/write errors when it crashed. I swapped the USB for a new one and while it's not giving me any USB errors it's still crashing the same with the parity checks. When it crashes it completely locks up so I don't have the true syslog files, however I did setup a syslog server to try and get an idea of what is going on. Two crashes are logged in the Crash 1.txt and Crash 2.txt files attached, in addition to the diagnostics files and a syslog file that was saved before starting a parity check (as I can't download it after the lock up). I'm at a loss for what is going on. I ran a memory test and everything came back clean. None of the drives have changed from when it was last running. This all started happening after a power outage took the server offline. If any other info is needed please let me know and I'll see what I can do. Crash 1.txt Crash 2.txt tower-diagnostics-20220508-2242.zip tower-syslog-20220509-2232.zip
  15. It's a Corsair RM750x. There are some splitters in use. Perhaps I'll open it up tomorrow morning and try redistributing the drives across the power connectors.
  16. Does anyone else have any ideas of what might be causing these crashes/freezes? It still happens every time a parity check is attempted, usually runs for 5-10 minutes before freezing.
  17. I've attempted everything in the highlighted comment regarding C-states and RAM speeds, unfortunately the problem still persists. I'm not particularly surprised seeing as those settings were set to their default values before and it was working fine. Thanks for the info though.
  18. Thanks for the reply! As requested I've attached diagnostics. The PSU is definitely not underpowered as it ran for over a year in the system, however it is possible it got damaged in the power outage, as it's only since then that the problems began. I tried spinning up the disk on the server and the server didn't freeze up or have any issues, but this might not be the same level of load as a parity check. tower-diagnostics-20220108-1730.zip
  19. Hello all, I'm hoping you can help me. A power outage took down my unraid server and now it will crash every time it tries to complete a parity check, which it will do every time the drives are mounted. So in other words the server is dead in the water at the moment. It gets about 1-3% of the way through the parity check (it doesn't seem to be consistently stopping at one part) then the whole thing will freeze. Even checking the logs or trying to pause the check is impossible. I set up an external log server to get the syslog, which I have attached. I did a memtest and ran some drive tests, all of which came back clean. I'm at a loss at this point. Thanks to anyone taking the time to read this! syslog.txt
  20. All the hardware was working fine before I last shut it down. I also see no signs of any issues when running health checks on the drives. Everything within the system is about a year and a half old. All the drives are proper NAS drives that should have years of life. If someone knows what they're looking for in the diagnostics maybe they can spot something, but to my untrained eye the hardware seems fine.
  21. I see two possible initiating factors. This started recently after an unfortunate event where I was transferring files to it from another computer over the network. That computer died (graphics card blew a shut resistor) while transferring files. The Unraid server was then taken offline to move it to gain access to some wires, and in the process I swapped the graphics card (I removed some ancient AMD card and swapped in a new Nvidia GT 710, because it's passive and the fan on the old thing was starting to make a ton of noise). Ever since then there has been errors. It's only ever on the parity check though. I don't know if it's because of the card (it's working fine for video output) or because of damage cause when that computer crashed during data transfer.
  22. My apologies. It completely locked up when I tried to get the diagnostics. I restarted it and got them, though I haven't tried to run a parity check again, as it will probably crash it again. tower-diagnostics-20210720-0952.zip
  23. Hello, I have been having this issue recently and could use some help. I'm able to start up the machine and login through the web console to unraid, I was able to start the array and everything was working fine, but when I started a parity check it ran normally for a few seconds, got a couple gigs into the check and then stopped. The web console started running really slow, some tabs in it just wouldn't load or show anything, like the dashboard and docker tabs. Pulling the syslog, or what I could get of it, I notice the following errors: Jul 19 22:24:54 kernel: mdcmd (42): check Jul 19 22:24:54 kernel: md: recovery thread: check P Q ... Jul 19 22:25:16 kernel: ------------[ cut here ]------------ Jul 19 22:25:16 kernel: kernel BUG at drivers/md/unraid.c:345! Jul 19 22:25:16 kernel: invalid opcode: 0000 [#1] SMP NOPTI Jul 19 22:25:16 kernel: CPU: 11 PID: 2308 Comm: mdrecoveryd Not tainted 5.10.28-Unraid #1 It looks like 20 seconds after the check started it crashed. I've attached the entire syslog file, crash happens near the end. No idea what these errors mean or how to fix them. I tried updating my bios and unfortunately that hasn't helped. tower-syslog-20210720-0530.zip