March 30, 20242 yr I've been accumulating a bunch of erros that need support: 1. Hardware Erros: Most urgently, hardware errors in syslog like this (these are filling the syslog, every hour from march25 until today) 2. flash is not read write: I've had this error a few times happen, it disappeared now so I don't have screenshots. I don't know what's caused it or why it disappeared.. maybe it's affecting other things 3. Failure updating plugins: The reason I went to syslog in the first place, plugins have been failing to update for a while (the error also happens with User Scripts and Unraid Connect) 4. Machine Check Events detected on your server: I did read in a few other threads (machine-check-events-detected-on-server, problems-with-ryzen-cpu) that this issue is safe to ignore, but I wanna see if it's linked to those cache memory failures I mentioned in the beginning: 5. Macvlan and Bridging found: This error I had a while ago in Fix Common Problems, I added it to the ignored list, but now I cannot un-ignore it: iboolnas-diagnostics-20240330-0543.zip Edited March 31, 20242 yr by iBoolGuy
March 30, 20242 yr Community Expert 4 hours ago, iBoolGuy said: . Macvlan and Bridging found: This error I had a while ago in Fix Common Problems, I added it to the ignored list, but now I cannot un-ignore it: You do not want to ignore this as it is known to cause system's to be unstable. Strongly recommended you take one of the recommended actions - switch to ipvlan docker networking or if continuing to use macvlan turn off bridging on eth0. 4 hours ago, iBoolGuy said: 2. flash is not read write: I've had this error a few times happen, it disappeared now so I don't have screenshots. I don't know what's caused it or why it disappeared.. maybe it's affecting other things This tends to indicate either problems with the flash drive or with the USB port to which it is attached.
April 2, 20242 yr Author On 3/30/2024 at 11:28 AM, itimpi said: This tends to indicate either problems with the flash drive or with the USB port to which it is attached. This error happened again: A few months ago I bought 3 SanDisk 16GB Cruzer Blade USB 2.0 Flash Drives, before them I had a tiny samsung 32GB one which did as the same current issue.. you're telling me these flash drives fail this often...? what sort of usb port could be this bad.. in my head I blame unRAID for being sensitive it's how I feel, but I don't know.. this is odd For the second macvlan issue, the reason I use it is because I'm using bond on eth0, that way if I switch the cable by mistake between the two ports on the NIC my static IP will not be reset on the router, because they have different mac addresses, and the bond solved that issue for me. Maybe you can @ someone who can maybe help with the first issue I mentioned? the hardware issues and what they could be related to Thank you very much!
March 1, 20251 yr Author Can anyone help me here? Since this post last year I'm still getting these errors daily: I replaced the Flash drive, and resolved the macvlan/bridging errors. Those aren't causing any issues. I attached my diagnostics file in the original post you can check that out, it's still the same exact errors. Edited March 1, 20251 yr by iBoolGuy
March 1, 20251 yr Those errors say something went wrong with your memory, but because of ECC they were corrected. But this means either your memory is broken, or the CPU (which has the memory controller on it) is. If you had more weird errors, it could be something about the power supply. Memtest86+ doesn't handle ECC I think, but you can try that, the commercial memtest86 does support ECC errors and detects them as well, you could use that to detect them, then switch modules to see if the error follows the module, or stays at the same position. If it's the module(s) replace them, if it stays, your CPU is broken.
March 1, 20251 yr Author 11 hours ago, Wody said: Those errors say something went wrong with your memory, but because of ECC they were corrected. But this means either your memory is broken, or the CPU (which has the memory controller on it) is. If you had more weird errors, it could be something about the power supply. Memtest86+ doesn't handle ECC I think, but you can try that, the commercial memtest86 does support ECC errors and detects them as well, you could use that to detect them, then switch modules to see if the error follows the module, or stays at the same position. If it's the module(s) replace them, if it stays, your CPU is broken. Very helpful reply thanks a lot. Could those be false positives? As in, could it be some software error triggering those hardware errors? Or is it really really broken? Also, that memory was very expensive.. how would it just "break?" like this? Is it something I did? Is there something I could do better? Or is it just what it is? 3rd and last question, is it ok if I choose to ignore the error? What consequences could happen from this sort of error? (I mean since ECC is correcting them.. should I still be worried? Thank you!
March 3, 20251 yr On 3/1/2025 at 7:19 PM, iBoolGuy said: Or is it really really broken? Also, that memory was very expensive.. how would it just "break?" 3rd and last question, is it ok if I choose to ignore the error? What consequences could happen from this sort of error? (I mean since ECC is correcting them.. should I still be worried? Thank you! Running a memory-test would show if it was really broken or not. But as far as I know, it can't be a software error, because it's the module itself reporting errors. If it was normal memory, you would get crashes, or corrupted data, or weird errors. As for why would it just break, memory is basically a whole bunch of switches that are on or off, so eventually they break down, plus things like heat can be an issue, and there can be electrical issues. With ECC memory, it verifies that these bits of memory are working correctly, and if not, they throw these errors. It could also be compatibility issue, not all modules are fit for all motherboards, usually it means the board won't boot at all, but all kinds of stuff can happen (for example, I have a broken module where one board says no module is inserted, on another board it boots fine, but run a memtest and it hangs the computer after 2 seconds). So yes, you should be worried and run a memory-test, if that throws error, the memory has to be replaced. You could also run the test on another computer after that, to see if it fails there too, if it does, it's the modules, otherwise it could be something else. Since the memory-controller is on the CPU, it can also be the board that is broken, or the CPU, and it can also be that the torque isn't correct so things aren't making contact properly. But ECC errors usually are the modules.
March 3, 20251 yr Author 17 hours ago, Wody said: Running a memory-test would show if it was really broken or not. But as far as I know, it can't be a software error, because it's the module itself reporting errors. If it was normal memory, you would get crashes, or corrupted data, or weird errors. As for why would it just break, memory is basically a whole bunch of switches that are on or off, so eventually they break down, plus things like heat can be an issue, and there can be electrical issues. With ECC memory, it verifies that these bits of memory are working correctly, and if not, they throw these errors. It could also be compatibility issue, not all modules are fit for all motherboards, usually it means the board won't boot at all, but all kinds of stuff can happen (for example, I have a broken module where one board says no module is inserted, on another board it boots fine, but run a memtest and it hangs the computer after 2 seconds). So yes, you should be worried and run a memory-test, if that throws error, the memory has to be replaced. You could also run the test on another computer after that, to see if it fails there too, if it does, it's the modules, otherwise it could be something else. Since the memory-controller is on the CPU, it can also be the board that is broken, or the CPU, and it can also be that the torque isn't correct so things aren't making contact properly. But ECC errors usually are the modules. If I forgot to state, I have a 5900X, and 4*32GB of 3200 MT/s ram. So I've obtained memtest86 software and put it on a flash drive and got to work, ECC tests (Polling and Injection) were enabled. It's been 4 hours, Pass 1 completed with 0 errors. Reading what you said again, and looking at the errors again.. it does say L3 Cache level/Unified Memory Controller.. so maybe it is the CPU! I have 2 other CPUs a 5950X and a 5800X, I could try those I guess? I'll finish for the 4 passes to complete, then if no errors occur, I'll put in another CPU and run the system see how that goes.
March 3, 20251 yr Community Expert 7 minutes ago, iBoolGuy said: I have a 5900X, and 4*32GB of 3200 MT/s ram. That's above speck, more if they are dual rank, see here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-819173
March 3, 20251 yr Author 1 hour ago, JorgeB said: That's above speck, more if they are dual rank, see here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-819173 Hey this is very helpful thank you! Oh yeah it is dual rank... Kingston Server Premier 32GB 3200MHz DDR4 ECC CL22 DIMM 2Rx8. I guess we're running 2667MT/s Will test and report back.
March 3, 20251 yr Author BTW, is it ok if the board auto clocks up? I have all the OC settings to auto/default, yet the BCLK actual clock becomes 100.48MHz, and so the memory clock instead of 2666MHz goes to 2679.52MHz. I've read online that it's "normal" for boards to do that. Is that considered within spec? Or must I bring it below?
March 3, 20251 yr 3 hours ago, iBoolGuy said: BTW, is it ok if the board auto clocks up? There's usually a slight difference between specs and reality, as long as you haven't enabled overclocking and the bus-speed is stable it should be fine.
March 3, 20251 yr Author 5 minutes ago, Wody said: There's usually a slight difference between specs and reality, as long as you haven't enabled overclocking and the bus-speed is stable it should be fine. I've read in the manual for the board that yes, the board follows the specification from the modules' SPD, so that's alright I think. But.. the errors are back: I turned off Spread Spectrum Control. I turned off Boost override I disabled C-state and put another setting to Typical Current something I disabled audio controller I disabled internal/m.b. built-in ethernet ports I modified the fan curve Confession: 2 of the RAM sticks came from Micron, the other 2 I purchased 2 years later came from "Hynix" I think it was called. Confession #2: I ran around 2 years I think without a UPS, and we have the most unstable power situation. Anyway.. I think the memtest86 I ran concluded the memory is fine, I'll try to run it on my PC and see if it's any different, and will try another CPU on the server see how that goes.
March 4, 20251 yr 1 hour ago, iBoolGuy said: 2 of the RAM sticks came from Micron, the other 2 I purchased 2 years later came from "Hynix" I think it was called. Generally mixing memory is not recommended. So you can also try and see what happens when you run only micron, and if they give errors, try hynix. The trick is usually to find out the most basic configuration where things run, and then start adding to it, and when it fails, it's usually the last thing you added, whether it's broken, not compatible etc. Since the address is the same all the time, you can also try switching modules, then if the error follows the module, it's something about the module, if the address stays the same, it's the processor, or something between, maybe some dirt or dust so things work most of the time, but not always. It does say 'no action required', so you can ignore it, but it just isn't like how it should be.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.