jimbo123 Posted November 1, 2020 Share Posted November 1, 2020 (edited) Hi all, I've been running unRAID for about 10 years now,.. mostly on auto-pilot, with only issues I've had a a few failed drives. I run a monthly parity check,.. and up until the last few weeks have never had any sync errors with the check at all. The last few parity checks have been showing up corrected errors. Today 1st Nov,.. it did its normal monthly check and again found and corrected sync errors (217). Nov 1 00:00:01 storage kernel: mdcmd (42): check CORRECT Nov 1 00:00:01 storage kernel: Nov 1 00:00:01 storage kernel: md: recovery thread woken up ... Nov 1 00:00:01 storage kernel: md: recovery thread checking parity... Nov 1 00:00:01 storage kernel: md: using 1536k window, over a total of 3907018532 blocks. Nov 1 00:05:16 storage kernel: md: correcting parity, sector=65721672 Nov 1 00:09:27 storage kernel: md: correcting parity, sector=122070112 Nov 1 00:09:28 storage kernel: md: correcting parity, sector=122183264 Nov 1 00:09:28 storage kernel: md: correcting parity, sector=122187616 Nov 1 00:09:28 storage kernel: md: correcting parity, sector=122196320 Nov 1 00:09:28 storage kernel: md: correcting parity, sector=122213728 Nov 1 00:09:28 storage kernel: md: correcting parity, sector=122218080 Nov 1 00:09:28 storage kernel: md: correcting parity, sector=122252896 Nov 1 00:09:29 storage kernel: md: correcting parity, sector=122435680 Nov 1 00:09:29 storage kernel: md: correcting parity, sector=122440032 Nov 1 00:09:29 storage kernel: md: correcting parity, sector=122514016 Nov 1 00:09:29 storage kernel: md: correcting parity, sector=122540128 Nov 1 00:09:30 storage kernel: md: correcting parity, sector=122548832 Nov 1 00:09:30 storage kernel: md: correcting parity, sector=122557536 Nov 1 00:09:30 storage kernel: md: correcting parity, sector=122561888 Nov 1 00:09:30 storage kernel: md: correcting parity, sector=122566240 Nov 1 00:09:30 storage kernel: md: correcting parity, sector=122592352 Nov 1 00:09:30 storage kernel: md: correcting parity, sector=122596704 Nov 1 00:20:28 storage kernel: md: correcting parity, sector=270032168 After it had finished,.. I manually kicked off the check again after a few hours,.. and again it found errors again, (its almost done and is at 218). Nov 1 14:15:11 storage kernel: md: recovery thread woken up ... Nov 1 14:15:11 storage kernel: md: recovery thread checking parity... Nov 1 14:15:11 storage kernel: md: using 1536k window, over a total of 3907018532 blocks. Nov 1 14:20:20 storage kernel: md: correcting parity, sector=65721672 Nov 1 14:24:08 storage kernel: md: correcting parity, sector=117798280 Nov 1 14:24:13 storage kernel: md: correcting parity, sector=118812296 Nov 1 14:24:15 storage kernel: md: correcting parity, sector=119217032 Nov 1 14:24:15 storage kernel: md: correcting parity, sector=119260552 Nov 1 14:24:16 storage kernel: md: correcting parity, sector=119452040 Nov 1 14:24:27 storage kernel: md: correcting parity, sector=122070112 Nov 1 14:24:28 storage kernel: md: correcting parity, sector=122183264 Nov 1 14:24:28 storage kernel: md: correcting parity, sector=122187616 Nov 1 14:24:28 storage kernel: md: correcting parity, sector=122196320 Nov 1 14:24:28 storage kernel: md: correcting parity, sector=122213728 Nov 1 14:24:28 storage kernel: md: correcting parity, sector=122218080 Nov 1 14:24:28 storage kernel: md: correcting parity, sector=122252896 Nov 1 14:24:29 storage kernel: md: correcting parity, sector=122435680 Nov 1 14:24:29 storage kernel: md: correcting parity, sector=122440032 Nov 1 14:24:29 storage kernel: md: correcting parity, sector=122514016 Nov 1 14:24:29 storage kernel: md: correcting parity, sector=122540128 Nov 1 14:24:29 storage kernel: md: correcting parity, sector=122548832 Nov 1 14:24:29 storage kernel: md: correcting parity, sector=122557536 Have I got a disk that is on its way out ? I see no other errors in the log,.. previously, when I was having disk errors, there'd be something in there. I'm running 5.0.5 with the following layout: 4TB parity 2 x 2TB 2 x 4TB I've looked at smartctl details and the only thing that looks out of place is in the attached file (WDC_WD40EFRX-68N32N0_3704261.txt), but the events in there seem like they're old, but I'm no expert here. The other 4 drives show similar to the file 'WDC_WD40EFRX-68N32N0_WD-WCC7K5NPLNN1.txt' Can the sectors associated with the errors be matched back to a disk ? Hoping someone can shed some light on what's going on and help me identify if I do have a disk that needs to be replaced. Thanks in advance, Jim..... WDC_WD40EFRX-68N32N0_3704261.txt WDC_WD40EFRX-68N32N0_WD-WCC7K5NPLNN1.txt Edited November 29, 2020 by jimbo123 Quote Link to comment
trurl Posted November 1, 2020 Share Posted November 1, 2020 Have you done memtest? Quote Link to comment
jimbo123 Posted November 1, 2020 Author Share Posted November 1, 2020 (edited) 8 hours ago, trurl said: Have you done memtest? No I hadn't. I'll do that shortly. Is there a way to run memtest while server is running,.. as I run it headless ? EDIT: memtest running now,.. 50% done, no errors yet. Edited November 1, 2020 by jimbo123 Quote Link to comment
jimbo123 Posted November 1, 2020 Author Share Posted November 1, 2020 Memtest looks ok so far,.. 2nd is now done still with no errors. There's only 2GB in this system. Will leave it going for a few more hours. Quote Link to comment
jimbo123 Posted November 2, 2020 Author Share Posted November 2, 2020 Several hours and passes later,.. sill no errors detected,.. did a couple with ECC on,.. still no errors. Will leave it running memtest for a few more hours. Anything else I should check ? Quote Link to comment
jimbo123 Posted November 3, 2020 Author Share Posted November 3, 2020 Memory checks seems fine,.. is there anything else I should be checking ? Quote Link to comment
trurl Posted November 3, 2020 Share Posted November 3, 2020 On 11/1/2020 at 8:30 AM, jimbo123 said: Can the sectors associated with the errors be matched back to a disk ? No, but the fact that the checks were getting different sectors is what made me suspect RAM. You might try again after checking all connections. Then controller would be next suspect I think. Tell us more about your hardware. Very difficult for us to support a version of Unraid we haven't seen in years. And latest version has a lot more to help us troubleshoot. When you get stable again you should upgrade. Quote Link to comment
jimbo123 Posted November 3, 2020 Author Share Posted November 3, 2020 Hardware is one of the early HP MIcroservers,.. an N36L. Nothing changed really, running onboard controller. I updated BIOS firmware to enable the CD/DVD drive to operate as a SATA disk, or something similar from memory. One parity disk and 4 data drives as mentioned earlier. I'll update to current stable and see how things behave. Quote Link to comment
jimbo123 Posted November 4, 2020 Author Share Posted November 4, 2020 OK,.. so unplugged all drives, board, cables, etc,.. plugged back in and kicked off parity check - still encounters sync errors. While it goes through the check, I'll read up on the upgrade process and then get it current. Quote Link to comment
trurl Posted November 4, 2020 Share Posted November 4, 2020 5 hours ago, jimbo123 said: read up on the upgrade process https://wiki.unraid.net/Upgrading_to_UnRAID_v6 Quote Link to comment
jimbo123 Posted November 5, 2020 Author Share Posted November 5, 2020 (edited) Have now upgraded to 6.8.3. Seems like a similar amount of errors are being corrected during the check: EDIT: Seems to have grown to more than before,.. Edited November 5, 2020 by jimbo123 Quote Link to comment
trurl Posted November 5, 2020 Share Posted November 5, 2020 Is this a correcting parity check? After it finishes and before rebooting, Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread. In fact, don't reboot at all if you can help it while we are trying to track this down. We need to be able to compare this parity check with the next one in syslog, and syslog resets when you reboot. Quote Link to comment
jimbo123 Posted November 5, 2020 Author Share Posted November 5, 2020 Yes, correcting as it goes. I've attached the diagnostic zip below. I've also included one that I took about 1 hr into the parity check from yesterday, may be of use. storage-diagnostics-20201106-0837.zip storage-diagnostics-20201105-1026.zip Quote Link to comment
jimbo123 Posted November 5, 2020 Author Share Posted November 5, 2020 7 hours ago, trurl said: In fact, don't reboot at all if you can help it while we are trying to track this down. We need to be able to compare this parity check with the next one in syslog, and syslog resets when you reboot. Based on this, I'll kick off another parity check and provide diagnostic zip when its completed. Quote Link to comment
JorgeB Posted November 6, 2020 Share Posted November 6, 2020 That Microserver uses ECC RAM, so unlikely to be a memory problem, could be a disk but IMHO most likely to be a board/controller issue, do you have another controller you could use or another board/PC you could transplant the disks to? Quote Link to comment
jimbo123 Posted November 6, 2020 Author Share Posted November 6, 2020 Hi JorgeB, Yes, I have 2 others doing other things, one as a low-end ESX box (40L), which by the looks of it, Unraid can now do, so its probably the obvious candidate. Will just have to figure out the whole BIOS thing out, as from memory, the onboard SATA port used for the CD/DVD drive at the top wouldn't support a disk and some Russian guy came up with some custom firmware to overcome this restriction,... or something similar. The 2nd parity check just finished, results below,.. pretty consistent over the 2 runs, only less errors this time: I've also attached the diagnostic zip again. Jim..... storage-diagnostics-20201106-1936.zip Quote Link to comment
jimbo123 Posted November 12, 2020 Author Share Posted November 12, 2020 Did another check earlier today which came up errors again,.. more this time. Diag zip attached. Will swap the board with the one running ESX shortly and see what happens. storage-diagnostics-20201112-2126.zip Quote Link to comment
jimbo123 Posted November 12, 2020 Author Share Posted November 12, 2020 Replaced the board, its now a 40L with 16GB ram. Kicked off a parity check,.. looks like the same as its correcting errors. Diag zip attached. storage-diagnostics-20201112-2225.zip Quote Link to comment
JorgeB Posted November 12, 2020 Share Posted November 12, 2020 If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors. Quote Link to comment
jimbo123 Posted November 12, 2020 Author Share Posted November 12, 2020 And more errors,.. is this now pointing to a disk/s issue ? Quote Link to comment
JorgeB Posted November 12, 2020 Share Posted November 12, 2020 3 minutes ago, JorgeB said: If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors. Quote Link to comment
jimbo123 Posted November 12, 2020 Author Share Posted November 12, 2020 2 minutes ago, JorgeB said: If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors. Clean shutdown,.. check was done earlier in the day with many corrected errors prior to replacing the board. Will wait for this to finish and then run another tomorrow. Quote Link to comment
JorgeB Posted November 12, 2020 Share Posted November 12, 2020 1 minute ago, jimbo123 said: Clean shutdown,.. c It's not about that, if there is a hardware problem, some sectors were being wrongly corrected, so when that's solved it's normal for the 1st check to still find errors, but if the 2nd one still finds more then it's still not fixed. 1 Quote Link to comment
jimbo123 Posted November 12, 2020 Author Share Posted November 12, 2020 So check finished,.. lots or corrected errors. Will kick off another now. Quote Link to comment
jimbo123 Posted November 14, 2020 Author Share Posted November 14, 2020 OK,. I'd say I've still got a problem here somewhere after the board replacement. Currently doing another check,.. did a few yesterday also, 2 of which looked better, but the one going now is still correcting many errors. Given I'm now running with a different board and different memory is this now pointing to a disk ? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.