June 17, 201016 yr I've tried checking parity for a few times in a row but I'm getting a write error on the parity drive. Should I be concerned and what can I do to fix it?
June 17, 201016 yr I've tried checking parity for a few times in a row but I'm getting a write error on the parity drive. Should I be concerned and what can I do to fix it? Yes, you should be concerned. You should post a syslog... Because you basically said something equivalent to: "I'm hearing a rattle when I drive my car. What bolt do I need to tighten to silence it?" You basically have some kind of hardware issue. It could be almost anything (Well almost... probably not the mouse or monitor) could be power supply, cabling, hard-disk, motherboard, memory, BIOS configuration, clock speed, voltages, memory timing, etc... The only clue will be for you to post a syslog from the time period where you experienced the parity error. Then we can advise what tests need to be performed to try to isolate the problem.
June 17, 201016 yr Author I just read this: http://lime-technology.com/wiki/index.php?title=Troubleshooting#Capturing_your_syslog here's the syslog...seems kinda messy. syslog.txt
June 17, 201016 yr I just read this: http://lime-technology.com/wiki/index.php?title=Troubleshooting#Capturing_your_syslog here's the syslog...seems kinda messy. I see no evidence of a "write" error. If a "write" error had occurred, unRAId would have immediately taken the disk off-line and you would see a "red" indicator adjacent to it on the web-interface. Are you seeing a "red" indicator? If not, where are you seeing the error? To see if it is a disk problem, you can run the following commands on your disks: smartctl -d ata -a /dev/sda smartctl -d ata -a /dev/sdb smartctl -d ata -a /dev/sdc smartctl -d ata -a /dev/sdd You are looking for the parameters for Re-allocated sectors or sectors pending re-allocation.
June 17, 201016 yr Author Basically the web gui was showing: Model / Serial No. Temperature Size Free Reads Writes Errors parity WDC_WD10EAVS-00D_WD-WCAU49248571 26°C 976,762,552 - 65,743 0 0 with write = 1 and, Sync errors: 1 The above is showing no error as I just rebooted the array and cleared the statistic. I'm doing another check which won't be done until 2am or so. There's no red indicator and everything runs fine. I'm not sure if it's a temp warning from SMART. The hottest my parity drive had gotten was 39C after 2-3 parity checks in a row. I'll run the commands when the parity check is done and post the results up tomorrow. Thanks for the prompt reply!
June 17, 201016 yr Author Smart Reports, don't think anything's wrong? Also, how do I get the report for the other 6400AAKS? /sde didn't work. smart1.txt smart2.txt smart3.txt smart4.txt
June 17, 201016 yr It might not be sde, what disks show up in the "Devices" link of unRAID for that drive? From your syslog, it is sdf. Jun 15 14:45:02 Tower emhttp: Device inventory: Jun 15 14:45:02 Tower emhttp: pci-0000:00:0e.0-scsi-0:0:0:0 host1 (sda) WDC_WD10EARS-00Y5B1_WD-WCAV5C749570 Jun 15 14:45:02 Tower emhttp: pci-0000:00:0e.0-scsi-1:0:0:0 host2 (sdb) WDC_WD6400AAKS-75A7B0_WD-WCASY1489916 Jun 15 14:45:02 Tower emhttp: pci-0000:00:0f.0-scsi-0:0:0:0 host3 (sdc) WDC_WD10EAVS-00D7B1_WD-WCAU49248571 Jun 15 14:45:02 Tower emhttp: pci-0000:00:0f.0-scsi-1:0:0:0 host4 (sdd) WDC_WD10EAVS-14D7B1_WD-WCAU49737055 Jun 15 14:45:02 Tower emhttp: pci-0000:04:09.0-scsi-0:0:0:0 host13 (sdf) WDC_WD6400AAKS-75A7B0_WD-WCASY1185307
June 17, 201016 yr Author parity device: pci-0000:00:0f.0-scsi-0:0:0:0 host3 (sdc) WDC_WD10EAVS-00D7B1_WD-WCAU49248571 disk1 device: pci-0000:00:0f.0-scsi-1:0:0:0 host4 (sdd) WDC_WD10EAVS-14D7B1_WD-WCAU49737055 disk2 device: pci-0000:00:0e.0-scsi-0:0:0:0 host1 (sda) WDC_WD10EARS-00Y5B1_WD-WCAV5C749570 disk3 device: pci-0000:00:0e.0-scsi-1:0:0:0 host2 (sdb) WDC_WD6400AAKS-75A7B0_WD-WCASY1489916 disk4 device: pci-0000:04:09.0-scsi-0:0:0:0 host13 (sdf) WDC_WD6400AAKS-75A7B0_WD-WCASY1185307 Every time I sync the drives I'd get 1 sync error at around 80+% But how do I find out what's the error? edit: ah...that's how you see which drive it is eh?
June 17, 201016 yr Author Also the Devices page on the web page should list what device it is. Yeah I saw it over there.
June 17, 201016 yr Author Ok I think I may have found the problem. I've done another set of SMART test and compared it to the initial results, these are my findings: Device Model: WDC WD10EAVS-00D7B1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2 (sda) (onboard sata) Device Model: WDC WD10EARS-00Y5B1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1 (sdb) (onboard sata) Device Model: WDC WD6400AAKS-75A7B0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 133 (sdc) (onboard sata) Device Model: WDC WD10EAVS-14D7B1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 (sdd) (onboard sata) Device Model: WDC WD6400AAKS-75A7B0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 (sdf) (Supermicro controller) I highly doubt it is wiring issue, I have just taken the case apart and everything is connected properly. The power/reset/usb cable is near the first few hdds but I'd be surprised it would cause enough interference from just that. I moved the power/usb cable and made sure that all the sata connectors are secured. Will do another set of test now to see what's up with it. If there's still problem I'll connect the second row sata cables to the mobo to see if it's really the bottom most row getting interference. I swapped the drives to the controller, and re-ran smartctl -d ata -tshort /dev/sd* but I'm still getting the same CRC error.
June 18, 201016 yr Author Well after putting all the drives to the controller I'm still getting sync error. This cannot possibly be a cabling issue. What should I do next? Can someone help or am I talking to myself here? :'(
June 18, 201016 yr Can someone help or am I talking to myself here? :'( That's a little rude. You aren't paying people for their help here. Did you try connecting the drives directly to the cables instead of through the backplane you have there? Could be that.
June 18, 201016 yr Author No, I haven't tried connecting to the hdd itself, bypassing the back plane. I have moved the drives to a different row and I'm still getting 1-2 sync errors, CRC errors are still the same. I'm going to pull the drives with crc errors out and check them on another pc with WD tool. EDIT: To further test if it was a cabling issue, I have swapped the drive with 133 CRC error, to a slot where one of the drives showed 0 CRC error. Ran smartctl -d ata -tshort /dev/sd_ and still found the same amount of CRC error. It seems like regardless of where I move the drives with crc error, I'm still getting the same reading from smart. I have also ran reiserfsck --check /dev/md# on all the drives with CRC error (except for the parity drive..which also has 2 CRC error) and found no errors. I am now running a long smart test on the drive with 133 CRC error to see if anything changes. As mentioned that the parity drive itself shows 2 CRC error, coincidentally (or maybe not so much of a coincident) I am getting 2 parity sync error. Could this be the error that unraid is reporting? To summarize, what I have done so far to troubleshoot: 1) ran parity test multiple times, still getting the same error 2) unplugged all mobo connectors and replugged them, still getting the same error 3) moved all drives to controller, still getting the same error 4) swapped drive with crc error, into slot where other drive showed 0 crc error to rule out cabling issue, still getting the same crc error What I am doing/plan to do: 1) running long smart test on drive with 133 crc error 2) remove drive with crc error, toss them into windows and do a diagnostic with WD tool 3) remove parity drive, convert it to storage drive. move data from another 1tb drive with 0 crc error to parity drive. move drive with 0 crc error to parity and do another sync. One thing I am curious is that SMART test can be performed even when the drives are spun down?
June 18, 201016 yr One thing I am curious is that SMART test can be performed even when the drives are spun down? A Short Test may be able to run while the drive is spun down, but a long test needs to be run while the drive is spinning. whatever drive you are running the long test on make sure to disable the spin down timer for that drive. The spindown timer will kill any smart test that is running when it triggers.
June 18, 201016 yr Author That's nice to know. Anyway, I didn't bother completing the long test, I believe I have done enough to rule out bad cabling. I disabled the parity drive, took it out and tossed it into my desktop. Surely enough WD tool showed 0 CRC error count HD Tune found the 2 CRC error that smartctl found as well. I'm now in the process zero-fill formatting the drive. First pass is almost done (3 hours for 1 tb drive), I will run the second pass afterward and check do another SMART test with HD Tune to see if this is resolved.
June 18, 201016 yr A 'sync error' is not a write error. What 'parity check' does is march through the array, stripe-by-stripe, reading each data disk and the parity disk. (Think of a 'stripe' as the set of 4K blocks at the same logical address on each disk in the array.) For each stripe, all data blocks and the parity block are XOR'ed together - if the resultant 4K block is not "all zeros", then a sync error is reported - this means one of the 4K blocks was incorrect, but we don't know which one. But what the software does, after detecting a sync error, is to XOR the data disk blocks together, and write the resulting parity block to the parity disk. This is why you see the write count for parity also incremented each time a sync error occurs. Note that if you run parity check again, if you still see a sync error, this implies that one of the array disks is returning bad data without reporting that it's bad. This is the worst kind of error & I'd be really surprised if it was the disk (but not impossible). Probably it's the cabling, which would include connectors and backplane, or disk controller or motherboard or motherboard RAM or power supply. One thing I can do is generate a 'test' release for you that will report a little more information in the system log, eg, the sector where the sync error is occurring. This may provide more clues about where the fault lies. Send me an email: [email protected]
June 18, 201016 yr Author Thanks for the insight, Tom. I'd imagine it's not the controller or the back plane. As mentioned that regardless of whether the hdds gets connected to the onboard controller or the supermicro controllers, I'm still getting the same sync error. I have also swapped the drives to different back plane so it shouldn't be the back plane as well (hopefully). Of the given list, I'd cast most suspicion on the Rams. I'd be happy to run any test build that could help identify the problem
June 21, 201016 yr Author Been trying to figure out which drive is causing the error. Two 1TB drives confirmed to be fine. (parity and one data disk). When I was going to test a second 1 TB drive, I messed up and unRaid tried to clear the drive. Forced shutdown over telnet when it cleared 3%. Superblock got messed up but luckily I managed to get it rebuilt. The system is currently running a complete rebuild. How long would it take to scan and rebuild a 1tb drive?
Archived
This topic is now archived and is closed to further replies.