unRaid 4.5.4 Parity drive write error

June 17, 201016 yr

I've tried checking parity for a few times in a row but I'm getting a write error on the parity drive. Should I be concerned and what can I do to fix it?

June 17, 201016 yr

I've tried checking parity for a few times in a row but I'm getting a write error on the parity drive. Should I be concerned and what can I do to fix it?

Yes, you should be concerned.

You should post a syslog... Because you basically said something equivalent to:

"I'm hearing a rattle when I drive my car. What bolt do I need to tighten to silence it?"

You basically have some kind of hardware issue. It could be almost anything (Well almost... probably not the mouse or monitor)

could be power supply, cabling, hard-disk, motherboard, memory, BIOS configuration, clock speed, voltages, memory timing, etc...

The only clue will be for you to post a syslog from the time period where you experienced the parity error. Then we can advise what tests need to be performed to try to isolate the problem.

June 17, 201016 yr

Author

I just read this:

http://lime-technology.com/wiki/index.php?title=Troubleshooting#Capturing_your_syslog

here's the syslog...seems kinda messy.

syslog.txt

June 17, 201016 yr

I just read this:

http://lime-technology.com/wiki/index.php?title=Troubleshooting#Capturing_your_syslog

here's the syslog...seems kinda messy.

I see no evidence of a "write" error. If a "write" error had occurred, unRAId would have immediately taken the disk off-line and you would see a "red" indicator adjacent to it on the web-interface. Are you seeing a "red" indicator?

If not, where are you seeing the error?

To see if it is a disk problem, you can run the following commands on your disks:

smartctl -d ata -a /dev/sda

smartctl -d ata -a /dev/sdb

smartctl -d ata -a /dev/sdc

smartctl -d ata -a /dev/sdd

You are looking for the parameters for Re-allocated sectors or sectors pending re-allocation.

June 17, 201016 yr

Author

Basically the web gui was showing:

Model / Serial No. Temperature Size Free Reads Writes Errors
parity WDC_WD10EAVS-00D_WD-WCAU49248571 26°C 976,762,552 - 65,743 0 0

with write = 1

and,

Sync errors: 1

The above is showing no error as I just rebooted the array and cleared the statistic. I'm doing another check which won't be done until 2am or so. There's no red indicator and everything runs fine. I'm not sure if it's a temp warning from SMART. The hottest my parity drive had gotten was 39C after 2-3 parity checks in a row. I'll run the commands when the parity check is done and post the results up tomorrow.

Thanks for the prompt reply!

June 17, 201016 yr

Author

Here's a screen shot of the error I got:

Uploaded with ImageShack.us

June 17, 201016 yr

Author

Smart Reports, don't think anything's wrong?

Also, how do I get the report for the other 6400AAKS? /sde didn't work.

June 17, 201016 yr

It might not be sde, what disks show up in the "Devices" link of unRAID for that drive?

From your syslog, it is sdf.

Jun 15 14:45:02 Tower emhttp: Device inventory:

Jun 15 14:45:02 Tower emhttp: pci-0000:00:0e.0-scsi-0:0:0:0 host1 (sda) WDC_WD10EARS-00Y5B1_WD-WCAV5C749570

Jun 15 14:45:02 Tower emhttp: pci-0000:00:0e.0-scsi-1:0:0:0 host2 (sdb) WDC_WD6400AAKS-75A7B0_WD-WCASY1489916

Jun 15 14:45:02 Tower emhttp: pci-0000:00:0f.0-scsi-0:0:0:0 host3 (sdc) WDC_WD10EAVS-00D7B1_WD-WCAU49248571

Jun 15 14:45:02 Tower emhttp: pci-0000:00:0f.0-scsi-1:0:0:0 host4 (sdd) WDC_WD10EAVS-14D7B1_WD-WCAU49737055

Jun 15 14:45:02 Tower emhttp: pci-0000:04:09.0-scsi-0:0:0:0 host13 (sdf) WDC_WD6400AAKS-75A7B0_WD-WCASY1185307

June 17, 201016 yr

Author

parity device: pci-0000:00:0f.0-scsi-0:0:0:0 host3 (sdc) WDC_WD10EAVS-00D7B1_WD-WCAU49248571
disk1 device: pci-0000:00:0f.0-scsi-1:0:0:0 host4 (sdd) WDC_WD10EAVS-14D7B1_WD-WCAU49737055

disk2 device: pci-0000:00:0e.0-scsi-0:0:0:0 host1 (sda) WDC_WD10EARS-00Y5B1_WD-WCAV5C749570

disk3 device: pci-0000:00:0e.0-scsi-1:0:0:0 host2 (sdb) WDC_WD6400AAKS-75A7B0_WD-WCASY1489916

disk4 device: pci-0000:04:09.0-scsi-0:0:0:0 host13 (sdf) WDC_WD6400AAKS-75A7B0_WD-WCASY1185307

Every time I sync the drives I'd get 1 sync error at around 80+% But how do I find out what's the error?

edit: ah...that's how you see which drive it is eh?

June 17, 201016 yr

Also the Devices page on the web page should list what device it is.

June 17, 201016 yr

Author

Also the Devices page on the web page should list what device it is.

Yeah I saw it over there.

June 17, 201016 yr

Author

Ok I think I may have found the problem. I've done another set of SMART test and compared it to the initial results, these are my findings:

Device Model: WDC WD10EAVS-00D7B1

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2 (sda) (onboard sata)

Device Model: WDC WD10EARS-00Y5B1

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1 (sdb) (onboard sata)

Device Model: WDC WD6400AAKS-75A7B0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 133 (sdc) (onboard sata)

Device Model: WDC WD10EAVS-14D7B1

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 (sdd) (onboard sata)

Device Model: WDC WD6400AAKS-75A7B0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 (sdf) (Supermicro controller)

I highly doubt it is wiring issue, I have just taken the case apart and everything is connected properly. The power/reset/usb cable is near the first few hdds but I'd be surprised it would cause enough interference from just that. I moved the power/usb cable and made sure that all the sata connectors are secured. Will do another set of test now to see what's up with it. If there's still problem I'll connect the second row sata cables to the mobo to see if it's really the bottom most row getting interference.

I swapped the drives to the controller, and re-ran

smartctl -d ata -tshort /dev/sd*

but I'm still getting the same CRC error.

June 18, 201016 yr

Author

Well after putting all the drives to the controller I'm still getting sync error. This cannot possibly be a cabling issue. What should I do next? Can someone help or am I talking to myself here? :'(

June 18, 201016 yr

Can someone help or am I talking to myself here? :'(

That's a little rude. You aren't paying people for their help here.

Did you try connecting the drives directly to the cables instead of through the backplane you have there? Could be that.

June 18, 201016 yr

Author

No, I haven't tried connecting to the hdd itself, bypassing the back plane. I have moved the drives to a different row and I'm still getting 1-2 sync errors, CRC errors are still the same. I'm going to pull the drives with crc errors out and check them on another pc with WD tool.

EDIT: To further test if it was a cabling issue, I have swapped the drive with 133 CRC error, to a slot where one of the drives showed 0 CRC error. Ran smartctl -d ata -tshort /dev/sd_ and still found the same amount of CRC error. It seems like regardless of where I move the drives with crc error, I'm still getting the same reading from smart.

I have also ran reiserfsck --check /dev/md# on all the drives with CRC error (except for the parity drive..which also has 2 CRC error) and found no errors.

I am now running a long smart test on the drive with 133 CRC error to see if anything changes. As mentioned that the parity drive itself shows 2 CRC error, coincidentally (or maybe not so much of a coincident) I am getting 2 parity sync error. Could this be the error that unraid is reporting?

To summarize, what I have done so far to troubleshoot:

1) ran parity test multiple times, still getting the same error

2) unplugged all mobo connectors and replugged them, still getting the same error

3) moved all drives to controller, still getting the same error

4) swapped drive with crc error, into slot where other drive showed 0 crc error to rule out cabling issue, still getting the same crc error

What I am doing/plan to do:

1) running long smart test on drive with 133 crc error

2) remove drive with crc error, toss them into windows and do a diagnostic with WD tool

3) remove parity drive, convert it to storage drive. move data from another 1tb drive with 0 crc error to parity drive. move drive with 0 crc error to parity and do another sync.

One thing I am curious is that SMART test can be performed even when the drives are spun down?

June 18, 201016 yr

One thing I am curious is that SMART test can be performed even when the drives are spun down?

A Short Test may be able to run while the drive is spun down, but a long test needs to be run while the drive is spinning. whatever drive you are running the long test on make sure to disable the spin down timer for that drive. The spindown timer will kill any smart test that is running when it triggers.

June 18, 201016 yr

Author

That's nice to know.

Anyway, I didn't bother completing the long test, I believe I have done enough to rule out bad cabling. I disabled the parity drive, took it out and tossed it into my desktop. Surely enough WD tool showed 0 CRC error count ::) HD Tune found the 2 CRC error that smartctl found as well. I'm now in the process zero-fill formatting the drive. First pass is almost done (3 hours for 1 tb drive), I will run the second pass afterward and check do another SMART test with HD Tune to see if this is resolved.

June 18, 201016 yr

A 'sync error' is not a write error. What 'parity check' does is march through the array, stripe-by-stripe, reading each data disk and the parity disk. (Think of a 'stripe' as the set of 4K blocks at the same logical address on each disk in the array.)

For each stripe, all data blocks and the parity block are XOR'ed together - if the resultant 4K block is not "all zeros", then a sync error is reported - this means one of the 4K blocks was incorrect, but we don't know which one. But what the software does, after detecting a sync error, is to XOR the data disk blocks together, and write the resulting parity block to the parity disk. This is why you see the write count for parity also incremented each time a sync error occurs.

Note that if you run parity check again, if you still see a sync error, this implies that one of the array disks is returning bad data without reporting that it's bad. This is the worst kind of error & I'd be really surprised if it was the disk (but not impossible). Probably it's the cabling, which would include connectors and backplane, or disk controller or motherboard or motherboard RAM or power supply.

One thing I can do is generate a 'test' release for you that will report a little more information in the system log, eg, the sector where the sync error is occurring. This may provide more clues about where the fault lies. Send me an email: [email protected]

June 18, 201016 yr

Author

Thanks for the insight, Tom. I'd imagine it's not the controller or the back plane. As mentioned that regardless of whether the hdds gets connected to the onboard controller or the supermicro controllers, I'm still getting the same sync error. I have also swapped the drives to different back plane so it shouldn't be the back plane as well (hopefully). Of the given list, I'd cast most suspicion on the Rams.

I'd be happy to run any test build that could help identify the problem

June 21, 201016 yr

Author

Been trying to figure out which drive is causing the error. Two 1TB drives confirmed to be fine. (parity and one data disk). When I was going to test a second 1 TB drive, I messed up and unRaid tried to clear the drive. Forced shutdown over telnet when it cleared 3%. Superblock got messed up but luckily I managed to get it rebuilt. The system is currently running a complete rebuild. How long would it take to scan and rebuild a 1tb drive?

unRaid 4.5.4 Parity drive write error

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)