WD EARS - Is pre-clearing necessary? - General Support (V5 and Older)

July 25, 201114 yr

I just received 2 WD EARS 2.0 Gb drives from Newegg. I upgraded my unit to Unraid 4.7, set the setting for 4k, and attempted to replace my 1.5Gb Parity drive with one of the new units. The parity sync failed with 288 errors.

Figuring I may of received a bad drive, I swapped it with the other new drive and the parity sync once again fails... with the same number of errors. (Sorry, I am at work and do not have access to my syslogs.)

Since this is used primary for storing media (movies, music, ebooks) I really am not concerned about the minor speed increase the AF provides... I am just looking to upgrade my capacity while replacing drives that have been running for a year or so.

This morning,after seeing the 2nd failure, I have attempted to jumper pins 7-8 and reset the setting for the format to default (I forget what the default says), and the parity check was running when I left for work. I will not know if it completed for a few more hours.

My questions are:

1- Is a pre-clear necessary on these drives to get them to work under the AF settings?

2- Why would both drives fail with an identical number of errors? (This may never be answered as I do not have the 1st syslog).

3- By jumpering the unit, do I lose the ability to move to AF at a later point (not in Unraid)?... sorry but some of the posts almost made it sound like it would tank the drive if you unjumper it... almost like it damages the drive.

Quote

July 25, 201114 yr

Do you run regular parity checks on a monthly basis?

The process for upgrading a parity drive is:

A) Run a parity check and ensure that it completes without errors

B) Stop the array, power down

C) Replace the drive

D) Boot the server, start the array (clicking the 'I'm sure I want to do this' checkbox)

E) Wait for the parity sync to complete

F) Run another parity check and ensure that it completes without errors

It sounds like you skipped step A. It is very likely that the parity errors existed in your array before you replaced your 1.5 TB drive. When you get home from work, please capture a syslog before powering down your server and post it here.

Have you powered on your WDEARS drive after installing the jumper? If so, then you should leave the jumper in place for the rest of the drive's life. When adding that particular drive to your array, go to the unRAID settings page and set the partition alignment setting to 'unaligned'. Once that drive is successfully added, then you can change the setting back to '4k mbr aligned'.

1) No. As long as unRAID's partition alignment is configured correctly, the drives will work with proper AF alignment even if they haven't been precleared. Your situation is slightly complicated by the fact that you now have one drive with a jumper and one drive without a jumper. The drive with a jumper should be added to the array with the 'mbr unaligned' setting in place. The drive without a jumper should be added to the array with the '4k mbr aligned' setting in place. Keep in mind that the unRAID partition alignment setting only affects future disks added, not disks currently in the array. So you need to adjust the setting before adding the new drives.

2) Most likely the errors existed before you added the new disks.

3) No. By using the jumper you are forcing the drive to align at sector 64 (which is correct for AF drives). As long as you follow my advice above, everything should be fine. Just don't add the jumpered with the 'mbr 4k aligned' setting enabled, as doing so would effectively negate the jumper and force the drive to run without 4k alignment. It is true that removing the jumper can brick the drive, several forum members have reported exactly that. If you have never powered on the drive with the jumper installed then it is safe to remove it. If you have powered on the drive with the jumper installed, then just leave it installed for the rest of the drive's life.

Quote

July 25, 201114 yr

Author

Thanks for the fast response.

No, I do not run regular parity checks... for some reason I was under the impression that if there was a parity error, the system would flag it automatically. From here on out, I will be doing monthly checks... thanks.

Yes, I did skip step A. This explains why I have 288 errors with each new drive... again, thinking that parity errors would autodetect I assumed these errors meant a failure with the new drives.

I will have a syslog to you later on.

Thanks again for your help.

Quote

July 25, 201114 yr

Thanks for the fast response.

No, I do not run regular parity checks... for some reason I was under the impression that if there was a parity error, the system would flag it automatically. From here on out, I will be doing monthly checks... thanks.

When reading data the parity disk is only read if the data drive reported a read error condition. When a read error happens then the raid software will read the data from all the other drives and the parity disk and use that to calculate what the original data should have been.

Because of this behavior, if you are just reading from your array (which is probably the most common activity) errors on the parity drive will not be detected.

When writing data to a disk in the array the parity must also be updated to match, to do this the raid software can calculate the new parity block from the old data, the old parity block and the new data. These extra reads and the two writes of new data and parity are why writes to the array are much slower than reads.

Performing monthly parity checks helps out in another way, these reads make sure all the data on all the drives are still accessible and if there are any "marginal blocks" this gives the firmware on the affected drives a chance to refresh the poor blocks, or if that fails, then to remap the failing blocks to one of the spare blocks on the drive.

Yes, I did skip step A. This explains why I have 288 errors with each new drive... again, thinking that parity errors would autodetect I assumed these errors meant a failure with the new drives.

Actually I don't think this explains the 288 errors, since you have replaced your original parity drive with new disks there is nothing to compare the new parity to the original one, so there is no way for the software to say these are "parity errors". What I think is going on is that one (or more) of the data disks is having problems reading some of its blocks and when this happens the raid software cannot calculate parity block, so is calling this an error. Hopefully the log file will clarify this.

Regards,

Stephen

Quote

July 25, 201114 yr

Yes, I did skip step A. This explains why I have 288 errors with each new drive... again, thinking that parity errors would autodetect I assumed these errors meant a failure with the new drives.

Actually I don't think this explains the 288 errors, since you have replaced your original parity drive with new disks there is nothing to compare the new parity to the original one, so there is no way for the software to say these are "parity errors". What I think is going on is that one (or more) of the data disks is having problems reading some of its blocks and when this happens the raid software cannot calculate parity block, so is calling this an error. Hopefully the log file will clarify this.

Two things could have happened:

a) If sdsnyr94 was upgrading his parity drive from a 1.5 TB drive to a 2 TB drive, unRAID should have run a parity sync and calculated new parity (without errors, assuming the data disks are healthy).

b) If sdsnyr94 assigned the 2 TB drive to one of the data disk slots, then unRAID would have ran a parity-swap in which the old parity data would have been copied to the 2 TB disk and the remainder cleared (filled with zeros). The old parity disk would then have been re-assigned as a data disk.

If a) happened, then you are right that the data disks are suspect. If b) happened, then any parity errors from the old disk would have been copied to the new disk, and it is still possible that the data disks are all perfectly healthy. Once we see the syslog we'll be able to know for sure how to proceed.

Quote

July 25, 201114 yr

Author

OK, here are the syslogs.

Syslog01 is from last night, no jumper.

Syslog02 (located in next post) is current, with jumper.

I have the same 288 errors after syncing with jumper.

On both logs, the final error repeats 2000+ more times... I cut it so it would fit under the attachment limit.

syslog01.txt

Quote

July 25, 201114 yr

Author

Syslog02

syslog02.txt

Quote

July 25, 201114 yr

Jul 24 21:42:56 MediaServer kernel: ata4.00: failed command: READ DMA
Jul 24 21:42:56 MediaServer kernel: ata4.00: cmd c8/00:80:9f:00:00/00:00:00:00:00/e0 tag 0 dma 65536 in
Jul 24 21:42:56 MediaServer kernel:          res 51/40:4f:c7:00:00/40:01:00:00:00/e0 Emask 0x9 (media error)
Jul 24 21:42:56 MediaServer kernel: ata4.00: status: { DRDY ERR }
Jul 24 21:42:56 MediaServer kernel: ata4.00: error: { UNC }

These types of errors often indicate a cabling problem. Could be a defective cable, but more likely just a loose cable or bad connection. Power down, open up your server, and reseat all your SATA power and data cables. Power back up and start a parity check. If it completes with no errors, then run a second parity check.

If you see any parity check errors, then the next step is to make sure your SATA cables are working properly. Swap them with spares (if you have any). Connect all drives directly to the motherboard, bypass all backplanes (if you are using any). Same goes for power cables - power each drive directly, bypass all power splitters. Run another parity check.

If you still see errors then the SATA ports on your motherboard might be defective. Hopefully you don't get to this point.

Quote

July 26, 201114 yr

Author

Frustrating part is that I saw that error last night, and ata4 appears to be the EARS drive. I swapped the cable and the power cable before I ran the parity sync this morning.... but the error remains.

I just reseated all the connectors again and am trying another parity check... I will report back.

Quote

July 26, 201114 yr

Jul 24 21:42:56 MediaServer kernel: ata4.00: failed command: READ DMA
Jul 24 21:42:56 MediaServer kernel: ata4.00: cmd c8/00:80:9f:00:00/00:00:00:00:00/e0 tag 0 dma 65536 in
Jul 24 21:42:56 MediaServer kernel:          res 51/40:4f:c7:00:00/40:01:00:00:00/e0 Emask 0x9 (media error)
Jul 24 21:42:56 MediaServer kernel: ata4.00: status: { DRDY ERR }
Jul 24 21:42:56 MediaServer kernel: ata4.00: error: { UNC }
These types of errors often indicate a cabling problem. Could be a defective cable, but more likely just a loose cable or bad connection. Power down, open up your server, and reseat all your SATA power and data cables. Power back up and start a parity check. If it completes with no errors, then run a second parity check.

If you see any parity check errors, then the next step is to make sure your SATA cables are working properly. Swap them with spares (if you have any). Connect all drives directly to the motherboard, bypass all backplanes (if you are using any). Same goes for power cables - power each drive directly, bypass all power splitters. Run another parity check.

If you still see errors then the SATA ports on your motherboard might be defective. Hopefully you don't get to this point.

In my experience, "UNC Media Errors" are almost NEVER a cabling problem, but unreadable sectors on disks. UNC = UN-Correctable

Quote

July 26, 201114 yr

Author

Well, once again it fails and I get 288 errors.

Is it still possible to go back to the 1.5Tb drive I had installed, and see if the parity sync completes?

Quote

July 26, 201114 yr

If you haven't done any writes to the array since starting then you can use the trust my parity procedure to put the disk back. Just search the Wiki it's easy to find. You may still get some errors right at the start of the parity check due to unRAID changing some data in the housekeeping area of the disks but that should be the only place with errors.

Peter

Quote

July 26, 201114 yr

Author

Well, 1.5 Tb HDD is back in place and there are no parity errors. If I was dealing with only 1 new drive, I would say I have a dud.... but the fact that I got exactly 288 errors with both drives makes me question that.

Any suggestions from here?

Quote

July 26, 201114 yr

The 288 errors represent a single "write" operation to the parity drive. It is the "md" sync window set in your Devices-settings.

The same error occurs in both logs. Basically, a timeout when attempting to write to the disk.

Since it occurs with two different disks, I'd suspect a cabling issue (power or data is bad or is intermittent), or disk controller port, or drive tray with an intermittent connection.

Jul 24 23:23:53 MediaServer kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jul 24 23:23:53 MediaServer kernel: ata4.00: failed command: WRITE DMA EXT

Jul 24 23:23:53 MediaServer kernel: ata4.00: cmd 35/00:00:9f:0e:ad/00:04:26:00:00/e0 tag 0 dma 524288 out

Jul 24 23:23:53 MediaServer kernel: res 40/00:ff:00:00:00/40:01:00:00:00/00 Emask 0x4 (timeout)

Jul 24 23:23:53 MediaServer kernel: ata4.00: status: { DRDY }

Jul 24 23:23:53 MediaServer kernel: ata4: hard resetting link

Jul 24 23:23:53 MediaServer kernel: ata4: nv: skipping hardreset on occupied port

Jul 24 23:23:59 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0)

Jul 24 23:24:03 MediaServer kernel: ata4: SRST failed (errno=-16)

Jul 24 23:24:03 MediaServer kernel: ata4: hard resetting link

Jul 24 23:24:03 MediaServer kernel: ata4: nv: skipping hardreset on occupied port

Jul 24 23:24:09 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0)

Jul 24 23:24:14 MediaServer kernel: ata4: SRST failed (errno=-16)

Jul 24 23:24:14 MediaServer kernel: ata4: hard resetting link

Jul 24 23:24:14 MediaServer kernel: ata4: nv: skipping hardreset on occupied port

Jul 24 23:24:19 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0)

Jul 24 23:24:49 MediaServer kernel: ata4: SRST failed (errno=-16)

Jul 24 23:24:49 MediaServer kernel: ata4: limiting SATA link speed to 1.5 Gbps

Jul 24 23:24:49 MediaServer kernel: ata4: hard resetting link

Jul 24 23:24:49 MediaServer kernel: ata4: nv: skipping hardreset on occupied port

Jul 24 23:24:54 MediaServer kernel: ata4: SRST failed (errno=-16)

Jul 24 23:24:54 MediaServer kernel: ata4: reset failed, giving up

Jul 24 23:24:54 MediaServer kernel: ata4.00: disabled

Jul 24 23:24:54 MediaServer kernel: ata4.00: device reported invalid CHS sector 0

Jul 24 23:24:54 MediaServer kernel: ata4: EH complete

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Unhandled error code

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 ad 0e 9f 00 04 00 00

Jul 24 23:24:54 MediaServer kernel: end_request: I/O error, dev sdd, sector 648875679

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Unhandled error code

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 ad 12 9f 00 01 00 00

Jul 24 23:24:54 MediaServer kernel: end_request: I/O error, dev sdd, sector 648876703

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Unhandled error code

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00

Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 ad 13 9f 00 04 00 00

Jul 24 23:24:54 MediaServer kernel: end_request: I/O error, dev sdd, sector 648876959

Jul 24 23:24:54 MediaServer kernel: md: disk0 write error <------------------ there are 288 of these

Jul 24 23:24:54 MediaServer kernel: handle_stripe write error: 648875616/0, count: 1