July 25, 201114 yr I just received 2 WD EARS 2.0 Gb drives from Newegg. I upgraded my unit to Unraid 4.7, set the setting for 4k, and attempted to replace my 1.5Gb Parity drive with one of the new units. The parity sync failed with 288 errors. Figuring I may of received a bad drive, I swapped it with the other new drive and the parity sync once again fails... with the same number of errors. (Sorry, I am at work and do not have access to my syslogs.) Since this is used primary for storing media (movies, music, ebooks) I really am not concerned about the minor speed increase the AF provides... I am just looking to upgrade my capacity while replacing drives that have been running for a year or so. This morning,after seeing the 2nd failure, I have attempted to jumper pins 7-8 and reset the setting for the format to default (I forget what the default says), and the parity check was running when I left for work. I will not know if it completed for a few more hours. My questions are: 1- Is a pre-clear necessary on these drives to get them to work under the AF settings? 2- Why would both drives fail with an identical number of errors? (This may never be answered as I do not have the 1st syslog). 3- By jumpering the unit, do I lose the ability to move to AF at a later point (not in Unraid)?... sorry but some of the posts almost made it sound like it would tank the drive if you unjumper it... almost like it damages the drive.
July 25, 201114 yr Do you run regular parity checks on a monthly basis? The process for upgrading a parity drive is: A) Run a parity check and ensure that it completes without errors B) Stop the array, power down C) Replace the drive D) Boot the server, start the array (clicking the 'I'm sure I want to do this' checkbox) E) Wait for the parity sync to complete F) Run another parity check and ensure that it completes without errors It sounds like you skipped step A. It is very likely that the parity errors existed in your array before you replaced your 1.5 TB drive. When you get home from work, please capture a syslog before powering down your server and post it here. Have you powered on your WDEARS drive after installing the jumper? If so, then you should leave the jumper in place for the rest of the drive's life. When adding that particular drive to your array, go to the unRAID settings page and set the partition alignment setting to 'unaligned'. Once that drive is successfully added, then you can change the setting back to '4k mbr aligned'. 1) No. As long as unRAID's partition alignment is configured correctly, the drives will work with proper AF alignment even if they haven't been precleared. Your situation is slightly complicated by the fact that you now have one drive with a jumper and one drive without a jumper. The drive with a jumper should be added to the array with the 'mbr unaligned' setting in place. The drive without a jumper should be added to the array with the '4k mbr aligned' setting in place. Keep in mind that the unRAID partition alignment setting only affects future disks added, not disks currently in the array. So you need to adjust the setting before adding the new drives. 2) Most likely the errors existed before you added the new disks. 3) No. By using the jumper you are forcing the drive to align at sector 64 (which is correct for AF drives). As long as you follow my advice above, everything should be fine. Just don't add the jumpered with the 'mbr 4k aligned' setting enabled, as doing so would effectively negate the jumper and force the drive to run without 4k alignment. It is true that removing the jumper can brick the drive, several forum members have reported exactly that. If you have never powered on the drive with the jumper installed then it is safe to remove it. If you have powered on the drive with the jumper installed, then just leave it installed for the rest of the drive's life.
July 25, 201114 yr Author Thanks for the fast response. No, I do not run regular parity checks... for some reason I was under the impression that if there was a parity error, the system would flag it automatically. From here on out, I will be doing monthly checks... thanks. Yes, I did skip step A. This explains why I have 288 errors with each new drive... again, thinking that parity errors would autodetect I assumed these errors meant a failure with the new drives. I will have a syslog to you later on. Thanks again for your help.
July 25, 201114 yr Thanks for the fast response. No, I do not run regular parity checks... for some reason I was under the impression that if there was a parity error, the system would flag it automatically. From here on out, I will be doing monthly checks... thanks. When reading data the parity disk is only read if the data drive reported a read error condition. When a read error happens then the raid software will read the data from all the other drives and the parity disk and use that to calculate what the original data should have been. Because of this behavior, if you are just reading from your array (which is probably the most common activity) errors on the parity drive will not be detected. When writing data to a disk in the array the parity must also be updated to match, to do this the raid software can calculate the new parity block from the old data, the old parity block and the new data. These extra reads and the two writes of new data and parity are why writes to the array are much slower than reads. Performing monthly parity checks helps out in another way, these reads make sure all the data on all the drives are still accessible and if there are any "marginal blocks" this gives the firmware on the affected drives a chance to refresh the poor blocks, or if that fails, then to remap the failing blocks to one of the spare blocks on the drive. Yes, I did skip step A. This explains why I have 288 errors with each new drive... again, thinking that parity errors would autodetect I assumed these errors meant a failure with the new drives. Actually I don't think this explains the 288 errors, since you have replaced your original parity drive with new disks there is nothing to compare the new parity to the original one, so there is no way for the software to say these are "parity errors". What I think is going on is that one (or more) of the data disks is having problems reading some of its blocks and when this happens the raid software cannot calculate parity block, so is calling this an error. Hopefully the log file will clarify this. Regards, Stephen
July 25, 201114 yr Yes, I did skip step A. This explains why I have 288 errors with each new drive... again, thinking that parity errors would autodetect I assumed these errors meant a failure with the new drives. Actually I don't think this explains the 288 errors, since you have replaced your original parity drive with new disks there is nothing to compare the new parity to the original one, so there is no way for the software to say these are "parity errors". What I think is going on is that one (or more) of the data disks is having problems reading some of its blocks and when this happens the raid software cannot calculate parity block, so is calling this an error. Hopefully the log file will clarify this. Two things could have happened: a) If sdsnyr94 was upgrading his parity drive from a 1.5 TB drive to a 2 TB drive, unRAID should have run a parity sync and calculated new parity (without errors, assuming the data disks are healthy). b) If sdsnyr94 assigned the 2 TB drive to one of the data disk slots, then unRAID would have ran a parity-swap in which the old parity data would have been copied to the 2 TB disk and the remainder cleared (filled with zeros). The old parity disk would then have been re-assigned as a data disk. If a) happened, then you are right that the data disks are suspect. If b) happened, then any parity errors from the old disk would have been copied to the new disk, and it is still possible that the data disks are all perfectly healthy. Once we see the syslog we'll be able to know for sure how to proceed.
July 25, 201114 yr Author OK, here are the syslogs. Syslog01 is from last night, no jumper. Syslog02 (located in next post) is current, with jumper. I have the same 288 errors after syncing with jumper. On both logs, the final error repeats 2000+ more times... I cut it so it would fit under the attachment limit. syslog01.txt
July 25, 201114 yr Jul 24 21:42:56 MediaServer kernel: ata4.00: failed command: READ DMA Jul 24 21:42:56 MediaServer kernel: ata4.00: cmd c8/00:80:9f:00:00/00:00:00:00:00/e0 tag 0 dma 65536 in Jul 24 21:42:56 MediaServer kernel: res 51/40:4f:c7:00:00/40:01:00:00:00/e0 Emask 0x9 (media error) Jul 24 21:42:56 MediaServer kernel: ata4.00: status: { DRDY ERR } Jul 24 21:42:56 MediaServer kernel: ata4.00: error: { UNC } These types of errors often indicate a cabling problem. Could be a defective cable, but more likely just a loose cable or bad connection. Power down, open up your server, and reseat all your SATA power and data cables. Power back up and start a parity check. If it completes with no errors, then run a second parity check. If you see any parity check errors, then the next step is to make sure your SATA cables are working properly. Swap them with spares (if you have any). Connect all drives directly to the motherboard, bypass all backplanes (if you are using any). Same goes for power cables - power each drive directly, bypass all power splitters. Run another parity check. If you still see errors then the SATA ports on your motherboard might be defective. Hopefully you don't get to this point.
July 26, 201114 yr Author Frustrating part is that I saw that error last night, and ata4 appears to be the EARS drive. I swapped the cable and the power cable before I ran the parity sync this morning.... but the error remains. I just reseated all the connectors again and am trying another parity check... I will report back.
July 26, 201114 yr Jul 24 21:42:56 MediaServer kernel: ata4.00: failed command: READ DMA Jul 24 21:42:56 MediaServer kernel: ata4.00: cmd c8/00:80:9f:00:00/00:00:00:00:00/e0 tag 0 dma 65536 in Jul 24 21:42:56 MediaServer kernel: res 51/40:4f:c7:00:00/40:01:00:00:00/e0 Emask 0x9 (media error) Jul 24 21:42:56 MediaServer kernel: ata4.00: status: { DRDY ERR } Jul 24 21:42:56 MediaServer kernel: ata4.00: error: { UNC } These types of errors often indicate a cabling problem. Could be a defective cable, but more likely just a loose cable or bad connection. Power down, open up your server, and reseat all your SATA power and data cables. Power back up and start a parity check. If it completes with no errors, then run a second parity check. If you see any parity check errors, then the next step is to make sure your SATA cables are working properly. Swap them with spares (if you have any). Connect all drives directly to the motherboard, bypass all backplanes (if you are using any). Same goes for power cables - power each drive directly, bypass all power splitters. Run another parity check. If you still see errors then the SATA ports on your motherboard might be defective. Hopefully you don't get to this point. In my experience, "UNC Media Errors" are almost NEVER a cabling problem, but unreadable sectors on disks. UNC = UN-Correctable
July 26, 201114 yr Author Well, once again it fails and I get 288 errors. Is it still possible to go back to the 1.5Tb drive I had installed, and see if the parity sync completes?
July 26, 201114 yr If you haven't done any writes to the array since starting then you can use the trust my parity procedure to put the disk back. Just search the Wiki it's easy to find. You may still get some errors right at the start of the parity check due to unRAID changing some data in the housekeeping area of the disks but that should be the only place with errors. Peter
July 26, 201114 yr Author Well, 1.5 Tb HDD is back in place and there are no parity errors. If I was dealing with only 1 new drive, I would say I have a dud.... but the fact that I got exactly 288 errors with both drives makes me question that. Any suggestions from here?
July 26, 201114 yr The 288 errors represent a single "write" operation to the parity drive. It is the "md" sync window set in your Devices-settings. The same error occurs in both logs. Basically, a timeout when attempting to write to the disk. Since it occurs with two different disks, I'd suspect a cabling issue (power or data is bad or is intermittent), or disk controller port, or drive tray with an intermittent connection. Jul 24 23:23:53 MediaServer kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jul 24 23:23:53 MediaServer kernel: ata4.00: failed command: WRITE DMA EXT Jul 24 23:23:53 MediaServer kernel: ata4.00: cmd 35/00:00:9f:0e:ad/00:04:26:00:00/e0 tag 0 dma 524288 out Jul 24 23:23:53 MediaServer kernel: res 40/00:ff:00:00:00/40:01:00:00:00/00 Emask 0x4 (timeout) Jul 24 23:23:53 MediaServer kernel: ata4.00: status: { DRDY } Jul 24 23:23:53 MediaServer kernel: ata4: hard resetting link Jul 24 23:23:53 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:23:59 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0) Jul 24 23:24:03 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:03 MediaServer kernel: ata4: hard resetting link Jul 24 23:24:03 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:24:09 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0) Jul 24 23:24:14 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:14 MediaServer kernel: ata4: hard resetting link Jul 24 23:24:14 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:24:19 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0) Jul 24 23:24:49 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:49 MediaServer kernel: ata4: limiting SATA link speed to 1.5 Gbps Jul 24 23:24:49 MediaServer kernel: ata4: hard resetting link Jul 24 23:24:49 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:24:54 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:54 MediaServer kernel: ata4: reset failed, giving up Jul 24 23:24:54 MediaServer kernel: ata4.00: disabled Jul 24 23:24:54 MediaServer kernel: ata4.00: device reported invalid CHS sector 0 Jul 24 23:24:54 MediaServer kernel: ata4: EH complete Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Unhandled error code Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 ad 0e 9f 00 04 00 00 Jul 24 23:24:54 MediaServer kernel: end_request: I/O error, dev sdd, sector 648875679 Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Unhandled error code Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 ad 12 9f 00 01 00 00 Jul 24 23:24:54 MediaServer kernel: end_request: I/O error, dev sdd, sector 648876703 Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Unhandled error code Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 Jul 24 23:24:54 MediaServer kernel: sd 4:0:0:0: [sdd] CDB: cdb[0]=0x2a: 2a 00 26 ad 13 9f 00 04 00 00 Jul 24 23:24:54 MediaServer kernel: end_request: I/O error, dev sdd, sector 648876959 Jul 24 23:24:54 MediaServer kernel: md: disk0 write error <------------------ there are 288 of these Jul 24 23:24:54 MediaServer kernel: handle_stripe write error: 648875616/0, count: 1 Jul 24 23:24:54 MediaServer kernel: md: disk0 write error Jul 24 23:24:54 MediaServer kernel: handle_stripe write error: 648875624/0, count: 1 Jul 24 23:24:54 MediaServer kernel: md: disk0 write error Jul 24 23:24:54 MediaServer kernel: handle_stripe write error: 648875632/0, count: 1 Jul 24 23:24:54 MediaServer kernel: md: disk0 write error Jul 24 23:24:54 MediaServer kernel: handle_stripe write error: 648875640/0, count: 1 Jul 24 23:24:54 MediaServer kernel: md: disk0 write error Jul 24 23:24:54 MediaServer kernel: handle_stripe write error: 648875648/0, count: 1 Jul 24 23:24:54 MediaServer kernel: md: disk0 write error
July 26, 201114 yr Author Not sure it could be the cabling, as the original Samsung 1.5Tb drive is back and functioning fine using the same cabling. I am using the MB SATA connections, and the MB is an older Gigabyte GA-K8N Pro SLI... could there be an incompatibility between the two? I wish I still had the log from the 1st EARS drive to compare... the only thing I know is that I got the 288 errors, and they both failed about 1-1/2 hrs into the parity sync. Now which way should I go? Run the pre-clear on each drive using this MB? Run it from a different PC? Use the WD utilities? Joe- You mentioned about a timeout attempting to write to the disk. Is it possible that the system is trying to spin down the disk too early (i think I have it set to an hour). Or could it be something that the "green" part of the WD drive is doing? Thanks guys for all the help... I know you cannot answer most of this in a definitive way, since you don't have my hardware in front of you. Any other suggestions you have, I am more than willing to try... as long as we keep my 2+Tb of media intact.
July 26, 201114 yr Not sure it could be the cabling, as the original Samsung 1.5Tb drive is back and functioning fine using the same cabling. I am using the MB SATA connections, and the MB is an older Gigabyte GA-K8N Pro SLI... could there be an incompatibility between the two? I wish I still had the log from the 1st EARS drive to compare... the only thing I know is that I got the 288 errors, and they both failed about 1-1/2 hrs into the parity sync. Now which way should I go? Run the pre-clear on each drive using this MB? Run it from a different PC? Use the WD utilities? Joe- You mentioned about a timeout attempting to write to the disk. Is it possible that the system is trying to spin down the disk too early (i think I have it set to an hour). Or could it be something that the "green" part of the WD drive is doing? Thanks guys for all the help... I know you cannot answer most of this in a definitive way, since you don't have my hardware in front of you. Any other suggestions you have, I am more than willing to try... as long as we keep my 2+Tb of media intact. The number of errors is 288, the size of the "md sync size" (in other words, it was a single "write" to the disk that failed) It timed out. The disk did not respond. Jul 24 23:23:53 MediaServer kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jul 24 23:23:53 MediaServer kernel: ata4.00: failed command: WRITE DMA EXT Jul 24 23:23:53 MediaServer kernel: ata4.00: cmd 35/00:00:9f:0e:ad/00:04:26:00:00/e0 tag 0 dma 524288 out Jul 24 23:23:53 MediaServer kernel: res 40/00:ff:00:00:00/40:01:00:00:00/00 Emask 0x4 (timeout) Jul 24 23:23:53 MediaServer kernel: ata4.00: status: { DRDY } The disk controller tried 4 times to reset itself to re-establish communications, it could not. Jul 24 23:23:53 MediaServer kernel: ata4: hard resetting link Jul 24 23:23:53 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:23:59 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0) Jul 24 23:24:03 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:03 MediaServer kernel: ata4: hard resetting link Jul 24 23:24:03 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:24:09 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0) Jul 24 23:24:14 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:14 MediaServer kernel: ata4: hard resetting link Jul 24 23:24:14 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:24:19 MediaServer kernel: ata4: link is slow to respond, please be patient (ready=0) Jul 24 23:24:49 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:49 MediaServer kernel: ata4: limiting SATA link speed to 1.5 Gbps Jul 24 23:24:49 MediaServer kernel: ata4: hard resetting link Jul 24 23:24:49 MediaServer kernel: ata4: nv: skipping hardreset on occupied port Jul 24 23:24:54 MediaServer kernel: ata4: SRST failed (errno=-16) Jul 24 23:24:54 MediaServer kernel: ata4: reset failed, giving up It has nothing to do with the spin-down timer. It could be something related to temperature... a connection or cable opening up when warm. Joe L.
July 26, 201114 yr Author Thanks Joe, the 288 number now makes more sense. I want to test the 2 drives in another unit. Will the Pre-Clearing Script tell me if there are failures on the drive? If so, will it run on Ubuntu, and where is it located (do I need to Download or is it on my flash)?
July 26, 201114 yr I've read a few stories about the EARS having issues such as this when run at high temps. About 1.5hrs in means the drives would all be hot by that time. So, how's the cooling in your case? Peter
July 26, 201114 yr Author There are 5 fans in this case, not including the processor fan. The case is usually pretty cool, and I did have the case closed while doing the parity syncs.
July 26, 201114 yr Not sure it could be the cabling, as the original Samsung 1.5Tb drive is back and functioning fine using the same cabling. I am using the MB SATA connections, and the MB is an older Gigabyte GA-K8N Pro SLI... could there be an incompatibility between the two? I wish I still had the log from the 1st EARS drive to compare... the only thing I know is that I got the 288 errors, and they both failed about 1-1/2 hrs into the parity sync. Now which way should I go? Run the pre-clear on each drive using this MB? Run it from a different PC? Use the WD utilities? Since you have restored your original 1.5TB parity drive and confirmed it and the data still match, I would say take the two new drives and put them in another PC and run the WD utilities on them. There are a number of postings on this forum that indicate the initial failure rate of these drives might be as high as 20%, see posts like: http://lime-technology.com/forum/index.php?topic=13425.msg127918 http://lime-technology.com/forum/index.php?topic=12789.msg121545 http://lime-technology.com/forum/index.php?topic=12501.msg119524 ...and so on. All of the WD 2TB drives I have had problems with have been identified by the WD diagnostic tool's quick test as bad (though in one case the tool never completed this test). What I do today when I get a new drive is: 1. install in my desktop PC and then run the WD diagnostic, I'll do the extended test, then the fill the drive with zeroes and then a second extended test (this takes about 18 hours total) 2. if the drive passes this, I will install it in my unRAID box and then run two passes of the preclear script on it (which takes about 48 hours) 3. only if the preclear passes will I add the drive to the unRAID raid array. I've had about 4 of these WD 2.0TB drives fail to pass these tests, including one time when the original drive failed and its replacement drive also failed (though the replacement for the replacement worked). Regards, Stephen
July 26, 201114 yr Author Wow... I had no idea that these drives would be such a problem. I will run them through the WD Utilities and see what my results are.
July 26, 201114 yr I take it you did the trust my parity procedure when you put the 1.5T back in place? Otherwise, you would have just rebuilt the parity and possibly not had any errors show (there will never be any parity errors during a parity build). I agree with you present direction though, give those drives a good testing before trying to use them again. Peter
July 26, 201114 yr Author No, I just let it rebuild the parity..... I wanted to see if I got similar results with a known working drive.
July 27, 201114 yr Author OK, both drives failed the writing of zeros test...... and now both drives are going to be on their way back to newegg. Thanks for the tips VCA, when I get the replacements I will take those exact steps. You made it sound like you have a few of these drives.... of the ones that did not fail initially, have they been reliable, or is it still hit or miss if they will flake out? Thanks again to everyone for their help...
July 27, 201114 yr Returning to the title of this thread: 'WD EARS - Is pre-clearing necessary?' The answer is 'no', but this thread provides good evidence why it is a good idea! It is rare to receive two defective drives at the same time, but it can happen. I hope your replacement drives are healthy.
Archived
This topic is now archived and is closed to further replies.