PeterB Posted August 21, 2015 Share Posted August 21, 2015 When power returned after a powercut earlier today, I received a notification: "Array has 1 disk with read errors". On checking the SMART attributes, I find that my newest drive (5232 power on hours), a WD Red 3TB, was showing a raw value of 254 for the Raw Read Error Rate. In the succeeding six hours, that value has risen to 1481: 1 Raw Read Error Rate 0x002f 200 200 051 Pre-fail Always - 1481 I googled this attribute string, and many sites advise to ignore this value. However, this is the first time I have seen a non-zero value for this attribute, on any drive. What is the advice/experience of unRAID users - should I be concerned? Would WD be likely to swap this drive under warranty? Edit to add: The main tab on the unRAID webcui is reporting: Disk 2 WDC_WD30EFRX-68EUZN0_WD-WMC4N2597206 - 3 TB (sdc) 41 C 51,111 19,533 16,800 xfs 3 TB 2.32 TB 682 GB That's 16,800 errors. I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related. Link to comment
SSD Posted August 21, 2015 Share Posted August 21, 2015 When power returned after a powercut earlier today, I received a notification: "Array has 1 disk with read errors". On checking the SMART attributes, I find that my newest drive (5232 power on hours), a WD Red 3TB, was showing a raw value of 254 for the Raw Read Error Rate. In the succeeding six hours, that value has risen to 1481: 1 Raw Read Error Rate 0x002f 200 200 051 Pre-fail Always - 1481 I googled this attribute string, and many sites advise to ignore this value. However, this is the first time I have seen a non-zero value for this attribute, on any drive. What is the advice/experience of unRAID users - should I be concerned? Would WD be likely to swap this drive under warranty? Edit to add: The main tab on the unRAID webcui is reporting: Disk 2 WDC_WD30EFRX-68EUZN0_WD-WMC4N2597206 - 3 TB (sdc) 41 C 51,111 19,533 16,800 xfs 3 TB 2.32 TB 682 GB That's 16,800 errors. I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related. I would be concerned about the reported errors. You should post your syslog, or even better, generate and post your diagnostics file. The raw read error rate does not appear to be a problem. Smart is reporting a normalized value of 200 and a failure level of 51. If the normalized value dropped close to 51 it would be a sign of trouble. That being said, a sudden change in this value in conjunction with the errors you are reporting could be related. I would suggest running the short and long smart tests. They are non-destructive (won't alter any data) and if they fail, would be proof the drive is failing. Remember to disable spin down if running the long test, as it can result in a false failure. Link to comment
BobPhoenix Posted August 21, 2015 Share Posted August 21, 2015 ... Would WD be likely to swap this drive under warranty? ... I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related. WD had an option to RMA a drive for performance reasons. So even if they wouldn't take it because of a Raw Read Error rate above zero the fact that you are having performance problems should allow you to RMA it. Assuming the performance option on a RMA is still present like it was the last time I had to return a 2TB EARS drive to them (to give you perspective on the last time I RMA'd a drive to WD). Link to comment
SSD Posted August 21, 2015 Share Posted August 21, 2015 ... Would WD be likely to swap this drive under warranty? ... I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related. WD had an option to RMA a drive for performance reasons. So even if they wouldn't take it because of a Raw Read Error rate above zero the fact that you are having performance problems should allow you to RMA it. Assuming the performance option on a RMA is still present like it was the last time I had to return a 2TB EARS drive to them (to give you perspective on the last time I RMA'd a drive to WD). I am sure they would swap it out. I just hate refurbs. Not had good luck with them. I'd rather know if the problem is really the drive before returning a disk with near perfect attributes. If you could return it for an exchange, that is a very different prospect. Link to comment
BobPhoenix Posted August 21, 2015 Share Posted August 21, 2015 ... Would WD be likely to swap this drive under warranty? ... I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related. WD had an option to RMA a drive for performance reasons. So even if they wouldn't take it because of a Raw Read Error rate above zero the fact that you are having performance problems should allow you to RMA it. Assuming the performance option on a RMA is still present like it was the last time I had to return a 2TB EARS drive to them (to give you perspective on the last time I RMA'd a drive to WD). I am sure they would swap it out. I just hate refurbs. Not had good luck with them. I'd rather know if the problem is really the drive before returning a disk with near perfect attributes. If you could return it for an exchange, that is a very different prospect. I tend to agree. I've had a 50/50 success rate with refurbs. Because of that if I get a refurb I usually use it as backup external drive only and not in my array. Link to comment
SSD Posted August 21, 2015 Share Posted August 21, 2015 I tend to agree. I've had a 50/50 success rate with refurbs. Because of that if I get a refurb I usually use it as backup external drive only and not in my array. This is the reason I think the value of warranty is somewhat over-rated. If warranty comes in the picture, you are often exchanging your problem for someone else's problem. Because all the manufacturer does is run some diagnostics that show the drive is good, and put your drive in the "good" pile. Of course they zap all the smart attributes first. Now some drives require repair, but a drive that has been disassembled by human hands to replace parts is not necessarily factory fresh either. All in all, I'd say refurbs are a crap shoot. Many users don't know what they are doing, and send in a perfectly fine drives for replacement. If you get one as a replacement, you hit the lottery. But many times, the problems people have are nuanced and the drive manufacturer won't pick it up, and they'll just send it to someone else as a refurb. Link to comment
PeterB Posted August 22, 2015 Author Share Posted August 22, 2015 I attach my diagnostics file from yesterday evening. The short SMART test ran for several minutes - half the time it was showing 90% complete - but returned with no error. The long SMART test has been running for more than 12 hours now, and has been showing 90% complete for much of that time. Edit to Add: and another 12 hours later, it still says 90% complete - I'm sure that this can't be right. tower-diagnostics-20150821-1957.zip Link to comment
BobPhoenix Posted August 22, 2015 Share Posted August 22, 2015 I tend to agree. I've had a 50/50 success rate with refurbs. Because of that if I get a refurb I usually use it as backup external drive only and not in my array. This is the reason I think the value of warranty is somewhat over-rated. If warranty comes in the picture, you are often exchanging your problem for someone else's problem. Because all the manufacturer does is run some diagnostics that show the drive is good, and put your drive in the "good" pile. Of course they zap all the smart attributes first. Now some drives require repair, but a drive that has been disassembled by human hands to replace parts is not necessarily factory fresh either. All in all, I'd say refurbs are a crap shoot. Many users don't know what they are doing, and send in a perfectly fine drives for replacement. If you get one as a replacement, you hit the lottery. But many times, the problems people have are nuanced and the drive manufacturer won't pick it up, and they'll just send it to someone else as a refurb. True but 50 percent of my returns have been NEW drives and 50 percent of those became larger drives than I sent in. That was during the flood shortage so I don't expect it any more but if they don't have a refurb available they will send you a new drive. When I get a new drive back I usually use them. Link to comment
PeterB Posted August 24, 2015 Author Share Posted August 24, 2015 Well, the long SMART test completed after about 36 hours, and reported "Completed without error". The SMART attributes are still reporting a raw error rate of 1987, but unRAID reports 17,952 errors. I have to say that I have little confidence in that drive - I fail to understand how a drive which reports almost 18k errors to the O/S can be considered to be 'good', but I'm not sure what I can do about it. I'm considering replacing it with a new drive and then hammering the old one with preclears. Link to comment
RobJ Posted August 24, 2015 Share Posted August 24, 2015 Well, the long SMART test completed after about 36 hours, and reported "Completed without error". The SMART attributes are still reporting a raw error rate of 1987, but unRAID reports 17,952 errors. I have to say that I have little confidence in that drive - I fail to understand how a drive which reports almost 18k errors to the O/S can be considered to be 'good', but I'm not sure what I can do about it. Those results aren't necessarily related. Except for CRC errors, SMART only reports physical drive issues with the drive. The unRAID Main page drive error counts could also be interface issues, problems with the cabling or controller, etc. If something happens to the controller, and the drive is dropped from the system, then you get a huge unRAID error count for the drive, yet it's almost certainly in perfect condition. Link to comment
PeterB Posted August 25, 2015 Author Share Posted August 25, 2015 Well, the long SMART test completed after about 36 hours, and reported "Completed without error". The SMART attributes are still reporting a raw error rate of 1987, but unRAID reports 17,952 errors. I have to say that I have little confidence in that drive - I fail to understand how a drive which reports almost 18k errors to the O/S can be considered to be 'good', but I'm not sure what I can do about it. Those results aren't necessarily related. Indeed, but both counts increase (SMART raw read error rate & unRAID error count) at the same time. As far as I'm aware, the SMART raw read error is nothing to do with the interface performance Except for CRC errors, SMART only reports physical drive issues with the drive. The unRAID Main page drive error counts could also be interface issues, problems with the cabling or controller, etc. If something happens to the controller, and the drive is dropped from the system, then you get a huge unRAID error count for the drive, yet it's almost certainly in perfect condition. Well, interestingly, despite clocking up 1664 unRAID errors in the space of a few minutes, the drive doesn't get dropped from the system. A small chunk of the system log shows this: Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502560 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502568 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502576 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502584 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502592 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502600 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502608 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502616 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502624 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502632 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502640 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502648 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502656 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502664 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502672 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502680 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502688 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502696 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502704 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502712 Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502720 I haven't counted, but I presume that there are 1664 of these lines. I also presume that the sector counts are going in multiples of 8 because of the 4k blocksize (8 sectors/block). The machine hasn't been opened for months, so no cabling has been disturbed. I suspect that what is happening is that the drive encounters the raw read errors and while it recovers from each of those there is a lengthy delay resulting in a controller timeout. These timeouts will be what unRAID is reporting Having just rebooted for the rc6 update, I became aware that the errors are occuring during the cachedirs reads. Further errors, and lengthy delays, occur when certain files are read. I have seen OpenELEC displaying the 'buffering ...' status when playing video files from this drive. I have had experience of an older (1TB) WD drive which exhibited extrememly slow progress between ~70-80% through each pass of preclearing - I suspect that this drive is behaving in a similar fashion (although there were no SMART raw read errors clocked on the older drive). I really think that I need to order up a new drive (could take a couple of weeks to obtain a 3TB drive here in Philippines), then run preclears of this 'faulty' drive in order to get a better measure of the performance issue, with a view to RMAing it. Link to comment
RobJ Posted August 25, 2015 Share Posted August 25, 2015 I should be asleep, but saw the email of your post and felt I had to come back, partly because you're getting frustrated (I'm sorry!), but also because you are misunderstanding the SMART info. And also to request another Diagnostics zip, hopefully from the same session with those read errors. Raw_Read_Error_Rate is an error rate not a counter, just like Seek_Error_Rate, and the ONLY thing of interest, the ONLY thing we can interpret, is the VALUE and WORST for them. Both of your error rate VALUEs are 200, which is perfect. You are still referring to 'raw read errors', but that is NOT what this attribute is about. And the fact that it is non-zero and LOOKS like a counter that is increasing is throwing you off. It is not a counter, it is an error rate, and each manufacturer uses the RAW in differing ways, if they use it at all, with coded values that only they know the meaning of. According to the earlier SMART report, the drive looks great, no issues at all. We'll check the next SMART report to verify that that hasn't changed. The small section from the syslog you showed above shows a drive that from the unRAID modules viewpoint cannot read anything. I suspect the drive has already been dropped from the system by the kernel, but the unRAID module does not know that. I've seen this happen many times. I do not know that for sure, but if you have the syslog that includes all that, we should be able to see what went wrong. What is ALWAYS necessary though are the VERY FIRST errors, not the later read errors that may occur. Once a drive has trouble, and especially if it stops responding for any reason, then you can completely discount all of the errors that follow. Something IS wrong, but it really doesn't look like the drive, and I don't want to speculate without the syslog. I need to sleep now, but I'll check it in the morning, if someone else hasn't helped first. Link to comment
PeterB Posted August 29, 2015 Author Share Posted August 29, 2015 RobJ, many thanks for your continued advice. I hear what you say about that SMART attribute but, from my observation, the raw value is a counter with ever-increasing values - it currently sits at a constant value of 2114. However, as suddenly as the 'problem' appeared, it has gone away again. The attribute was increasing, and the unRAID errors were being reported, after two 'power-on' boots. However, subsequent (2 power cuts already today, before 8am, and several over the previous days) have not shown any further error reports, although, as I said, the attribute raw value is sitting at 2114. I will keep an eye on it but, for the time being, all seems well. Link to comment
PeterB Posted August 29, 2015 Author Share Posted August 29, 2015 HAd a short spell of this happening again, this evening (after the third powercut of the day). However, it did not happen during directory cacheing, as on the two previous occasions, but while playing a movie. I noticed OpenELEC displaying "Buffering ..." again, but it then carried on playing the movie normally - and it still is, more than an hour later. Raw read error rate has jumped from the previous 2114 to 2205. During that time, errors on that drive, reported by unRAID, show 1536. The start of this episode appears in the logfile as: Aug 29 22:38:14 Tower autofan: Highest disk temp is 39°C, adjusting fan speed from: 185 (72% @ 1872rpm) to: 150 (58% @ 1612rpm) Aug 29 22:39:20 Tower rpc.mountd[9176]: authenticated mount request from 172.22.1.26:779 for /mnt/user/xbmc (/mnt/user/xbmc) Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] Sense Key : 0x3 [current] [descriptor] Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] ASC=0x11 ASCQ=0x0 Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] CDB: opcode=0x88 88 00 00 00 00 01 04 99 63 40 00 00 04 00 00 00 Aug 29 22:41:55 Tower kernel: blk_update_request: critical medium error, dev sdd, sector 4372128576 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128512 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128520 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128528 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128536 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128544 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128552 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128560 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128568 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128576 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128584 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128592 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128600 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128608 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128616 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128624 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128632 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128640 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128648 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128656 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128664 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128672 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128680 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128688 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128696 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128704 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128712 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128720 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128728 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128736 Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128744 This was some four hours after the system booted up. tower-diagnostics-20150829-2337.zip Link to comment
SSD Posted August 29, 2015 Share Posted August 29, 2015 Whatever brand/model drive you are using may work that way, but it is not true of all drives. Many of then stick at 0 and others have huge numbers that, as RobJ mentioned, as not interpretable without technical docs that drive manufacturers do not publish. But all signs are pointing to this being a real issue that is observable by the drive and the OS. Unless it is some sort of connection issue, it sounds like the drive is bad. Link to comment
tr0910 Posted June 7, 2016 Share Posted June 7, 2016 This is an old topic, but I want to keep the raw read error rate discussions together. Drive is a Samsung 2tb 203WI. Working fine for years, but suddenly started spitting and fussing about raw read error rate. Seems to have stabilized at 52,000 now. Smart thinks this means "Failing Now" but the drive is working without issue again. Thanks to 6.2 having 2 parity slots, we slammed it into parity 2 spot on a small server to see what happens. Several preclears brought the pending sectors down to 0 from about 700. I know we are rolling the dice with this one, but I have never had a 203WI fail, and I want to see what happens. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.