Raw Read Error Rate.

August 21, 201510 yr

When power returned after a powercut earlier today, I received a notification: "Array has 1 disk with read errors".

On checking the SMART attributes, I find that my newest drive (5232 power on hours), a WD Red 3TB, was showing a raw value of 254 for the Raw Read Error Rate. In the succeeding six hours, that value has risen to 1481:

1	Raw Read Error Rate	0x002f	200	200	051	Pre-fail	Always	-	1481

I googled this attribute string, and many sites advise to ignore this value. However, this is the first time I have seen a non-zero value for this attribute, on any drive.

What is the advice/experience of unRAID users - should I be concerned? Would WD be likely to swap this drive under warranty?

Edit to add:

The main tab on the unRAID webcui is reporting:

Disk 2	WDC_WD30EFRX-68EUZN0_WD-WMC4N2597206 - 3 TB (sdc)	41 C	51,111	19,533	16,800	xfs	3 TB	2.32 TB	682 GB

That's 16,800 errors.

I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related.

Quote

August 21, 201510 yr

When power returned after a powercut earlier today, I received a notification: "Array has 1 disk with read errors".

On checking the SMART attributes, I find that my newest drive (5232 power on hours), a WD Red 3TB, was showing a raw value of 254 for the Raw Read Error Rate. In the succeeding six hours, that value has risen to 1481:
1	Raw Read Error Rate	0x002f	200	200	051	Pre-fail	Always	-	1481
I googled this attribute string, and many sites advise to ignore this value. However, this is the first time I have seen a non-zero value for this attribute, on any drive.

What is the advice/experience of unRAID users - should I be concerned? Would WD be likely to swap this drive under warranty?

Edit to add:

The main tab on the unRAID webcui is reporting:
Disk 2	WDC_WD30EFRX-68EUZN0_WD-WMC4N2597206 - 3 TB (sdc)	41 C	51,111	19,533	16,800	xfs	3 TB	2.32 TB	682 GB
That's 16,800 errors.

I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related.

I would be concerned about the reported errors. You should post your syslog, or even better, generate and post your diagnostics file.

The raw read error rate does not appear to be a problem. Smart is reporting a normalized value of 200 and a failure level of 51. If the normalized value dropped close to 51 it would be a sign of trouble. That being said, a sudden change in this value in conjunction with the errors you are reporting could be related.

I would suggest running the short and long smart tests. They are non-destructive (won't alter any data) and if they fail, would be proof the drive is failing. Remember to disable spin down if running the long test, as it can result in a false failure.

Quote

August 21, 201510 yr

...
Would WD be likely to swap this drive under warranty?

...

I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related.

WD had an option to RMA a drive for performance reasons. So even if they wouldn't take it because of a Raw Read Error rate above zero the fact that you are having performance problems should allow you to RMA it. Assuming the performance option on a RMA is still present like it was the last time I had to return a 2TB EARS drive to them (to give you perspective on the last time I RMA'd a drive to WD).

Quote

August 21, 201510 yr

...
Would WD be likely to swap this drive under warranty?

...

I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related.

WD had an option to RMA a drive for performance reasons. So even if they wouldn't take it because of a Raw Read Error rate above zero the fact that you are having performance problems should allow you to RMA it. Assuming the performance option on a RMA is still present like it was the last time I had to return a 2TB EARS drive to them (to give you perspective on the last time I RMA'd a drive to WD).

I am sure they would swap it out. I just hate refurbs. Not had good luck with them. I'd rather know if the problem is really the drive before returning a disk with near perfect attributes. If you could return it for an exchange, that is a very different prospect.

Quote

August 21, 201510 yr

...
Would WD be likely to swap this drive under warranty?

...

I have been aware of lengthy 'Buffering...' delays when playing a movie, which may be related.

WD had an option to RMA a drive for performance reasons. So even if they wouldn't take it because of a Raw Read Error rate above zero the fact that you are having performance problems should allow you to RMA it. Assuming the performance option on a RMA is still present like it was the last time I had to return a 2TB EARS drive to them (to give you perspective on the last time I RMA'd a drive to WD).

I am sure they would swap it out. I just hate refurbs. Not had good luck with them. I'd rather know if the problem is really the drive before returning a disk with near perfect attributes. If you could return it for an exchange, that is a very different prospect.

I tend to agree. I've had a 50/50 success rate with refurbs. Because of that if I get a refurb I usually use it as backup external drive only and not in my array.

Quote

August 21, 201510 yr

I tend to agree. I've had a 50/50 success rate with refurbs. Because of that if I get a refurb I usually use it as backup external drive only and not in my array.

This is the reason I think the value of warranty is somewhat over-rated.

If warranty comes in the picture, you are often exchanging your problem for someone else's problem. Because all the manufacturer does is run some diagnostics that show the drive is good, and put your drive in the "good" pile. Of course they zap all the smart attributes first. Now some drives require repair, but a drive that has been disassembled by human hands to replace parts is not necessarily factory fresh either.

All in all, I'd say refurbs are a crap shoot. Many users don't know what they are doing, and send in a perfectly fine drives for replacement. If you get one as a replacement, you hit the lottery. But many times, the problems people have are nuanced and the drive manufacturer won't pick it up, and they'll just send it to someone else as a refurb.

Quote

August 22, 201510 yr

Author

I attach my diagnostics file from yesterday evening. The short SMART test ran for several minutes - half the time it was showing 90% complete - but returned with no error. The long SMART test has been running for more than 12 hours now, and has been showing 90% complete for much of that time.

Edit to Add:

and another 12 hours later, it still says 90% complete - I'm sure that this can't be right.

tower-diagnostics-20150821-1957.zip

Quote

August 22, 201510 yr

I tend to agree. I've had a 50/50 success rate with refurbs. Because of that if I get a refurb I usually use it as backup external drive only and not in my array.

This is the reason I think the value of warranty is somewhat over-rated.

If warranty comes in the picture, you are often exchanging your problem for someone else's problem. Because all the manufacturer does is run some diagnostics that show the drive is good, and put your drive in the "good" pile. Of course they zap all the smart attributes first. Now some drives require repair, but a drive that has been disassembled by human hands to replace parts is not necessarily factory fresh either.

All in all, I'd say refurbs are a crap shoot. Many users don't know what they are doing, and send in a perfectly fine drives for replacement. If you get one as a replacement, you hit the lottery. But many times, the problems people have are nuanced and the drive manufacturer won't pick it up, and they'll just send it to someone else as a refurb.

True but 50 percent of my returns have been NEW drives and 50 percent of those became larger drives than I sent in. That was during the flood shortage so I don't expect it any more but if they don't have a refurb available they will send you a new drive. When I get a new drive back I usually use them.

Quote

August 24, 201510 yr

Author

Well, the long SMART test completed after about 36 hours, and reported "Completed without error".

The SMART attributes are still reporting a raw error rate of 1987, but unRAID reports 17,952 errors.

I have to say that I have little confidence in that drive - I fail to understand how a drive which reports almost 18k errors to the O/S can be considered to be 'good', but I'm not sure what I can do about it.

I'm considering replacing it with a new drive and then hammering the old one with preclears.

Quote

August 24, 201510 yr

Well, the long SMART test completed after about 36 hours, and reported "Completed without error".

The SMART attributes are still reporting a raw error rate of 1987, but unRAID reports 17,952 errors.

I have to say that I have little confidence in that drive - I fail to understand how a drive which reports almost 18k errors to the O/S can be considered to be 'good', but I'm not sure what I can do about it.

Those results aren't necessarily related. Except for CRC errors, SMART only reports physical drive issues with the drive. The unRAID Main page drive error counts could also be interface issues, problems with the cabling or controller, etc. If something happens to the controller, and the drive is dropped from the system, then you get a huge unRAID error count for the drive, yet it's almost certainly in perfect condition.

Quote

August 25, 201510 yr

Author

Well, the long SMART test completed after about 36 hours, and reported "Completed without error".

The SMART attributes are still reporting a raw error rate of 1987, but unRAID reports 17,952 errors.

I have to say that I have little confidence in that drive - I fail to understand how a drive which reports almost 18k errors to the O/S can be considered to be 'good', but I'm not sure what I can do about it.

Those results aren't necessarily related.

Indeed, but both counts increase (SMART raw read error rate & unRAID error count) at the same time. As far as I'm aware, the SMART raw read error is nothing to do with the interface performance

Except for CRC errors, SMART only reports physical drive issues with the drive. The unRAID Main page drive error counts could also be interface issues, problems with the cabling or controller, etc. If something happens to the controller, and the drive is dropped from the system, then you get a huge unRAID error count for the drive, yet it's almost certainly in perfect condition.

Well, interestingly, despite clocking up 1664 unRAID errors in the space of a few minutes, the drive doesn't get dropped from the system.

A small chunk of the system log shows this:

Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502560
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502568
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502576
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502584
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502592
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502600
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502608
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502616
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502624
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502632
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502640
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502648
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502656
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502664
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502672
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502680
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502688
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502696
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502704
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502712
Aug 25 12:04:41 Tower kernel: md: disk2 read error, sector=5390502720

I haven't counted, but I presume that there are 1664 of these lines. I also presume that the sector counts are going in multiples of 8 because of the 4k blocksize (8 sectors/block).

The machine hasn't been opened for months, so no cabling has been disturbed.

I suspect that what is happening is that the drive encounters the raw read errors and while it recovers from each of those there is a lengthy delay resulting in a controller timeout. These timeouts will be what unRAID is reporting

Having just rebooted for the rc6 update, I became aware that the errors are occuring during the cachedirs reads. Further errors, and lengthy delays, occur when certain files are read. I have seen OpenELEC displaying the 'buffering ...' status when playing video files from this drive.

I have had experience of an older (1TB) WD drive which exhibited extrememly slow progress between ~70-80% through each pass of preclearing - I suspect that this drive is behaving in a similar fashion (although there were no SMART raw read errors clocked on the older drive).

I really think that I need to order up a new drive (could take a couple of weeks to obtain a 3TB drive here in Philippines), then run preclears of this 'faulty' drive in order to get a better measure of the performance issue, with a view to RMAing it.

Quote

August 25, 201510 yr

I should be asleep, but saw the email of your post and felt I had to come back, partly because you're getting frustrated (I'm sorry!), but also because you are misunderstanding the SMART info. And also to request another Diagnostics zip, hopefully from the same session with those read errors.

Raw_Read_Error_Rate is an error rate not a counter, just like Seek_Error_Rate, and the ONLY thing of interest, the ONLY thing we can interpret, is the VALUE and WORST for them. Both of your error rate VALUEs are 200, which is perfect. You are still referring to 'raw read errors', but that is NOT what this attribute is about. And the fact that it is non-zero and LOOKS like a counter that is increasing is throwing you off. It is not a counter, it is an error rate, and each manufacturer uses the RAW in differing ways, if they use it at all, with coded values that only they know the meaning of.

According to the earlier SMART report, the drive looks great, no issues at all. We'll check the next SMART report to verify that that hasn't changed.

The small section from the syslog you showed above shows a drive that from the unRAID modules viewpoint cannot read anything. I suspect the drive has already been dropped from the system by the kernel, but the unRAID module does not know that. I've seen this happen many times. I do not know that for sure, but if you have the syslog that includes all that, we should be able to see what went wrong. What is ALWAYS necessary though are the VERY FIRST errors, not the later read errors that may occur. Once a drive has trouble, and especially if it stops responding for any reason, then you can completely discount all of the errors that follow.

Something IS wrong, but it really doesn't look like the drive, and I don't want to speculate without the syslog. I need to sleep now, but I'll check it in the morning, if someone else hasn't helped first.

Quote

August 29, 201510 yr

Author

RobJ, many thanks for your continued advice. I hear what you say about that SMART attribute but, from my observation, the raw value is a counter with ever-increasing values - it currently sits at a constant value of 2114.

However, as suddenly as the 'problem' appeared, it has gone away again. The attribute was increasing, and the unRAID errors were being reported, after two 'power-on' boots. However, subsequent (2 power cuts already today, before 8am, and several over the previous days) have not shown any further error reports, although, as I said, the attribute raw value is sitting at 2114.

I will keep an eye on it but, for the time being, all seems well.

Quote

August 29, 201510 yr

Author

HAd a short spell of this happening again, this evening (after the third powercut of the day). However, it did not happen during directory cacheing, as on the two previous occasions, but while playing a movie.

I noticed OpenELEC displaying "Buffering ..." again, but it then carried on playing the movie normally - and it still is, more than an hour later.

Raw read error rate has jumped from the previous 2114 to 2205. During that time, errors on that drive, reported by unRAID, show 1536.

The start of this episode appears in the logfile as:

Aug 29 22:38:14 Tower autofan: Highest disk temp is 39°C, adjusting fan speed from: 185 (72% @ 1872rpm) to: 150 (58% @ 1612rpm)
Aug 29 22:39:20 Tower rpc.mountd[9176]: authenticated mount request from 172.22.1.26:779 for /mnt/user/xbmc (/mnt/user/xbmc)
Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] Sense Key : 0x3 [current] [descriptor] 
Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] ASC=0x11 ASCQ=0x0 
Aug 29 22:41:55 Tower kernel: sd 1:0:2:0: [sdd] CDB: opcode=0x88 88 00 00 00 00 01 04 99 63 40 00 00 04 00 00 00
Aug 29 22:41:55 Tower kernel: blk_update_request: critical medium error, dev sdd, sector 4372128576
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128512
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128520
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128528
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128536
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128544
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128552
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128560
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128568
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128576
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128584
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128592
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128600
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128608
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128616
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128624
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128632
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128640
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128648
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128656
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128664
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128672
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128680
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128688
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128696
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128704
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128712
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128720
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128728
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128736
Aug 29 22:41:55 Tower kernel: md: disk2 read error, sector=4372128744

This was some four hours after the system booted up.

tower-diagnostics-20150829-2337.zip

Quote

August 29, 201510 yr

Whatever brand/model drive you are using may work that way, but it is not true of all drives. Many of then stick at 0 and others have huge numbers that, as RobJ mentioned, as not interpretable without technical docs that drive manufacturers do not publish. But all signs are pointing to this being a real issue that is observable by the drive and the OS. Unless it is some sort of connection issue, it sounds like the drive is bad.

Quote

June 7, 201610 yr

This is an old topic, but I want to keep the raw read error rate discussions together.

Drive is a Samsung 2tb 203WI. Working fine for years, but suddenly started spitting and fussing about raw read error rate.

Seems to have stabilized at 52,000 now. Smart thinks this means "Failing Now" but the drive is working without issue again. Thanks to 6.2 having 2 parity slots, we slammed it into parity 2 spot on a small server to see what happens.

Several preclears brought the pending sectors down to 0 from about 700. I know we are rolling the dice with this one, but I have never had a 203WI fail, and I want to see what happens.

Quote

Raw Read Error Rate.

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)