Intel 520 SSD "Uncorrectable Error Count" warnings

burnaby_boy · August 22, 2015

Last night I added an Intel 520 120GB SSD as a cache drive to my array which is running unRAID 6.0.1. The array is set to notify me by email with notices, warnings and alerts. This morning I began receiving emails warning me of an increasing "uncorrectable error count" on the SSD. After a quick search I discovered that this drive has a bug that causes the uncorrectable error count to increase, though there is no problem with the drive. Is there any way to shut off the SMART health warnings for just this drive?

MortenSchmidt · March 28, 2016

I too have this. Same drive Intel 520 120GB. I'm flooded with "uncorrectable error count" emails and notifications in webgui.

I have had the drive hooked up to a windows PC running the Intel SSD Toolbox today to check for new firmware, but drive already has the latest firmware version "400i".

Did you find a solution?

What did you find that suggested this is causes by a bug in the SSD? I'm thinking since the power on hours reporting is also messed up in linux but reports correctly in windows with SSD toolbox, maybe the problem is with linux or its unraid implementation. This topic seems to suggest the drive works with "extended logs" and requires running "smartctl -x" instead of "smartctl -a"

https://communities.intel.com/thread/53491?start=0&tstart=0

For what it's worth, with smartctl -a (and in unraid WebGui) the power hours are reported like this:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       909799h+57m+24.280s

But with smartctl -x it adds this interpretation:

Device Statistics (GP Log 0x04)
Page Offset Size         Value  Description
  1  =====  =                =  == General Statistics (rev 2) ==
  1  0x010  4            15004  Power-on Hours

So the -x makes a lot more sense, and is the same value I see in windows with the SSD toolbox.

Have just rebooted the server so don't have the uncorrectable notifications popping up yet. Can update this later when it returns (from what I have read, the uncorrectable error count raw value resets at power-on with these SSD's).

MortenSchmidt · March 28, 2016

Back again. Took all of 10 minutes for the uncorrectable error count notifications to start popping up.

With smartctl-a and Unraid WebGui:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
187 Uncorrectable_Error_Cnt 0x000f   114   114   050    Pre-fail  Always       -       76820312

(BTW, please note that worst and value are way above the threshold)

With smartctl -x the same raw value is reported, but this is added below:

Device Statistics (GP Log 0x04)
Page Offset Size         Value  Description
  4  =====  =                =  == General Errors Statistics (rev 1) ==
  4  0x008  4                0  Number of Reported Uncorrectable Errors

RobJ · March 28, 2016

What you can do is uncheck 187 for this drive, in its SMART monitoring config.

MortenSchmidt · March 29, 2016

Thanks. Found that checkbox and it works.

MortenSchmidt · April 3, 2016

I kept looking at this a bit longer - the other way to get rid of the false warnings is to change from monitoring raw to normalized values. This way one could still monitor field 187's normalized value.

I also looked closer at the smartctl-x report which breaks out which fields are relevant to pre-failiure prediction and which are not.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   000   000   000    -    909937 (110 2 0)
12 Power_Cycle_Count       -O--CK   100   100   000    -    116
170 Unknown_Attribute       PO--CK   100   100   010    -    0
171 Unknown_Attribute       -O--CK   100   100   000    -    0
172 Unknown_Attribute       -O--CK   100   100   000    -    0
174 Unknown_Attribute       -O--CK   100   100   000    -    27
184 End-to-End_Error        PO--CK   100   100   090    -    0
187 Reported_Uncorrect      POSR--   118   118   050    -    199827011
192 Power-Off_Retract_Count -O--CK   100   100   000    -    27
225 Unknown_SSD_Attribute   -O--CK   100   100   000    -    327259
226 Unknown_SSD_Attribute   -O--CK   100   100   000    -    65535
227 Unknown_SSD_Attribute   -O--CK   100   100   000    -    30
228 Power-off_Retract_Count -O--CK   100   100   000    -    65535
232 Available_Reservd_Space PO--CK   100   100   010    -    0
233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
241 Total_LBAs_Written      -O--CK   100   100   000    -    327259
242 Total_LBAs_Read         -O--CK   100   100   000    -    146395
249 Unknown_Attribute       PO--C-   100   100   000    -    11051
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

I decided to add fields 170, 184 and 232 since these are classified as "prefailure warning" fields, and also field 232 because of its title. (But not 249 since that already has a high raw value which like 187 is not 'auto-kept', so presumably resets with power-on like 187 does).

Interestingly, the only field that UnRaid monitors per default, that is a classified as a prefailure warning field for this SSD is 187 (which it monitors incorrectly in raw mode).

Not sure if monitoring all those fields including 187 in normalized mode or all of them excl 187 in raw mode is better. I'm leaning toward the latter, thinking that will give the earliest warning possible, but on the other hand since it is just a cache device maybe we don't need to be concerned with the raw values.

In case anyone at limetech sees this, the obvious improvement suggestion here is to offer the possibility to monitor some fields in raw mode and others in normalized mode. Or implement the extended attributes so field 187 can be monitored in raw mode correctly.

Intel 520 SSD "Uncorrectable Error Count" warnings

Recommended Posts

burnaby_boy

Link to comment

MortenSchmidt

Link to comment

MortenSchmidt

Link to comment

RobJ

Link to comment

MortenSchmidt

Link to comment

MortenSchmidt

Link to comment

Join the conversation