burnaby_boy Posted August 22, 2015 Share Posted August 22, 2015 Last night I added an Intel 520 120GB SSD as a cache drive to my array which is running unRAID 6.0.1. The array is set to notify me by email with notices, warnings and alerts. This morning I began receiving emails warning me of an increasing "uncorrectable error count" on the SSD. After a quick search I discovered that this drive has a bug that causes the uncorrectable error count to increase, though there is no problem with the drive. Is there any way to shut off the SMART health warnings for just this drive? Link to comment
MortenSchmidt Posted March 28, 2016 Share Posted March 28, 2016 I too have this. Same drive Intel 520 120GB. I'm flooded with "uncorrectable error count" emails and notifications in webgui. I have had the drive hooked up to a windows PC running the Intel SSD Toolbox today to check for new firmware, but drive already has the latest firmware version "400i". Did you find a solution? What did you find that suggested this is causes by a bug in the SSD? I'm thinking since the power on hours reporting is also messed up in linux but reports correctly in windows with SSD toolbox, maybe the problem is with linux or its unraid implementation. This topic seems to suggest the drive works with "extended logs" and requires running "smartctl -x" instead of "smartctl -a" https://communities.intel.com/thread/53491?start=0&tstart=0 For what it's worth, with smartctl -a (and in unraid WebGui) the power hours are reported like this: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always - 909799h+57m+24.280s But with smartctl -x it adds this interpretation: Device Statistics (GP Log 0x04) Page Offset Size Value Description 1 ===== = = == General Statistics (rev 2) == 1 0x010 4 15004 Power-on Hours So the -x makes a lot more sense, and is the same value I see in windows with the SSD toolbox. Have just rebooted the server so don't have the uncorrectable notifications popping up yet. Can update this later when it returns (from what I have read, the uncorrectable error count raw value resets at power-on with these SSD's). Link to comment
MortenSchmidt Posted March 28, 2016 Share Posted March 28, 2016 Back again. Took all of 10 minutes for the uncorrectable error count notifications to start popping up. With smartctl-a and Unraid WebGui: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 187 Uncorrectable_Error_Cnt 0x000f 114 114 050 Pre-fail Always - 76820312 (BTW, please note that worst and value are way above the threshold) With smartctl -x the same raw value is reported, but this is added below: Device Statistics (GP Log 0x04) Page Offset Size Value Description 4 ===== = = == General Errors Statistics (rev 1) == 4 0x008 4 0 Number of Reported Uncorrectable Errors Link to comment
RobJ Posted March 28, 2016 Share Posted March 28, 2016 What you can do is uncheck 187 for this drive, in its SMART monitoring config. Link to comment
MortenSchmidt Posted March 29, 2016 Share Posted March 29, 2016 Thanks. Found that checkbox and it works. Link to comment
MortenSchmidt Posted April 3, 2016 Share Posted April 3, 2016 I kept looking at this a bit longer - the other way to get rid of the false warnings is to change from monitoring raw to normalized values. This way one could still monitor field 187's normalized value. I also looked closer at the smartctl-x report which breaks out which fields are relevant to pre-failiure prediction and which are not. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct -O--CK 100 100 000 - 0 9 Power_On_Hours -O--CK 000 000 000 - 909937 (110 2 0) 12 Power_Cycle_Count -O--CK 100 100 000 - 116 170 Unknown_Attribute PO--CK 100 100 010 - 0 171 Unknown_Attribute -O--CK 100 100 000 - 0 172 Unknown_Attribute -O--CK 100 100 000 - 0 174 Unknown_Attribute -O--CK 100 100 000 - 27 184 End-to-End_Error PO--CK 100 100 090 - 0 187 Reported_Uncorrect POSR-- 118 118 050 - 199827011 192 Power-Off_Retract_Count -O--CK 100 100 000 - 27 225 Unknown_SSD_Attribute -O--CK 100 100 000 - 327259 226 Unknown_SSD_Attribute -O--CK 100 100 000 - 65535 227 Unknown_SSD_Attribute -O--CK 100 100 000 - 30 228 Power-off_Retract_Count -O--CK 100 100 000 - 65535 232 Available_Reservd_Space PO--CK 100 100 010 - 0 233 Media_Wearout_Indicator -O--CK 100 100 000 - 0 241 Total_LBAs_Written -O--CK 100 100 000 - 327259 242 Total_LBAs_Read -O--CK 100 100 000 - 146395 249 Unknown_Attribute PO--C- 100 100 000 - 11051 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning I decided to add fields 170, 184 and 232 since these are classified as "prefailure warning" fields, and also field 232 because of its title. (But not 249 since that already has a high raw value which like 187 is not 'auto-kept', so presumably resets with power-on like 187 does). Interestingly, the only field that UnRaid monitors per default, that is a classified as a prefailure warning field for this SSD is 187 (which it monitors incorrectly in raw mode). Not sure if monitoring all those fields including 187 in normalized mode or all of them excl 187 in raw mode is better. I'm leaning toward the latter, thinking that will give the earliest warning possible, but on the other hand since it is just a cache device maybe we don't need to be concerned with the raw values. In case anyone at limetech sees this, the obvious improvement suggestion here is to offer the possibility to monitor some fields in raw mode and others in normalized mode. Or implement the extended attributes so field 187 can be monitored in raw mode correctly. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.