[SOLVED] Warning over increase in SMART [188] RAW VALUE remains Unchanged??


danioj

Recommended Posts

Hi Guys,

 

If the Subject doesn't do it, here goes a detailed description:

 

unRAID v6.1.6 concerning Parity Drive: ST8000AS0002-1NA17Z_Z8404KRE - 8 TB (sdh)

 

Last night I had two warning emails from my MAIN server within 5 mins of each other letting me know that the RAW # of S.M.A.R.T attribute # 188 had increase by 1 each time:

 

Email 1: 12:36 AM

 

unRAID Main Status: Warning [MAIN] - command timeout is 1

 

Event: unRAID Parity disk SMART health [188]

Subject: Warning [MAIN] - command timeout is 2

Description: ST8000AS0002-1NA17Z_Z8404KRE (sdh)

Importance: warning

 

Email 2: 12:41 AM

 

unRAID Main Status: Warning [MAIN] - command timeout is 2

 

Event: unRAID Parity disk SMART health [188]

Subject: Warning [MAIN] - command timeout is 2

Description: ST8000AS0002-1NA17Z_Z8404KRE (sdh)

Importance: warning

 

Still being awake I quickly checked the GUI and refreshed the S.M.A.R.T data on the disk Attributes page. This confirmed indeed that the RAW # for that attribute had increased BUT the VALUE had not changed at ALL?  :-\

 

Attributes
#	Attribute Name	Flag	Value	Worst	Threshold	Type	Updated	Failed	Raw Value
1	Raw read error rate	0x000f	120	099	006	Pre-fail	Always	Never	239904320
3	Spin up time	0x0003	090	090	000	Pre-fail	Always	Never	0
4	Start stop count	0x0032	100	100	020	Old age	Always	Never	752
5	Reallocated sector count	0x0033	100	100	010	Pre-fail	Always	Never	0
7	Seek error rate	0x000f	080	060	030	Pre-fail	Always	Never	102709815
9	Power on hours	0x0032	095	095	000	Old age	Always	Never	4962 (6m, 23d, 18h)
10	Spin retry count	0x0013	100	100	097	Pre-fail	Always	Never	0
12	Power cycle count	0x0032	100	100	020	Old age	Always	Never	3
183	Runtime bad block	0x0032	100	100	000	Old age	Always	Never	0
184	End-to-end error	0x0032	100	100	099	Old age	Always	Never	0
187	Reported uncorrect	0x0032	100	100	000	Old age	Always	Never	0
188	Command timeout	0x0032	100	099	000	Old age	Always	Never	2
189	High fly writes	0x003a	100	100	000	Old age	Always	Never	0
190	Airflow temperature cel	0x0022	075	064	045	Old age	Always	Never	25 (min/max 16/36)
191	G-sense error rate	0x0032	100	100	000	Old age	Always	Never	0
192	Power-off retract count	0x0032	100	100	000	Old age	Always	Never	24
193	Load cycle count	0x0032	100	100	000	Old age	Always	Never	961
194	Temperature celsius	0x0022	025	040	000	Old age	Always	Never	25 (0 16 0 0 0)
195	Hardware ECC recovered	0x001a	120	099	000	Old age	Always	Never	239904320
197	Current pending sector	0x0012	100	100	000	Old age	Always	Never	0
198	Offline uncorrectable	0x0010	100	100	000	Old age	Offline	Never	0
199	UDMA CRC error count	0x003e	200	200	000	Old age	Always	Never	206
240	Head flying hours	0x0000	100	253	000	Old age	Offline	Never	224940322195149
241	Total lbas written	0x0000	100	253	000	Old age	Offline	Never	82503073336
242	Total lbas read	0x0000	100	253	000	Old age	Offline	Never	193686638341

 

Now normally I would look at that report, ignore the RAW # for that Attribute - glance at the VALUE - see that it is unchanged and reports a healthy VALUE of 100 and a worst of 099. Job is a good one disk is fine. My conclusion would normally be that the drive is fine and there is not an issue. Given the actual VALUE's for them are all 100 everything is perfecto mundo!

 

I felt OK with that and went to bed!  :)

 

However after thinking about it this morning, I am slightly concerned that unRAID felt the need to WARN me about the increase?  :o This must mean that there is a rule in the notification logic which specifically is there to capture this increase and warn the user of it - presumably to do something about it or monitor the increases. Either way - that is not the "Ignore it attitude" I was going to apply.  :-\

 

So, given unRAID's reaction to this increase does this mean that my assessment is incorrect and I should be worried and that there is a corrective action here?

 

Any comments or suggestions are welcome. Thanks in advance everyone.  :)

 

Daniel

Link to comment

If its a segate drive then see the dozens of other threads about the smart field for Command Timeout.

 

Thanks for replying. I knew that the RAW value was nothing to worry about as I'd been following those threads in the background for a while.

 

My issue was why unRAID was notifying me it's an issue. Was there something I was missing but was written in the notification logic to capture? Thought I had misunderstood. I hadn't seen this thread from 2 weeks ago though: http://lime-technology.com/forum/index.php?topic=44269.msg422480#msg422480

 

Seems the question by gundamguy from this thread (which was unanswered) is still the pertinent one given the likelihood of Seagate doing something about it is low to nil and people (including myself) are not going to stop buying Seagates.

 

It's not an issue.  Just Seagate reports the attribute differently than wd. Just disable monitoring of that attribute for those drives

 

The issue is that it's not an issue but being treated as one. People not in the know (many users) don't know that it's not actually an issue and can be safely ignored.

 

There might need to be more logic about the way smart values are captured and reported to prevent confusion.

 

There's nothing LT can do about this one though.  Seagate drives report it differently, therefore will need to contact Seagate and see if they'll change...

 

I realize that this is an issue with Seagate, and I realize they aren't going to fix it. So why not have some logic in the smart notification code that ignores 188 if it's from a Seagate drive? Is that adding more issues?

 

[me=danioj]smells feature / fix request??[/me]

Link to comment

OK - I'm going to have to re-open this thread. I know the advice was not to worry and ignore BUT I just couldn't do it and I am glad I didn't.

 

I continued to get the odd email notice about the Command Timeout (because I didn't do the ignore as yet) and I have now noticed that the VALUE is also decreasing in number now. Whereby at first it was just RAW that was increasing (as posted above) and VALUE was remaining the same now the emails continue but I also I have this ....

 

188	Command timeout	0x0032	096	096	000	Old age	Always	Never	10

 

when 2 days ago it was ....

 

188	Command timeout	0x0032	100	099	000	Old age	Always	Never	2

 

Now it seems that people suggest that this can be because of an old cable or an old drive but the drive is less than a year old and the cable the same. In fact the whole setup is only about 8 months old.

 

Would appreciate any advice on what course of action others would take right now, clearly it can't be ignore though surely!

Link to comment

For that specific attribute, Yes ... I would just ignore it.  If you want to be sure it's not just a poorly seated cable, you could shut down; replace SATA cable; and then reboot ... but the reality is these are known anomalies with Seagate drives and simply nothing to be concerned about.

 

As Bonienl noted, the parameter's not even going to be monitored in the next release.

 

 

 

Link to comment

For that specific attribute, Yes ... I would just ignore it.  If you want to be sure it's not just a poorly seated cable, you could shut down; replace SATA cable; and then reboot ... but the reality is these are known anomalies with Seagate drives and simply nothing to be concerned about.

 

As Bonienl noted, the parameter's not even going to be monitored in the next release.

 

Fair enough. I clearly misunderstood. I thought we were only ignoring the RAW # of that attribute not the attribute as a whole.

 

*clicks ignore on the S.M.A.R.T Settings Page for this attribute*

 

*Set's Thread label back to SOLVED*

 

Thank You!

Link to comment

Sigh. Sorry to keep going with this thread BUT the weirdness continues. There HAS to be something wrong here. The Command Timeout Value has now gone back to BETTER values than it had before. I've only ever seen that after a pre_clear!??

 

So, we have gone from

 

188	Command timeout	0x0032	100	099	000	Old age	Always	Never	2

 

to

 

188	Command timeout	0x0032	096	096	000	Old age	Always	Never	10

 

and now ...

 

188	Command timeout	0x0032	100	096	000	Old age	Always	Never	11

 

Firstly, I know you guys are saying ignore this and I am sure some will say the increase in the VALUE again is just another example of this attribute and its weird behaviour. But to almost re-baseline itself like this is weird. Putting that aside then for a second, couple with that I now have noticed that this drive (which happens to be my parity drive) on the Array is registering errors too:

 

Device	Identification	                                                Temp.	Reads	         Writes	         Errors	FS	Size	Used	Free	View
Parity	ST8000AS0002-1NA17Z_Z8404KRE - 8 TB (sdh)	32 C	        20,343,936	1,578,255	        59,472	

 

Yesterday my scheduled monthly Parity Check completed fine:

 

Event: unRAID Parity check
Subject: Notice [MAIN] - Parity check finished (0 errors)
Description: Duration: unavailable (no parity-check entries logged)
Importance: normal

 

I feel there is something weird happening with this drive. What would others do (assuming you are going to say ignore the command timeout variable again) with these parity disk errors? I have checked the S.M.A.R.T report and all looks ok and there are no S.M.A.R.T errors to note (other than the one I am to ignore of course).

 

However, I now have masses of these lines in my Syslog:

 

18:15:48 main kernel: md: disk0 read error, sector=xxxxxxx

 

Which of course is the log of the 59,472 errors I have now noted in the GUI. Interesting that I had no Warning from unRAID as to these errors (but anyway thats a discussion for another day).

 

What would others do here?

syslog.txt.zip

ST8000AS0002-1NA17Z_Z8404KRE-20160102-1818.txt

Link to comment

I on the other hand just think this is a Seagate drive with Seagate firmware acting like a typical Seagate product and reporting Seagate non-sense values for certain SMART aspects, as far as SMART Attribute 188 is concerned.

 

Now those read errors on the other hand, are something to be concerned about.

 

I would start by at least performing a SMART short test on that drive and see what the results are, if it can even finish the test.

Link to comment

I on the other hand just think this is a Seagate drive with Seagate firmware acting like a typical Seagate product and reporting Seagate non-sense values for certain SMART aspects, as far as SMART Attribute 188 is concerned.

 

Now those read errors on the other hand, are something to be concerned about.

 

I would start by at least performing a SMART short test on that drive and see what the results are, if it can even finish the test.

 

Lol - ok I don't usually believe in coincidences but I'll go with it re att #188.

 

Just run a Short Test on the drive: "Completed without error"

 

No new errors in the syslog. What do you think, long test??

 

 

Link to comment

Which of course is the log of the 59,472 errors I have now noted in the GUI. Interesting that I had no Warning from unRAID as to these errors (but anyway thats a discussion for another day).

 

When Array status notification is enabled it will generate a report at the given interval and gives a warning notification when disks have read errors.

 

Footnote: System notifications need to be enabled in the first place to receive notifications  ;D

Link to comment

Which of course is the log of the 59,472 errors I have now noted in the GUI. Interesting that I had no Warning from unRAID as to these errors (but anyway thats a discussion for another day).

 

When Array status notification is enabled it will generate a report at the given interval and gives a warning notification when disks have read errors.

 

Footnote: System notifications need to be enabled in the first place to receive notifications  ;D

 

Sorry, it turns out I DID get an email about the errors. My notification settings are spot on and are clearly working. I did only get 1 notification email though, but I assume this is due to the frequency these error reports are sent via email. Would be stupid to send them every time there was 1 error. LOL - 57k emails!  ;)

 

Well, when they started anyway:

 

Event: unRAID array errors

Subject: Warning [MAIN] - array has errors

Description: Array has 1 disk with read errors

Importance: warning

 

Parity disk - ST8000AS0002-1NA17Z_Z8404KRE (sdh) (errors 256)

 

It turns out the disk started to get these errors when it was in the middle of this months Parity Check.

Screen_Shot_2016-01-02_at_7_16.47_PM.png.1493c539fde0bd0b8eb323280b22bed9.png

Link to comment

What is the smart report? Any values look off?

 

S.M.A.R.T report looks spot on as far as I am concerned. Perfecto Mundo ....

 

Unless I have missed something????

 

My personal experience with this is that when read errors occur and at the same time the disk SMART report gives current pending sectors and/or reallocated sectors, it is better to replace the disk. Otherwise do a long test, but is mostly safe to keep using the disk.

 

Link to comment

What is the smart report? Any values look off?

 

S.M.A.R.T report looks spot on as far as I am concerned. Perfecto Mundo ....

 

Unless I have missed something????

 

My personal experience with this is that when read errors occur and at the same time the disk SMART report gives current pending sectors and/or reallocated sectors, it is better to replace the disk. Otherwise do a long test, but is mostly safe to keep using the disk.

 

That sounds reasonable. Given I don't have pending and/or reallocated sectors I have just started the long test. Lets see what it comes back with ....

 

"O self-test in progress, 10% complete..."

Link to comment

Your S.M.A.R.T. data looks fine => as already noted, Seagate reports a lot of raw values that are meaningless for the end user; and the Command Timeout issue is well known and can simply be ignored.

 

Personally, the ONLY things I watch for are pending sectors (very bad) and I watch the reallocated sector count.    Reallocated sectors are NOT by themselves bad ... modern drives are DESIGNED to have spare sectors to use instead of any sectors that are found to be unreliable/defective.    What's bad is if a sector deteriorate to the point where the current data on it can't be recovered and reallocated -- which puts it in the "pending" category ... something you definitely don't want.    I replace any drive that has ANY pending sector.    For virtually all other parameters, all I look at is the Value column, to see if it's deteriorating significantly, and there are some of those that are also known to drop a lot on Seagates [e.g. High Fly Writes and Seek Error Rate], but are nothing to worry about.

 

 

 

 

Link to comment

Looking at your earlier SMART reports you do appear to have a faulty SATA cable, and with so many errors wouldn’t be surprised if they are the cause of the errors you’re seeing:

 

December 29th

199 UDMA_CRC_Error_Count    0x003e   200   191   000    Old_age   Always       -       206

 

January 1st

199 UDMA_CRC_Error_Count    0x003e   200   191   000    Old_age   Always       -       64717

 

Today

199 UDMA_CRC_Error_Count    0x003e   200   191   000    Old_age   Always       -       66564

 

Increase of more than 1 in quick succession usually indicates bad cable.

Link to comment

Interesting Interesting Interesting ..... It appears that my error issue has GONE! Command Timeout variable has stopped increasing, no more unRAID errors in the Syslog and things are looking peachy!

 

What did I do. I reseated the SATA cable, as in, I pulled it out and put it back into the drive and then motherboard.

 

So it appears it WAS a SATA cable issue.

 

However, it brings me back to the thing that originally got me looking into things and that was the Command Timeout variable. It appears (based on my research) that the command timeout variable can increase as a result of a bad SATA cable. Now granted there were other variables in this equation too such as UDMA_CRC_Error_Count and also the Read Errors reported by unRAID.

 

However, I feel like the reason I have found this issue is because of the Command Timeout increase. I guess my point is, I don't think it should just be ignored. It has certainly "help" me find an issue here which might have taken me longer to find.

 

JM2C.

 

Thank you to all those who took the time to reply and help me with this issue. I appreciate it.

Link to comment

It did help you find an "issue" -- and it was certainly worth reseating the cable ... but the reality is it didn't have any negative impact => the system still ran fine, and there were no data integrity issues (the drive was simply running a tad slower due to frequent needs to reissue commands to it).  The system would have continued to work just fine without doing anything.

 

So it's not a parameter you want to use when deciding whether or not to replace a drive.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.