danioj Posted December 29, 2015 Share Posted December 29, 2015 Hi Guys, If the Subject doesn't do it, here goes a detailed description: unRAID v6.1.6 concerning Parity Drive: ST8000AS0002-1NA17Z_Z8404KRE - 8 TB (sdh) Last night I had two warning emails from my MAIN server within 5 mins of each other letting me know that the RAW # of S.M.A.R.T attribute # 188 had increase by 1 each time: Email 1: 12:36 AM unRAID Main Status: Warning [MAIN] - command timeout is 1 Event: unRAID Parity disk SMART health [188] Subject: Warning [MAIN] - command timeout is 2 Description: ST8000AS0002-1NA17Z_Z8404KRE (sdh) Importance: warning Email 2: 12:41 AM unRAID Main Status: Warning [MAIN] - command timeout is 2 Event: unRAID Parity disk SMART health [188] Subject: Warning [MAIN] - command timeout is 2 Description: ST8000AS0002-1NA17Z_Z8404KRE (sdh) Importance: warning Still being awake I quickly checked the GUI and refreshed the S.M.A.R.T data on the disk Attributes page. This confirmed indeed that the RAW # for that attribute had increased BUT the VALUE had not changed at ALL? Attributes # Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value 1 Raw read error rate 0x000f 120 099 006 Pre-fail Always Never 239904320 3 Spin up time 0x0003 090 090 000 Pre-fail Always Never 0 4 Start stop count 0x0032 100 100 020 Old age Always Never 752 5 Reallocated sector count 0x0033 100 100 010 Pre-fail Always Never 0 7 Seek error rate 0x000f 080 060 030 Pre-fail Always Never 102709815 9 Power on hours 0x0032 095 095 000 Old age Always Never 4962 (6m, 23d, 18h) 10 Spin retry count 0x0013 100 100 097 Pre-fail Always Never 0 12 Power cycle count 0x0032 100 100 020 Old age Always Never 3 183 Runtime bad block 0x0032 100 100 000 Old age Always Never 0 184 End-to-end error 0x0032 100 100 099 Old age Always Never 0 187 Reported uncorrect 0x0032 100 100 000 Old age Always Never 0 188 Command timeout 0x0032 100 099 000 Old age Always Never 2 189 High fly writes 0x003a 100 100 000 Old age Always Never 0 190 Airflow temperature cel 0x0022 075 064 045 Old age Always Never 25 (min/max 16/36) 191 G-sense error rate 0x0032 100 100 000 Old age Always Never 0 192 Power-off retract count 0x0032 100 100 000 Old age Always Never 24 193 Load cycle count 0x0032 100 100 000 Old age Always Never 961 194 Temperature celsius 0x0022 025 040 000 Old age Always Never 25 (0 16 0 0 0) 195 Hardware ECC recovered 0x001a 120 099 000 Old age Always Never 239904320 197 Current pending sector 0x0012 100 100 000 Old age Always Never 0 198 Offline uncorrectable 0x0010 100 100 000 Old age Offline Never 0 199 UDMA CRC error count 0x003e 200 200 000 Old age Always Never 206 240 Head flying hours 0x0000 100 253 000 Old age Offline Never 224940322195149 241 Total lbas written 0x0000 100 253 000 Old age Offline Never 82503073336 242 Total lbas read 0x0000 100 253 000 Old age Offline Never 193686638341 Now normally I would look at that report, ignore the RAW # for that Attribute - glance at the VALUE - see that it is unchanged and reports a healthy VALUE of 100 and a worst of 099. Job is a good one disk is fine. My conclusion would normally be that the drive is fine and there is not an issue. Given the actual VALUE's for them are all 100 everything is perfecto mundo! I felt OK with that and went to bed! However after thinking about it this morning, I am slightly concerned that unRAID felt the need to WARN me about the increase? This must mean that there is a rule in the notification logic which specifically is there to capture this increase and warn the user of it - presumably to do something about it or monitor the increases. Either way - that is not the "Ignore it attitude" I was going to apply. So, given unRAID's reaction to this increase does this mean that my assessment is incorrect and I should be worried and that there is a corrective action here? Any comments or suggestions are welcome. Thanks in advance everyone. Daniel Quote Link to comment
BRiT Posted December 29, 2015 Share Posted December 29, 2015 If its a segate drive then see the dozens of other threads about the smart field for Command Timeout. Quote Link to comment
danioj Posted December 29, 2015 Author Share Posted December 29, 2015 If its a segate drive then see the dozens of other threads about the smart field for Command Timeout. Thanks for replying. I knew that the RAW value was nothing to worry about as I'd been following those threads in the background for a while. My issue was why unRAID was notifying me it's an issue. Was there something I was missing but was written in the notification logic to capture? Thought I had misunderstood. I hadn't seen this thread from 2 weeks ago though: http://lime-technology.com/forum/index.php?topic=44269.msg422480#msg422480 Seems the question by gundamguy from this thread (which was unanswered) is still the pertinent one given the likelihood of Seagate doing something about it is low to nil and people (including myself) are not going to stop buying Seagates. It's not an issue. Just Seagate reports the attribute differently than wd. Just disable monitoring of that attribute for those drives The issue is that it's not an issue but being treated as one. People not in the know (many users) don't know that it's not actually an issue and can be safely ignored. There might need to be more logic about the way smart values are captured and reported to prevent confusion. There's nothing LT can do about this one though. Seagate drives report it differently, therefore will need to contact Seagate and see if they'll change... I realize that this is an issue with Seagate, and I realize they aren't going to fix it. So why not have some logic in the smart notification code that ignores 188 if it's from a Seagate drive? Is that adding more issues? [me=danioj]smells feature / fix request??[/me] Quote Link to comment
bonienl Posted December 29, 2015 Share Posted December 29, 2015 The solution is simple, in the next release the setting for this attribute defaults to OFF. Quote Link to comment
danioj Posted December 29, 2015 Author Share Posted December 29, 2015 The solution is simple, in the next release the setting for this attribute defaults to OFF. Cool, thanks for the reply bonienl. Quote Link to comment
danioj Posted January 1, 2016 Author Share Posted January 1, 2016 OK - I'm going to have to re-open this thread. I know the advice was not to worry and ignore BUT I just couldn't do it and I am glad I didn't. I continued to get the odd email notice about the Command Timeout (because I didn't do the ignore as yet) and I have now noticed that the VALUE is also decreasing in number now. Whereby at first it was just RAW that was increasing (as posted above) and VALUE was remaining the same now the emails continue but I also I have this .... 188 Command timeout 0x0032 096 096 000 Old age Always Never 10 when 2 days ago it was .... 188 Command timeout 0x0032 100 099 000 Old age Always Never 2 Now it seems that people suggest that this can be because of an old cable or an old drive but the drive is less than a year old and the cable the same. In fact the whole setup is only about 8 months old. Would appreciate any advice on what course of action others would take right now, clearly it can't be ignore though surely! Quote Link to comment
garycase Posted January 1, 2016 Share Posted January 1, 2016 For that specific attribute, Yes ... I would just ignore it. If you want to be sure it's not just a poorly seated cable, you could shut down; replace SATA cable; and then reboot ... but the reality is these are known anomalies with Seagate drives and simply nothing to be concerned about. As Bonienl noted, the parameter's not even going to be monitored in the next release. Quote Link to comment
danioj Posted January 1, 2016 Author Share Posted January 1, 2016 For that specific attribute, Yes ... I would just ignore it. If you want to be sure it's not just a poorly seated cable, you could shut down; replace SATA cable; and then reboot ... but the reality is these are known anomalies with Seagate drives and simply nothing to be concerned about. As Bonienl noted, the parameter's not even going to be monitored in the next release. Fair enough. I clearly misunderstood. I thought we were only ignoring the RAW # of that attribute not the attribute as a whole. *clicks ignore on the S.M.A.R.T Settings Page for this attribute* *Set's Thread label back to SOLVED* Thank You! Quote Link to comment
danioj Posted January 2, 2016 Author Share Posted January 2, 2016 Sigh. Sorry to keep going with this thread BUT the weirdness continues. There HAS to be something wrong here. The Command Timeout Value has now gone back to BETTER values than it had before. I've only ever seen that after a pre_clear!?? So, we have gone from 188 Command timeout 0x0032 100 099 000 Old age Always Never 2 to 188 Command timeout 0x0032 096 096 000 Old age Always Never 10 and now ... 188 Command timeout 0x0032 100 096 000 Old age Always Never 11 Firstly, I know you guys are saying ignore this and I am sure some will say the increase in the VALUE again is just another example of this attribute and its weird behaviour. But to almost re-baseline itself like this is weird. Putting that aside then for a second, couple with that I now have noticed that this drive (which happens to be my parity drive) on the Array is registering errors too: Device Identification Temp. Reads Writes Errors FS Size Used Free View Parity ST8000AS0002-1NA17Z_Z8404KRE - 8 TB (sdh) 32 C 20,343,936 1,578,255 59,472 Yesterday my scheduled monthly Parity Check completed fine: Event: unRAID Parity check Subject: Notice [MAIN] - Parity check finished (0 errors) Description: Duration: unavailable (no parity-check entries logged) Importance: normal I feel there is something weird happening with this drive. What would others do (assuming you are going to say ignore the command timeout variable again) with these parity disk errors? I have checked the S.M.A.R.T report and all looks ok and there are no S.M.A.R.T errors to note (other than the one I am to ignore of course). However, I now have masses of these lines in my Syslog: 18:15:48 main kernel: md: disk0 read error, sector=xxxxxxx Which of course is the log of the 59,472 errors I have now noted in the GUI. Interesting that I had no Warning from unRAID as to these errors (but anyway thats a discussion for another day). What would others do here? syslog.txt.zip ST8000AS0002-1NA17Z_Z8404KRE-20160102-1818.txt Quote Link to comment
BRiT Posted January 2, 2016 Share Posted January 2, 2016 I on the other hand just think this is a Seagate drive with Seagate firmware acting like a typical Seagate product and reporting Seagate non-sense values for certain SMART aspects, as far as SMART Attribute 188 is concerned. Now those read errors on the other hand, are something to be concerned about. I would start by at least performing a SMART short test on that drive and see what the results are, if it can even finish the test. Quote Link to comment
danioj Posted January 2, 2016 Author Share Posted January 2, 2016 I on the other hand just think this is a Seagate drive with Seagate firmware acting like a typical Seagate product and reporting Seagate non-sense values for certain SMART aspects, as far as SMART Attribute 188 is concerned. Now those read errors on the other hand, are something to be concerned about. I would start by at least performing a SMART short test on that drive and see what the results are, if it can even finish the test. Lol - ok I don't usually believe in coincidences but I'll go with it re att #188. Just run a Short Test on the drive: "Completed without error" No new errors in the syslog. What do you think, long test?? Quote Link to comment
bonienl Posted January 2, 2016 Share Posted January 2, 2016 Which of course is the log of the 59,472 errors I have now noted in the GUI. Interesting that I had no Warning from unRAID as to these errors (but anyway thats a discussion for another day). When Array status notification is enabled it will generate a report at the given interval and gives a warning notification when disks have read errors. Footnote: System notifications need to be enabled in the first place to receive notifications Quote Link to comment
BRiT Posted January 2, 2016 Share Posted January 2, 2016 What is the smart report? Any values look off? Quote Link to comment
danioj Posted January 2, 2016 Author Share Posted January 2, 2016 Which of course is the log of the 59,472 errors I have now noted in the GUI. Interesting that I had no Warning from unRAID as to these errors (but anyway thats a discussion for another day). When Array status notification is enabled it will generate a report at the given interval and gives a warning notification when disks have read errors. Footnote: System notifications need to be enabled in the first place to receive notifications Sorry, it turns out I DID get an email about the errors. My notification settings are spot on and are clearly working. I did only get 1 notification email though, but I assume this is due to the frequency these error reports are sent via email. Would be stupid to send them every time there was 1 error. LOL - 57k emails! Well, when they started anyway: Event: unRAID array errors Subject: Warning [MAIN] - array has errors Description: Array has 1 disk with read errors Importance: warning Parity disk - ST8000AS0002-1NA17Z_Z8404KRE (sdh) (errors 256) It turns out the disk started to get these errors when it was in the middle of this months Parity Check. Quote Link to comment
danioj Posted January 2, 2016 Author Share Posted January 2, 2016 What is the smart report? Any values look off? S.M.A.R.T report looks spot on as far as I am concerned. Perfecto Mundo .... Unless I have missed something? ST8000AS0002-1NA17Z_Z8404KRE-20160102-1921.txt Quote Link to comment
bonienl Posted January 2, 2016 Share Posted January 2, 2016 ... Would be stupid to send them every time there was 1 error. LOL - 57k emails! If you don't get tired in receiving emails, you can up the status check frequency to every hour Quote Link to comment
bonienl Posted January 2, 2016 Share Posted January 2, 2016 What is the smart report? Any values look off? S.M.A.R.T report looks spot on as far as I am concerned. Perfecto Mundo .... Unless I have missed something? My personal experience with this is that when read errors occur and at the same time the disk SMART report gives current pending sectors and/or reallocated sectors, it is better to replace the disk. Otherwise do a long test, but is mostly safe to keep using the disk. Quote Link to comment
danioj Posted January 2, 2016 Author Share Posted January 2, 2016 What is the smart report? Any values look off? S.M.A.R.T report looks spot on as far as I am concerned. Perfecto Mundo .... Unless I have missed something? My personal experience with this is that when read errors occur and at the same time the disk SMART report gives current pending sectors and/or reallocated sectors, it is better to replace the disk. Otherwise do a long test, but is mostly safe to keep using the disk. That sounds reasonable. Given I don't have pending and/or reallocated sectors I have just started the long test. Lets see what it comes back with .... "O self-test in progress, 10% complete..." Quote Link to comment
JorgeB Posted January 2, 2016 Share Posted January 2, 2016 199 UDMA_CRC_Error_Count 0x003e 200 191 000 Old_age Always - 66564 This usually indicates bad sata cable, monitor and replace cable if it continues to increase. Quote Link to comment
garycase Posted January 2, 2016 Share Posted January 2, 2016 Your S.M.A.R.T. data looks fine => as already noted, Seagate reports a lot of raw values that are meaningless for the end user; and the Command Timeout issue is well known and can simply be ignored. Personally, the ONLY things I watch for are pending sectors (very bad) and I watch the reallocated sector count. Reallocated sectors are NOT by themselves bad ... modern drives are DESIGNED to have spare sectors to use instead of any sectors that are found to be unreliable/defective. What's bad is if a sector deteriorate to the point where the current data on it can't be recovered and reallocated -- which puts it in the "pending" category ... something you definitely don't want. I replace any drive that has ANY pending sector. For virtually all other parameters, all I look at is the Value column, to see if it's deteriorating significantly, and there are some of those that are also known to drop a lot on Seagates [e.g. High Fly Writes and Seek Error Rate], but are nothing to worry about. Quote Link to comment
JorgeB Posted January 2, 2016 Share Posted January 2, 2016 Looking at your earlier SMART reports you do appear to have a faulty SATA cable, and with so many errors wouldn’t be surprised if they are the cause of the errors you’re seeing: December 29th 199 UDMA_CRC_Error_Count 0x003e 200 191 000 Old_age Always - 206 January 1st 199 UDMA_CRC_Error_Count 0x003e 200 191 000 Old_age Always - 64717 Today 199 UDMA_CRC_Error_Count 0x003e 200 191 000 Old_age Always - 66564 Increase of more than 1 in quick succession usually indicates bad cable. Quote Link to comment
danioj Posted January 5, 2016 Author Share Posted January 5, 2016 Interesting Interesting Interesting ..... It appears that my error issue has GONE! Command Timeout variable has stopped increasing, no more unRAID errors in the Syslog and things are looking peachy! What did I do. I reseated the SATA cable, as in, I pulled it out and put it back into the drive and then motherboard. So it appears it WAS a SATA cable issue. However, it brings me back to the thing that originally got me looking into things and that was the Command Timeout variable. It appears (based on my research) that the command timeout variable can increase as a result of a bad SATA cable. Now granted there were other variables in this equation too such as UDMA_CRC_Error_Count and also the Read Errors reported by unRAID. However, I feel like the reason I have found this issue is because of the Command Timeout increase. I guess my point is, I don't think it should just be ignored. It has certainly "help" me find an issue here which might have taken me longer to find. JM2C. Thank you to all those who took the time to reply and help me with this issue. I appreciate it. Quote Link to comment
garycase Posted January 6, 2016 Share Posted January 6, 2016 It did help you find an "issue" -- and it was certainly worth reseating the cable ... but the reality is it didn't have any negative impact => the system still ran fine, and there were no data integrity issues (the drive was simply running a tad slower due to frequent needs to reissue commands to it). The system would have continued to work just fine without doing anything. So it's not a parameter you want to use when deciding whether or not to replace a drive. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.