disk error alert after gui plugin upgrade

mostlydave · January 22, 2015

I installed the new GUI update this moring and email notifications started working but now I'm getting this error every minute or so:

unRAID Cache disk SMART health [188]: 01-22-2015 07:34 AM

Warning: command timeout is 21475229705

ST31000528AS_5VP714HH (sdc)

unRAID Cache disk SMART health [195] Warning: hardware ecc recovered is 70793036 ST31000528AS_5VP714HH (sdc) warning

Everything looks good in the disk settings it passed the short SMART test and the extended test is running now.

Do I need to replace this drive? is there a way to acknowledge the warning so it stops alerting me every minute?

bonienl · January 22, 2015

I installed the new GUI update this moring and email notifications started working but now I'm getting this error every minute or so:

unRAID Cache disk SMART health [188]: 01-22-2015 07:34 AM

Warning: command timeout is 21475229705

ST31000528AS_5VP714HH (sdc)

unRAID Cache disk SMART health [195] Warning: hardware ecc recovered is 70793036 ST31000528AS_5VP714HH (sdc) warning

Everything looks good in the disk settings it passed the short SMART test and the extended test is running now.

Do I need to replace this drive? is there a way to acknowledge the warning so it stops alerting me every minute?

Warning [195] is a bug

mostlydave · January 22, 2015

It actually started with this when I checked the archived alerts:

01-22-2015 07:34 AM unRAID Cache disk SMART health [188] Warning: command timeout is 21475229705 ST31000528AS_5VP714HH (sdc) warning

Should I just ignore the [195] messages it's being sent about once a minute with a different 8 digit code each time?

Do I need to report this as a bug somewhere?

bonienl · January 22, 2015

Yeah you can ignore the [195] message.

Correction is already done in the code, will become available in a future update.

mostlydave · January 22, 2015

Thanks,

would you mind updating this thread once it's released? unfortunately I'm going to have to turn off the alerts for now...

bonienl · January 22, 2015

same here .....I'm getting these..

Event: unRAID Disk 1 SMART health [188]

Subject: Warning: command timeout is 1

Description: ST2000DM001-9YN164_Z1E08HKB (sdc)

Importance: warning

Event: unRAID Parity disk SMART health [188]

Subject: Warning: command timeout is 15

Description: ST2000DM001-1CH164_Z2F0T8DJ (sdb)

Importance: warning

//hellboy

Do you see the command timeout counter increment each time (i.e. do you get every minute a new notification) ?

mostlydave · January 22, 2015

I only got the command timeout once, after updating and rebooting, the alerts coming every minute or so are all 195s

bonienl · January 22, 2015

There are five SMART attributes monitored:

SMART 5 – Reallocated_Sector_Count.

SMART 187 – Reported_Uncorrectable_Errors.

SMART 188 – Command_Timeout.

SMART 197 – Current_Pending_Sector_Count.

SMART 198 – Offline_Uncorrectable.

After reboot any of these attributes will give a warning notification if their value is greater than zero, hereafter notifications are only given when the particular attribute value increases.

The "bug" with [195] is that a check is done on an attribute identifier containing the digit 5, while it should check for the exact value 5 instead.

mostlydave · January 22, 2015

The drive in question for me passed the long and short SMART test after reboot

bonienl · January 23, 2015

When these SMART attributes are non-zero it doesn't automatically mean there is a drive error, hence a warning.

When values start to increase it should get attention as it may mean that a drive failure is eminent. Statistics have shown that the likelihood of a failure goes up when these attributes go up.

PeterB · January 23, 2015

I'm getting something similar - one of my drives (a Samsung) reports a SMART value for 225, Load Cycle Count. The value is currently 6953 and I'm receiving an email everytime that value changes. I'm not sure what I'm expected to do about this - should I be replacing the drive? SMART test always report no fault.

How can I stop this happening - can I reset the SMART value?

Edit to add:

I have another drive (A WD) which reports 193 Load Cycle Count. The current value is 267120, but I'm not getting any status reports for that one

Edit 2:

Ah, I've just understood what you wrote above. The code is meant to be checking attribute 5 - the reallocated sector count but, because of a coding error, it's checking every attribute whose identifier includes a '5'. Okay, looking forward to the next update to save my mailbox filling!

bonienl · January 24, 2015

Ah, I've just understood what you wrote above. The code is meant to be checking attribute 5 - the reallocated sector count but, because of a coding error, it's checking every attribute whose identifier includes a '5'. Okay, looking forward to the next update to save my mailbox filling!

Correct description of the bug !

bonienl · January 24, 2015

For those experiencing the SMART notifications bug, I've made a correction.

The new file also includes the feature to ignore the initial value reading and use that as a threshold for subsequent readings, i.e. only send a notification when the initial value starts to increase.

Download the attached file monitor.zip, unzip it and copy the file 'monitor' to /usr/local/emhttp/plugins/dynamix/scripts

(this will overwrite the existing file).

Note: You can copy the file first to your flash drive and from there move/copy it to the final destination.

monitor.zip

Squid · January 24, 2015

For those experiencing the SMART notifications bug, I've made a correction.

The new file also includes the feature to ignore the initial value reading and use that as a threshold for subsequent readings, i.e. only send a notification when the initial value starts to increase.

Download the attached file monitor.zip, unzip it and copy the file 'monitor' to /usr/local/emhttp/plugins/dynamix/scripts

(this will overwrite the existing file).

Note: You can copy the file first to your flash drive and from there move/copy it to the final destination.

It would probably be wise to copy the file to your flash drive, and then add a line to your go file

cp /boot/monitor /usr/local/emhttp/plugins/dynamix/scripts

That way the updated file will survive a reboot. However, bear in mind the when an update for the WebGUI comes out your should remove that line from GO and the file from the flash, as it may (probably) break the update

Squid · January 24, 2015

The new file also includes the feature to ignore the initial value reading and use that as a threshold for subsequent readings, i.e. only send a notification when the initial value starts to increase.

Personally, I don't mind the initial emails after a reboot (to remind me of where my drives' health stands, but how do you change that setting beyond saving monitor.ini and then copying it back on a reboot?

bonienl · January 24, 2015

When the file "monitor.ini" or the respective entry in that file does not exist, then the file gets updated without sending a notification. It is not a setting as such, but 'default' behavior.

Squid · January 24, 2015

When the file "monitor.ini" or the respective entry in that file does not exist, then the file gets updated without sending a notification. It is not a setting as such, but 'default' behavior.

Before I check out your code, there is a big problem with that operation as you've described.

If a drive has NO reallocated sectors, then the occurrance of the first reallacted sector will not be reported.

I've changed your code to include monitoring attribute 184 (end to end error) where a single occurance constitutes a drive failure. This will not get reported. And, it is possible (especially with 184) for these errors to occur at power-on... Way before your script is even running.

bonienl · January 24, 2015

Yes, I am aware that *any* initial value will cause a non-reporting. The easiest way to solve it is to always report (no initial value), I guess that is a safer approach despite the possible initial notifications upon reboot.

I updated the script and removed the initial value setting, but I am open for better suggestions

monitor.zip

Squid · January 24, 2015

I updated the script and removed the initial value setting, but I am open for better suggestions

Save monitor.ini to the flash at array powerdown, and restore it at power up

disk error alert after gui plugin upgrade

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived