Jump to content

disk error alert after gui plugin upgrade


mostlydave

Recommended Posts

I installed the new GUI update this moring and email notifications started working but now I'm getting this error every minute or so:

 

unRAID Cache disk SMART health [188]: 01-22-2015 07:34 AM

Warning: command timeout is 21475229705

ST31000528AS_5VP714HH (sdc)

 

unRAID Cache disk SMART health [195] Warning: hardware ecc recovered is 70793036 ST31000528AS_5VP714HH (sdc) warning

 

Everything looks good in the disk settings it passed the short SMART test and the extended test is running now.

 

Do I need to replace this drive? is there a way to acknowledge the warning so it stops alerting me every minute?

Link to comment

I installed the new GUI update this moring and email notifications started working but now I'm getting this error every minute or so:

 

unRAID Cache disk SMART health [188]: 01-22-2015 07:34 AM

Warning: command timeout is 21475229705

ST31000528AS_5VP714HH (sdc)

 

unRAID Cache disk SMART health [195] Warning: hardware ecc recovered is 70793036 ST31000528AS_5VP714HH (sdc) warning

 

Everything looks good in the disk settings it passed the short SMART test and the extended test is running now.

 

Do I need to replace this drive? is there a way to acknowledge the warning so it stops alerting me every minute?

 

Warning [195] is a bug  :(

 

Link to comment

It actually started with this when I checked the archived alerts:

 

01-22-2015 07:34 AM unRAID Cache disk SMART health [188] Warning: command timeout is 21475229705 ST31000528AS_5VP714HH (sdc) warning

 

Should I just ignore the [195] messages it's being sent about once a minute with a different 8 digit code each time?

 

Do I need to report this as a bug somewhere?

Link to comment

same here .....I'm getting these.. ???

 

Event: unRAID Disk 1 SMART health [188]

Subject: Warning: command timeout is 1

Description: ST2000DM001-9YN164_Z1E08HKB (sdc)

Importance: warning

 

 

Event: unRAID Parity disk SMART health [188]

Subject: Warning: command timeout is 15

Description: ST2000DM001-1CH164_Z2F0T8DJ (sdb)

Importance: warning

 

//hellboy

 

Do you see the command timeout counter increment each time (i.e. do you get every minute a new notification) ?

 

Link to comment

There are five SMART attributes monitored:

 

SMART 5 – Reallocated_Sector_Count.

SMART 187 – Reported_Uncorrectable_Errors.

SMART 188 – Command_Timeout.

SMART 197 – Current_Pending_Sector_Count.

SMART 198 – Offline_Uncorrectable.

 

After reboot any of these attributes will give a warning notification if their value is greater than zero, hereafter notifications are only given when the particular attribute value increases.

 

The "bug" with [195] is that a check is done on an attribute identifier containing the digit 5, while it should check for the exact value 5 instead.

 

 

 

Link to comment

When these SMART attributes are non-zero it doesn't automatically mean there is a drive error, hence a warning.

 

When values start to increase it should get attention as it may mean that a drive failure is eminent. Statistics have shown that the likelihood of a failure goes up when these attributes go up.

 

Link to comment

I'm getting something similar - one of my drives (a Samsung) reports a SMART value for 225, Load Cycle Count.  The value is currently 6953 and I'm receiving an email everytime that value changes.  I'm not sure what I'm expected to do about this - should I be replacing the drive?  SMART test always report no fault.

 

How can I stop this happening - can I reset the SMART value?

 

Edit to add:

I have another drive (A WD) which reports 193 Load Cycle Count.  The current value is 267120, but I'm not getting any status reports for that one

 

Edit 2:

Ah, I've just understood what you wrote above.  The code is meant to be checking attribute 5 - the reallocated sector count but, because of a coding error, it's checking every attribute whose identifier includes a '5'.  Okay, looking forward to the next update to save my mailbox filling!

Link to comment

Ah, I've just understood what you wrote above.  The code is meant to be checking attribute 5 - the reallocated sector count but, because of a coding error, it's checking every attribute whose identifier includes a '5'.  Okay, looking forward to the next update to save my mailbox filling!

 

Correct description of the bug !

Link to comment

For those experiencing the SMART notifications bug, I've made a correction.

 

The new file also includes the feature to ignore the initial value reading and use that as a threshold for subsequent readings, i.e. only send a notification when the initial value starts to increase.

 

Download the attached file monitor.zip, unzip it and copy the file 'monitor' to /usr/local/emhttp/plugins/dynamix/scripts

 

(this will overwrite the existing file).

 

Note: You can copy the file first to your flash drive and from there move/copy it to the final destination.

monitor.zip

Link to comment

For those experiencing the SMART notifications bug, I've made a correction.

 

The new file also includes the feature to ignore the initial value reading and use that as a threshold for subsequent readings, i.e. only send a notification when the initial value starts to increase.

 

Download the attached file monitor.zip, unzip it and copy the file 'monitor' to /usr/local/emhttp/plugins/dynamix/scripts

 

(this will overwrite the existing file).

 

Note: You can copy the file first to your flash drive and from there move/copy it to the final destination.

It would probably be wise to copy the file to your flash drive, and then add a line to your go file

 

cp /boot/monitor /usr/local/emhttp/plugins/dynamix/scripts

 

That way the updated file will survive a reboot.  However, bear in mind the when an update for the WebGUI comes out your should remove that line from GO and the file from the flash, as it may (probably) break the update

Link to comment

The new file also includes the feature to ignore the initial value reading and use that as a threshold for subsequent readings, i.e. only send a notification when the initial value starts to increase.

Personally, I don't mind the initial emails after a reboot (to remind me of where my drives' health stands, but how do you change that setting beyond saving monitor.ini and then copying it back on a reboot?

Link to comment

When the file "monitor.ini" or the respective entry in that file does not exist, then the file gets updated without sending a notification. It is not a setting as such, but 'default' behavior.

Before I check out your code, there is a big problem with that operation as you've described.

 

If a drive has NO reallocated sectors, then the occurrance of the first reallacted sector will not be reported.

 

I've changed your code to include monitoring attribute 184 (end to end error) where a single occurance constitutes a drive failure.  This will not get reported.  And, it is possible (especially with 184) for these errors to occur at power-on...  Way before your script is even running. 

Link to comment

Yes, I am aware that *any* initial value will cause a non-reporting. The easiest way to solve it is to always report (no initial value), I guess that is a safer approach despite the possible initial notifications upon reboot.

 

I updated the script and removed the initial value setting, but I am open for better suggestions :)

monitor.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...