Preclear plugin


Recommended Posts

[glow=red,2,300]Update to Beta:[/glow]

 

2016.03.22

- Fix: Better S.M.A.R.T. report

- Add: Save a report in /boot/preclear_reports

 

2016.03.21

- Add: pause any preclear operations while array start/stop

- Add: Initial SMART reporting

 

Does that mean you can stop the array during a preclear . Add additional hard drives and restart the array and the preclear will resume ?

This does NOT Mean you can reboot the system ?

 

It will pause all current clearing sessions while the array starts and stops. If you reboot the server, all progress will be lost.

Link to comment

Looking good. I don't know about anyone else but when eyeballing a SMART report I find I use the SMART Attribute ID numbers. I guess they could be considered superfluous.

 

I use the ID's as a whitelist. Currently, these are the current whitelisted id's: 4 5 9 183 184 187 188 190 196 197 198 199

 

The 9 (Power On Hours) and 190 (Airflow Temperature Celsius) are just for testing.

Link to comment

Looking good. I don't know about anyone else but when eyeballing a SMART report I find I use the SMART Attribute ID numbers. I guess they could be considered superfluous.

 

I use the ID's as a whitelist. Currently, these are the current whitelisted id's: 4 5 9 183 184 187 188 190 196 197 198 199

 

The 9 (Power On Hours) and 190 (Airflow Temperature Celsius) are just for testing.

 

Ah. Is there any reason we need to display 188 - Command Timeout for a preclear?

 

I was under the impression that by-in-large it was advised that this should be ignored? I think even any change in this attribute is turned to ignore (as of 6.1.7 I think) by default in the WEB GUI too?

 

http://lime-technology.com/forum/index.php?topic=44068.msg420647#msg420647

Link to comment

Hi guys, what do you this of SMART report? Anything missing?

 

mwFeiqB.png

 

* For consistency with every where else in unRAID, I would prefer the attribute numbers visible.

 

* In my opinion, I don't think you need the Status column with changed/no change, as we can all see the changes or lack thereof.

 

* Temps can be found in 190 or 194.  Some drives only use 190, some only use 194, many use both.  You could look for 190, if not exists use 194.

 

* The over temp message "Failed in past" I've only seen a few times, seems harmless, just means temp hit high temp threshold at some point in past.  I don't think I have seen that "Failed in past" on any other attribute, but I suppose it's possible.  I don't think it's necessary to display it.

 

* 4 Start_Stop_Count is quite unimportant.  I don't think I have ever seen it associated with a single drive issue.  If you want it, then you should probably add 12 Power_Cycle_Count, 192 Power-Off_Retract_Count, and 193 Load_Cycle_Count, so the user can see the relations between them.  But these are all non-critical, won't fail the drive.

 

* I know you've selected a set of attributes to open the discussion.  I personally would prefer seeing as few as possible, keep it clean and simple, to avoid confusion.  Showing 9 and 190/194 gives a nice time and temp history, and 5 and 197 display the most critical items concerning bad sectors.  And it's always nice to show any significant change to 199, so you can warn them about SATA cable quality.  But I think that more than that isn't necessary, unless the attribute is important and changes significantly.

 

* The only critical attribute you have chosen is 5 Reallocated_Sector_Ct.  Only critical attributes can fail a drive, those with a flag value with the one bit on, those that say 'Pre-fail'.  I believe the ones most likely to fail a drive are 5 Reallocated_Sector_Ct, 1 Raw_Read_Error_Rate, 7 Seek_Error_Rate, and 3 Spin_Up_Time.  For all but 5, you would want to display the VALUE, not the RAW.

 

* If you are sure you want 188, be aware that some drives have it as a single small number, but some Seagates have it as a set of 3 small numbers, a 48 bit value that should be treated as 3 16 bit values.  If it's a huge number, break it into 3 numbers.

 

* A number of those attributes won't exist on some drives.

 

* Not used to seeing a drive go through multiple Preclear cycles and not age a single hour!

 

* At some point, you may be asked for an SSD Preclear option, which could be just a zeroing plus signature plus SMART display and evaluation, no pre or post reads.  The SMART attributes for SSD's are generally very different, and the age and wear level may be the most interesting ones (but very inconsistent attribute numbers).

 

* 'ATENTION' should be 'ATTENTION'.

Link to comment

Hi guys, what do you this of SMART report? Anything missing?

 

mwFeiqB.png

 

* For consistency with every where else in unRAID, I would prefer the attribute numbers visible.

 

* In my opinion, I don't think you need the Status column with changed/no change, as we can all see the changes or lack thereof.

 

* Temps can be found in 190 or 194.  Some drives only use 190, some only use 194, many use both.  You could look for 190, if not exists use 194.

 

* The over temp message "Failed in past" I've only seen a few times, seems harmless, just means temp hit high temp threshold at some point in past.  I don't think I have seen that "Failed in past" on any other attribute, but I suppose it's possible.  I don't think it's necessary to display it.

 

* 4 Start_Stop_Count is quite unimportant.  I don't think I have ever seen it associated with a single drive issue.  If you want it, then you should probably add 12 Power_Cycle_Count, 192 Power-Off_Retract_Count, and 193 Load_Cycle_Count, so the user can see the relations between them.  But these are all non-critical, won't fail the drive.

 

* I know you've selected a set of attributes to open the discussion.  I personally would prefer seeing as few as possible, keep it clean and simple, to avoid confusion.  Showing 9 and 190/194 gives a nice time and temp history, and 5 and 197 display the most critical items concerning bad sectors.  And it's always nice to show any significant change to 199, so you can warn them about SATA cable quality.  But I think that more than that isn't necessary, unless the attribute is important and changes significantly.

 

* The only critical attribute you have chosen is 5 Reallocated_Sector_Ct.  Only critical attributes can fail a drive, those with a flag value with the one bit on, those that say 'Pre-fail'.  I believe the ones most likely to fail a drive are 5 Reallocated_Sector_Ct, 1 Raw_Read_Error_Rate, 7 Seek_Error_Rate, and 3 Spin_Up_Time.  For all but 5, you would want to display the VALUE, not the RAW.

 

* If you are sure you want 188, be aware that some drives have it as a single small number, but some Seagates have it as a set of 3 small numbers, a 48 bit value that should be treated as 3 16 bit values.  If it's a huge number, break it into 3 numbers.

 

* A number of those attributes won't exist on some drives.

 

* Not used to seeing a drive go through multiple Preclear cycles and not age a single hour!

 

* At some point, you may be asked for an SSD Preclear option, which could be just a zeroing plus signature plus SMART display and evaluation, no pre or post reads.  The SMART attributes for SSD's are generally very different, and the age and wear level may be the most interesting ones (but very inconsistent attribute numbers).

 

* 'ATENTION' should be 'ATTENTION'.

1) It's easy to add back;

 

2) I somewhat facilitates the job, so I'm leaving it;

 

3) Yes, I know. It's being triggered when a fan failed last year;

 

4) Ok, Start_Stop_Count  removed.

 

5) I'll take a special look into those attributes;

 

6) Sometimes, other attributes may failed the drive, e.g. http://lime-technology.com/forum/index.php?topic=47764.0 . It's a work in progress, so we should add those related to failure, being pre-fail or old age.

 

7) I'll remove 188;

 

8 ) Yes, I know. I will show only those that exist on the clearing drive;

 

9) It's a "short test" function that reads/writes only 1 GB of data each cycle. Do you think that I wait 120 hours to test a cosmetic change?  ;D

 

10) It's already in place, just skip pre read and post read and you're good to go. It won't try to stress a SSD header too.

 

11) Nice catch. Didn't had the time to search for typos/misspellings yet.

 

Thanks a lot for your input!

Link to comment

2) I somewhat facilitates the job, so I'm leaving it;

I would suggest leaving it blank for unchanged.    Would make it more obvious exactly which values should be examined further.  Also then you do not need to actually say it has changed - just include the details.

Link to comment

2) I somewhat facilitates the job, so I'm leaving it;

I would suggest leaving it blank for unchanged.    Would make it more obvious exactly which values should be examined further.  Also then you do not need to actually say it has changed - just include the details.

 

Well, I do agree it looks cleaner. What do you guys think?

 

#########################################################################################################################
#                                                                                                                       #
#                                                  S.M.A.R.T. Status                                                    #
#                                                                                                                       #
#                                                                                                                       #
#   ATTRIBUTE                INITIAL  CYCLE 1  STATUS                                                                   #
#   Start_Stop_Count         4991     4991                                                                              #
#   Reallocated_Sector_Ct    0        0                                                                                 #
#   Power_On_Hours           24879    24893    Increased '14'                                                           #
#   Runtime_Bad_Block        2        2                                                                                 #
#   End-to-End_Error         31       31                      ->FAILING NOW!<-                                          #
#   Reported_Uncorrect       0        21       Increased '21'                                                           #
#   Command_Timeout          4        4                                                                                 #
#   Airflow_Temperature_Cel  44       42       Decreased '-2' ->Failed in Past<-                                        #
#   Current_Pending_Sector   0        72       Increased '72'                                                           #
#   Offline_Uncorrectable    0        72       Increased '72'                                                           #
#   UDMA_CRC_Error_Count     0        0                                                                                 #
#                                                                                                                       #
#########################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                            #
########################################################################################################################?

Link to comment

6) Sometimes, other attributes may failed the drive, e.g. http://lime-technology.com/forum/index.php?topic=47764.0 . It's a work in progress, so we should add those related to failure, being pre-fail or old age.

That's an interesting case, that falls in a gray area not specified by the standard.  From a SMART standpoint, I don't even know what "Failing now" means when it isn't an attribute marked as a critical one.  I took a look at how other drives handle it, and found many that don't have it, some that mark it as critical, a number of Seagates that handle it as non-critical but with a high threshold (like yours), and even some Samsungs that mark it as critical but with a threshold of zero, which can't be reached!

 

The thing is though, if you want to add it, you would need to add a bunch of others that are associated with drive failure much more often than 'End-to-end error'.  That would include all attributes marked as critical, and perhaps all attributes with a threshold.  I'm certainly not an expert, but I've probably looked at a thousand SMART reports and yours may be the first involving 'End-to-end error'.

 

I like the display better without all the 'Unchanged's.  The Decreased should not have the negative sign ("Decreased '-2'" means it went up 2 not down 2).  And they are numbers, should not have quotes around them.  I think I would prefer the simpler Up 1 and Down 2 though.

 

I don't expect everyone to agree with me on every point!  But I do like where this is going.

Link to comment

I like the display better without all the 'Unchanged's.  The Decreased should not have the negative sign ("Decreased '-2'" means it went up 2 not down 2).  And they are numbers, should not have quotes around them.  I think I would prefer the simpler Up 1 and Down 2 though.

 

I don't expect everyone to agree with me on every point!  But I do like where this is going.

I like the simplicity of Up 1 and Down 2.  These are in some ways cosmetic points, but getting them right makes the reports easier to digest and look for anomalies.
Link to comment

Any reason both my disks show sdg "Preclear in progress... Starting..." I started the pre-clear aboutt 9 hours ago.  If i check the preclear status i see the following.

 

Sorry: Device /dev/sdf is busy.: 1
root@Servo:/usr/local/emhttp#

I swapped to the beta and it looks like everything is working as it should.  Thanks!

Link to comment

Hi,

I've had similar issues to others with the new beta plugin with regards to it filling up the /var/log space.

Started a thread here as I didn't realise it was due to the plugin, https://lime-technology.com/forum/index.php?topic=47893.0

 

I've run,

 

mount -o remount,size=256m /var/log 

 

and I can see that has extended the space.

 

Will it have broken the 2 preclears that I have been running against some 8TB Seagate drives or will it have kept on running?

 

It also appears to have had the side effect of preventing me from accessing my unRAID shares from a windows machine (but I assume that due to the /var/log space filling up).

 

 

Link to comment

Ok i'm currently running 6.2.0-beta19.  I just added 2 more 4tb drives, one for Parity 2 and one for data expansion.  I ran the preclear beta script and both preclears finished successfully.  Now to add the drive to Parity 2, it says it will do a parity check or rebuild.  Is that correct?  If i go to add a data disk, it says click ok and it will begin clearing the drive.  Is that right?  Will it go quickly because it's already precleared?  Should it not try to clear it and give me a different message?  Thanks!

Link to comment

Ok i'm currently running 6.2.0-beta19.  I just added 2 more 4tb drives, one for Parity 2 and one for data expansion.  I ran the preclear beta script and both preclears finished successfully.  Now to add the drive to Parity 2, it says it will do a parity check or rebuild.  Is that correct?  If i go to add a data disk, it says click ok and it will begin clearing the drive.  Is that right?  Will it go quickly because it's already precleared?  Should it not try to clear it and give me a different message?  Thanks!

 

Parity2 rebuild will take a wile, approximately the same as a parity check, the data disk will be added instantly if successfully precleared.

Link to comment

Ok i'm currently running 6.2.0-beta19.  I just added 2 more 4tb drives, one for Parity 2 and one for data expansion.  I ran the preclear beta script and both preclears finished successfully.  Now to add the drive to Parity 2, it says it will do a parity check or rebuild.  Is that correct?  If i go to add a data disk, it says click ok and it will begin clearing the drive.  Is that right?  Will it go quickly because it's already precleared?  Should it not try to clear it and give me a different message?  Thanks!

 

Parity2 rebuild will take a wile, approximately the same as a parity check, the data disk will be added instantly if successfully precleared.

 

Parity 2 added no problem.  Ran parity check and it's all good.  Data disk did not auto add.  I'm rerunning the preclear on the data disk, so we'll see what happens.

Link to comment

Parity 2 added no problem.  Ran parity check and it's all good.  Data disk did not auto add.  I'm rerunning the preclear on the data disk, so we'll see what happens.

 

Note that when adding a new disk unRAID will always say that it's going to be cleared next to the start button, the difference is that when starting the array a precleared disk is not cleared.

 

P.S. just found that v6.2-beta clears a new disk online, this does not replace preclear as a way of stress testing a new disk.

Link to comment

Since the primary purpose for the preclear function has been made obsolete, (FINALLY!) I would like to propose a few changes to this project. Firstly, I can't see a need for the abbreviated write zeros and signature function any more, so efforts imho should be focused on testing and rehabilitation of new and possibly marginal older drives. To that end, I would like to see more efforts on a possibly non- destructive badblocks run option, plus the ability to mix and match the current read write dd routine with the more aggressive badblocks options. Having the ability to zero and prepare the drive for unraid inclusion can be an optional last step.

 

Comments? Additions? Modifications?

Link to comment

Since the primary purpose for the preclear function has been made obsolete, (FINALLY!) I would like to propose a few changes to this project. Firstly, I can't see a need for the abbreviated write zeros and signature function any more, so efforts imho should be focused on testing and rehabilitation of new and possibly marginal older drives. To that end, I would like to see more efforts on a possibly non- destructive badblocks run option, plus the ability to mix and match the current read write dd routine with the more aggressive badblocks options. Having the ability to zero and prepare the drive for unraid inclusion can be an optional last step.

 

Comments? Additions? Modifications?

 

When I studied Joe L. code, I realized that the primary function of the original script was, for a long time, to detect malfunction in hard drives. Little coding is needed to clear and write the clear signature. In fact, it can be done with less than 40 lines of code. All the remaining code is to make a pleasant user output, retrieve and compare SMART attributes, generate reports, avoid wrong disks from being cleared, stress the disk headers etc....

 

Of course I'm open to any suggestions. We can easily add other methods to complement those in place. If you can point me the proven badblocks tests we could add, that would be great.

 

But I would never add a disk to my array before I took a good look into it's SMART attributes, and that kind of awareness is not officially offered yet.

Link to comment
  • Squid featured, unfeatured and pinned this topic
  • Squid unpinned this topic

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.