Re: preclear_disk.sh - a new utility to burn-in and pre-clear disks for quick add


Recommended Posts

I did 2 "preclears" on WD 1 TB drives - on two different servers. Both did hang at 88% - took approx. 25 hours (that's what makes it difficult to just "retest" ;-))

I understand, it is also what makes it difficult for me to test...    Combine that with the fact that the only WD 1TB drive I own is already part of my array (and nearly full), and I have no desire to clear it, and you can see why testing can take as long as it is.

 

Can you do me a favor and let me know the "geometry" of the drive that fails to clear?

 

You can do that by typing:

fdisk -l /dev/sdX

 

where sdX = the actual drive in your array.  (replace the X with the correct drive letter)

 

Joe L.

 

Joe L.

Link to comment

I did 2 "preclears" on WD 1 TB drives - on two different servers. Both did hang at 88% - took approx. 25 hours (that's what makes it difficult to just "retest" ;-))

I understand, it is also what makes it difficult for me to test...    Combine that with the fact that the only WD 1TB drive I own is already part of my array (and nearly full), and I have no desire to clear it, and you can see why testing can take as long as it is.

 

Can you do me a favor and let me know the "geometry" of the drive that fails to clear?

 

You can do that by typing:

fdisk -l /dev/sdX

 

where sdX = the actual drive in your array.  (replace the X with the correct drive letter)

 

Joe L.

 

Joe L.

 

Sure - here you go:

 

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes

1 heads, 63 sectors/track, 31008336 cylinders

Units = cylinders of 63 * 512 = 32256 bytes

Disk identifier: 0x00000000

 

   Device Boot      Start         End      Blocks   Id  System

/dev/sdc1               2    31008336   976762552+  83  Linux

Partition 1 does not end on cylinder boundary.

 

The "funny" thing is, that the pre-read runs always to 100% - so maybe you can check your code about differences in the handling of pre-read and post-read?

 

cheers, Guzzi

Link to comment

The "funny" thing is, that the pre-read runs always to 100% - so maybe you can check your code about differences in the handling of pre-read and post-read?

cheers, Guzzi

That's what makes it really interesting...  The post-read, and pre-read... they both use the exact same function. 

They are handled identically.  The only difference is the wording of the messages they are displaying during their progress.

 

The other difference, of course, is the post-read occurs a few hours after the pre-read, after the drive gets warmer.., and after it has been written to.

 

Edit: and the post-read uses a different "block size" since the clearing process sets zeros where the old cylinders/heads/sectors are defined.  This would end up tripping an obscure "bash" bug with tracking background processes.  This was the cause of the freeze at 88%.  Fixed version now attached to first post in this thread.

Joe L.

Link to comment

But always at 88%??

 

Don't remember the details, but I have seen the 88% hang on Seagate 1tb as well....

 

Jim

 

 

to be exact:

hang occurs at 888,330,240,000  bytes read...

To be even  more exact, it occurs after waiting 4096 times for background processes as it iterates through the "read_entire_disk" loop.  ;D   This number apparently occurs during the "post-read" processing of certain disks with smaller cylinder sizes coupled with a large number of cylinders.    That is why some users saw the problem, and others did not.  It depended upon the specific geometry of the disks involved.  It would never show up on smaller disks as they did not loop as many times.

 

I had made some changes here, on my server, attempting to find the problem some of you have been experiencing, and I had reduced the number of disk blocks read in each iteration of the "read" loop of a disk.  This resulted in my preclear_disk.sh program freezing after 4 1/2 hours of running clearing a 1.5TB disk.   It exhibited the exact same symptoms as described.  (but it froze in the pre-read phase) No "dd" was evident, and the "bash" shell was using up 98% of the CPU.

 

I ended up using the GNU debugger to attach to the running "bash" shell to find the poor thing locked in the delete_job() function in an infinite loop.  This then gave me the clue I needed to track down the bug report. 

 

It is a bash bug, as reported here: http://www.mail-archive.com/[email protected]/msg03060.html

 

Good news is this bug in the shell is fixed in version 4 of "bash"  Bad news is we are on version 3.2.  (and I have no idea when Slackware will advance to the new version)

 

I don't yet have a fix for my script, but at least I think I know how to fix it.

 

So... I'll need to think of a solution.  I'll probably just get rid of the "wait" in the loop, since the 5 individual "dd" commands reading 1 block each should take much less time than the 6th read of 2000 blocks of data.   Lots of testing is in order here...  and, as you know, each test takes a LOT of time.

 

A quick fix for anyone who is having problems with a freeze at 88% (or any other %) might be to edit the program and to just delete the line with the "wait" at line 283.  It is certainly worth a try.

  (deleting that line did not work.. but fixed version is now attached to first post in this thread, it did not care if the "wait" was issued, it failed when attempting to store the process ID of the 4096th background process.)

The line to be deleted is shown below in "RED"  Did not fix the bug.  :-[[pre]    if [ "$skip" -gt "$blocks" ]

    then

      let skip=($blocks)

      end_of_disk="y"

 

    fi

    wait # make sure the background random blocks are read before continuing  <--  DELETE THIS ENTIRE LINE  (line number 283)

  done

[/pre]Oh yes, my testing also pointed out a stupid mistake I made in the lines reading a single block from the disk.   It did not affect the clearing of the disk, but it was not doing everything I was thinking it was doing in torturing the disk while reading.  I'll fix that in the next version I release too. 

 

Edit: July 21, 2009. The above suggestion did not fix the script... but a fixed script is now attached to the first post in this thread.

 

Joe L.

Link to comment

(only works if you have a valid "mail" command installed)

Aaaaaaaa!!!!!!!

 

HOW do I do that??

 

I've been searching these boards for the mail command in unraid...

 

(It's a newb speaking... I't will probably turn out to be something embarrasingly simple)

 

And how about a POP3 and a SMTP server in the unraid box?

 

I apologize for this being off topic, but if it has been discussed elsewhere on the boards, can somebody please point me in the right direction?

 

Thanks

Link to comment

(only works if you have a valid "mail" command installed)

Aaaaaaaa!!!!!!!

 

HOW do I do that??

 

I've been searching these boards for the mail command in unraid...

 

I apologize for this being off topic, but if it has been discussed elsewhere on the boards, can somebody please point me in the right direction?

 

If I recall, there have been 3 mail scripts written, although they probably build on each other.  This should get you started.  (The info about the missing library was for older versions of unRAID, the newest versions include it.)

* Email Notifications

Link to comment
If I recall, there have been 3 mail scripts written, although they probably build on each other.  This should get you started.  (The info about the missing library was for older versions of unRAID, the newest versions include it.)

* Email Notifications

 

No, I was talking about the built-in mail command which is present in every linux system, but is missing in unraid.

 

The Email Notifications script that you mentioned requires an external email account.

 

Yours,

Purko

 

Link to comment

If I recall, there have been 3 mail scripts written, although they probably build on each other.  This should get you started.  (The info about the missing library was for older versions of unRAID, the newest versions include it.)

* Email Notifications

 

No, I was talking about the built-in mail command which is present in every linux system, but is missing in unraid.

 

The Email Notifications script that you mentioned requires an external email account.

 

Yours,

Purko

 

You are correct.  unRAID has no e-mail server of its own, and probably will not ever have one in its stock configuration (too much bloat, and not normally needed in a file-server).  It does not even have a "mail" command.  You can install one if you wish using additional packages.  (We are hoping that Tom will add e-mail notifications at some point... but even then, it probably will just be as the scripts linked to, a way to alert you if the server needs attention.

 

If you wish a bit more than the described scripts, then I suggest installing something like "ssmtp" and "mailx." Between them you will be able to have working mail accounts on the unRAID server and to use it to forward status mail to you in the event of the need for administrative actions.    I would never install something as complex as "sendmail," as it is far too big a nightmare for most to configure and administer and make secure.

 

Joe L.

Link to comment

OK, newest  version 0.9.3 of preclear_disk.sh is now attached to the first post in this thread.

 

First... It should fix the problem where it froze at the 88% point in the post-read phase for some users.  The script itself was fine, but it tripped across a bug in the "bash" shell that sometimes would cause it to loop forever

(using 98% of your CPU) and stop clearing the disk.  This occurred if you had over 4096 background processes in a single script.  Certain disk drives with smaller cylinder sizes, and large numbers of cylinders ended up with that happening.  It therefore showed itself with certain drive geometry, and not with others.  If you had a smaller disk, or larger cylinder sizes, the 4096 number was never reached.  (My luck, it never happened here on any of my disks, so I never saw it until accidentally I forced it to occur)

 

Next, it adds the ability to be notified via e-mail of the progress of the pre-clear.

 

new options are:

      -m [email protected] = optional recipient address.  If blank and -M

            option is used, it will default to default e-mail address of "root"

 

      -M 1 = Will send an e-mail message at the end of the final results

              (default if -m is used, but no other -M option given)

 

      -M 2 = Will send an e-mail same as 1 plus at the end of a cycle (if multiple

            cycles are specified)

 

      -M 3 = Will send an e-mail same as 2 plus at the start and end of the pre-read,

            zeroing, post-read

 

      -M 4 = Will send an e-mail same as 3 plus also at intervals of 25%

            during the long tests

      The -m, -M options requires that a valid working mail command is installed.

      One version that has worked (bashmail) is affilaited with the unraid_notify script.

      There are others that also will work.

 

      All of these mail scripts need to be configured to work with your mail server.

      The unraid_notify script will have instructions on how to

      configure this. See http://lime-technology.com/forum/index.php?topic=2470.0

      for unraid_notify and http://lime-technology.com/forum/index.php?topic=2961.0

      for the mail script.  NOTE:  The lastest verion of mail has to be used. earlier versions of

      the mail script affiliated with unraid_notify do not support

      the standard mail syntax needed by this pre-clear script.

 

Joe L.

Link to comment

If you do use the mail programs listed from the posts above (from unraid_notify and it's mail offshoot) you will have to use the

 

    -m [email protected]

 

command line parameter

 

It will not default to the e-mail address in the unraid_notify.config file.  Maybe we can change the mail script to handle "root" as a recipient someday..  I'm still hoping that brianbone will update the package into a seperate mail and unraid_notify package!

 

Jim

Link to comment

After running 3 interations on my new 1TB green disk I had

 

< 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

---

> 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 5

64c64

< 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

---

> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1

 

Are 5 reallocated sectors anything to worry about..  I was hoping for 0! :)

 

This is still running on the old version of the script..  Maybe I should try the new version.. (I started my test the morning before Joe posted the new version!)  I did start a cycle again on a different controller (one cycle this time - and still the old script)

 

Another thought...  Should we start a new thread for preclear disk result questions and keep this thread for questions/comments about the functionality of preclear?

 

Jim

 

Link to comment

After running 3 interations on my new 1TB green disk I had

 

< 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

---

> 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 5

64c64

< 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

---

> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1

 

Are 5 reallocated sectors anything to worry about..  I was hoping for 0! :)

 

This is still running on the old version of the script..  Maybe I should try the new version.. (I started my test the morning before Joe posted the new version!)   I did start a cycle again on a different controller (one cycle this time - and still the old script)

 

Another thought...  Should we start a new thread for preclear disk result questions and keep this thread for questions/comments about the functionality of preclear?

 

Jim

 

If it stays at 5, in my opinion, no problem.  If it increases over time, then you might want to use the RMA process.  Odds are good it will stabilize.  I have one 250Gig drive that has had 100 relocated sectors since the first time I ran smartctl on it.  That number has never changed on that disk.

 

I'd say, download the new version of preclear_disk.sh and run another set of test cycles and see if it shows an increase in re-allocated sectors.  (the new version stress-tests the drive more.  The old one had a bug that prevented the random cylinders from being read in addition to the linear read that was properly occurring)  If the number stays at 5, fine, if not another test cycle might be in order.  At that point you have all the evidence you need if an RMA is warranted.

 

You might want to start a thread with your preclear experience.  It will allow the questions about the output to all be in one spot.

 

Joe L.

Link to comment

You might want to start a thread with your preclear experience.  It will allow the questions about the output to all be in one spot.

 

Joe L.

 

Done!  I started a new thread that can be devoted to just questions about the results of the script.  Hopefully all the gurus will monitor that thread too!

 

Thanks again, Joe, for a great script!

 

Results discussion thread can be found here

Link to comment

OK, newest  version 0.9.3 of preclear_disk.sh is now attached to the first post in this thread.

 

Nice!

 

The new version zoomed past the cursed 88%, and finished normally.

 

Funny things happened though while the thing was running: A whole bunch of processes got severely deadlocked in "disk sleeping" state, including samba, rtorrent, and some of my telnet sessions.  As processes in such state are not killable by any means, I almost pulled the power plug from the wall at one point, while also pulling my hair... But waited it out.  For 15 hours!!!

 

During all that time the whole system was not totally locked: some telnet sessions were still responsive, and the overall CPU usage reported by htop was in the lower 30%.

 

Once the preclear script was done doing its job,  the deadlocked processes got back to normal eventually, and I was able to cleanly restart the system.

 

So I wonder, could this possibly be some new bug introduced in the new version of preclear? Joe mentioned that this new version is stressing the disk now much more seriously (something the previous version wasn't doing due to a bug). Could this be some kilnd of connection to my disaster?

 

Yours,

Purko

 

Link to comment

OK, newest  version 0.9.3 of preclear_disk.sh is now attached to the first post in this thread.

 

Nice!

 

The new version zoomed past the cursed 88%, and finished normally.

 

Funny things happened though while the thing was running: A whole bunch of processes got severely deadlocked in "disk sleeping" state, including samba, rtorrent, and some of my telnet sessions.  As processes in such state are not killable by any means, I almost pulled the power plug from the wall at one point, while also pulling my hair... But waited it out.  For 15 hours!!!

 

During all that time the whole system was not totally locked: some telnet sessions were still responsive, and the overall CPU usage reported by htop was in the lower 30%.

 

Once the preclear script was done doing its job,  the deadlocked processes got back to normal eventually, and I was able to cleanly restart the system.

 

So I wonder, could this possibly be some new bug introduced in the new version of preclear? Joe mentioned that this new version is stressing the disk now much more seriously (something the previous version wasn't doing due to a bug). Could this be some kilnd of connection to my disaster?

 

Yours,

Purko

 

It is an indication of you having a deadlock of some kind.  Since the pre-clear is only reading or writing the drive that is being cleared, it might be the combined resources needed by everything else you have running.  It has to be something at a pretty low level... below the file-system.  Perhaps something deadlocked in the device driver for your disk controller (you did have a lot of file-activity going on).     

 

Did you see anything in your syslog?

Link to comment

Joe,

 

A couple of questions:

 

1) When I run the -t option on a disk that I recently precleared, one of the messeges it says about the drive is: "Partition 1 does not end on cylinder boundary."  What does that mean?  Is this a bad thing?

 

2) The -n option is described as follows: "Do NOT perform preread and postread of entire disk to allow SMART firmware to reallocate bad blocks in the clearing process."  Could you explain this a bit more, please?  It sounds like something I would always want to do, but since it is an option and not a default I thought I'd ask you to be sure.

 

JT

 

Link to comment

Joe,

 

A couple of questions:

 

1) When I run the -t option on a disk that I recently precleared, one of the messeges it says about the drive is: "Partition 1 does not end on cylinder boundary."  What does that mean?  Is this a bad thing?

I know, you would think it should.  The pre-clear script uses every block possible... exactly the same as unRAID. (although it does not use the entire first cylinder, apparently to make it backward compatible with the partitioning on some window's OS.)

I'm guessing that the remaining sectors on that last cylinder are probably those that are used as spares by the disk itself if it finds it needs to reallocate one when it is unable to read one.  (Strictly a guess)  I did a bit more searching using google.  Odds are high your disk is reporting it has 255 disk heads, or some number far greater than the 1 or two it might actually have.  This it to make the ancient Cylinder/Heads/Sectors counts in the partition fit the bits available.    In reality, the disk itself does not used C/H/S to locate a given sector.  Internally, it uses linear addressing.  You can ignore the "warning"

 

It is not anything you need to worry about.

 

2) The -n option is described as follows: "Do NOT perform preread and postread of entire disk to allow SMART firmware to reallocate bad blocks in the clearing process."  Could you explain this a bit more, please?  It sounds like something I would always want to do, but since it is an option and not a default I thought I'd ask you to be sure.

 

JT

 

That might be better worded to say:

Do NOT perform preread and postread of entire disk to allow which allows SMART firmware to reallocate bad blocks in the clearing process."

 

The SMART firmware on a disk will only detect a bad sector on a disk when an attempt is made to read it.  Writing to a disk is done blindly, because to read what was just written would require the disk to spin a whole revolution before the disk head was back over the sector just written, and it would really slow down the drive performance.

 

Therefore, to check the entire disk for bad sectors, you must read every sector on the disk.  Any failures will flag the sector as needing re-allocation.

 

When a bad sector on the disk is detected, it is marked for subsequent re-allocation at a future time.  (It cannot immediately re-allocate it, as it has no idea what the content should be., as the read failed.)

 

When the marked sector is subsequently written, as occurs when we write zeros to the disk after the pre-read, the SMART firmware on the disk can re-locate the sector, and since we are writing it, the disk knows exactly what to write to the re-located sector.

 

The subsequent post-read in the pre-clear script is to identify any additional problems with reading the blocks just written with zeros.

 

You would always want to do both the pre-read and the post-read... unless you are in a real rush, and only want to zero the drive. (perhaps prior to selling it on e-bay)   

 

In other words, you would almost never use the "-n" option.

 

Joe L.

Link to comment

Very good explaination Joe.  I believe that I now completely understand.  It looks like you really tried to think of everything when you wrote the preclear script.  I did start a 3rd preclear with the -n option believing it worked in exactly the opposite manner.  I guess I'll just let it finish.

Link to comment
  • 2 weeks later...

A couple of questions:

 

1) Help! I'm lost. I downloaded the script on my PC, but I assume i have to get it onto my unRAID flash drive to run it? When do I unzip it, if ever? Any Noob walkthroughs appreciated... I'll look through the wiki as well.

 

 

2)Is the script capable of doing more than one drive at a time? If so, how? I believe unRAID can do more than one drive at a time, but you can't access the array of course. I have three 1 TB drives that I wish to clear. I can leave it all overnight no problem, but if they could all be done concurrently, that would be ideal.

 

TIA,

Joe

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.