Re: preclear_disk.sh - a new utility to burn-in and pre-clear disks for quick add


Recommended Posts

Follow the instructions in the wiki: 

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

You have file-system corruption.

 

Hi Joe L.,

it is obvious, that there is filesystemcorruption - but the questions is, what caused this - and since there are some messages in the syslog posted that seem to indicate problems with the communication to the drive, I am not sure, if it is really a good idea to let the fs-check run in this state.

 

The thing, that I do not understand is: how can there be filesystemcorruption, when there is not a single error reported by unraid - neither regarding parity, nor regarding any drive? So istn't this something the the parity should heve either covered or at least notified (datacorruption on a drive)?

 

Link to comment

Follow the instructions in the wiki: 

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

You have file-system corruption.

 

Hi Joe L.,

it is obvious, that there is filesystemcorruption - but the questions is, what caused this - and since there are some messages in the syslog posted that seem to indicate problems with the communication to the drive, I am not sure, if it is really a good idea to let the fs-check run in this state.

 

The thing, that I do not understand is: how can there be filesystemcorruption, when there is not a single error reported by unraid - neither regarding parity, nor regarding any drive? So istn't this something the the parity should heve either covered or at least notified (datacorruption on a drive)?

 

Errors on the unRAID interface are "read" errors.  The errors in the syslog are several layers lower than the "md" driver.  The "md" driver just uses the reiserfs under it. 

 

The errors in the syslog could be caused by a bad disk, bad memory, bad cabling, loose cabling (either data or power) a bad disk controller, or a bad motherboard chipset.    (That narrows it down a lot, doesn't it)  For certain, the reiserfs file system errors will never show up on the unRAID management screen.

 

Step 1.  Post a full syslog

Step 2.  Test your memory.  Verify the timing, voltage and clock speed are correct for your specific memory strips.  Do NOT rely on the BIOS to set them for you. (Some get it right, many get it wrong)

Step 3.  Get a smart report on the drive.  Post it.  If it looks good, then go ahead with fixing the corruption

with the steps as outlined in the wiki.

 

Joe L.

Link to comment

Follow the instructions in the wiki:  

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

You have file-system corruption.

 

Hi Joe L.,

it is obvious, that there is filesystemcorruption - but the questions is, what caused this - and since there are some messages in the syslog posted that seem to indicate problems with the communication to the drive, I am not sure, if it is really a good idea to let the fs-check run in this state.

 

The thing, that I do not understand is: how can there be filesystemcorruption, when there is not a single error reported by unraid - neither regarding parity, nor regarding any drive? So istn't this something the the parity should heve either covered or at least notified (datacorruption on a drive)?

 

Errors on the unRAID interface are "read" errors.  The errors in the syslog are several layers lower than the "md" driver.  The "md" driver just uses the reiserfs under it.  

So for my understanding: When I write data to the box, I cannot know, if it was written properly unless I verified it myself manually be reading and comparing to source? So it's only the parity drive, that will "read" during write operations and thus check, if it can read properly?

The errors in the syslog could be caused by a bad disk, bad memory, bad cabling, loose cabling (either data or power) a bad disk controller, or a bad motherboard chipset.    (That narrows it down a lot, doesn't it)  For certain, the reiserfs file system errors will never show up on the unRAID management screen.

Not such a big problem, because the box was stable for a year and all was fine, there were never errors in the syslog (I checked from time to time).

The only change was the addition of this precleared disk (even the controller and powercables already had been prepared a year ago for extensions).

That narrows it down to

- powersupply (because of additional disk): Dont think so, because the errors were only on this one disk

- controller: its Sil3114, had been in the box for a year, but this is the forst disk connected to it, so it might be faulty

- cable of this specific disk

- the disk itself

If you check the preclear result of the disk some posts above, I noted, that there were some errors in syslog before, when precleared in my unraid testbox

Step 1.   Post a full syslog

Just wanted to do that, the box did hang, no telnet, no console, no webinterface (neither 80 nor 8080), console just telling about filesystemcorruption, no more  reaction an keystrokes :-(

Step 2.   Test your memory.  Verify the timing, voltage and clock speed are correct for your specific memory strips.  Do NOT rely on the BIOS to set them for you. (Some get it right, many get it wrong)

As mentioned above, that should be fine, done long time ago and never had problems with the box until this disk was added

Step 3.   Get a smart report on the drive.  Post it.  If it looks good, then go ahead with fixing the corruption

with the steps as outlined in the wiki.

Well, I will have to to a hard poweroff to get access again - or any other way to get it back to live? Ping still works, so the IP-Stack is alive...

 

Edit: ok, I switched off the box and restarted. It happened, what i didn't want to happen - paritysync started and reported 270 syncerrors - I assume, that parity was updated, but I trusted my parity drive and assume, the errors were on the new harddisk.

So I assume, there is no more way to try starting without the "bad new disk" and use the proper data from the rest of the array - because now it's "bad" too and replicated the errors from the drive to parity, right?  S....t :-(

 

Now Smartreport and syslog added ...

syslog-2010-04-20.txt

SmartReport.txt

Link to comment

Hi Joe L.

i ran Reiserfschk and it recommends to do a rebuid-tree. I will wait before doing so for your feedback.

btw, I have stopped the parity sync right after reboot, so it has shown 270 sync errors, but did not yet complete (so there is now only some sync done on root of the drive, but not the rest, thus parity for now is more or less useless for this array, right?

Should I replace the drive with a new one, run parity and try to save as much data from the current md11 drive as possible? (Just asking, because to my understanding I currently don't have a protected array anyway .... and I would get rid of the drive from the array...)

 

root@XMS-GMI-02:~# reiserfsck --check /dev/md16

reiserfsck 3.6.19 (2003 www.namesys.com)

 

*************************************************************

** If you are using the latest reiserfsprogs and  it fails **

** please  email bug reports to [email protected], **

** providing  as  much  information  as  possible --  your **

** hardware,  kernel,  patches,  settings,  all reiserfsck **

** messages  (including version),  the reiserfsck logfile, **

** check  the  syslog file  for  any  related information. **

** If you would like advice on using this program, support **

** is available  for $25 at  www.namesys.com/support.html. **

*************************************************************

 

Will read-only check consistency of the filesystem on /dev/md16

Will put log info to 'stdout'

 

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Tue Apr 20 08:23:35 2010

###########

Replaying journal..

Reiserfs journal '/dev/md16' in blocks [18..8211]: 0 transactions replayed

Checking internal tree../  1 (of   3)/  3 (of 170)/114 (of 170)block 81835406: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (81835406), whole subtree is skipped                                                     /  4 (of 170)/  1 (of 170)block 81893123: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (81893123), whole subtree is skipped                                                     / 39 (of 170)/ 56 (of 170)block 88049724: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (88049724), whole subtree is skipped                                        /  2 (of   3)/ 59 (of  86)/  2 (of 170)block 120818206: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (120818206), whole subtree is skipped                                       /  3 (of   3)/ 80 (of 148)/100 (of 165)block 163348502: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (163348502), whole subtree is skipped                                                    / 83 (of 148)/132 (of 170)block 180782520: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (180782520), whole subtree is skipped                                                    / 84 (of 148)/ 21 (of 170)block 24707073: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (24707073), whole subtree is skipped                                        finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

7 found corruptions can be fixed only when running with --rebuild-tree

###########

reiserfsck finished at Tue Apr 20 08:36:25 2010

###########

root@XMS-GMI-02:~#

 

Link to comment

One issue is, if you Format the drives outside of unRAID, when you add them to the array you will have to recalculate parity over the entire array.

 

You could possibly downgrade to unRAID 4.5.1 to add the drives and format them. Once added, you could then upgrade back to unRAID 4.5.3.

Link to comment

These are the results after running preclear on a new 2TB Samsung drive for use as my parity drive.

 

============================================================================
S.M.A.R.T. error count differences detected after pre-clear
note, some 'raw' values may change, but not be an indication of a problem
55c55
<   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
---
>   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       10
72c72
< 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       1
---
> 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       3
============================================================================

 

Should I be worried and run through preclear again or am I looking too much into this?

 

Thanks!

Link to comment

These are the results after running preclear on a new 2TB Samsung drive for use as my parity drive.

 

============================================================================
S.M.A.R.T. error count differences detected after pre-clear
note, some 'raw' values may change, but not be an indication of a problem
55c55
<   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
---
>   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       10
72c72
< 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       1
---
> 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       3
============================================================================

 

Should I be worried and run through preclear again or am I looking too much into this?

 

Thanks!

Your drive is fine.  (you did not read line 2)

'raw' values may change, but not be an indication of a problem

 

The only "raw" values we typically care about are sectors pending re-allocation and sectors already re-allocated.

 

Joe L.

Link to comment

Can you get a smartctl report on them?

 

The only time I've ever seen something like you describe is is the disk is defective.  I suppose it could also occur if a device is deadlocked with another process trying to access it, but that is not likely.

 

Yes, please describe in more detail what you've tried and the output you are seeing.

Link to comment

thanks Joe..

 

I've replaced the 3 Samsungs with 3 WD 2TB EADS and preclear is running on them currently.

 

The F2 HD154UI act as if they just are not initializing when I try to preclear them.  It just sits at time elapsed :01 and hangs.  

 

Hardware:

 

SUPERMICRO X7SBE

Intel Celeron 430 1.80GHz

Kingston 2Gb Ddr2 800Mhz Ecc Kvr800D2E5 1G (2 X 2GB)

Supermicro Add-on Card AOC-SIM1U+

Supermicro Add-on Card AOC-SASLP-MV8 (X2)

Supermicro CSE-M35T-1B Mobile Rack (Black) (X4)

 

 

Is it possible there is a firmware or jumper issue with the samsungs?  This really took me by surprise.  I've always had great success with the samsungs in my windows pcs.

 

 

Is it possible that my mobo sata settings are the fault here? (AHCI?)

 

Actually this is the post that got me thinking...

 

http://lime-technology.com/forum/index.php?topic=5229.msg48823#msg48823

Link to comment

Where can I find this preclear_disk.sh and download it? I have been looking for it for two hours and getting very frustrated about it. I need to start over again on the start up of my new unRAID. The unformat bug confused me by showing up on all drives and I totally screwed up two hard drives.

 

Thanks in advance

Link to comment

Where can I find this preclear_disk.sh and download it? I have been looking for it for two hours and getting very frustrated about it. I need to start over again on the start up of my new unRAID. The unformat bug confused me by showing up on all drives and I totally screwed up two hard drives.

 

Thanks in advance

 

It is attached to the bottom of the first post in this thread.    The link to download it will not be visible unless you first log onto the forum.  Perhaps you were not logged on when you went looking.

 

Joe L.

Link to comment

Where can I find this preclear_disk.sh and download it? I have been looking for it for two hours and getting very frustrated about it.

It is attached to the bottom of the first post in this thread.  The link to download it will not be visible unless you first log onto the forum.

Perhaps you can add a little note to that effect at the and of the first post.

 

Link to comment

Where can I find this preclear_disk.sh and download it? I have been looking for it for two hours and getting very frustrated about it.

It is attached to the bottom of the first post in this thread.  The link to download it will not be visible unless you first log onto the forum.

Perhaps you can add a little note to that effect at the and of the first post.

 

These lines are present at the TOP of the first post... so you don't have to scroll all the way to the bottom to read them.

They've been there for a very long time.

I've put both of those processes together in a shell script I wrote. 

You do not need to set up a spare PC and or a "burn-in" unRAID test machine for this purpose any longer.

It is named:

preclear_disk.sh

It is attached to this post, look for the download link near the end of this first long post (in this thread)

(The link for the attachment is only visible after you log in as a user of this forum)

 

You had a good idea, but I had a note in place...

I guess he never looked in the wiki either, as it has a link to the thread in the forum and this text:

    *  The Preclear Disk thread, begins with a very long post that includes a number of screen shots.

    * The download link is at the very bottom of the first post.

Link to comment

Update..

 

So the 3 WD20EADS all successfully precleared in ~32 hrs.  shut down the server, put the 3 (F2 HD154UI ) back in and all 3 failed short smart test.  wtf?  Amazingly bad quality or incompatability with my hardware/unraid? 

 

I RMA'd all 3 back to newegg and ordered 3 more WD20EADS just to be safe. 

 

Also tried to kick off a preclear on the (F2 HD154UI ) and same thing as before.  It failed to get past 0%.  Interesting to say the least.

 

 

 

 

 

thanks Joe..

 

I've replaced the 3 Samsungs with 3 WD 2TB EADS and preclear is running on them currently.

 

The F2 HD154UI act as if they just are not initializing when I try to preclear them.  It just sits at time elapsed :01 and hangs.  

 

Hardware:

 

SUPERMICRO X7SBE

Intel Celeron 430 1.80GHz

Kingston 2Gb Ddr2 800Mhz Ecc Kvr800D2E5 1G (2 X 2GB)

Supermicro Add-on Card AOC-SIM1U+

Supermicro Add-on Card AOC-SASLP-MV8 (X2)

Supermicro CSE-M35T-1B Mobile Rack (Black) (X4)

 

 

Is it possible there is a firmware or jumper issue with the samsungs?  This really took me by surprise.  I've always had great success with the samsungs in my windows pcs.

 

 

Is it possible that my mobo sata settings are the fault here? (AHCI?)

 

Actually this is the post that got me thinking...

 

http://lime-technology.com/forum/index.php?topic=5229.msg48823#msg48823

Link to comment

Hi all!

 

Just finished preclearing my old 1tb Seagate. Does anyone sees anything abnormal with the info bellow?

 

============================================================================

==

== Disk /dev/sde has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

54c54

<  1 Raw_Read_Error_Rate    0x000f  114  099  006    Pre-fail  Always      -      69956695

---

>  1 Raw_Read_Error_Rate    0x000f  119  099  006    Pre-fail  Always      -      202538958

58c58

<  7 Seek_Error_Rate        0x000f  068  060  030    Pre-fail  Always      -      7309450

---

>  7 Seek_Error_Rate        0x000f  069  060  030    Pre-fail  Always      -      7396424

66,67c66,67

< 190 Airflow_Temperature_Cel 0x0022  069  063  045    Old_age  Always      -      31 (Lifetime Min/Max 26/31)

< 195 Hardware_ECC_Recovered  0x001a  050  028  000    Old_age  Always

---

> 190 Airflow_Temperature_Cel 0x0022  066  063  045    Old_age  Always      -      34 (Lifetime Min/Max 26/35)

> 195 Hardware_ECC_Recovered  0x001a  051  028  000    Old_age  Always

71,73c71,73

< 240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      143610821479152

< 241 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      3339162421

< 242 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      2123312550

---

> 240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      195167608900350

> 241 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      997463318

> 242 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      1442950787

 

Link to comment

Hi all!

 

Just finished preclearing my old 1tb Seagate. Does anyone sees anything abnormal with the info bellow?

 

============================================================================

==

== Disk /dev/sde has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

54c54

<   1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       69956695

---

>   1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       202538958

58c58

<   7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       7309450

---

>   7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       7396424

66,67c66,67

< 190 Airflow_Temperature_Cel 0x0022   069   063   045    Old_age   Always       -       31 (Lifetime Min/Max 26/31)

< 195 Hardware_ECC_Recovered  0x001a   050   028   000    Old_age   Always

---

> 190 Airflow_Temperature_Cel 0x0022   066   063   045    Old_age   Always       -       34 (Lifetime Min/Max 26/35)

> 195 Hardware_ECC_Recovered  0x001a   051   028   000    Old_age   Always

71,73c71,73

< 240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       143610821479152

< 241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3339162421

< 242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2123312550

---

> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       195167608900350

> 241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       997463318

> 242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1442950787

 

Yes, the raw read error rate and seek error rate both seem to have improved from the start of the process.
Link to comment

Yes, the raw read error rate and seek error rate both seem to have improved from the start of the process.

 

So now preclear has the superpower of repairing disks!

Not repairing, but exercising.    And if that "breaks in" the mechanical parts to where they are not binding as much as when first assembled, then yes, the error rates can improve.  Exactly the same as your gas mileage improving once your car engine is broken in after the first few hundred miles.
Link to comment
  • 4 weeks later...

Hey guys, I've been lurking around and learning as much as I can for a few weeks now.   Started to put together a system and and precleared a few drives and started copying stuff to them.  I'm pretty new to Linux but follow instructions pretty well.  I have Win XP Pro and I copied stuff by just dragging the files to a folders I created on DISK1 and DISK2.  DISK1 went fine, moved 500gb over no prob.  During the file copy on DISK2 it hung at a particular spot and gave me the Windows pop-up notice "Cannot copy FILEXXX: The specified network name is no longer available."  Then when trying to access that file over the network to delete it, I got the message "Runtime Error! Program: C:\WINDOWS\explorer.exe This application has requested the Runtime to terminate it in an unusual way...."  Then I click OK and all the windows close down.   I open them up again and everything seems to be fine.  The //tower control panel reports no problems, all green, good parity, 0 errors.  I finished copying about 1.5TB to the drive.  I managed to delete the problem file by highlighting 3 files and deleting, then copied them back in a different order.  The file that got hung before now moved ok and works (it's an avi video) and now a diff file got hung up the same way and same events happened.

 

I'm assuming this is a hard drive disk error?  I still have all my data on my Windows computer and am not going to delete anything just yet...

This is my preclear log of that drive.  This is the only one of the 6 I precleared so far to give me the "G-Sense Error rate"...

 

 

== Disk /dev/sdc has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

55c55

<   1 Raw_Read_Error_Rate     0x002f   252   252   051    Pre-fail  Always       -       0

---

>   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       34

65c65

< 191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0

---

> 191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1

67c67

< 195 Hardware_ECC_Recovered  0x003a   252   252   000    Old_age   Always      

---

> 195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always      

============================================================================

 

Should I RMA the drive back, re-preclear it, format it?

BTW this is a great forum and I'm so glad I found unRAID!

Thanks for any help/advice.

Link to comment

Hey guys, I've been lurking around and learning as much as I can for a few weeks now.   Started to put together a system and and precleared a few drives and started copying stuff to them.  I'm pretty new to Linux but follow instructions pretty well.  I have Win XP Pro and I copied stuff by just dragging the files to a folders I created on DISK1 and DISK2.  DISK1 went fine, moved 500gb over no prob.  During the file copy on DISK2 it hung at a particular spot and gave me the Windows pop-up notice "Cannot copy FILEXXX: The specified network name is no longer available."  Then when trying to access that file over the network to delete it, I got the message "Runtime Error! Program: C:\WINDOWS\explorer.exe This application has requested the Runtime to terminate it in an unusual way...."  Then I click OK and all the windows close down.   I open them up again and everything seems to be fine.  The //tower control panel reports no problems, all green, good parity, 0 errors.  I finished copying about 1.5TB to the drive.  I managed to delete the problem file by highlighting 3 files and deleting, then copied them back in a different order.  The file that got hung before now moved ok and works (it's an avi video) and now a diff file got hung up the same way and same events happened.

 

I'm assuming this is a hard drive disk error?  I still have all my data on my Windows computer and am not going to delete anything just yet...

This is my preclear log of that drive.  This is the only one of the 6 I precleared so far to give me the "G-Sense Error rate"...

 

 

== Disk /dev/sdc has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

55c55

<   1 Raw_Read_Error_Rate     0x002f   252   252   051    Pre-fail  Always       -       0

---

>   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       34

65c65

< 191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0

---

> 191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1

67c67

< 195 Hardware_ECC_Recovered  0x003a   252   252   000    Old_age   Always      

---

> 195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always      

============================================================================

 

Should I RMA the drive back, re-preclear it, format it?

BTW this is a great forum and I'm so glad I found unRAID!

Thanks for any help/advice.

None of the above.

 

The statistics from the pre-clear show absolutely nothing wrong.

 

You really should learn how to interpert the various columns in the output.  The first column is the current value.  The next is the "worst" value ever for that parameter, the third column is the threshold at which the drive is considered as failed.

 

252  looks to be an initial factory value, reset to 100 once the drive is put into use. 

 

The G-sense "normalized" value was zero at the start of your pre-clear, and zero after..  None of your normalized values changed.  Unless you have a different parameter that say FAILING_NOW, you are probably fine.

 

If you are having troubles, you should follow the advice given in the wiki under troubleshooting.  Post a syslog.  If there are errors in it describing reiserfs errors, then run a file-system check as described in the wiki.

 

Joe L.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.