[SOLVED] Replacing a disk - mdcmd check NOCORRECT found errors

DrMexSS · March 19, 2011

UPDATE: I have solved the problem. If you don't want to read everything I have summarized the steps in my 4th post in this thread.

Hi everyone,

I have doubts about my next step.

I have replace a disk of my system because I bought a bigger one. After doing that I run

/root/mdcmd check NOCORRECT

It's now running and it's founding errors... aaaaaaaaaahhhhhhhh

My system:

Unraid 4.5.6 - Plus key
USB: Sandisk cruzer (16gb)
Mobo: Asus P5Q-VM (asus) :   Intel® G45 / ICH10
NIC: Intel pro/1000
CPU: Celeron S 430 (1x 1800 MHz)
RAM: 2 GB

Disks:
Parity : Western Digital 2TB   : WDC_WD20EARS-00S8B1_WD-WCAVY2494470
Disk 1: Seagate 1TB             : ST31000528AS_6VP0EL3G
Disk 2: Seagate 500 GB        : ST3500830AS_9QG67DQX (THIS is the one I wanted to change)
Disk 3: Hitachi 750 GB          : Hitachi_HDT721075SLA380_STA401MG03XLPA
Disk 4: Seagate 1 TB            : ST31000528AS_6VP05W02
Disk 5: Western Digital 2TB   : WDC_WD20EARS-00S8B1_WD-WCAVY2502205

I changed the disk2 with a new Western digital 2TB (WD20EARS)

The process I followed to change the disk was:

1º- Parity check.

2º- Power down, change the disk, power on

NOTE: I did NOT use the preclear.sh script !!!!!

3º- Devices tab > the new disk was assigned automatically to the missing slot.

4º- Start button, "I'm sure..."

The system did a rebuilding with zero errors.

Then, as I said, I run a non-correcting parity check :: /root/mdcmd check NOCORRECT

It hasn't finished yet. Here it is a capture of the process:

Syslog:

r 19 00:16:39 Tower login[1401]: ROOT LOGIN  on `tty1'
Mar 19 00:18:44 Tower kernel: mdcmd (39): check NOCORRECT
Mar 19 00:18:44 Tower kernel: 
Mar 19 00:18:44 Tower kernel: md: recovery thread woken up ...
Mar 19 00:18:44 Tower kernel: md: recovery thread checking parity...
Mar 19 00:18:44 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.
Mar 19 00:18:45 Tower kernel: md: parity incorrect: 26272

I have errors, (at least one). What do I have to do next?

This is my guess:

1- Stop the array.

2- Do a presclear.sh to the new disk2.

3- Assign again to the array

4- Start the array. It will start again the process of rebuilding.

Is it correct?

Thank you.

UPDATE : It has finished and it has just found that one error.

UPDATE : I have attached the smartctl info (command : SMARTCTL -a -d ata /dev/sdc | todos )

resultsSMARTCTL-2011-03-19.txt

dgaschk · March 19, 2011

Post a SMART report for the drive.

DrMexSS · March 19, 2011

The test has finished, just one error.

I have attached the smart report.

Thank you

DrMexSS · March 23, 2011

I have done the following:

-Stop the array

-Unassign disk

-Start the array

-Stop the array

-run the preclear over the disk 2

Results: NO errors

================================================================== 1.9
=                unRAID server Pre-Clear disk /dev/sdc
=               cycle 1 of 1, partition start on sector 63 
= Disk Pre-Clear-Read completed                                 DONE
= Step 1 of 10 - Copying zeros to first 2048k bytes             DONE
= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE
= Step 3 of 10 - Disk is now cleared from MBR onward.           DONE
= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4       DONE
= Step 5 of 10 - Clearing MBR code area                         DONE
= Step 6 of 10 - Setting MBR signature bytes                    DONE
= Step 7 of 10 - Setting partition 1 to precleared state        DONE
= Step 8 of 10 - Notifying kernel we changed the partitioning   DONE
= Step 9 of 10 - Creating the /dev/disk/by* entries             DONE
= Step 10 of 10 - Verifying if the MBR is cleared.              DONE
= Disk Post-Clear-Read completed                                DONE
Disk Temperature: 32C, Elapsed Time:  27:10:05
========================================================================1.9
==  WDC WD20EARS-00MVWB0    WD-WMAZA3272095
== Disk /dev/sdc has been successfully precleared
== with a starting sector of 63
============================================================================
** Changed attributes in files: /tmp/smart_start_sdc  /tmp/smart_finish_sdc
                ATTRIBUTE   NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE
      Temperature_Celsius =   118     121            0        ok          32
No SMART attributes are FAILING_NOW

0 sectors were pending re-allocation before the start of the preclear.
0 sectors were pending re-allocation after pre-read in cycle 1 of 1.
0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.
0 sectors are pending re-allocation at the end of the preclear,
    the number of sectors pending re-allocation did not change.
0 sectors had been re-allocated before the start of the preclear.
0 sectors are re-allocated at the end of the preclear,
    the number of sectors re-allocated did not change.

-Then I assigned the disk again to the Disk 2 slot

-Start the array

The rebuilding process started.

- I did NOT run the "mdcmd check NOCORRECT" command but after the rebuilding I still have the 1 error. What I shoud do now? Normal parity test and forgive?

I have attached the new smartctl report of the disk 2

smartctl_post.txt

DrMexSS · March 23, 2011

Well, it's solved.

After the rebuilding I run a NONCORRECT test and now I don't have any errors. I suppose that the error from the previous post is a bug. The system remembers that there WAS a sync error.

So, to sum up. The complete process I've followed is:

-I wanted to replace one disk with a bigger one

-I run a parity check (the regular one, with corrections)

-It found 0 errors.

-Stop the array

-Unassign the disk (disk 2 in my case)

-Power down

-Change the disk

Note, I forgot to run the preclear script on the new disk

-Power on

-Assign the new disk to the old slot

-Start the array :: Rebuilding

-When finish. Run the command

/root/mdcmd check NOCORRECT

You can see the progress in the web browser

-It found 1 error

-Stop the array

-Unassign the disk

-Start the array.

-Stop the array.

-Run the preclear.sh script over the disk (28 hours... bufff)

-Assign the disk to the slot

-Start the array. Rebuild.

-It finish. It shows 1 error. I assume it's the old one.

-Run the command.

/root/mdcmd check NOCORRECT

It shows NO error

dgaschk · March 23, 2011

Another option is that there is an error in the data on new disk. The original error could have been on any of the array disks including parity. Rebuild used the data on all remaining disk plus the parity. If the original error was on the replaced disk then it has been corrected. If the error was not on the replaced disk then the original error still exists and the corresponding bit on the new disk has been flipped incorrectly to match parity. If parity was wrong then there is a single wrong bit on the new disk. If parity was correct then there is a wrong bit on the new disk and on the other data disk that still has a wrong bit. Clear as mud, right?

unRAID is designed to support the failure of an entire disk and does this well. It currently fails to handle occasional bit errors gracefully. The parity check indicates that a bit error has occurred but does nothing to indicate on which disk the error lays. It is difficult to determine the source of parity check errors.

The error may be gone and you will probably never notice a single error or one or at most 2 disks.

DrMexSS · March 25, 2011

But I have run the test with the NOCORRECT option. The second time has NOT found any errors. I think everything is ok now.

In my opinion the error happened when Unraid rebuilt the disk the first time. I mean:

-Before opening the case, I run a parity check. -> No errors.

-I change the old disk with the new one.

-Unraid rebuilt the data (here there was 1 error because:)

-I run the test with NOCORRECTION option -> It found the error

-I unassigned the disk and precleared it.

-I put again the disk to the system. Unraid rebuilt again the data.

-I run the test with NOCORRECTION option -> It did not found any error.

So, the second rebuilt was done correctly and my data is safe.

p.d: Maybe I have been paranoid, it's just one bit!! (or byte?)

lionelhutz · March 25, 2011

Another option is that there is an error in the data on new disk. The original error could have been on any of the array disks including parity. Rebuild used the data on all remaining disk plus the parity. If the original error was on the replaced disk then it has been corrected. If the error was not on the replaced disk then the original error still exists and the corresponding bit on the new disk has been flipped incorrectly to match parity. If parity was wrong then there is a single wrong bit on the new disk. If parity was correct then there is a wrong bit on the new disk and on the other data disk that still has a wrong bit. Clear as mud, right?

Not really. unRAID has absolutely no knowledge of the expected state of the bits on any single disk during the rebuild process. The disk rebuilding function assumes the other data disks and parity are correct because it must use all that data to rebuild the new disk. If there was a bit that was wrong on another disk, that would introduce a corresponding wrong bit on the disk being rebuild and these 2 data errors would effectively cancel each other out as far as the parity check was concerned. So, in a perfect world, the parity check would never have an error right after a disk rebuild.

Peter

dgaschk · March 25, 2011

Another option is that there is an error in the data on new disk. The original error could have been on any of the array disks including parity. Rebuild used the data on all remaining disk plus the parity. If the original error was on the replaced disk then it has been corrected. If the error was not on the replaced disk then the original error still exists and the corresponding bit on the new disk has been flipped incorrectly to match parity. If parity was wrong then there is a single wrong bit on the new disk. If parity was correct then there is a wrong bit on the new disk and on the other data disk that still has a wrong bit. Clear as mud, right?

Not really. unRAID has absolutely no knowledge of the expected state of the bits on any single disk during the rebuild process. The disk rebuilding function assumes the other data disks and parity are correct because it must use all that data to rebuild the new disk. If there was a bit that was wrong on another disk, that would introduce a corresponding wrong bit on the disk being rebuild and these 2 data errors would effectively cancel each other out as far as the parity check was concerned. So, in a perfect world, the parity check would never have an error right after a disk rebuild.

Peter

Isn't this what I said?

[SOLVED] Replacing a disk - mdcmd check NOCORRECT found errors

Recommended Posts

DrMexSS

Link to comment

dgaschk

Link to comment

DrMexSS

Link to comment

DrMexSS

Link to comment

DrMexSS

Link to comment

dgaschk

Link to comment

DrMexSS

Link to comment

lionelhutz

Link to comment

dgaschk

Link to comment

Join the conversation