lost my data, help me find my mistakes

September 7, 201015 yr

For about two months now, I have parity errors when doing a parity check. The logs are clean. I repeat - there are no hardware errors in the log.

The number of errors is usually between 100,000 and 4,000,000. I have done about 12 parity checks on the array in an effort to produce a hardware error in the log that would give me a clue about where to look.

Smartctl -a -d ata XXX

information is unremarkable except for one drive with a reallocated sector count of 2151. [start-Edit-1] This is for "smartctl -a -d ata /dev/disk/by-id/ata-ST31500341AS_9VS1D9MJ" [End-Edit-1] [start-Edit-2] I have been watching the 2151 number since the start of the parity check errors two months ago and it has remained the same [End-Edit-2]

I finally couldn't take it anymore and I decided to do an "md5sum" of each drive in the array, record the result, repeat, and compare the result.

18 of the 20 1500GB drives had identical "md5sum" of their entire contents. Two of the drives are different.

$ cat /boot/config/md5sum.ata-ST31500341AS_9VS0L63S

d8fe535ce8e4b43b1bc677089c786344 /dev/disk/by-id/ata-ST31500341AS_9VS0L63S

fe516955ce64f2b322c5d8d3da20d292 /dev/disk/by-id/ata-ST31500341AS_9VS0L63S

$ cat /boot/config/md5sum.ata-ST31500341AS_9VS1D9MJ

fdadf84b9e77566b5685f493510b7ae0 /dev/disk/by-id/ata-ST31500341AS_9VS1D9MJ

ff4c1267ba6d503851fcb8f2722beca7 /dev/disk/by-id/ata-ST31500341AS_9VS1D9MJ

What is also interesting is that these two drives are on the same 2-port sil3132 controller:

disk3 device: pci-0000:03:00.0-scsi-0:0:0:0 host3 (sde) ST31500341AS_9VS0L63S

disk6 device: pci-0000:03:00.0-scsi-1:0:0:0 host4 (sdf) ST31500341AS_9VS1D9MJ

One more time, I repeat - There are no errors in the logs (except for "md" reported parity check errors during a "parity check").

So, it looks like I have lost my data?

Any ideas about the root cause of this problem (cables, controllers, drives, etc.) and any suggestions on how I do not repeat my mistake in the future?

Thank you for reading.

September 7, 201015 yr

The one drive with over 2000 re-allocated sectors has had to have failed its threshold. That drive MUST be replaced.

Is it one of the two you are comparing with different checksums?

It could easily be one controller. Have you tried swapping the cable to the disk that comes up with different result with one that is consistent?

September 7, 201015 yr

Author

I do not know what the "threshold" is, so I will give the entire smartctl output:

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Device Model: ST31500341AS

Serial Number: 9VS1D9MJ

Firmware Version: CC1H

User Capacity: 1,500,301,910,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Tue Sep 7 15:44:03 2010 Local time zone must be set--see zic m

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 609) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x103f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 196285440

3 Spin_Up_Time 0x0003 098 092 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 096 096 020 Old_age Always - 4369

5 Reallocated_Sector_Ct 0x0033 048 048 036 Pre-fail Always - 2151

7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 8576433

9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11189

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 4

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 255

184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0

189 High_Fly_Writes 0x003a 070 070 000 Old_age Always - 30

190 Airflow_Temperature_Cel 0x0022 069 055 045 Old_age Always - 31 (Lifetime Min/Max 24/34)

194 Temperature_Celsius 0x0022 031 045 000 Old_age Always - 31 (0 14 0 0)

195 Hardware_ECC_Recovered 0x001a 048 019 000 Old_age Always - 196285440

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 108512348737042

241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3646033974

242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3693611445

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

September 7, 201015 yr

Author

> Is it one of the two you are comparing with different checksums?

Yes, I edited the original post in order to add that missing information. However, your eyes were quicker than my hand.

One of the two drives with the different checksum is, in fact, the drive with 2151 reallocated sectors in the smartctl output.

I like your suggestion of swapping two known good cables from two other drives in the array with the two that are reporting different md5sum information and repeating the md5sum test.

This is to pin it down to the controller or the cables, no?

Also, is it possible that the suspect drive is having an interaction with the controller that is disturbing the md5sum of the other drive on the same controller. Should one of my tests be isolate this suspect drive on its own controller?

Thanks for the ideas. It will take about 20 hours to do the md5sum, so I'll wait a little bit to see if you want me to do anything different, or if there is some secret "go faster" option on the md5sum command that someone knows about.

September 7, 201015 yr

I'm pretty sure that 'reallocated sectors' must be caused by problems within that drive - I don't believe that the controller can have any effect on that. Data on this drive would certainly be suspect.

With regard to the variable checksums .... are you absolutely sure that nothing is writing to those disks between calculating the first and second sums. For instance, I have squeezeboxserver installed on my drive1, and that will be updating logfiles etc relatively regularly.

If nothing is writing to the drives, then suspicion would point to the controller - first of all try unplugging and re-seating the card and cables.

Don't panic yet, because random read errors don't necessarily mean that the data is lost and, provided that you haven't done a parity re-build while the random errors have been occuring, there's a good chance that, once the errors have been eliminated, you can recover the data from the one failed drive. However, be cautious about how you proceed because, with one failed drive, your data is currently unprotected.

September 7, 201015 yr

Author

> With regard to the variable checksums .... are you absolutely sure that nothing is writing to those disks between calculating the first and second sums?

Well, one can never be absolutely sure. So, you are right - that is something to keep in one's mind.

However, I do not have any additional software installed, and all remote mounts of this share are with "-o ro" (readonly).

Further, I did a "clear statistics" in the "unraid web based management utility" [edit-1] before starting the first md5sum [edit-1]. The utility reports that there were no writes to any of the drives since that last "clear statistics".

All that being said, I am still not "absolutely sure". So your point is well taken.

September 7, 201015 yr

I'm pretty sure that 'reallocated sectors' must be caused by problems within that drive - I don't believe that the controller can have any effect on that. Data on this drive would certainly be suspect.

With regard to the variable checksums .... are you absolutely sure that nothing is writing to those disks between calculating the first and second sums. For instance, I have squeezeboxserver installed on my drive1, and that will be updating logfiles etc relatively regularly.

If nothing is writing to the drives, then suspicion would point to the controller - first of all try unplugging and re-seating the card and cables.

Don't panic yet, because random read errors don't necessarily mean that the data is lost and, provided that you haven't done a parity re-build while the random errors have been occuring, there's a good chance that, once the errors have been eliminated, you can recover the data from the one failed drive. However, be cautious about how you proceed because, with one failed drive, your data is currently unprotected.

If the drives were mounted as read-only, then I don't think any writes are possible.

As far as the re-allocated sectors... The line is here:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct 0x0033 048 048 036 Pre-fail Always - 2151

The current normalized value is "48" and the failure threshold is "36" You are getting really close to where there are no spare sectors to re-allocate. Most definitely, get that drive replaced as soon as possible. unRAID has been busy protecting your data and when an un-readable sector was reported, it re-constructed the data from parity and the other disks and then re-write it to the drive. That allowed the re-allocation to occur. You are pushing your luck if you think that drive will last much longer.

As far as the other points. It is un-likely to be a cabling issue, since that would present itself as CRC errors in the syslog, and you said you are not seeing them. It is highly likely the drive itself or the controller and since two disks are involved probably the controller.

It is just as likely the data is just fine on the disk... and when you attach the disk to a good disk controller the files will be just fine. (if it is the bad disk controller)

Good luck. There is no simple way to shorten your process unless you just take the MD5SUM of specific files rather than the entire drive. You might get fooled though if the errors do not show themselves in the smaller sample of data. Can't hurt to try though. do repeated md5 sums on one or two ISO files. It will show quickly enough if it is a hardware issue based on the large number of errors you were reporting.

Joe L.

September 7, 201015 yr

Author

That is a very good idea about choosing one very big file or a group of big files for the md5sum. I am going to do that.

I also believe it could be a controller issue. For completeness, without endangering the status of the array, I am going to switch the two disks to an adjacent cable-controller combination that has been working for the other drives.

Step 1: reproduce md5sum differences with a big file or file group to save time.

Step 2: switch to known good controller/cable combination and repeat the md5sum on the same file(s).

Step 3: report back the results.

Thank you so much for the clear headed reasoning. Of course, it's much easier to think logically instead of emotionally when it isn't your data that's in danger

September 8, 201015 yr

Author

I didn't have much luck randomly picking a file that had different md5sum results. So, I created a shell script to do a md5sum on each file on the disk. Did pass1, then pass 2, and compared line by line with "diff" "tr" and "cut"

I had to choose a file or a couple of files whose size exceeded the memory size of the unraid host (2GB). Otherwise, the file could be cached and we would not be reading the disk a second time. I decided anything more than 3GB should do the trick.

Some progress:

$ ls -l /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

-rwx------ 1 root root 5756338176 Apr 12 2009 /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

$ md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c03b17dbed0b1582e5f12d7a7ef4abe8 /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c50f432e74fda2dd78007d6d41c36e18 /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

7707b687a57157203348fdeffdde1006 /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

65977e56bf50f7fa8d5980766caef383 /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

$ ls -l /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

-rwx------ 1 root root 1073739776 Oct 14 2008 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

-rwx------ 1 root root 1073739776 Oct 14 2008 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

-rwx------ 1 root root 1073739776 Oct 14 2008 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

$ md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

bd28034c161f554de0d9ad77ced29f98 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

bd28034c161f554de0d9ad77ced29f98 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

de406bd44a79710915eb031763be0fc7 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

bd28034c161f554de0d9ad77ced29f98 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

992488f8cee6c7e5ff5380103d7d2e38 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

bd28034c161f554de0d9ad77ced29f98 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

Now I am going to plug these drives into a known good controller-cable combination and repeat the md5sum tests.

Note that I still do not have any errors in the logs.

September 8, 201015 yr

Author

Perhaps I have not lost my data after all! Is there a way to change the subject line of the original post?

It appears that moving the two disks to a known good controller-cable combination has yielded consistent md5sum results.

See if you agree with my next steps to isolate the fault.

1. Switch it back to original to see if I can reproduce the problem - perform md5sum.

2. Swap in brand new sata cables and retry the md5sum.

3. Try removing controller and re-inserting into the PCIe slot and retry the md5sum.

4. Try identical controller from my unraid test box and retry the md5sum.

5. Throw up my hands and post back here for additional ideas.

<< Hopefully I will be "Back In Action" and never get to step 5 >>

$ md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

<6/media/recordings/LooneyTunesBackinAction-252522-0.ts