Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

lost my data, help me find my mistakes

Featured Replies

For about two months now, I have parity errors when doing a parity check.  The logs are clean.  I repeat - there are no hardware errors in the log.

 

The number of errors is usually between 100,000 and 4,000,000.  I have done about 12 parity checks on the array in an effort to produce a hardware error in the log that would give me a clue about where to look.

 

Smartctl -a -d ata XXX

 

information is unremarkable except for one drive with a reallocated sector count of 2151.  [start-Edit-1] This is for "smartctl -a -d ata /dev/disk/by-id/ata-ST31500341AS_9VS1D9MJ" [End-Edit-1]  [start-Edit-2]  I have been watching the 2151 number since the start of the parity check errors two months ago and it has remained the same [End-Edit-2]

 

I finally couldn't take it anymore and I decided to do an "md5sum" of each drive in the array, record the result, repeat, and compare the result.

 

18 of the 20 1500GB drives had identical "md5sum" of their entire contents.  Two of the drives are different.

 

$ cat /boot/config/md5sum.ata-ST31500341AS_9VS0L63S

d8fe535ce8e4b43b1bc677089c786344  /dev/disk/by-id/ata-ST31500341AS_9VS0L63S

fe516955ce64f2b322c5d8d3da20d292  /dev/disk/by-id/ata-ST31500341AS_9VS0L63S

 

$ cat /boot/config/md5sum.ata-ST31500341AS_9VS1D9MJ

fdadf84b9e77566b5685f493510b7ae0  /dev/disk/by-id/ata-ST31500341AS_9VS1D9MJ

ff4c1267ba6d503851fcb8f2722beca7  /dev/disk/by-id/ata-ST31500341AS_9VS1D9MJ

 

What is also interesting is that these two drives are on the same 2-port sil3132 controller:

 

disk3 device:   pci-0000:03:00.0-scsi-0:0:0:0 host3 (sde) ST31500341AS_9VS0L63S

disk6 device:   pci-0000:03:00.0-scsi-1:0:0:0 host4 (sdf) ST31500341AS_9VS1D9MJ

 

One more time, I repeat - There are no errors in the logs (except for "md" reported parity check errors during a "parity check").

 

So, it looks like I have lost my data?

 

Any ideas about the root cause of this problem (cables, controllers, drives, etc.) and any suggestions on how I do not repeat my mistake in the future?

 

Thank you for reading.

 

The one drive with over 2000 re-allocated sectors has had to have failed its threshold.  That drive MUST be replaced.

 

Is it one of the two you are comparing with different checksums?

 

It could easily be one controller.  Have you tried swapping the cable to the disk that comes up with different result with one that is consistent?

 

 

  • Author

I do not know what the "threshold" is, so I will give the entire smartctl output:

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF INFORMATION SECTION ===

Device Model:    ST31500341AS

Serial Number:    9VS1D9MJ

Firmware Version: CC1H

User Capacity:    1,500,301,910,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Tue Sep  7 15:44:03 2010 Local time zone must be set--see zic m

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 609) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x103f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  118  099  006    Pre-fail  Always      -      196285440

  3 Spin_Up_Time            0x0003  098  092  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  096  096  020    Old_age  Always      -      4369

  5 Reallocated_Sector_Ct  0x0033  048  048  036    Pre-fail  Always      -      2151

  7 Seek_Error_Rate        0x000f  069  060  030    Pre-fail  Always      -      8576433

  9 Power_On_Hours          0x0032  088  088  000    Old_age  Always      -      11189

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      4

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      255

184 Unknown_Attribute      0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Unknown_Attribute      0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  070  070  000    Old_age  Always      -      30

190 Airflow_Temperature_Cel 0x0022  069  055  045    Old_age  Always      -      31 (Lifetime Min/Max 24/34)

194 Temperature_Celsius    0x0022  031  045  000    Old_age  Always      -      31 (0 14 0 0)

195 Hardware_ECC_Recovered  0x001a  048  019  000    Old_age  Always      -      196285440

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      108512348737042

241 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      3646033974

242 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      3693611445

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

  • Author

> Is it one of the two you are comparing with different checksums?

 

Yes, I edited the original post in order to add that missing information.  However, your eyes were quicker than my hand.

 

One of the two drives with the different checksum is, in fact, the drive with 2151 reallocated sectors in the smartctl output.

 

I like your suggestion of swapping two known good cables from two other drives in the array with the two that are reporting different md5sum information and repeating the md5sum test.

 

This is to pin it down to the controller or the cables, no?

 

Also, is it possible that the suspect drive is having an interaction with the controller that is disturbing the md5sum of the other drive on the same controller.  Should one of my tests be isolate this suspect drive on its own controller?

 

Thanks for the ideas.  It will take about 20 hours to do the md5sum, so I'll wait a little bit to see if you want me to do anything different, or if there is some secret "go faster" option on the md5sum command that someone knows about.

 

I'm pretty sure that 'reallocated sectors' must be caused by problems within that drive - I don't believe that the controller can have any effect on that.  Data on this drive would certainly be suspect.

 

With regard to the variable checksums .... are you absolutely sure that nothing is writing to those disks between calculating the first and second sums.  For instance, I have squeezeboxserver installed on my drive1, and that will be updating logfiles etc relatively regularly.

 

If nothing is writing to the drives, then suspicion would point to the controller - first of all try unplugging and re-seating the card and cables.

 

Don't panic yet, because random read errors don't necessarily mean that the data is lost and, provided that you haven't done a parity re-build while the random errors have been occuring, there's a good chance that, once the errors have been eliminated, you can recover the data from the one failed drive.  However, be cautious about how you proceed because, with one failed drive, your data is currently unprotected.

  • Author

> With regard to the variable checksums .... are you absolutely sure that nothing is writing to those disks between calculating the first and second sums?

 

Well, one can never be absolutely sure.  So, you are right - that is something to keep in one's mind.

 

However, I do not have any additional software installed, and all remote mounts of this share are with "-o ro" (readonly).

 

Further, I did a "clear statistics" in the "unraid web based management utility" [edit-1] before starting the first md5sum [edit-1].  The utility reports that there were no writes to any of the drives since that last "clear statistics".

 

All that being said, I am still not "absolutely sure".  So your point is well taken.

 

I'm pretty sure that 'reallocated sectors' must be caused by problems within that drive - I don't believe that the controller can have any effect on that.  Data on this drive would certainly be suspect.

 

With regard to the variable checksums .... are you absolutely sure that nothing is writing to those disks between calculating the first and second sums.  For instance, I have squeezeboxserver installed on my drive1, and that will be updating logfiles etc relatively regularly.

 

If nothing is writing to the drives, then suspicion would point to the controller - first of all try unplugging and re-seating the card and cables.

 

Don't panic yet, because random read errors don't necessarily mean that the data is lost and, provided that you haven't done a parity re-build while the random errors have been occuring, there's a good chance that, once the errors have been eliminated, you can recover the data from the one failed drive.  However, be cautious about how you proceed because, with one failed drive, your data is currently unprotected.

If the drives were mounted as read-only, then I don't think any writes are possible.

 

As far as the re-allocated sectors...  The line is here:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct  0x0033  048  048  036    Pre-fail  Always      -      2151

The current normalized value is "48" and the failure threshold is "36"  You are getting really close to where there are no spare sectors to re-allocate.  Most definitely, get that drive replaced as soon as possible.  unRAID has been busy protecting your data and when an un-readable sector was reported, it re-constructed the data from parity and the other disks and then re-write it to the drive.  That allowed the re-allocation to occur.  You are pushing your luck if you think that drive will last much longer.

 

As far as the other points.  It is un-likely to be a cabling issue, since that would present itself as CRC errors in the syslog, and you said you are not seeing them.  It is highly likely the drive itself or the controller and since two disks are involved probably the controller.

 

It is just as likely the data is just fine on the disk... and when you attach the disk to a good disk controller the files will be just fine. (if it is the bad disk controller)

 

 

Good luck.  There is no simple way to shorten your process unless you just take the MD5SUM of specific files rather than the entire drive.  You might get fooled though if the errors do not show themselves in the smaller sample of data.  Can't hurt to try though.  do repeated md5 sums on one or two ISO files.  It will show quickly enough if it is a hardware issue based on the large number of errors you were reporting.

 

Joe L.

  • Author

That is a very good idea about choosing one very big file or a group of big files for the md5sum.  I am going to do that.

 

I also believe it could be a controller issue.  For completeness, without endangering the status of the array, I am going to switch the two disks to an adjacent cable-controller combination that has been working for the other drives.

 

Step 1: reproduce md5sum differences with a big file or file group to save time.

Step 2: switch to known good controller/cable combination and repeat the md5sum on the same file(s).

Step 3: report back the results.

 

Thank you so much for the clear headed reasoning.  Of course, it's much easier to think logically instead of emotionally when it isn't your data that's in danger  ;)

 

  • Author

I didn't have much luck randomly picking a file that had different md5sum results.  So, I created a shell script to do a md5sum on each file on the disk.  Did pass1, then pass 2, and compared line by line with "diff" "tr" and "cut"

 

I had to choose a file or a couple of files whose size exceeded the memory size of the unraid host (2GB).  Otherwise, the file could be cached and we would not be reading the disk a second time.  I decided anything more than 3GB should do the trick.

 

Some progress:

 

$ ls -l /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

-rwx------ 1 root root 5756338176 Apr 12  2009 /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

$ md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c03b17dbed0b1582e5f12d7a7ef4abe8  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c50f432e74fda2dd78007d6d41c36e18  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

7707b687a57157203348fdeffdde1006  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

65977e56bf50f7fa8d5980766caef383  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

 

$ ls -l /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

-rwx------ 1 root root 1073739776 Oct 14  2008 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

-rwx------ 1 root root 1073739776 Oct 14  2008 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

-rwx------ 1 root root 1073739776 Oct 14  2008 /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

$ md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

de406bd44a79710915eb031763be0fc7  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

992488f8cee6c7e5ff5380103d7d2e38  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

Now I am going to plug these drives into a known good controller-cable combination and repeat the md5sum tests.

 

Note that I still do not have any errors in the logs.

 

  • Author

Perhaps I have not lost my data after all!  Is there a way to change the subject line of the original post?  ???

 

It appears that moving the two disks to a known good controller-cable combination has yielded consistent md5sum results.

 

See if you agree with my next steps to isolate the fault.

1. Switch it back to original to see if I can reproduce the problem - perform md5sum.

2. Swap in brand new sata cables and retry the md5sum.

3. Try removing controller and re-inserting into the PCIe slot and retry the md5sum.

4. Try identical controller from my unraid test box and retry the md5sum.

5. Throw up my hands and post back here for additional ideas.

<< Hopefully I will be "Back In Action" and never get to step 5  ;D >>

 

$ md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts ; md5sum /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

<6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c50f432e74fda2dd78007d6d41c36e18  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c50f432e74fda2dd78007d6d41c36e18  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c50f432e74fda2dd78007d6d41c36e18  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

c50f432e74fda2dd78007d6d41c36e18  /mnt/disk6/media/recordings/LooneyTunesBackinAction-252522-0.ts

 

$ md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB;md5sum /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

bd28034c161f554de0d9ad77ced29f98  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_1.VOB

ee89736ef06711b6c5a528cc486ececb  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_2.VOB

25378e4c1af08fb5bf7737dca127901e  /mnt/disk3/media/disc/DVD/XYZZY/VIDEO_TS/VTS_02_3.VOB

 

Not to derail you, but why does it seem like SATA cards fail fairly often (compared to other card types), but SATA controllers on motherboards never seem to fail?  I rarely ever encounter a MB SATA failure (actually I've never encountered it personally), but have thrown away a half dozen or more failed SATA cards.

Just as an FYI, I had a new SIL3132 based card give MD5 errors and parity check errors. I got another one and had the same problems. So, I did encounter a problem card implimentation, just never figured out exactly why though. It's been a while but I seem to recall the card worked OK in my Windows machine with Vista at the time.

 

Peter

 

Just as an FYI, I had a new SIL3132 based card give MD5 errors and parity check errors. I got another one and had the same problems. So, I did encounter a problem card implimentation, just never figured out exactly why though. It's been a while but I seem to recall the card worked OK in my Windows machine with Vista at the time.

 

Peter

 

 

I googled the chip and did find a large number of reports of MD5 errors using the 3132 with a MAC OS.  Maybe it has a problem with the Linux OS too.

 

And here is an old thread:

http://lime-technology.com/forum/index.php?topic=4601.15

 

It was a Syba card using the 3132 chipset and causing similar problems as the OPS.

  • Author

Yes, I proved it was the controller, and I recovered all my data!

 

After exhausting all other possibilities, I swapped the sil3132 for another sil3132 and parity check had *zero* errors.  I put back the old sil3132 for one more try, and I had sync errors within 5 minutes of starting the check.

 

I think my biggest mistake was holding on to the notion that if there was something wrong, then the hardware would register some kind of error - perhaps a CRC or something.

 

From reading other posts on this forum, I already knew that an undetected error could happen, but I didn't think it would ever happen to me.  I thought I would get some kind of indication like an unexplained crash or hardware error in a log somewhere.

 

I can't tell you how close I was to doing an "initconfig" because I thought it was a software or procedural problem (like not shutting down properly or doing the first parity check with one of the ports on "native ide" instead of "ahci").

 

From now on, when I see a parity sync error without any hardware faults, I will assume that my hardware is bad before I assume that unraid somehow dropped a bit somewhere, or that I had somehow configured or managed unraid improperly.

 

If the operating system had no idea where the problem was, I can't fathom how the unraid software could possibly figure it out for me either.  I guess it would be nice to know if a particular brand of controller had both firmware and software driver support for detecting errors and diagnosing itself.  If I knew which controllers had this feature, I would probably spend the extra money for one of these if I were using it in a raid system.

 

Thank you all for your assistance!

 

Yes, I proved it was the controller, and I recovered all my data!

 

After exhausting all other possibilities, I swapped the sil3132 for another sil3132 and parity check had *zero* errors.  I put back the old sil3132 for one more try, and I had sync errors within 5 minutes of starting the check.

 

I think my biggest mistake was holding on to the notion that if there was something wrong, then the hardware would register some kind of error - perhaps a CRC or something.

 

From reading other posts on this forum, I already knew that an undetected error could happen, but I didn't think it would ever happen to me.  I thought I would get some kind of indication like an unexplained crash or hardware error in a log somewhere.

 

I can't tell you how close I was to doing an "initconfig" because I thought it was a software or procedural problem (like not shutting down properly or doing the first parity check with one of the ports on "native ide" instead of "ahci").

 

From now on, when I see a parity sync error without any hardware faults, I will assume that my hardware is bad before I assume that unraid somehow dropped a bit somewhere, or that I had somehow configured or managed unraid improperly.

 

If the operating system had no idea where the problem was, I can't fathom how the unraid software could possibly figure it out for me either.  I guess it would be nice to know if a particular brand of controller had both firmware and software driver support for detecting errors and diagnosing itself.  If I knew which controllers had this feature, I would probably spend the extra money for one of these if I were using it in a raid system.

 

Thank you all for your assistance!

 

Excellent news...

 

One question, to help those who might follow.   Are the two sil3132 cards the same brand?  If not, which brands were they?

(Not that it is definitive, but if we get a lot of reports of a specific brand/model card misbehaving, we can alert others to avoid it.)

 

If they are the same brand/model, then you just have a defective card.     If it is not an item you can RMA I'd test its capabilities as a "wheel-chock"

 

Place it behind the wheel of your car and test its ability to impede the forward and backwards motion of the car as you repeatedly drive over it.   If might work better at that then is handling unRAID data.  If not, well... you'll feel better. ;)

 

Joe L.

Very glad you found your problem. These types of intermittent problems are the most frustrating and hard to fix problems!

 

I cannot remember a previous problem so definitively isolated to a controller, which tend to either work or not work. But this exemplifies that the replacement method is highly effective ( although sometimes expensive) at isolating problems.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.