My unRAID is rather unwell

fitbrit · January 31, 2011

I'm running 4.6 rc5. The 20 data drive server comprises a Centurion 590 with 12 drives; parity at top, then cache drive, followed by data drives 1-10. The bottom 4 drives are run off a Supermicro 8-port PCIe x4 card. The remaining ten drives are in two Sans digital 5-bay eSATA enclosures, run by a single Sil3132(?) eSATA PCIe x1 card. Currently drives 12 and 13 have been removed and were going to be replaced by bigger drives.

After some recent problems, I mistakenly believed all was well because I successfully rebuilt parity with a new Seagate 2TB LP drive. However, when I started a parity check, I'd always have lots of errors shown very quickly. Having read about some of the problems with the model of parity drive I use, I decided to run a non-correcting parity check to completion. The results were alarming, including tens if not hundreds of thousands reported in the area which reports how much progress in the check has been made, and tens of thousands in the main disk status area. Additionally, I'm still hearing a clicking from the main server case. At first I though it was my cache drive, but am now pretty sure it's not that one.

We just had a power outage which lasted longer than my UPS was able to support while I was out. When I returned and restarted the server, it took some time to mount the drives. The parity check that started showed lots of errors very quickly again.

Now I'm not sure what to do. I don't trust the parity, but am not sure whether it's due to a bad parity drive or another drive that's failing. I'm attaching the log from the unmenu readout. Any help appreciated.

Jan302011logPt1.txt

fitbrit · January 31, 2011

Part 2 of syslog...

And SMART report on parity drive:

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===
Device Model:     ST32000542AS
Serial Number:    5XW1FDFJ
Firmware Version: CC34
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Jan 31 12:20:52 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                       was never started.
                                       Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                       without error or no self-test has ever
                                       been run.
Total time to complete Offline
data collection:                 ( 623) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       No Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   097   082   006    Pre-fail  Always       -       43827459
 3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   094   094   020    Old_age   Always       -       6607
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       10150913
 9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       981
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   094   094   020    Old_age   Always       -       6512
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       8391
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   064   045    Old_age   Always       -       23 (Lifetime Min/Max 23/33)
194 Temperature_Celsius     0x0022   023   040   000    Old_age   Always       -       23 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   049   040   000    Old_age   Always       -       43827459
197 Current_Pending_Sector  0x0012   100   095   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   095   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       210273008879140
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3018893516
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1466551571

SMART Error Log Version: 1
ATA Error Count: 8466 (device log contains only the most recent five errors)
       CR = Command Register [HEX]
       FR = Features Register [HEX]
       SC = Sector Count Register [HEX]
       SN = Sector Number Register [HEX]
       CL = Cylinder Low Register [HEX]
       CH = Cylinder High Register [HEX]
       DH = Device/Head Register [HEX]
       DC = Device Command Register [HEX]
       ER = Error register [HEX]
       ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8466 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 a9 6d c9 00  Error: UNC at LBA = 0x00c96da9 = 13200809

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 00 af 6b c9 e0 00      00:00:27.714  READ DMA EXT
 27 00 00 00 00 00 e0 00      00:00:27.713  READ NATIVE MAX ADDRESS EXT
 ec 00 00 00 00 00 a0 00      00:00:27.712  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:00:27.712  SET FEATURES [set transfer mode]
 27 00 00 00 00 00 e0 00      00:00:27.688  READ NATIVE MAX ADDRESS EXT

Error 8465 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 a9 6d c9 00  Error: UNC at LBA = 0x00c96da9 = 13200809

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 00 af 6b c9 e0 00      00:00:23.929  READ DMA EXT
 27 00 00 00 00 00 e0 00      00:00:23.929  READ NATIVE MAX ADDRESS EXT
 ec 00 00 00 00 00 a0 00      00:00:23.928  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:00:23.927  SET FEATURES [set transfer mode]
 27 00 00 00 00 00 e0 00      00:00:23.903  READ NATIVE MAX ADDRESS EXT

Error 8464 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 a9 6d c9 00  Error: UNC at LBA = 0x00c96da9 = 13200809

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 00 af 6b c9 e0 00      00:00:20.176  READ DMA EXT
 27 00 00 00 00 00 e0 00      00:00:20.175  READ NATIVE MAX ADDRESS EXT
 ec 00 00 00 00 00 a0 00      00:00:20.174  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:00:20.174  SET FEATURES [set transfer mode]
 27 00 00 00 00 00 e0 00      00:00:20.150  READ NATIVE MAX ADDRESS EXT

Error 8463 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 a9 6d c9 00  Error: UNC at LBA = 0x00c96da9 = 13200809

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 00 af 6b c9 e0 00      00:00:16.412  READ DMA EXT
 27 00 00 00 00 00 e0 00      00:00:16.411  READ NATIVE MAX ADDRESS EXT
 ec 00 00 00 00 00 a0 00      00:00:16.410  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:00:16.410  SET FEATURES [set transfer mode]
 27 00 00 00 00 00 e0 00      00:00:16.386  READ NATIVE MAX ADDRESS EXT

Error 8462 occurred at disk power-on lifetime: 965 hours (40 days + 5 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 a9 6d c9 00  Error: UNC at LBA = 0x00c96da9 = 13200809

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 00 af 6b c9 e0 00      00:00:12.665  READ DMA EXT
 27 00 00 00 00 00 e0 00      00:00:12.664  READ NATIVE MAX ADDRESS EXT
 ec 00 00 00 00 00 a0 00      00:00:12.663  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:00:12.663  SET FEATURES [set transfer mode]
 27 00 00 00 00 00 e0 00      00:00:12.637  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Jan302011logPt2.txt

bcbgboy13 · January 31, 2011

Hi fitbrit,

Thanks to your post in another place I found about Unraid and I will try to help you here as much as I can.

You are using a rather unusual configuration and I believe this is why the people are not proposing any ideas - after all a wrong idea can lead to a data loss and no one wants to venture there.

Now I am a pure hardware guy and will wait for the Linux gurus to come with a suggestions regarding the software side but while waiting I will throw some ideas.

1. Please post your complete hardware configuration - motherboard model with BIOS if possible and the PSU you are using.

2. Your syslog is incomplete - there are a lot of lines before the "part1" you posted - basically the area when Unraid boots - I like to see the enumeration of the different devices as you have this exotic "external" drives and I have never seen configuration as yours.

3. You are using a Seagate 2TB LP hard drive as parity - and it is still with the original (but discuses as bad) firmware CC34.

Now I have not used personally Seagate in more that 10 years and also personally consider this particular model to be a "poor" choice but I do try keep in touch with the various discussions and I have heard that the clicking noise is often sign of not good power to the drive.

There is also consensus that the "RAW" data on the SMART report has meanings only to the HD manufacturers (especially Seagate's HD) but on other hand some of the attributes look like a real data - temperature, power on hours, etc.

Now according to this SMART report you have used this HD for 981 hours (good match with SMART error log about the last 5 errors) - look like it is the actual data.

Attribute 12 is "power cycle hours" and it is 6512 - if one is to assume that this is the actual value too then it looks very high - around 7 per hour. But if you had an intermittent power connection to this HD that may explain it.

Then we have attribute 187 - "reported incorrect" with "raw data" of 8391 and this in fact may be an "actual data" too as you have 8466 errors in the SMART log below (where only the last five are kept). Important here is the almost same numbers for "value", "worst" and "threshold" - so it looks like this HD is bad one way or another.

This is from me for now.

fitbrit · February 1, 2011

Thanks very much, bcbgboy. Your name sounds familiar; was it RFD? My configuration isn't that unusual for some of the expert users here. The eSATA boxes allow oen to expand the array outside the limits of one's case. If I'd known I was going to get so much storage over time, I would have invested in a Norco 4224.

The Seagate data is strange in that there has not been that many power cycles at all. Could this be the head parking issue with the C34 firmware? I said it was a new drive so under 1000 hours seems right to me too. It's such a pain to update the firmware too, but I guess I'll have to try it and see if it works out. The reason I chose the Seagate was because I was getting fed up with DOA WD drives, or them giving up the ghost a few weeks after installing them. I've had six or more WD drives go/arrive bad in the past year. In fact one of the RMA replacements was DOA too. I loved the Samsung drives I have, but now the F4 also seems to have firmware problems.

My plan is to just replace the parity drive for now and rebuild parity. However, I just wanted to check with some experts that that was not a bad thing to do at this stage.

SSD · February 1, 2011

The kinds of errors you are seeing in your smart report are normally caused by some type of problem with the connection between the computer and the drive - not with the drive itself. Could be a bad or loose cable, bad controller port, bad drive cage, or even a bad/broken connector on the drive. Most commonly the problem is that the cable is not securely plugged in on one end or the other. Locking cables will sometimes fix these types of intermittent problems. If you don't have one, I'd recommend replacing the cable with a new one (or at least unplugging and replugging both ends) and trying again.

fitbrit · February 1, 2011

The kinds of errors you are seeing in your smart report are normally caused by some type of problem with the connection between the computer and the drive - not with the drive itself. Could be a bad or loose cable, bad controller port, bad drive cage, or even a bad/broken connector on the drive. Most commonly the problem is that the cable is not securely plugged in on one end or the other. Locking cables will sometimes fix these types of intermittent problems. If you don't have one, I'd recommend replacing the cable with a new one (or at least unplugging and replugging both ends) and trying again.

Thanks very much, bjp999.

It kind of makes sense:

I've had parity problems with several drives now with my newish motherboard - a Supermicro C2SEA, like Tom uses in his servers. I had all-locking cables, but changed the one on my parity drive when my last two parity drives had problems. When I moved the drives to data ones (after successful pre-clears and also using them for a while in Windows without issue). I think it might be the motherboard port that's screwy if taking your post into account. Bummer, because this is an RMA'd board, with the first one being DOA. I have more internal ports available than I have slots in my main server cage, so I could move all the drives down one port and use one of the unused connections on my Supermicro 4x SAS board to take up the displaced configuration.

Here I was thinking I wouldn't need to do drive rearrangement on this scale for a while!

My unRAID is rather unwell

Recommended Posts

fitbrit

Link to comment

fitbrit

Link to comment

bcbgboy13

Link to comment

fitbrit

Link to comment

SSD

Link to comment

fitbrit

Link to comment

Archived