I've made a few mistakes...

January 5, 201313 yr

Background

I've been going through some struggles with my unRAID server since I set it up. When I first built it, I was very impatient and formatted 6TB of data back to the stone age because I was too lazy to wait for the data to copy before building the server. It took months to get that data back. I've had 3 (count 'em, 3) drive failures, and I'm pretty certain I'm working on a fourth so far since setting this up in March of 2012. This October, when one of my drives failed out of warranty (#3), I shut the machine down and walked away for 2 months. Back then, I was running unRAID 5.0 rc3, and I had a 7-drive, 12TB array.

Present Day

I picked up a 3TB drive over the holidays, with the intention of replacing my failed (red ball) 2TB drive. Popped it in, and data rebuild wouldn't happen because (duh) my new drive was larger than my parity drive. I should have known better. But that's OK. I ordered another 2TB drive and figured, no big deal, I needed another 2TB of space, I'll just plug in the 2TB drive, let the rebuild do its thing, replace the parity drive with the 3TB drive, parity rebuild, then the old parity drive becomes data drive number 7. :-). Sound logical so far? You're about to hate me.

About an hour in, I notice that the rebuild speed has slowed to about 30kB/s. I can't remember why anymore. I didn't pull the logs. I didn't even look at them. I just powered the server off. Then I started it back up. Over the course of this operation, I would do this several times. In actuality, it didn't look like I had caused any damage. So I started everything up again. Noticed that one of the disks (couldn't tell which by ear) was clicking. Didn't sound good, but it was about that time I realized I was on the verge of a multiple drive failure.

Now, a logical person with my technical acumen (smart, but not very adept at low-level hard drive logs or really low-level - I work in native mobile applications) would have stopped there, collected himself, and started the process I have begun now. Collecting logs to post here. Actually, I sort of did. I read the logs I had, noticed lots of things about one of my drives failing to start. I powered off a few more times, double, triple, quadruple checked my cabling, hopped on the forum, and read for about 30 seconds before deciding I didn't know where to get help in the forums for my issue, I'd done a few google and forum searches, but I just had a feeling I was going to need unique help for my situation. Once I saw that posting for 5.0 rc issues required rc8, guess what illogical thing I did next? You guessed it. I upgraded my broken server to rc8, and said, "who cares if I lose two drives?"

For those of you still willing to listen to my story, I put my array back together with disk2 replaced with the new 2TB drive (the old red-ball), the new 3TB drive in the parity slot, and the old parity drive in the space where the mad clicker was. I RMA'd that thing. Started up, and parity started to rebuild. So far so good. Now to try mounting the drives in Windows as user shares. Worked! So far, so good. I knew I was out a bunch of data, but I didn't care. It was just a bunch of media I could re-download, and I could fill in the gaps with what I was missing. But trying to browse the contents of the user shares gave me REISERFS errors on device md3, asking me about doing a fsck. What's more, my console is going crazy, repeating this message to me over and over again, and I can't stop the array (well I can, but I'm trying to be good so the last person still reading this won't give up on me).

Dilemma

I'm at the point now where I feel like it might be best to just blow the whole thing away and start all over. I'd be setting myself back about 7 months of downloading, but why should I care? I'll somehow, someway, be able to rebuild. My photos and important documents are all backed up to Time Machine and the cloud in various ways. I just need a stable NAS - and someday, I'd like to migrate my downloading applications to my NAS as well so I can repurpose the crappy little Windows box I have that stuff on now.

What do I do? Do I try to recover the array I have now, and bring this thing back to life? Or format every disk again and start over?

I've been considering moving to FreeNAS as well. Mainly because I've had so many drive failures with unRAID. Again, I don't know that unRAID is to blame, but I've never had hard drives fail on me before. Never. Now in an array like this, all the time. It very well could be the drives, but I just don't really know or understand. I'm willing to learn, and I'm willing to take a step back and get everything right this time.

Attached are my last two syslogs that I pulled during restarts. As I was writing this, I realized that I had made it difficult to pull the logs from my system via SMB and I did another hard power down. I promise, if anyone is willing to answer me, and be patient with me, I'll walk away from my NAS after posting this, wait a few days for a reply, do your bidding, and wait patiently again. If you need disk stat stuff, I'll stop and provide that too, If you want me to take my whole array apart and put it back together again, I'll begrudgingly do that as well. I'm generally very independent and will take off after being nudged in the right direction, but this once, I just want someone to take me under their wing and get me through this whole process with as few additional scars as possible and the ideal setup. Screenshots, logs, anything. You got it. I don't care if I'm not done working on this until March. I'm gonna get it right this time. Thanks.

Ok, new problem. My 2 syslogs are 1GB in size between the two of them so they're not going to fit in this post. If you're interested in assisting me, let me know and I'll pare them down/put them in a shared Dropbox folder or something.

Quote

January 5, 201313 yr

Sorry this has been frustrating for you. Please send the logs, or links to the logs, to me via email and we'll get this figured out.

[email protected]

Quote

January 5, 201313 yr

I doubt it's unraid causing the hard drive issues. Do you know what the problem with them is/was?

Did you preclear the drives before adding them?

Quote

January 5, 201313 yr

Author

I doubt it's unraid causing the hard drive issues. Do you know what the problem with them is/was?

Honestly, I don't remember - could have been SMART errors - could have been bad sectors. I do know that I've done my best to replace the offending WD drives with Samsung and I've had better luck (no red ball errors on those yet). Although when I did run diagnostics on all my drives a few months back, I got discouraged, because from reading a little on this site on how to interpret these reports, ALL of my disks, including the brand new ones I'd got as replacements and/or from Newegg were "failing". So yeah.

Did you preclear the drives before adding them?

Yeah I forgot to mention that in the background portion of my sob story. I did make sure to preclear all my drives when first building my system. After that - when replacing drives I was always without a real backup plan while I had a red ball, so I'd just rebuild from parity out of the box.

***RANDOM*** Does anyone have a quick shell script that can pull some drive statistics and system information from my server? Perhaps something I can just leave on the USB drive in case I have issues again, so in the future I'm less likely to panic and try to solve the problem with everything but the assistance of a more knowledgeable community? Actually, I can probably write my own script, a shopping list of important stuff to grab with it would be enough.

Quote

January 5, 201313 yr

Author

I just wanted to update this thread. Tom was kind enough to take time out and help me through my issues.

For the reference of others who may follow me, below is our correspondence:

Tom:

Hi Dan,
The dropbox folder is still syncing. One of the logs is 721MB so far and that’s not good because probably this is filling all your RAM on the server.

So, tell me please the current state of your server. That is, how many drives with data, is parity valid, does it boot? etc.

Cheers,

Tom

Me:

Hi Tom,

I'm able to boot the server. It doesn't go into an unusable state until I try to browse the network share "Videos/TV Shows" or "Videos/Movies".

I think (but I'm not certain) that disk3 is the new 2TB drive I just bought to replace the one that went redball on me. It seems to think it's full... even though it was just added to the array and couldn't have possibly finished its parity rebuild.[/img]

Tom:

Ok that looks good. So what you need to do is check and if necessary repair the file systems on each of your data drives.

1. Stop the array.

2. On the Main page you are going to Start the array again, but first check the Maintenance mode checkbox.

3. From the console, or telnet session, type this series of commands:

reiserfsck -y /dev/md1

This will start a file system check on disk1. The program might take a while to run depending on how many files are on the volume and the degree of corruption, if any. If the program finds corruption it will tell you to re-run the program again but this time including a particular switch, such as “--fix-fixable”. If this is the case, go ahead and do it. Files that it finds but can’t place in directories properly will be put in a “lost+found” directory in the root of the disk.

When done with the first disk, move on to the next one:

reiserfsck -y /dev/md2

etc.

reiserfsck -y /dev/md6

will be your last device.

Once all your file systems have been checked/repaired, you should be able to access normally. Run a parity-check at some point to see if any errors pop up in there. Try not to power-down your server without first clicking on ‘Shutdown’ or hit the server front panel power button once and let it shut down on its own.

This worked. As I expected from what I could ascertain, md3 was never fully rebuilt, and required a tree rebuild. Whatever data was there was now gone for good, but I was fine with that. I just wanted access to the 4 remaining drives so I could begin to rebuild with some remnants of my media server still in place.

I'd like to extend my thanks again to Tom for taking the time out to give me a hand with this. Completely unexpected but very welcome great customer support. I'll be faithfully sticking with unRAID until the day I can just buy a 30TB SSD for $50. :-)

Quote

January 5, 201313 yr

Author

Guess I spoke too soon. During the parity-check, disk6 redballed. I believe this is one of the disks that was a replacement RMA, so it's starting to come full-circle. Just so I don't have to blame myself, I'll start harboring a grudge against Western Digital now.

Anyway, I double checked the SATA cable - seemed fine. At first I wasn't able to get a SMART report after simply stopping the array. I stopped the array, activated shutdown from the webpage :-) checked the cable and rebooted. The console seemed to indicate on startup that there were two disks now that were not connecting right away. One is the one I have kept inactive until the RMA shows up, and the other, I'm assuming, is the redball. Here's the latest screen grab.

After rebooting, I tried again to get a SMART report - this time I was able to. It's here:

=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EARX-008FB0
Serial Number: WD-WCAZAE131140
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Jan 5 11:05:02 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84)Offline data collection activity
     was suspended by an interrupting command from host.
     Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0)The previous self-test routine completed
     without error or no self-test has ever 
     been run.
Total time to complete Offline 
data collection: (32700) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
     Auto Offline data collection on/off support.
     Suspend Offline collection upon new
     command.
     Offline surface scan supported.
     Self-test supported.
     Conveyance Self-test supported.
     Selective Self-test supported.
SMART capabilities: (0x0003)Saves SMART data before entering
     power-saving mode.
     Supports SMART auto save timer.
Error logging capability: (0x01)Error logging supported.
     General Purpose Logging supported.
Short self-test routine 
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x30b5)SCT Status supported.
     SCT Feature Control supported.
     SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 25
  3 Spin_Up_Time 0x0027 194 185 021 Pre-fail Always - 5275
  4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 82
  5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
  7 Seek_Error_Rate 0x002e 200 196 000 Old_age Always - 0
  9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2829
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 27
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 22
193 Load_Cycle_Count 0x0032 196 196 000 Old_age Always - 12567
194 Temperature_Celsius 0x0022 120 119 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
ATA Error Count: 67 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 67 occurred at disk power-on lifetime: 2829 hours (117 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  ef 10 02 00 00 00 a0 08 01:36:07.344 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:07.343 IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 08 01:36:07.343 SET FEATURES [set transfer mode]
  ef 10 02 00 00 00 a0 08 01:36:07.343 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:07.342 IDENTIFY DEVICE

Error 66 occurred at disk power-on lifetime: 2829 hours (117 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 45 00 00 00 a0 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  ef 03 45 00 00 00 a0 08 01:36:07.343 SET FEATURES [set transfer mode]
  ef 10 02 00 00 00 a0 08 01:36:07.343 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:07.342 IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08 01:36:06.801 SET FEATURES [Reserved for Serial ATA]

Error 65 occurred at disk power-on lifetime: 2829 hours (117 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  ef 10 02 00 00 00 a0 08 01:36:07.343 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:07.342 IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08 01:36:06.801 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:06.800 IDENTIFY DEVICE

Error 64 occurred at disk power-on lifetime: 2829 hours (117 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  ef 10 02 00 00 00 a0 08 01:36:06.801 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:06.800 IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08 01:36:06.800 SET FEATURES [set transfer mode]
  ef 10 02 00 00 00 a0 08 01:36:06.800 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:06.799 IDENTIFY DEVICE

Error 63 occurred at disk power-on lifetime: 2829 hours (117 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  ef 03 46 00 00 00 a0 08 01:36:06.800 SET FEATURES [set transfer mode]
  ef 10 02 00 00 00 a0 08 01:36:06.800 SET FEATURES [Reserved for Serial ATA]
  ec 00 00 00 00 00 a0 08 01:36:06.799 IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08 01:36:06.797 SET FEATURES [Reserved for Serial ATA]

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And I have attached my latest syslog. Much smaller than last time.

syslog.txt

Quote

January 5, 201313 yr

There are communications errors. Could be a bad cable or port. Try a new SATA cable. Then a different SATA port.

Quote

January 10, 201313 yr

Author

Trying to keep this all in one place. Here's an update on my communications with Tom.

Tom:

That’s great! What do you suppose was the original source of the problems? If you suspect bad hardware then it’s probably best to keep an eye on the logs for a while to see if any errors start cropping up.

As mentioned before, ‘reiserfsck’ will save orphaned files in the ‘lost+found’ directory. This directory might not be visible via network but you can check for it using the command line. If you see one there you can type this command to make everything visible via the disk share:

chmod -R 777 /mnt/disk3/lost+found

Me:

Guess I spoke too soon. I've reopened the thread... as I got redballed again while the parity check was running. Another WD drive... so I posted screenshot, smart report, syslog, and the first responder thinks it's a cabling problem. I've tried several SATA cables. None of them seem to work any better than the others. Starting to think those two ports on my motherboard are not so good (although I've had failures elsewhere). Either way, I've sent out RMAs for both those drives.

Kicking myself for it again, but I tried the "this drive is good" remove and re-add trick with disk6 when I didn't see errors referencing it on startup last time. Data-rebuild failed immediately. Lost and found will be helpful later on I think. Hopefully I'll be able to put these drives into a SATA dock and connect them to my Ubuntu machine to recover some of my data.

And today's update:

I needed to step away a bit, but I'm still trying to recover as much of my server as I still can. I tried different SATA cables, then since I was out of ports (the two malfunctioning ports were the last two on my board), I had to order another RAID card from Monoprice. I received it, but still, got errors just starting the system after I migrated the drives. They looked similar, and they're attached.

SMART reports also attached:

sdd is the drive NOT attached to the array, sde is the drive attached to the array with the orange ball from the below screenshot.

syslog011013.txt

SMARTsdd011013.txt

SMARTsde011013.txt

Quote

I've made a few mistakes...

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)