Lost 1 drive, havoc ensued

May 19, 200917 yr

Long story, I'm running 4.4.2 on a 6 disk array. After an evening of using just disk5, I went to power down for the night. disk1-4 were spun down while disk5 and parity were still up. I hit the spin up button and everything seemed fine. But after I clicked the stop button, the page wouldn't refresh. The box isn't pingable nor can I telnet to it. After a reboot, I still can't get to any of the shares, and the terminal showed some skewed text without really any error. At this point I think it's a disk that is semi failing.

So I rebooted it and went into the prompt this time. All disks show they were mounted. So I started doing touch /mnt/disk1/1, touch /mnt/disk2/2. When I got to disk4, the box froze. I thought I found the culprit and shutdown the box to unplug disk4's power. After power up, I can get to the webpage now, and I see disk4 is missing. Good. I start the array from the page, but it froze yet again. I go to the console, and df showed only /mnt/disk3 being there. So I thought maybe it's disk3 that's bad. Another power down, plug in disk4 and unplug disk3. Oops, should have powered up with all 6 disks plugged in, now when it came up, the page said: "Too many wrong and/or missing disks!"

What should I do now? Are my data still protected as long as I find the bad disk and replace it? How can I find out which disk is bad? How do I get unRAID to recognize my drives again and start rebuilding? Currently disk3 shows missing since I unplugged it, and disk4 has a blue dot next to it after I unplugged it and plugged it back in.

BTW, I couldn't post this in the 4.4 forum, the post button doesn't do anything.

Quote

May 19, 200917 yr

Author

syslog

Quote

May 19, 200917 yr

Your diagnostic methods remind me of trying to locate a bad spark plug! I'm going to have to kid you here, by saying we're in a more modern age now. If we're wondering about a hard drive, we don't have to treat it like a dumb spark plug, we just ask it what is wrong! We have SMART in our drives, and we have syslogs, both chock full of info that they want to tell us! On a more serious note, that really is the wrong way, a very bad way, to diagnose any RAID array. If this was not an unRAID array, you would have lost ALL of the data on all of the drives.

And besides, I don't think there is anything seriously wrong with any of the drives. It is extremely rare for drive issues to crash the system, they just cause a lot of errors, and sometimes slow the system way down, but don't actually freeze or crash it. It is possible (but rare) to have corruption in the Reiser file system, after a bad shutdown or serious power issue, that could cause a characteristic crash, with very characteristic evidence of a kernel panic on the physical console. But that would only require a single reboot, and the use of reiserfsck (like scandisk) to clean it up.

What would have been the most helpful was a syslog immediately after the apparent crashes, if it was available from the console. And even if you can't get a syslog, a digital camera pic of the console can help. I do recommend the Troubleshooting page, below in my sig. For now, we need to first recover your array, so please do re-hook up all drives, and see if a reboot will keep the system up long enough to do the Trust My Array procedure, which will restore the array and all drives back to normal. Your syslog show no issues at all with the system, besides the obvious status of the 2 drives, which is completely normal, just the way I would have expected them to be, based on your story of actions taken. Now we need a syslog captured right after the system fails again, or a good idea of what are the last messages on the screen, in order to figure out what really is the problem.

Two other things you can try, the first is run a 2 to 4 hour memory test, from the unRAID boot screen, to eliminate a memory problem. Memory problems are a big source of crashes. And the next thing you can do is obtain SMART reports for ALL of the drives, including Disk 3 once re-connected. Please see the Troubleshooting page, Obtaining a SMART report section.

Quote

May 19, 200917 yr

Author

Thanks for all the suggestions, I will bring everything back and try the trust my array method. The reason I just unplugged it is because everytime I tried to access the share via XP, it locks up the system or at least appears to be. (Share timing out, gui timing out, can't ping or telnet and command prompt not responding to any key strokes) In my experience with bad drives in XP is that it freezes up for a looong time until it finally times out. That and the touching of file freezing up the box.

I did not realize the syslog isn't being appended to until I copied over the latest to the flash drive. The syslog right after the crash was actually on the flash drive before this. I'll get another once I get the array restored tonight, hopefully it's still in tact.

Quote

May 20, 200917 yr

Author

I did get 1 "screen shot" before I started unplugging drives. Here it is:

Quote

May 20, 200917 yr

Author

After I plugged in all the drives, booted up and followed the trust my array steps, I got to the part where it says click start to bring all the drives to green. The gui froze after that, I went to the console and tried to copy the syslog. Before I finished typing the "g" for log, the keyboard went unresponsive. The screen went "skewed" again which I am attaching to this post. After a reboot, the gui showed all drives being blue balls. And the Start button is labeled:

Start will record all disk information, bring the array on-line, and start Parity-Sync (if parity is present). The array is immediately available, but is unprotected until Parity-Sync completes.

I was able to run a smarts report on all drives, and they all said PASSED. I don't know what to do next. Since I have the case open now, I can hear some periodic sound from one of the drives. Its sounds like the motor shutsdown then starts up right away in a split second. But it's so random, I haven't been able to find out which drive is doing that. While I wait, I will run a mem test.

Quote

May 20, 200917 yr

Author

more smart reports and syslog, but I doubt the syslog is helpful since it didn't crash.

Quick overview of the smart reports.

Reallocated_Sector_Ct is 0 for all drives

Current_Pending_Sector is 10 for the parity drive

highest Temperature_Celsius was 30

Quote

May 20, 200917 yr

The parity disk has some problems ...

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 10

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 10

But this does not seem lke the kind of thing that would crash the server.

Look here for a description of these (and other) SMART attributes.

Quote

May 20, 200917 yr

Author

mem test first pass returned no errors. I don't think it's the memory.

Quote

May 20, 200917 yr

Author

The parity disk has some problems ...

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 10

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 10

But this does not seem lke the kind of thing that would crash the server.

Look here for a description of these (and other) SMART attributes.

Thanks for the link. It appears I may have some sector errors for the parity drives, but like you said, it shouldn't be crashing the server.

Quote

May 21, 200917 yr

Author

What should I do next? Help.

Quote

May 21, 200917 yr

Sorry, life got in the way. Volunteer help can be very unreliable, especially from me!

This looks like a tough one. There is a fair amount of evidence, but nothing conclusive.

* Clear corruption, in 2 different sessions, of the screen buffer of the current terminal session. Appears to include an 8 byte slide of the various lines from the buffered console, plus an overwrite of the 8 bytes between each line buffer

* No apparent corruption elsewhere. Both syslogs and all of the SMART reports are correctly structured, and contain no evidence whatsoever of any buggy modules or memory corruption. So far, only the console display shows anything wrong.

* System misbehavior: apparent freezes, keyboard unresponsive, network offline. So far, cannot tell if these freezes are crashes or panics, or just endless loops (100% CPU) or stuck in a race condition

* Possible power interruption, or motor control issue. A motor that is dipping down then returning to full revs seems very significant. Power supply issues? Getting too hot? Which motor, fan or power supply or drive? Drive SMART reports show no evidence of this.

* Memory test, 1 pass only, shows no problems, but flaky memory bits need more passes, perhaps over night, to be identified. Symptoms so far don't 'feel' like memory issues, not random enough, but that is not conclusive.

* No evidence at all of drive or cabling issues. That leaves motherboard, addon controller, power supply, over heating chipsets, flaky memory under load or when heated up, kernel incompatibility with some hardware component, some other module bug?

Some things to try, may or may not help, but may provide more info:

* Repeat the Trust My Array procedure a couple more times, to determine if it freezes in the same way at the same point. Repeat other tasks also, that had previously 'crashed', to tell if they crash randomly or are completely repeatable. It's really helpful to know what is repeatable, and what is not.

* Run memory test over night or when not doing anything else, we want to force any weak memory bits to fail

* Try booting different unRAID versions, both forward and backward, v4.3.3 and v4.5-beta6. We want to 'try' to eliminate software compatibility issues in a particular kernel release. Get syslogs when ever possible, never know which one will have the important clue.

* Try unhooking power to ALL drives, and booting, then listening for the dipping motor. Especially check to see if it is a fan or the power supply. You said that it was a 'periodic sound', but also that it was 'random'?

* Carefully, feel for extremely hot components on the motherboard, but be careful to avoid any static discharge, don't touch any exposed electrical traces.

* If you can find a substitute power supply, try it.

That should keep you busy for a bit!

Quote

May 21, 200917 yr

Author

Thank you for the volunteer help. I was hoping someone from lime would take this too.

For the trust my array method, can I still do that right now when all my drives are blue and the start button say it'll start a sync? I'm guessing after running the trust my array method, it'll go back to "normal"?

After I posted, the memory test finished 3 passes all without error, so I leaning against memory error. Since the motor noise is a bigger sign. When I say periodic, I just mean I hear it randomly but repeatedly. No pattern so far. I didn't want to leave the drives on for too long to test the memory, so I'll test the memory tonight with all the drive's power unhooked.

If it's a drive issue, I suppose if I hook it up to an external power supply and power up the drive individually, I'd hear the bad drive's motor that way. Maybe this is something I can try too. It does not appear to be the power supply fan etc, but I'll try unhooking power to all the drives. I'll create a script too to copy the syslog with minimal key strokes so I can get the syslog before it crashes, report back what I find. Thanks again for the help.

Quote

May 21, 200917 yr

The 'Trust My Array' procedure starts with the Restore button, so that resets the config back to zero anyway, whatever the current state is. I can't think of any reason at all, why you couldn't repeat it. Once you Start the array, everything should return to normal, as normal as your system can be currently, with a parity check in progress, which you can abort. Once everything is fixed, you will want to run a full parity check.

Memory sounds good.

Quote

May 21, 200917 yr

Author

Finally, I got something useful in the syslog. I noticed that it would crash soon after I do a df. So I just kept doing df and copying the syslog over and over. Here's a snippet:

May 20 20:34:19 unRAID kernel: hda: dma_timer_expiry: dma status == 0x61

May 20 20:34:29 unRAID login[1311]: ROOT LOGIN on `tty1'

May 20 20:34:29 unRAID kernel: hda: DMA timeout error

May 20 20:34:29 unRAID kernel: hda: dma timeout error: status=0x50 { DriveReady SeekComplete }