[SOLVED] Drive or controller issues - could use some advice - General Support (V5 and Older)

October 7, 201312 yr

Been happily running unRaid for a bit over two years now, stuck with 4.7 because I never felt the need to upgrade to >2TB drives. My hardware specs are in this post, all unchanged except for some drive changes (most of the EADS and some of the EARS drives have been replaced with WD Red 2TB EFRX drives).

Last friday, I finally upgraded to unRAID 5.0 and on saturday I upgraded my parity drive to a 4TB WD Red (which, I should add, involved opening the case, unplugging that drive and pushing some other cables aside). Sunday early morning, I found two drives missing with a bunch of errors in the syslog. After some reading, digging and searching, I figured it was a cabling issue, and indeed after reseating the offending cables, things seemed happy again (well, except for the parity drive which completely lost track of things with the two failed drives). However, today, after about 22 hours uptime, the same drives started acting up again.

Syslog shows things like this:

Oct  7 13:51:56 garfield kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x180000 action 0x6 frozen
Oct  7 13:51:56 garfield kernel: ata4: edma_err_cause=00000020 pp_flags=00000001, SError=00180000
Oct  7 13:51:56 garfield kernel: ata4: SError: { 10B8B Dispar }
Oct  7 13:51:56 garfield kernel: ata4: hard resetting link
Oct  7 13:51:57 garfield kernel: ata4: SATA link down (SStatus 0 SControl 310)
Oct  7 13:51:57 garfield kernel: ata4.00: link offline, clearing class 1 to NONE
Oct  7 13:51:57 garfield kernel: ata4: hard resetting link
Oct  7 13:51:58 garfield kernel: ata4: SATA link down (SStatus 0 SControl 310)
Oct  7 13:51:58 garfield kernel: ata4.00: link offline, clearing class 1 to NONE
Oct  7 13:51:58 garfield kernel: ata4: hard resetting link
Oct  7 13:51:59 garfield kernel: ata4: SATA link down (SStatus 0 SControl 310)
Oct  7 13:51:59 garfield kernel: ata4.00: link offline, clearing class 1 to NONE
Oct  7 13:51:59 garfield kernel: ata4.00: disabled
Oct  7 13:51:59 garfield kernel: ata4: EH complete
Oct  7 13:51:59 garfield kernel: sd 4:0:0:0: rejecting I/O to offline device
Oct  7 13:51:59 garfield kernel: sd 4:0:0:0: [sde] killing request
Oct  7 13:51:59 garfield kernel: sd 4:0:0:0: rejecting I/O to offline device
Oct  7 13:51:59 garfield kernel: sd 4:0:0:0: [sde] Unhandled error code
Oct  7 13:51:59 garfield kernel: sd 4:0:0:0: [sde]
Oct  7 13:51:59 garfield kernel: Result: hostbyte=0x01 driverbyte=0x00
Oct  7 13:51:59 garfield kernel: sd 4:0:0:0: [sde] CDB:
Oct  7 13:51:59 garfield kernel: cdb[0]=0x28: 28 00 48 7b 86 18 00 01 e0 00
Oct  7 13:51:59 garfield kernel: end_request: I/O error, dev sde, sector 1216054808
Oct  7 13:51:59 garfield kernel: md: disk8 read error, sector=1216055232
Oct  7 13:51:59 garfield kernel: ata4.00: detaching (SCSI 4:0:0:0)
Oct  7 13:51:59 garfield kernel: md: disk8 read error, sector=1216054744
Oct  7 13:51:59 garfield kernel: md: disk8 read error, sector=1216054752

and similar for ata3/disk7. Full syslog obviously attached. At some point, both drives (which were /dev/sdd and /dev/sde) get re-detected as /dev/sds and /dev/sdt.

After yesterday's incidents, I ran fsck on both drives and everything seemed in order. Also, smart doesn't show any problems for either drive (or at least as far as I can tell), so for now I'm going to assume the drives are fine. But then what is the issue here? My initial hunch was cabling, and messing with the relevant cables did seem to solve the problem briefly, but it returned today.

My current list of candidates for causing this problem, in order of likeliness:

1. SATA cables (moving them about after they've been in the exact same position for 2+ years may have harmed them)

2. Controller issues - both troublesome drives are connected to the same Adaptec, I'm not quite sure why this would start acting up now though

3. Icy dock trouble - both drives sit next to each other in the same dock

Based on the syslog and smart output (all smart outputs are mashed together in a single file), does anyone have any other ideas/suggestions?

Oh, there's one drive that has some serious issues: /dev/sdo, which is my cache drive. Also noticed this yesterday, so that drive is obviously now also up for replacement but not currently my biggest worry, as you can imagine

syslog.zip

20131007_smart_summary.txt

Quote

October 8, 201312 yr

Disk 7 and 8 look like cabling issues.

Quote

October 9, 201312 yr

Author

Sorry for the slow response, life got in the way. Thanks for the confirmation, I've replaced the cables. Now when I boot the system, I still have a red ball for disk8, all other disks are green and all are in the correct positions according to my pre-5.0 screenshot. smart output for disk8 reports all's well, short self test completes without error.

Right now, I'm not entirely sure how to proceed. Here's what I'm thinking the current state is:

- All data drives are probably (mostly) intact, disk7 and disk8 might have some filesystem corruption, but nothing fsck won't be able to fix;

- Because of the dual disk failure (and possible writes to the array after this happened), I'm going to assume parity is invalid.

As per Joe's response in this thread, I'm thinking initconfig might be the way to go for me as well, assuming that will automatically mark the red balled drive blue as well. I also need to run an fsck against at least disk7 and disk8. Should I do this after I start the array, or would it be better to run it against /dev/sdd1 outside of the array? I'm thinking stressing the drive before starting the array may also give an indication of its stability.

Any guidance will be much appreciated

One thing's for sure: had this been a traditional raid, my life would've been a lot simpler by now: I'd just have to redownload all 26TB that's on the array (needles to say, I'm quite happy with unRaid )

syslog.txt

Quote

October 9, 201312 yr

It appears (from your comment r.e. needing to re-download everything) that you don't have backups of your data.

So be cautious about doing a "New Config" -- which will eliminate the ability to rebuild your red-balled disk.

I'd simply replaced disk8 with a new disk (now that you're running v5 and have a 4TB parity drive, this could be a 4TB disk) ... and let the system rebuild it.

Quote

October 9, 201312 yr

Note: If you think the red-ball is an error, I'd do the following before doing a "New Config".

Remove disk8 and attach it to a Windows PC. Use the free LinuxReader [ http://www.diskinternals.com/linux-reader/ ] to read the contents of the disk => don't just "look" at the disk ... actually copy all the files from it [to ensure they're actually readable] -- at least do enough of them that you have confidence in the readability of the data.

If you don't encounter any issues reading the data, then you can put it back in the system and do a New Config.

HOWEVER ... remember that a red-ball means UnRAID encountered an error when writing to the disk (not reading from it). The safest thing to do is either replace the drive (as I noted above); or at least copy all of the data off of it (e.g. with LinuxReader) to a backup drive; then pre-clear the disk before you put it back in the array.

Quote

October 9, 201312 yr

Author

You're correct, I don't have backups. It is simply too much (and not nearly important enough) to backup effectively, although it would of course still suck if I'd lost it

My worry about rebuilding disk8 from parity is that I'm not sure parity is valid. But I suppose if I replaced the disk, I would still have the data on it and the only disk affected by the rebuild would be disk8. I'll probably know soon enough if the parity turns out to be invalid because that'll likely leave me with an unreadable (or at least very corrupted) disk8. Your advice seems sound, I guess I was overthinking it, thanks

As for the reason it got red-balled, I'm reasonably sure it's because of controller/cable fault, not drive fault. Still, there too your advice is sound. I'll just grab the old parity 2TB that has just finished pre-clearing with no errors, let unRAID work its magic and shelf the current disk8 as fall-back.

Quote

October 9, 201312 yr

You're correct, I don't have backups. It is simply too much (and not nearly important enough) to backup effectively, although it would of course still suck if I'd lost it

As anyone who follows my posts here knows, I'm a VERY strong believer in backups. My arrays have 39TB of data -- ALL backed up on offline disks I keep for that purpose.

As long as it's truly "not ... important enough" to backup, that's fine. Just remember that when something catastrophic happens -- and don't be upset that you've actually lost data. That's usually when folks suddenly decide that backups aren't such a bad idea after all

Quote

October 10, 201312 yr

Author

I hear what you're saying about backups, and I can't say I disagree, but at the same time I'm unsure if I'm prepared to spend the dough on and extra set of storage for backup purposes. Maybe a subset. Guess I'll browse your post history to get an idea of how you've approached the actual process, if you feel so strongly about it I'm sure you've posted your setup more than once

As for my situation: the rebuild completed overnight and everything looks ok, no errors at all in syslog. Of course, the previous two times things ran for about a day before something started failing, so I'm not celebrating just yet Only thing I'm not too sure about now is if I should run an fsck against disks 7 and 8 (and maybe even all others?) Even without any crashes normal linux filesystems tend to run an automatic fsck every so many days or mounts, but UnRAID so far hasn't done any as far as I can tell...

Quote

October 10, 201312 yr

No need to browse my posts ... I use a very simple process:

=> I have a backup disk in an external caddy [ http://www.newegg.com/Product/Product.aspx?Item=N82E16817153071 ]

=> Anything I copy to my array, I copy to the backup disk.

=> When the backup disk gets full, I replace it with another one; mark it "Backup #xx" (xx is currently 17); store it in a WiebeTech DriveBox [ http://www.amazon.com/DriveBox-3851-0000-11-Hard-Disk-Case/dp/B004UALLPE ]; and put it in my fireproof/waterproof safe in another part of the house. I also "print" a directory of the backup disk using Directory Printer [i actually "print" it to a PDF file] and save all these directories in a folder on my hard drive.

So I only need to buy perhaps one drive a year (or even less with 3 & 4 TB drives) for backups ... but of course it depends on what rate you're adding data to your array and what size drives you use for backups.

Obviously, if you have a very large array, and haven't been maintaining backups of it, it can cost a good bit to initially back it all up. But after that, it's a pretty nominal expense -- in fact, I've only bought about half of my backup disks for that purpose. The rest were disks I used for that purpose after I had replaced them with a larger one; or replaced a failed disk and RMA'd the bad one ... then just used the refurbished disk I got back from the manufacturer for backups; etc.

I've simply spent FAR too much time building up the data I have to even think about doing it again ... and to me it's inexpensive "insurance" to keep backups.

Quote

October 10, 201312 yr

One more thought r.e. backups => I tell all my friends there's one simple rule r.e. backing up your data: Always assume that at midnight your system is going to lose all of its hard drives. If there's anything you're going to be upset about losing, be sure you have backups. If not, don't worry about it.

Quote

October 10, 201312 yr

Author

I guess you managed to convince me I'll just need to think of a way to automate this, as most of the stuff that would suck if I'd lost it is placed onto the array automatically (sickbeard, mostly). Perhaps I could somehow hook into the mover script

Quote

October 10, 201312 yr

I'll just need to think of a way to automate this ...

Yes, it's a more difficult problem when your content is being added automatically. I don't have that issue.

I can think of two relatively simple approaches (both cost a few $$):

(1) This one requires a 2nd backup disk (to ensure there's always space for the copies). Set up a backup utility (e.g. SyncBack, Robocopy, SyncTool, etc.) to automatically run once/day and automatically copy everything from your array with a file date/time later than the last run of the script (or just = "today", etc.). The destination for the copy could be the current backup disk ... but to ensure it never fails due to lack of space, I'd copy it to a dedicated extra disk; then just manually move everything to the backup disk (which would keep the spare disk always empty except when you had to change backup disks). The spare disk wouldn't need to be all that large -- $50 should buy one plenty large enough.

(2) This costs the most, but is clearly the easiest ==> Instead of using offline disks, just build a 2nd UnRAID server for your backups. Then just have a sync utility run once/day to ensure it mirrors your main server. Clearly this costs a good bit more ... case/PSU/CPU/motherboard/memory that you wouldn't need with offline disks. Probably an extra $400 to $500. It wouldn't need to be on except during the backups -- setting it for WOL would let the script turn it on a few minutes before the actual sync ran; then shut it down when complete.

I've actually toyed with doing this; but chose not to -- NOT because of the cost; but because I keep my backup disks in a fireproof/waterproof safe.

Quote

[SOLVED] Drive or controller issues - could use some advice

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)