Unknown near-catastrophe....need advice

SilntBob · July 27, 2012

I'm working on recovering from a pretty bad situation and I need some advice.

A few days ago I went to access some of my data and I noticed entire user shares that were empty...this didn't make much sense.

Checking the unRAID web interface I saw that one of my drives was marked RED and it indicated 7 errors.

2 other drives in the array also showed errors (one with 93 errors, the other with 69 errors) but they were still green.

Additionally, looking at the read/write stats on the drives I noticed some strange things

	Model / Serial No.	Temperature	Size	Free	Reads	Writes	Errors
parity	_	*	1,953,514,552	-	433,384	703,653	0
disk1	WDC_WD20EARS-00M	0°C	1,953,514,552	-	160	134,874,888	93
disk2	SAMSUNG_HD103UJ_	25°C	976,762,552	-	128,332	60	0
*disk3	WDC_WD20EADS-00R	0°C	1,953,514,552	-	90	134,548,877	7
disk4	WDC_WD20EARS-00M	*	1,953,514,552	-	781,639	214,171	0
disk5	WDC_WD20EARS-00M	*	1,953,514,552	-	445,872	72,117	0
disk6	WDC_WD10EADS-00L	0°C	976,762,552	-	445,872	134,871,616	69
disk7	WDC_WD20EARS-00M	*	1,953,514,552	-	1,660,823	62,410	0
cache	WDC_WD7501AALS-_	*	732,574,552	-	528,746	1,341,675	0

The capture was taken after I unmounted drives, so free space isn't reported. I'll admit that most of those drives have very little free space. This is my fault, I've been meaning to address that, not sure if it contributed to my problems.

What is really strange is that the 3 drives that had issues all had a 134M+ writes to them and relatively few reads. I don't know why, I don't think I'd been writing that much data.....especially since the parity drive only has 700K writes in the same period of time.

After stopping the array, before shutting down I attempted to run reiserfsck read-only check on each of the drives. The drives without errors all completed the reiserfsck check fine but it complained loudly about the other 3 drives.

I then powered down the server to locate the problematic drives. Interestingly all 3 of those drives were connected to the same SFF-8087 SAS break-out cable.....did something go crazy with the cable/controller ? I dunno.

So I powered down and pulled the drives, and I'm working on them on a separate system. Running reiserfsck on them one at a time and they all checked out, no major complaints or corruption indicated by fsck.

I'm still a bit nervous about all of those writes...why did they happen? why roughly the same number to all 3 drives?

What should my next step be? To throw them back in the array, boot it up, "trust the array" cross my fingers and do a parity check?

Any other ideas or suggestions?

Once I get it back up and stable, I'm going to add another drive and make sure those drives have a bit more free space on them....

Thanks.

dgaschk · July 27, 2012

See here: http://lime-technology.com/forum/index.php?topic=9880.0

If those drives were mounted in another system the parity is invalid and must be regenerated.

mbryanr · July 27, 2012

A few days ago I went to access some of my data and I noticed entire user shares that were empty...this didn't make much sense.

Checking the unRAID web interface I saw that one of my drives was marked RED and it indicated 7 errors.

2 other drives in the array also showed errors (one with 93 errors, the other with 69 errors) but they were still green.

The first drive had a write failure, and was mounted read only. The other 2 likely had a read failure due to ?? (could be a number of reasons). Post smart reports for all of your drives, and syslog..and that will give the clues to why they "failed"

SilntBob · July 28, 2012

I'm running unRAID version 4.7. I've attached the syslog files. I copied them before I restarted the box, however, they unfortunately don't have much useful in them. When I first had problems, I stopped the array, but some of the drives that had been flagged with errors were completely non-responsive. I proceeded to start validating the filesystems on the drives that appeared to be okay...that took a couple of days, but they all checked out. In the meantime, the syslog files rolled over and I no longer had the syslog data from the actual time of the incident...but I've attached what I have.

EDIT: okay, I *couldn't* attach the syslog files, because they are each about 15MB and even after compression they are still about 900K each...I've got 3 of them, these messages are repeated:

Jul 18 04:40:01 UnRaidServer emhttp: shcmd (4418): umount /mnt/disk3 >/dev/null 2>&1
Jul 18 04:40:01 UnRaidServer emhttp: _shcmd: shcmd (4418): exit status: 1
Jul 18 04:40:01 UnRaidServer emhttp: shcmd (4419): rmdir /mnt/disk3 >/dev/null 2>&1
Jul 18 04:40:01 UnRaidServer emhttp: _shcmd: shcmd (4419): exit status: 1
Jul 18 04:40:01 UnRaidServer emhttp: shcmd (4420): umount /mnt/cache >/dev/null 2>&1
Jul 18 04:40:01 UnRaidServer emhttp: _shcmd: shcmd (4420): exit status: 1
Jul 18 04:40:01 UnRaidServer emhttp: shcmd (4421): rmdir /mnt/cache >/dev/null 2>&1
Jul 18 04:40:01 UnRaidServer emhttp: _shcmd: shcmd (4421): exit status: 1
Jul 18 04:40:01 UnRaidServer emhttp: Retry unmounting disk share(s)...
Jul 18 04:40:01 UnRaidServer emhttp: mdcmd: write: Invalid argument
Jul 18 04:40:01 UnRaidServer emhttp: disk_spinning: open: No such file or directory
Jul 18 04:40:01 UnRaidServer kernel: mdcmd (5803): spindown 0
Jul 18 04:40:01 UnRaidServer kernel: md: disk0: ATA_OP_STANDBYNOW1 ioctl error: -22
Jul 18 04:40:02 UnRaidServer emhttp: disk_spinning: open: No such file or directory
Jul 18 04:40:02 UnRaidServer emhttp: disk_spinning: open: No such file or directory
Jul 18 04:40:02 UnRaidServer emhttp: shcmd (4422): /usr/sbin/hdparm -y /dev/sdj >/dev/null
Jul 18 04:40:03 UnRaidServer emhttp: _shcmd: shcmd (4422): exit status: 52
Jul 18 04:40:06 UnRaidServer emhttp: shcmd (4423): umount /mnt/disk3 >/dev/null 2>&1
Jul 18 04:40:06 UnRaidServer emhttp: _shcmd: shcmd (4423): exit status: 1
Jul 18 04:40:06 UnRaidServer emhttp: shcmd (4424): rmdir /mnt/disk3 >/dev/null 2>&1
Jul 18 04:40:06 UnRaidServer emhttp: _shcmd: shcmd (4424): exit status: 1
Jul 18 04:40:06 UnRaidServer emhttp: shcmd (4425): umount /mnt/cache >/dev/null 2>&1
Jul 18 04:40:06 UnRaidServer emhttp: _shcmd: shcmd (4425): exit status: 1
Jul 18 04:40:06 UnRaidServer emhttp: shcmd (4426): rmdir /mnt/cache >/dev/null 2>&1
Jul 18 04:40:06 UnRaidServer emhttp: _shcmd: shcmd (4426): exit status: 1
Jul 18 04:40:06 UnRaidServer emhttp: Retry unmounting disk share(s)...
Jul 18 04:40:11 UnRaidServer emhttp: shcmd (4427): umount /mnt/disk3 >/dev/null 2>&1
Jul 18 04:40:11 UnRaidServer emhttp: _shcmd: shcmd (4427): exit status: 1
Jul 18 04:40:11 UnRaidServer emhttp: shcmd (4428): rmdir /mnt/disk3 >/dev/null 2>&1
Jul 18 04:40:11 UnRaidServer emhttp: _shcmd: shcmd (4428): exit status: 1
Jul 18 04:40:11 UnRaidServer emhttp: shcmd (4429): umount /mnt/cache >/dev/null 2>&1
Jul 18 04:40:11 UnRaidServer emhttp: _shcmd: shcmd (4429): exit status: 1
Jul 18 04:40:11 UnRaidServer emhttp: shcmd (4430): rmdir /mnt/cache >/dev/null 2>&1
Jul 18 04:40:11 UnRaidServer emhttp: _shcmd: shcmd (4430): exit status: 1
Jul 18 04:40:11 UnRaidServer emhttp: Retry unmounting disk share(s)...
Jul 18 04:40:13 UnRaidServer emhttp: mdcmd: write: Invalid argument
Jul 18 04:40:13 UnRaidServer emhttp: disk_spinning: open: No such file or directory

I've also attached smart reports for all of the drives that reported read/write errors....they are fine.

As far as "mounting" the drives in another system...I never did that. I PUT the drives in another system (my forensics box) but I didn't make any changes to any of the drives....until the very end.

I first started working with the 1TB drive since I figured it would be quicker to experiment with since it was smaller. I first used ddrescue to pull a full image of the drive and started working with that.....I was expecting the worst, and I wanted to eliminate other potential system hardware problems, thus the use of a known-good machine and forensics/data recovery tools that I'm familiar and comfortable with.

The ddrescue imaged the drive with no hiccups. After that I ran reiserfsck against the image, and it was fine. From there I proceeded to run reiserfsck in a read-only check mode against the original 1TB drive....it came up clean. I then did the same thing for the 2TB drive that reported the write error....it was clean. Finally I ran it against the other 2TB drive that logged read errors....fsck reported that there were 6 corruptions and suggested to re-run with fix-fixable, which I did. I understand that by doing that outside of the unRAID system I invalidated parity...I'm okay with that. With 3 drives that had 137M writes on them while the parity drive only had 703K writes, I don't really trust the parity disk much anyway...it will be interesting to see how many parity errors are reported when I do the parity check. I could make an image of the parity drive before I start, but I really don't know what good a parity drive image would do for me since multiple drives have had questionable activity on them.

At this point, I think everything is as good as it can be, although I don't know if any of the files themselves were corrupted by the 137M write operations that occurred on those 3 drives. I really wish I understood what could have caused those....there were only 528K reads from the cache drive, so it couldn't have been caused by that. There were only 703K writes to the parity drive, so it really doesn't make sense to me how all those write operations occurred on the 3 drives that were showing problems without a corresponding update to the parity disk...

I was considering building up a new system, adding some blank drives and migrating all of the data back into the array, but after looking at the status of things, it looks like things may not be as bad as they initially seemed (depending on whether or not the actual files got corrupted by the mystery writes).

Before I reinstall the drives and power back up and try to get the array to start I thought I'd check to see if there is anything else I should do first. I assume from previous experience that when I start things up the one drive is still going to be marked in RED so the only way to get unRAID happy again will be to follow the "Trust My Array" procedure....

smart.zip

JonathanM · July 28, 2012

I assume from previous experience that when I start things up the one drive is still going to be marked in RED so the only way to get unRAID happy again will be to follow the "Trust My Array" procedure....

I'm not going to comment on the rest of your situation, but at this point I think it's even easier to get started again, and that's by doing the new configuration and rebuilding parity from your existing data. I wouldn't even assign your parity drive for the first start, just the data drives and see how some test network reads go, and get a syslog from that.

Unknown near-catastrophe....need advice

Recommended Posts

SilntBob

Link to comment

dgaschk

Link to comment

mbryanr

Link to comment

SilntBob

Link to comment

JonathanM

Link to comment

Archived