All drives on AOC-SASLP-MV8 inaccessable

Dephcon · January 7, 2013

Hello,

I noticed this message in my telnet session when I got home from work:

Message from syslogd@Vault13 at Sun Jan  6 17:29:54 2013 ...
Vault13 kernel: Disabling IRQ #16

All the drives connected to my MV8 are green, spun up, but not reporting temps. I cannot open/copy any files from the drive shares on those disks. I'm running rc8a and my syslog is attached.

Thanks.

syslog.zip

Dephcon · January 7, 2013

I should also add that this is not the first time this has happened, I usually just stop the array and reboot to fix it as it interrupts the wife's TV time after work so quick fixes keep everyone sane. This has probably happened 4-6 times since I've been on rc8a.

dgaschk · January 8, 2013

Disable all add-ons until the problem is diagnosed. Post a syslog that includes a problem event.

Dephcon · January 8, 2013

Well I told myself if it did it again I would remove all my add-ons and play ball, and it did at 8:30 this morning. Any objections to me keeping my apc and screen plugins on? We have dirty power and the UPS is a must and screen is handy as I need to pre-clear a couple drive I got an advanced RMA on.

Dephcon · January 12, 2013

Looks like the issue has occurred again... good thing the drives i'm currently pre-clearing are on the onboard sata ports. Syslog attached, drives 2-7,9 are connected to the MV8 the rest are on onboard.

Thanks.

syslog2.zip

limetech · January 12, 2013

Well I told myself if it did it again I would remove all my add-ons and play ball, and it did at 8:30 this morning. Any objections to me keeping my apc and screen plugins on? We have dirty power and the UPS is a must and screen is handy as I need to pre-clear a couple drive I got an advanced RMA on.

What PSU are you using, and how good is your UPS, does it also 'condition' the power? Hard drives and controllers are notoriously susceptible to bad power such as ripple and spikes.

Dephcon · January 12, 2013

It's a seasonic built "XFX 750W PRO750W XXX Edition Single Rail". My UPS is an APC back-ups RS 1500, not sure how it conditions the power.

Our power isn't as "dirty" as i may have let on, but it's an apartment and we get the occasional blip.

Dephcon · January 12, 2013

Update:

When I woke up the webgui service and my telnet connections were down. Plugged the monitor in and the console was spamming this style of message (i think so, it was too fast to read):

an 12 00:22:25 Vault13 kernel: REISERFS error (device md7): vs-13070 red trying to find stat data of [2429 2467 0x0 SD]

Jan 12 00:22:25 Vault13 kernel: REISERFS error (device md7): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [2429 2467 0x0 SD]

I had to do a hard shutdown

I have an untested and unverified LSI SAS3081E-R collecting dust that I coudl flash to IT mode. Could it be that my MV8 is bad? Would moving half of the drives from the MV8 to this LSI be safe and possibly very informative? Should I stay at RC8a until this is resolved?

RobJ · January 12, 2013

I'm afraid your server does not look stable at all to me, in its current configuration. Both syslogs show relatively numerous system crashes, with modules often 'tainted'. Both times you lost the mvs interrupt, there was a simultaneous crash, with 'swapper tainted'.

The 2 primary causes, in my opinion, are memory issues or power issues (as Tom mentioned), but it could also be other things such as bad motherboard, bad overclocking, or a bad or incompatible card installed.

The first thing I would test is the memory, perhaps there is a bad card, or perhaps it is configured wrong or too fast, or needs a voltage adjustment. I would run memtest a full day. Check whether this memory is considered compatible with this motherboard, then whether the memory settings are correct for it.

The next thing is power issues, if you have another PSU, give it a try. You might also try running without that UPS, perhaps it is doing something to the waveform that is causing occasional trouble. One of the other problems in the syslogs is that the USB interface between the server and the UPS is constantly dropping then recovering, which is obviously wrong. I don't know what that could be. The only choices are, it is related to the unstable operation of the server, or it is something wrong with the USB interfaces on either end, or the cable itself, or something is wrong with the UPS. You might try hooking the UPS to a Windows machine and installing the UPS software, and determining whether the UPS *thinks* it is running correctly, and whether it continues dropping the USB connection.

Other possibilities, check for an updated BIOS for the motherboard, try removing all addon cards and/or moving them to other slots (if there are any), and make sure the server is not overclocked, has all settings as conservative or slow as possible. Can't think of anything else, but there probably is...

dgaschk · January 12, 2013

The file system is corrupt. See my sig.

Dephcon · January 12, 2013

It's currently doing a parity check due to the hand shutdown. Shall I let it continue or just cancel and move onto memetest and checking the disks' file systems?

Thanks,

RobJ · January 12, 2013

I'd stop and begin testing the memory, I don't consider a machine usable if I can't trust the memory. No telling what kinds of new corruptions and bad behaviors and invalid test results are possible with untrusted memory.

I forgot to mention the file system, and one drive with an HPA (which you probably already know about since you aren't using the drive). I suspect the file systems are OK, because it looked to me that all of the file system errors occurred on drive(s) that were NOT accessible, on the mvsas card after its interrupt was disabled. Therefore the ReiserFS errors were on an emulated Reiser system, but how can you successfully emulate a ReiserFS when half the other drives are also missing?!? I think it was a mercy at that point that you could not write to those hard disks, and corrupt their file systems. While I think the actual file systems on the drives are OK, once you have seen a REISERFS error, you will have to test them anyway, to make sure. I would test all of the file systems with reiserfsck, once you feel confident in your server performing correctly.

Dephcon · January 12, 2013

Sounds good.

I ran an 8 hour memtest when I build this system, but shit happens I guess. I have a brand new HX650 Pro PSU that I can try if the memtest comes back ok. I also flashed my LSI 3081E-R to IT firmware this afternoon so I can also try with half the drives on that and half on my MV8 if need be, apparently they're similar in performance (the LSI might be slightly better).

I'll report back with my findings tomorrow.

Dephcon · January 13, 2013

Update:

All the disks checked out fine with zero errors and I ran memtest for 6 hours without issue. I updated the BIOS on my motherboard (was quite old), reset all the settings to default, then disabled all the stuff I don't need (audio,serial, firewire, etc). I installed my LSI 3081E-R and I now have disks 1-4 on the MV8, disks 5-8 on the LSI and disk9, cache and parity on the on-board controller.

One thing that may have been an issue was I had the server on the carpet (not plush at all) with a bottom mount PSU, I noticed the exhaust air was "warm" so now it's on my desk.

So I'm going to do a parity check and hopefully if nothing bad happens I will be rebuilding disk 8 as I RMA'd because I didn't like the seek noise it made. I already have the advance ship disk from WD pre-cleared and I need to ship disk 8 back ASAP so I don't get charged $150.

Hopefully if I have another failure it will be limited to the MV8...

I'm also planning to rebuild either 3, 4 or 6 onto a new 3TD Red and then copy the data on the other two to other disks. They're 1.5TB seagates that I've had forever and are on multiple RMA... I think one is the original purchased drive. Actually disk 8 was just recently rebuilt from a 4th 1.5 seagate that was on it's 3rd re-certified replacement. I've just had enough with these awful drives.

Dephcon · January 14, 2013

I just upgraded to RC10 so I guess I'll leave it at that for now. If the issue re-surfaces I'll start a new thread.

Thanks for everyone's help.

RobJ · January 15, 2013

Sounds like you are feeling better about the machine, but I still feel you should monitor the syslog every now and then, and look for those 'call traces'. Search 'tainted' or 'call trace'. They should not be happening, at all.

Dephcon · January 15, 2013

Oh don't worry, I'm going to be all over it.... though at the time same time the wife is happy that the "tv" works again.

I wonder if I can monitor the syslog with nagios/cmk...

Dephcon · January 23, 2013

This just popped up in my telnet session:

Message from syslogd@Vault13 at Wed Jan 23 18:27:14 2013 ...
Vault13 kernel: Disabling IRQ #16

And this is what was in the log:

Jan 23 18:27:14 Vault13 kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Jan 23 18:27:14 Vault13 kernel: Pid: 0, comm: swapper/1 Tainted: G O 3.4.24-unRAID #1
Jan 23 18:27:14 Vault13 kernel: Call Trace:
Jan 23 18:27:14 Vault13 kernel: [] __report_bad_irq+0x29/0xb4
Jan 23 18:27:14 Vault13 kernel: [] note_interrupt+0x137/0x1ac
Jan 23 18:27:14 Vault13 kernel: [] ? acpi_idle_enter_bm+0x276/0x2a9
Jan 23 18:27:14 Vault13 kernel: [] handle_irq_event_percpu+0x109/0x11a
Jan 23 18:27:14 Vault13 kernel: [] ? free_notes_attrs+0x3/0x35
Jan 23 18:27:14 Vault13 kernel: [] ? handle_percpu_irq+0x3b/0x3b
Jan 23 18:27:14 Vault13 kernel: [] handle_irq_event+0x25/0x3c
Jan 23 18:27:14 Vault13 kernel: [] ? handle_percpu_irq+0x3b/0x3b
Jan 23 18:27:14 Vault13 kernel: [] handle_fasteoi_irq+0x6d/0xab
Jan 23 18:27:14 Vault13 kernel: [] ? do_IRQ+0x37/0x9b
Jan 23 18:27:14 Vault13 kernel: [] ? common_interrupt+0x29/0x30
Jan 23 18:27:14 Vault13 kernel: [] ? acpi_idle_enter_bm+0x276/0x2a9
Jan 23 18:27:14 Vault13 kernel: [] ? cpuidle_enter+0xe/0x12
Jan 23 18:27:14 Vault13 kernel: [] ? cpuidle_idle_call+0x59/0xac
Jan 23 18:27:14 Vault13 kernel: [] ? cpu_idle+0x41/0x65
Jan 23 18:27:14 Vault13 kernel: [] ? start_secondary+0x9a/0x9c
Jan 23 18:27:14 Vault13 kernel: handlers:
Jan 23 18:27:14 Vault13 kernel: [] usb_hcd_irq
Jan 23 18:27:14 Vault13 kernel: [] mpt_interrupt
Jan 23 18:27:14 Vault13 kernel: [] mvs_interrupt
Jan 23 18:27:14 Vault13 kernel: Disabling IRQ #16
Jan 23 18:27:50 Vault13 emhttp: Spinning up all drives...
Jan 23 18:27:50 Vault13 emhttp: shcmd (80): /usr/sbin/hdparm -S0 /dev/sdc &> /dev/null
Jan 23 18:27:50 Vault13 kernel: mdcmd (373): spinup 0
Jan 23 18:27:50 Vault13 kernel: mdcmd (374): spinup 1
Jan 23 18:27:50 Vault13 kernel: mdcmd (375): spinup 2
Jan 23 18:27:50 Vault13 kernel: mdcmd (376): spinup 3
Jan 23 18:27:50 Vault13 kernel: mdcmd (377): spinup 4
Jan 23 18:27:50 Vault13 kernel: mdcmd (378): spinup 5
Jan 23 18:27:50 Vault13 kernel: mdcmd (379): spinup 6
Jan 23 18:27:50 Vault13 kernel: mdcmd (380): spinup 7
Jan 23 18:27:50 Vault13 kernel: mdcmd (381): spinup 8
Jan 23 18:27:50 Vault13 kernel: mdcmd (382): spinup 9
Jan 23 18:28:03 Vault13 emhttp: title not found

I spun up the drives to check if any of them had become disconnected. This is on RC10, btw.

All drives on AOC-SASLP-MV8 inaccessable

Recommended Posts

Dephcon

Link to comment

Dephcon

Link to comment

dgaschk

Link to comment

Dephcon

Link to comment

Dephcon

Link to comment

limetech

Link to comment

Dephcon

Link to comment

Dephcon

Link to comment

RobJ

Link to comment

dgaschk

Link to comment

Dephcon

Link to comment

RobJ

Link to comment

Dephcon

Link to comment

Dephcon

Link to comment

Dephcon

Link to comment

RobJ

Link to comment

Dephcon

Link to comment

Dephcon

Link to comment

Archived