Jump to content

All drives on AOC-SASLP-MV8 inaccessable


Recommended Posts

Hello,

 

I noticed this message in my telnet session when I got home from work:

 

Message from syslogd@Vault13 at Sun Jan  6 17:29:54 2013 ...
Vault13 kernel: Disabling IRQ #16

 

All the drives connected to my MV8 are green, spun up, but not reporting temps.  I cannot open/copy any files from the drive shares on those disks.  I'm running rc8a and my syslog is attached.

 

Thanks.

syslog.zip

Link to comment

I should also add that this is not the first time this has happened, I usually just stop the array and reboot to fix it as it interrupts the wife's TV time after work so quick fixes keep everyone sane.  This has probably happened 4-6 times since I've been on rc8a.

Link to comment

Well I told myself if it did it again I would remove all my add-ons and play ball, and it did at 8:30 this morning.  Any objections to me keeping my apc and screen plugins on?  We have dirty power and the UPS is a must and screen is handy as I need to pre-clear a couple drive I got an advanced RMA on.

Link to comment

Well I told myself if it did it again I would remove all my add-ons and play ball, and it did at 8:30 this morning.  Any objections to me keeping my apc and screen plugins on?  We have dirty power and the UPS is a must and screen is handy as I need to pre-clear a couple drive I got an advanced RMA on.

What PSU are you using, and how good is your UPS, does it also 'condition' the power?  Hard drives and controllers are notoriously susceptible to bad power such as ripple and spikes.

 

Link to comment

Update:

 

When I woke up the webgui service and my telnet connections were down.  Plugged the monitor in and the console was spamming this style of message (i think so, it was too fast to read):

 

an 12 00:22:25 Vault13 kernel: REISERFS error (device md7): vs-13070 red trying to find stat data of [2429 2467 0x0 SD]

Jan 12 00:22:25 Vault13 kernel: REISERFS error (device md7): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [2429 2467 0x0 SD]

 

I had to do a hard shutdown :(

 

 

I have an untested and unverified LSI SAS3081E-R collecting dust that I coudl flash to IT mode.  Could it be that my MV8 is bad?  Would moving half of the drives from the MV8 to this LSI be safe and possibly very informative?  Should I stay at RC8a until this is resolved?

Link to comment

I'm afraid your server does not look stable at all to me, in its current configuration.  Both syslogs show relatively numerous system crashes, with modules often 'tainted'.  Both times you lost the mvs interrupt, there was a simultaneous crash, with 'swapper tainted'.

 

The 2 primary causes, in my opinion, are memory issues or power issues (as Tom mentioned), but it could also be other things such as bad motherboard, bad overclocking, or a bad or incompatible card installed.

 

The first thing I would test is the memory, perhaps there is a bad card, or perhaps it is configured wrong or too fast, or needs a voltage adjustment.  I would run memtest a full day.  Check whether this memory is considered compatible with this motherboard, then whether the memory settings are correct for it.

 

The next thing is power issues, if you have another PSU, give it a try.  You might also try running without that UPS, perhaps it is doing something to the waveform that is causing occasional trouble.  One of the other problems in the syslogs is that the USB interface between the server and the UPS is constantly dropping then recovering, which is obviously wrong.  I don't know what that could be.  The only choices are, it is related to the unstable operation of the server, or it is something wrong with the USB interfaces on either end, or the cable itself, or something is wrong with the UPS.  You might try hooking the UPS to a Windows machine and installing the UPS software, and determining whether the UPS *thinks* it is running correctly, and whether it continues dropping the USB connection.

 

Other possibilities, check for an updated BIOS for the motherboard, try removing all addon cards and/or moving them to other slots (if there are any), and make sure the server is not overclocked, has all settings as conservative or slow as possible.  Can't think of anything else, but there probably is...

Link to comment

I'd stop and begin testing the memory, I don't consider a machine usable if I can't trust the memory.  No telling what kinds of new corruptions and bad behaviors and invalid test results are possible with untrusted memory.

 

I forgot to mention the file system, and one drive with an HPA (which you probably already know about since you aren't using the drive).  I suspect the file systems are OK, because it looked to me that all of the file system errors occurred on drive(s) that were NOT accessible, on the mvsas card after its interrupt was disabled.  Therefore the ReiserFS errors were on an emulated Reiser system, but how can you successfully emulate a ReiserFS when half the other drives are also missing?!?  I think it was a mercy at that point that you could not write to those hard disks, and corrupt their file systems.  While I think the actual file systems on the drives are OK, once you have seen a REISERFS error, you will have to test them anyway, to make sure.  I would test all of the file systems with reiserfsck, once you feel confident in your server performing correctly.

Link to comment

Sounds good. 

 

I ran an 8 hour memtest when I build this system, but shit happens I guess.  I have a brand new HX650 Pro PSU that I can try if the memtest comes back ok.  I also flashed my LSI 3081E-R to IT firmware this afternoon so I can also try with half the drives on that and half on my MV8 if need be, apparently they're similar in performance (the LSI might be slightly better).

 

I'll report back with my findings tomorrow.

Link to comment

Update:

 

All the disks checked out fine with zero errors and I ran memtest for 6 hours without issue.  I updated the BIOS on my motherboard (was quite old), reset all the settings to default, then disabled all the stuff I don't need (audio,serial, firewire, etc).  I installed my LSI 3081E-R and I now have disks 1-4 on the MV8, disks 5-8 on the LSI and disk9, cache and parity on the on-board controller.

 

One thing that may have been an issue was I had the server on the carpet (not plush at all) with a  bottom mount PSU, I noticed the exhaust air was "warm" so now it's on my desk.

 

So I'm going to do a parity check and hopefully if nothing bad happens I will be rebuilding disk 8 as I RMA'd because I didn't like the seek noise it made.  I already have the advance ship disk from WD pre-cleared and I need to ship disk 8 back ASAP so I don't get charged $150.

 

Hopefully if I have another failure it will be limited to the MV8...

 

I'm also planning to rebuild either 3, 4 or 6 onto a new 3TD Red and then copy the data on the other two to other disks.  They're 1.5TB seagates that I've had forever and are on multiple RMA... I think one is the original purchased drive.  Actually disk 8 was just recently rebuilt from a 4th 1.5 seagate that was on it's 3rd re-certified replacement.  I've just had enough with these awful drives.

Link to comment
  • 2 weeks later...

This just popped up in my telnet session:

 

Message from syslogd@Vault13 at Wed Jan 23 18:27:14 2013 ...
Vault13 kernel: Disabling IRQ #16

 

And this is what was in the log:

 

Jan 23 18:27:14 Vault13 kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Jan 23 18:27:14 Vault13 kernel: Pid: 0, comm: swapper/1 Tainted: G O 3.4.24-unRAID #1
Jan 23 18:27:14 Vault13 kernel: Call Trace:
Jan 23 18:27:14 Vault13 kernel: [] __report_bad_irq+0x29/0xb4
Jan 23 18:27:14 Vault13 kernel: [] note_interrupt+0x137/0x1ac
Jan 23 18:27:14 Vault13 kernel: [] ? acpi_idle_enter_bm+0x276/0x2a9
Jan 23 18:27:14 Vault13 kernel: [] handle_irq_event_percpu+0x109/0x11a
Jan 23 18:27:14 Vault13 kernel: [] ? free_notes_attrs+0x3/0x35
Jan 23 18:27:14 Vault13 kernel: [] ? handle_percpu_irq+0x3b/0x3b
Jan 23 18:27:14 Vault13 kernel: [] handle_irq_event+0x25/0x3c
Jan 23 18:27:14 Vault13 kernel: [] ? handle_percpu_irq+0x3b/0x3b
Jan 23 18:27:14 Vault13 kernel: [] handle_fasteoi_irq+0x6d/0xab
Jan 23 18:27:14 Vault13 kernel: [] ? do_IRQ+0x37/0x9b
Jan 23 18:27:14 Vault13 kernel: [] ? common_interrupt+0x29/0x30
Jan 23 18:27:14 Vault13 kernel: [] ? acpi_idle_enter_bm+0x276/0x2a9
Jan 23 18:27:14 Vault13 kernel: [] ? cpuidle_enter+0xe/0x12
Jan 23 18:27:14 Vault13 kernel: [] ? cpuidle_idle_call+0x59/0xac
Jan 23 18:27:14 Vault13 kernel: [] ? cpu_idle+0x41/0x65
Jan 23 18:27:14 Vault13 kernel: [] ? start_secondary+0x9a/0x9c
Jan 23 18:27:14 Vault13 kernel: handlers:
Jan 23 18:27:14 Vault13 kernel: [] usb_hcd_irq
Jan 23 18:27:14 Vault13 kernel: [] mpt_interrupt
Jan 23 18:27:14 Vault13 kernel: [] mvs_interrupt
Jan 23 18:27:14 Vault13 kernel: Disabling IRQ #16
Jan 23 18:27:50 Vault13 emhttp: Spinning up all drives...
Jan 23 18:27:50 Vault13 emhttp: shcmd (80): /usr/sbin/hdparm -S0 /dev/sdc &> /dev/null
Jan 23 18:27:50 Vault13 kernel: mdcmd (373): spinup 0
Jan 23 18:27:50 Vault13 kernel: mdcmd (374): spinup 1
Jan 23 18:27:50 Vault13 kernel: mdcmd (375): spinup 2
Jan 23 18:27:50 Vault13 kernel: mdcmd (376): spinup 3
Jan 23 18:27:50 Vault13 kernel: mdcmd (377): spinup 4
Jan 23 18:27:50 Vault13 kernel: mdcmd (378): spinup 5
Jan 23 18:27:50 Vault13 kernel: mdcmd (379): spinup 6
Jan 23 18:27:50 Vault13 kernel: mdcmd (380): spinup 7
Jan 23 18:27:50 Vault13 kernel: mdcmd (381): spinup 8
Jan 23 18:27:50 Vault13 kernel: mdcmd (382): spinup 9
Jan 23 18:28:03 Vault13 emhttp: title not found

 

I spun up the drives to check if any of them had become disconnected.  This is on RC10, btw.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...