Dephcon Posted January 7, 2013 Share Posted January 7, 2013 Hello, I noticed this message in my telnet session when I got home from work: Message from syslogd@Vault13 at Sun Jan 6 17:29:54 2013 ... Vault13 kernel: Disabling IRQ #16 All the drives connected to my MV8 are green, spun up, but not reporting temps. I cannot open/copy any files from the drive shares on those disks. I'm running rc8a and my syslog is attached. Thanks. syslog.zip Link to comment
Dephcon Posted January 7, 2013 Author Share Posted January 7, 2013 I should also add that this is not the first time this has happened, I usually just stop the array and reboot to fix it as it interrupts the wife's TV time after work so quick fixes keep everyone sane. This has probably happened 4-6 times since I've been on rc8a. Link to comment
dgaschk Posted January 8, 2013 Share Posted January 8, 2013 Disable all add-ons until the problem is diagnosed. Post a syslog that includes a problem event. Link to comment
Dephcon Posted January 8, 2013 Author Share Posted January 8, 2013 Well I told myself if it did it again I would remove all my add-ons and play ball, and it did at 8:30 this morning. Any objections to me keeping my apc and screen plugins on? We have dirty power and the UPS is a must and screen is handy as I need to pre-clear a couple drive I got an advanced RMA on. Link to comment
Dephcon Posted January 12, 2013 Author Share Posted January 12, 2013 Looks like the issue has occurred again... good thing the drives i'm currently pre-clearing are on the onboard sata ports. Syslog attached, drives 2-7,9 are connected to the MV8 the rest are on onboard. Thanks. syslog2.zip Link to comment
limetech Posted January 12, 2013 Share Posted January 12, 2013 Well I told myself if it did it again I would remove all my add-ons and play ball, and it did at 8:30 this morning. Any objections to me keeping my apc and screen plugins on? We have dirty power and the UPS is a must and screen is handy as I need to pre-clear a couple drive I got an advanced RMA on. What PSU are you using, and how good is your UPS, does it also 'condition' the power? Hard drives and controllers are notoriously susceptible to bad power such as ripple and spikes. Link to comment
Dephcon Posted January 12, 2013 Author Share Posted January 12, 2013 It's a seasonic built "XFX 750W PRO750W XXX Edition Single Rail". My UPS is an APC back-ups RS 1500, not sure how it conditions the power. Our power isn't as "dirty" as i may have let on, but it's an apartment and we get the occasional blip. Link to comment
Dephcon Posted January 12, 2013 Author Share Posted January 12, 2013 Update: When I woke up the webgui service and my telnet connections were down. Plugged the monitor in and the console was spamming this style of message (i think so, it was too fast to read): an 12 00:22:25 Vault13 kernel: REISERFS error (device md7): vs-13070 red trying to find stat data of [2429 2467 0x0 SD] Jan 12 00:22:25 Vault13 kernel: REISERFS error (device md7): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [2429 2467 0x0 SD] I had to do a hard shutdown I have an untested and unverified LSI SAS3081E-R collecting dust that I coudl flash to IT mode. Could it be that my MV8 is bad? Would moving half of the drives from the MV8 to this LSI be safe and possibly very informative? Should I stay at RC8a until this is resolved? Link to comment
RobJ Posted January 12, 2013 Share Posted January 12, 2013 I'm afraid your server does not look stable at all to me, in its current configuration. Both syslogs show relatively numerous system crashes, with modules often 'tainted'. Both times you lost the mvs interrupt, there was a simultaneous crash, with 'swapper tainted'. The 2 primary causes, in my opinion, are memory issues or power issues (as Tom mentioned), but it could also be other things such as bad motherboard, bad overclocking, or a bad or incompatible card installed. The first thing I would test is the memory, perhaps there is a bad card, or perhaps it is configured wrong or too fast, or needs a voltage adjustment. I would run memtest a full day. Check whether this memory is considered compatible with this motherboard, then whether the memory settings are correct for it. The next thing is power issues, if you have another PSU, give it a try. You might also try running without that UPS, perhaps it is doing something to the waveform that is causing occasional trouble. One of the other problems in the syslogs is that the USB interface between the server and the UPS is constantly dropping then recovering, which is obviously wrong. I don't know what that could be. The only choices are, it is related to the unstable operation of the server, or it is something wrong with the USB interfaces on either end, or the cable itself, or something is wrong with the UPS. You might try hooking the UPS to a Windows machine and installing the UPS software, and determining whether the UPS *thinks* it is running correctly, and whether it continues dropping the USB connection. Other possibilities, check for an updated BIOS for the motherboard, try removing all addon cards and/or moving them to other slots (if there are any), and make sure the server is not overclocked, has all settings as conservative or slow as possible. Can't think of anything else, but there probably is... Link to comment
dgaschk Posted January 12, 2013 Share Posted January 12, 2013 The file system is corrupt. See my sig. Link to comment
Dephcon Posted January 12, 2013 Author Share Posted January 12, 2013 It's currently doing a parity check due to the hand shutdown. Shall I let it continue or just cancel and move onto memetest and checking the disks' file systems? Thanks, Link to comment
RobJ Posted January 12, 2013 Share Posted January 12, 2013 I'd stop and begin testing the memory, I don't consider a machine usable if I can't trust the memory. No telling what kinds of new corruptions and bad behaviors and invalid test results are possible with untrusted memory. I forgot to mention the file system, and one drive with an HPA (which you probably already know about since you aren't using the drive). I suspect the file systems are OK, because it looked to me that all of the file system errors occurred on drive(s) that were NOT accessible, on the mvsas card after its interrupt was disabled. Therefore the ReiserFS errors were on an emulated Reiser system, but how can you successfully emulate a ReiserFS when half the other drives are also missing?!? I think it was a mercy at that point that you could not write to those hard disks, and corrupt their file systems. While I think the actual file systems on the drives are OK, once you have seen a REISERFS error, you will have to test them anyway, to make sure. I would test all of the file systems with reiserfsck, once you feel confident in your server performing correctly. Link to comment
Dephcon Posted January 12, 2013 Author Share Posted January 12, 2013 Sounds good. I ran an 8 hour memtest when I build this system, but shit happens I guess. I have a brand new HX650 Pro PSU that I can try if the memtest comes back ok. I also flashed my LSI 3081E-R to IT firmware this afternoon so I can also try with half the drives on that and half on my MV8 if need be, apparently they're similar in performance (the LSI might be slightly better). I'll report back with my findings tomorrow. Link to comment
Dephcon Posted January 13, 2013 Author Share Posted January 13, 2013 Update: All the disks checked out fine with zero errors and I ran memtest for 6 hours without issue. I updated the BIOS on my motherboard (was quite old), reset all the settings to default, then disabled all the stuff I don't need (audio,serial, firewire, etc). I installed my LSI 3081E-R and I now have disks 1-4 on the MV8, disks 5-8 on the LSI and disk9, cache and parity on the on-board controller. One thing that may have been an issue was I had the server on the carpet (not plush at all) with a bottom mount PSU, I noticed the exhaust air was "warm" so now it's on my desk. So I'm going to do a parity check and hopefully if nothing bad happens I will be rebuilding disk 8 as I RMA'd because I didn't like the seek noise it made. I already have the advance ship disk from WD pre-cleared and I need to ship disk 8 back ASAP so I don't get charged $150. Hopefully if I have another failure it will be limited to the MV8... I'm also planning to rebuild either 3, 4 or 6 onto a new 3TD Red and then copy the data on the other two to other disks. They're 1.5TB seagates that I've had forever and are on multiple RMA... I think one is the original purchased drive. Actually disk 8 was just recently rebuilt from a 4th 1.5 seagate that was on it's 3rd re-certified replacement. I've just had enough with these awful drives. Link to comment
Dephcon Posted January 14, 2013 Author Share Posted January 14, 2013 I just upgraded to RC10 so I guess I'll leave it at that for now. If the issue re-surfaces I'll start a new thread. Thanks for everyone's help. Link to comment
RobJ Posted January 15, 2013 Share Posted January 15, 2013 Sounds like you are feeling better about the machine, but I still feel you should monitor the syslog every now and then, and look for those 'call traces'. Search 'tainted' or 'call trace'. They should not be happening, at all. Link to comment
Dephcon Posted January 15, 2013 Author Share Posted January 15, 2013 Oh don't worry, I'm going to be all over it.... though at the time same time the wife is happy that the "tv" works again. I wonder if I can monitor the syslog with nagios/cmk... Link to comment
Dephcon Posted January 23, 2013 Author Share Posted January 23, 2013 This just popped up in my telnet session: Message from syslogd@Vault13 at Wed Jan 23 18:27:14 2013 ... Vault13 kernel: Disabling IRQ #16 And this is what was in the log: Jan 23 18:27:14 Vault13 kernel: irq 16: nobody cared (try booting with the "irqpoll" option) Jan 23 18:27:14 Vault13 kernel: Pid: 0, comm: swapper/1 Tainted: G O 3.4.24-unRAID #1 Jan 23 18:27:14 Vault13 kernel: Call Trace: Jan 23 18:27:14 Vault13 kernel: [] __report_bad_irq+0x29/0xb4 Jan 23 18:27:14 Vault13 kernel: [] note_interrupt+0x137/0x1ac Jan 23 18:27:14 Vault13 kernel: [] ? acpi_idle_enter_bm+0x276/0x2a9 Jan 23 18:27:14 Vault13 kernel: [] handle_irq_event_percpu+0x109/0x11a Jan 23 18:27:14 Vault13 kernel: [] ? free_notes_attrs+0x3/0x35 Jan 23 18:27:14 Vault13 kernel: [] ? handle_percpu_irq+0x3b/0x3b Jan 23 18:27:14 Vault13 kernel: [] handle_irq_event+0x25/0x3c Jan 23 18:27:14 Vault13 kernel: [] ? handle_percpu_irq+0x3b/0x3b Jan 23 18:27:14 Vault13 kernel: [] handle_fasteoi_irq+0x6d/0xab Jan 23 18:27:14 Vault13 kernel: [] ? do_IRQ+0x37/0x9b Jan 23 18:27:14 Vault13 kernel: [] ? common_interrupt+0x29/0x30 Jan 23 18:27:14 Vault13 kernel: [] ? acpi_idle_enter_bm+0x276/0x2a9 Jan 23 18:27:14 Vault13 kernel: [] ? cpuidle_enter+0xe/0x12 Jan 23 18:27:14 Vault13 kernel: [] ? cpuidle_idle_call+0x59/0xac Jan 23 18:27:14 Vault13 kernel: [] ? cpu_idle+0x41/0x65 Jan 23 18:27:14 Vault13 kernel: [] ? start_secondary+0x9a/0x9c Jan 23 18:27:14 Vault13 kernel: handlers: Jan 23 18:27:14 Vault13 kernel: [] usb_hcd_irq Jan 23 18:27:14 Vault13 kernel: [] mpt_interrupt Jan 23 18:27:14 Vault13 kernel: [] mvs_interrupt Jan 23 18:27:14 Vault13 kernel: Disabling IRQ #16 Jan 23 18:27:50 Vault13 emhttp: Spinning up all drives... Jan 23 18:27:50 Vault13 emhttp: shcmd (80): /usr/sbin/hdparm -S0 /dev/sdc &> /dev/null Jan 23 18:27:50 Vault13 kernel: mdcmd (373): spinup 0 Jan 23 18:27:50 Vault13 kernel: mdcmd (374): spinup 1 Jan 23 18:27:50 Vault13 kernel: mdcmd (375): spinup 2 Jan 23 18:27:50 Vault13 kernel: mdcmd (376): spinup 3 Jan 23 18:27:50 Vault13 kernel: mdcmd (377): spinup 4 Jan 23 18:27:50 Vault13 kernel: mdcmd (378): spinup 5 Jan 23 18:27:50 Vault13 kernel: mdcmd (379): spinup 6 Jan 23 18:27:50 Vault13 kernel: mdcmd (380): spinup 7 Jan 23 18:27:50 Vault13 kernel: mdcmd (381): spinup 8 Jan 23 18:27:50 Vault13 kernel: mdcmd (382): spinup 9 Jan 23 18:28:03 Vault13 emhttp: title not found I spun up the drives to check if any of them had become disconnected. This is on RC10, btw. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.