July 7, 201213 yr Bad server crash. The situation started when I noticed lots of repeated syslog messages related to port 5 on my AOC-SASLP-MV8 controller (disk10 is on that port). Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000 Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000 Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000 Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000 Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: usr/srusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq stsusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irqusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq susr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq stsusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq susr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts =usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. Jul 7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Porusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800. I shut down my processes (sabnzbd, transmission, subsonic, cachedirs), and went to the webinterface - all were green balls. I tried stopping server through webinterface, and it then hung on me and stopped responding. I was able to check the syslog with unmenu one time before unmenu also stopped responding. I waited for a while, still nothing. Then fired up a command window and hit /sbin/powerdown which did not work (I forgot the error message it came back with but think it's the same I later got when trying /root/mdcmd stop). I googled the forums a bit and followed the instructions from this thread: http://lime-technology.com/forum/index.php?topic=6255.0 The unmount commands would hang semi-indefinitely, eventually completing for most of the drives. One stood out, however - disk 10 reported the device was busy. I'm sure I did not have processes accessing files on this particular drive (they all use disk2 only), and the busy status has to somehow be related to the syslog errors above for port5, which is disk10. ps-ef revealed a hdparm -C for this drive (/dev/sdf) running. I also took out a /root/mdcmd status which is attached. It reports DISK_OK for all the disks, and there is one error for disk10, none for any others. I tried killing that and a bunch of other stuff. Eventually the unmount for disk10 stopped reporting budy and instead reported drive not mounted. I proceeded to try /root/mdcmd stop again and got an error message along the lines of "error in line 11: cannot write to device, device busy" (should have noted down the exact error message). Invoking /sbin/powerdown still also did not work, even poweroff command did nothing and I eventually resorted to holding the power button to turn the bastard off. Upon rebooting, I was greeted with a red dot for disk10. All others are green, and the server can be started unprotected. I'm really unsure what to do at this point. Can I trust the parity given the circumstances above? Am I better off rebuilding parity (thereby trusting disk10? Or would it be worth invoking the "trust my parity procedure and checking to see if I have any inconsistencies at all? I have included a post-reboot syslog also. If inspecting the post-reboot syslog you will see I have changed disk6 & 7 to free ports on the mvsas controller - the JMB363 controller they were on was giving me syslog errors as well, but I believe that to be unrelated to the problem described above (and had verified I could successfully parity-check with no errors after the JBM363 errors had appeared. mdcmd-status-2012-07-07.txt syslog-2012-07-07.zip syslog-2012-07-07-post-reboot.zip
July 8, 201213 yr Author Nobody feeling chatty tonight? I ended up running a reiserfsch on the redballed disk10, and it came back 100% clean. Therefore elected to use the trust parity procedure instead of rebuilding the data on disk10. Before I initiated this, I had rebooted the server in order to be running a clean go script, and unfortunately this had the side effect that the array automatically started unprotected, and lots of writes to the disks were performed in this step. The parity check now in progress found 69 errors immediately (within the first 0.1%) and no additional errors so far the following 2 percent. I can't know for sure if the parity mismatches were because of the improper shutdown or because the array was started without disk10, but in any case I'm hopeful (almost confident) I've done the right thing here. Now, has anyone ever seen anything like the mvsas errors I listed above??? How about this stuff: Jul 3 04:09:08 FileServer kernel: sdb: sdb1 Jul 3 04:09:08 FileServer kernel: ------------[ cut here ]------------ Jul 3 04:09:08 FileServer kernel: WARNING: at drivers/ata/libata-core.c:5186 ata_qc_issue+0x10b/0x308() Jul 3 04:09:08 FileServer kernel: Hardware name: P35-DS3R Jul 3 04:09:08 FileServer kernel: Modules linked in: md_mod xor ide_gd_mod i2c_i801 i2c_core r8169 pata_jmicron jmicron ahci mvsas libsas scst scsi_transport_sas Jul 3 04:09:08 FileServer kernel: Pid: 12231, comm: smartctl Not tainted 2.6.32.9-unRAID #8 Jul 3 04:09:08 FileServer kernel: Call Trace: Jul 3 04:09:08 FileServer kernel: [<c102449e>] warn_slowpath_common+0x60/0x77 Jul 3 04:09:08 FileServer kernel: [<c10244c2>] warn_slowpath_null+0xd/0x10 Jul 3 04:09:08 FileServer kernel: [<c11b624d>] ata_qc_issue+0x10b/0x308 Jul 3 04:09:08 FileServer kernel: [<c11ac0ba>] ? scsi_init_sgtable+0x5b/0x76 Jul 3 04:09:08 FileServer kernel: [<c11ba260>] ata_scsi_translate+0xd1/0xff Jul 3 04:09:08 FileServer kernel: [<c11a816c>] ? scsi_done+0x0/0xd Jul 3 04:09:08 FileServer kernel: [<c11a816c>] ? scsi_done+0x0/0xd Jul 3 04:09:08 FileServer kernel: [<c11baa40>] ata_sas_queuecmd+0x120/0x1d7 Jul 3 04:09:08 FileServer kernel: [<c11bc6df>] ? ata_scsi_pass_thru+0x0/0x21d Jul 3 04:09:08 FileServer kernel: [<f841369a>] sas_queuecommand+0x65/0x20d [libsas] Jul 3 04:09:08 FileServer kernel: [<c11a816c>] ? scsi_done+0x0/0xd Jul 3 04:09:08 FileServer kernel: [<c11a82c0>] scsi_dispatch_cmd+0x147/0x181 Jul 3 04:09:08 FileServer kernel: [<c11ace4d>] scsi_request_fn+0x351/0x376 Jul 3 04:09:08 FileServer kernel: [<c11269a8>] __generic_unplug_device+0x26/0x29 Jul 3 04:09:08 FileServer kernel: [<c112934a>] blk_execute_rq_nowait+0x56/0x73 Jul 3 04:09:08 FileServer kernel: [<c11293dc>] blk_execute_rq+0x75/0x91 Jul 3 04:09:08 FileServer kernel: [<c11292cc>] ? blk_end_sync_rq+0x0/0x28 Jul 3 04:09:08 FileServer kernel: [<c11295fa>] ? blk_recount_segments+0x16/0x24 Jul 3 04:09:08 FileServer kernel: [<c11250c6>] ? blk_rq_bio_prep+0x47/0x78 Jul 3 04:09:08 FileServer kernel: [<c11ac58a>] scsi_execute+0xbf/0x113 Jul 3 04:09:08 FileServer kernel: [<c11bb6d0>] ata_cmd_ioctl+0xf8/0x1ba Jul 3 04:09:08 FileServer kernel: [<c11bb940>] ata_sas_scsi_ioctl+0x1ae/0x1e8 Jul 3 04:09:08 FileServer kernel: [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf Jul 3 04:09:08 FileServer kernel: [<f841333d>] sas_ioctl+0x36/0x43 [libsas] Jul 3 04:09:08 FileServer kernel: [<c11a94c8>] scsi_ioctl+0x299/0x2ad Jul 3 04:09:08 FileServer kernel: [<f8413307>] ? sas_ioctl+0x0/0x43 [libsas] Jul 3 04:09:08 FileServer kernel: [<c11b326d>] sd_ioctl+0x80/0x8c Jul 3 04:09:08 FileServer kernel: [<c112a420>] __blkdev_driver_ioctl+0x50/0x62 Jul 3 04:09:08 FileServer kernel: [<c112ad1c>] blkdev_ioctl+0x8b0/0x8dc Jul 3 04:09:08 FileServer kernel: [<c1131e2d>] ? kobject_get+0x12/0x17 Jul 3 04:09:08 FileServer kernel: [<c112b0f8>] ? get_disk+0x4a/0x61 Jul 3 04:09:08 FileServer kernel: [<c101b028>] ? kmap_atomic+0x14/0x16 Jul 3 04:09:08 FileServer kernel: [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf Jul 3 04:09:08 FileServer kernel: [<c104a179>] ? filemap_fault+0xb8/0x305 Jul 3 04:09:08 FileServer kernel: [<c1048c43>] ? unlock_page+0x18/0x1b Jul 3 04:09:08 FileServer kernel: [<c1057c63>] ? __do_fault+0x3a7/0x3da Jul 3 04:09:08 FileServer kernel: [<c105985f>] ? handle_mm_fault+0x42d/0x8f1 Jul 3 04:09:08 FileServer kernel: [<c108b6c6>] block_ioctl+0x2a/0x32 Jul 3 04:09:08 FileServer kernel: [<c108b69c>] ? block_ioctl+0x0/0x32 Jul 3 04:09:08 FileServer kernel: [<c10769d5>] vfs_ioctl+0x22/0x67 Jul 3 04:09:08 FileServer kernel: [<c1076f33>] do_vfs_ioctl+0x478/0x4ac Jul 3 04:09:08 FileServer kernel: [<c105c983>] ? expand_downwards+0x109/0x136 Jul 3 04:09:08 FileServer kernel: [<c1076f93>] sys_ioctl+0x2c/0x45 Jul 3 04:09:08 FileServer kernel: [<c1002935>] syscall_call+0x7/0xb Jul 3 04:09:08 FileServer kernel: ---[ end trace d2bdebf2054f7c59 ]--- Jul 3 04:09:09 FileServer preclear_disk-start[12265]: smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Jul 3 04:09:09 FileServer preclear_disk-start[12265]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Jul 3 04:09:09 FileServer preclear_disk-start[12265]: Jul 3 04:09:09 FileServer preclear_disk-start[12265]: === START OF INFORMATION SECTION === Jul 3 04:09:09 FileServer preclear_disk-start[12265]: Model Family: Western Digital Caviar Blue Serial ATA family ... The above was timed right down to the second I initiated the first preclear. Ever seen anything like this before? Not necessarily related to the JMB363 SATA controller as I first assumed, is it?
July 8, 201213 yr Nobody feeling chatty tonight? I ended up running a reiserfsch on the redballed disk10, and it came back 100% clean. Therefore elected to use the trust parity procedure instead of rebuilding the data on disk10. That would have run it on the emulated (simulated) file system. It had absolutely nothing to do with the physical disk. Before I initiated this, I had rebooted the server in order to be running a clean go script, and unfortunately this had the side effect that the array automatically started unprotected, and lots of writes to the disks were performed in this step. The parity check now in progress found 69 errors immediately (within the first 0.1%) and no additional errors so far the following 2 percent. I can't know for sure if the parity mismatches were because of the improper shutdown or because the array was started without disk10, but in any case I'm hopeful (almost confident) I've done the right thing here. I'm not as certain. A small number of parity errors at the very beginning of a disk are probably in the housekeeping area of the disk tracking journal entries. They might be as of the result of the hard shutdown. Good luck. Personally, I would NEVER use the "trust" process. If a disk is "red" it is GUARANTEED a write to it failed. You are asking that it be trusted even though it does NOT contain what was written to it. Joe L.
July 8, 201213 yr Author Thanks. I think it may be a little more nuanced in this case. I had suffered an improper shutdown, which doesn't exactly make one want to trust the parity either, does it? Between a rock and a hard place - either trust the redballed disk10 or rebuild it from a parity which very likely was corrupted due to the improper shutdown. If only using the web-interface my only option would have been to unassign disk10 and then reassign it, after which point unraid would have rebuild that drive from parity. The reiserfsck I did first was on the actual drive (with the array stopped, checking /dev/sdf1). Using the trust procedure allowed me to get back to a protected state and then re-sync the parity immediately after. One could say I used the same commands used to "trust my parity", but in this case the purpose and effect was to "trust my disk10". As it happened, the drive redballed about 30% through the parity sync. Have moved it to a different controller port and checked all the cables are plugged all the way in, seems to have gotten further this time.
July 8, 201213 yr Disk10 was disabled by unRAID and the parity disk was not. Parity is more likely correct. Examining SMART reports could provide more info. The attempt to preserve disk10 and rebuild parity is mistaken. It will take exactly as long to rebuild disk 10 as it does to generate new parity. Disk 10 has definitely has a write failure and the correct information could be restored from parity. The "trust" procedure does not return the array to a protected state if one the disks has bad data.
July 9, 201213 yr Author OK guys, you win. I suppose the fact that parity was deemed valid could indicate my hamfisted attempts at manually stopping the array may have been successful for all disks except disk10 (including parity). Will keep in mind should there come another time - Thanks. Did anybody have any input on the syslog siplets I quoted in the above posts? BTW: I did check SMART data and the disk's got a 100% clean bill of health (no reallocated or pending sectors etc., short selftest completes without errors)
Archived
This topic is now archived and is closed to further replies.