4.7 ugly server crash - red dot - HELP Please - General Support (V5 and Older)

July 7, 201213 yr

Bad server crash.

The situation started when I noticed lots of repeated syslog messages related to port 5 on my AOC-SASLP-MV8 controller (disk10 is on that port).

Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: usr/srusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = 0x1000000usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq stsusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irqusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq susr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq stsusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq susr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts = usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Port 5 irq sts =usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.
Jul  7 14:16:20 FileServer kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 2071:Porusr/src/sas/trunk/mvsas_tgt/mv_sas.c 2069:port 5 ctrl sts=0x199800.

I shut down my processes (sabnzbd, transmission, subsonic, cachedirs), and went to the webinterface - all were green balls. I tried stopping server through webinterface, and it then hung on me and stopped responding. I was able to check the syslog with unmenu one time before unmenu also stopped responding. I waited for a while, still nothing.

Then fired up a command window and hit /sbin/powerdown which did not work (I forgot the error message it came back with but think it's the same I later got when trying /root/mdcmd stop).

I googled the forums a bit and followed the instructions from this thread: http://lime-technology.com/forum/index.php?topic=6255.0

The unmount commands would hang semi-indefinitely, eventually completing for most of the drives. One stood out, however - disk 10 reported the device was busy. I'm sure I did not have processes accessing files on this particular drive (they all use disk2 only), and the busy status has to somehow be related to the syslog errors above for port5, which is disk10. ps-ef revealed a hdparm -C for this drive (/dev/sdf) running.

I also took out a /root/mdcmd status which is attached. It reports DISK_OK for all the disks, and there is one error for disk10, none for any others.

I tried killing that and a bunch of other stuff. Eventually the unmount for disk10 stopped reporting budy and instead reported drive not mounted. I proceeded to try /root/mdcmd stop again and got an error message along the lines of "error in line 11: cannot write to device, device busy" (should have noted down the exact error message).

Invoking /sbin/powerdown still also did not work, even poweroff command did nothing and I eventually resorted to holding the power button to turn the bastard off.

Upon rebooting, I was greeted with a red dot for disk10. All others are green, and the server can be started unprotected.

I'm really unsure what to do at this point. Can I trust the parity given the circumstances above?

Am I better off rebuilding parity (thereby trusting disk10?

Or would it be worth invoking the "trust my parity procedure and checking to see if I have any inconsistencies at all?

I have included a post-reboot syslog also. If inspecting the post-reboot syslog you will see I have changed disk6 & 7 to free ports on the mvsas controller - the JMB363 controller they were on was giving me syslog errors as well, but I believe that to be unrelated to the problem described above (and had verified I could successfully parity-check with no errors after the JBM363 errors had appeared.

mdcmd-status-2012-07-07.txt

syslog-2012-07-07.zip

syslog-2012-07-07-post-reboot.zip

Quote

July 8, 201213 yr

Author

Nobody feeling chatty tonight?

I ended up running a reiserfsch on the redballed disk10, and it came back 100% clean. Therefore elected to use the trust parity procedure instead of rebuilding the data on disk10.

Before I initiated this, I had rebooted the server in order to be running a clean go script, and unfortunately this had the side effect that the array automatically started unprotected, and lots of writes to the disks were performed in this step. The parity check now in progress found 69 errors immediately (within the first 0.1%) and no additional errors so far the following 2 percent. I can't know for sure if the parity mismatches were because of the improper shutdown or because the array was started without disk10, but in any case I'm hopeful (almost confident) I've done the right thing here.

Now, has anyone ever seen anything like the mvsas errors I listed above???

How about this stuff:

Jul  3 04:09:08 FileServer kernel:  sdb: sdb1
Jul  3 04:09:08 FileServer kernel: ------------[ cut here ]------------
Jul  3 04:09:08 FileServer kernel: WARNING: at drivers/ata/libata-core.c:5186 ata_qc_issue+0x10b/0x308()
Jul  3 04:09:08 FileServer kernel: Hardware name: P35-DS3R
Jul  3 04:09:08 FileServer kernel: Modules linked in: md_mod xor ide_gd_mod i2c_i801 i2c_core r8169 pata_jmicron jmicron ahci mvsas libsas scst scsi_transport_sas
Jul  3 04:09:08 FileServer kernel: Pid: 12231, comm: smartctl Not tainted 2.6.32.9-unRAID #8
Jul  3 04:09:08 FileServer kernel: Call Trace:
Jul  3 04:09:08 FileServer kernel:  [<c102449e>] warn_slowpath_common+0x60/0x77
Jul  3 04:09:08 FileServer kernel:  [<c10244c2>] warn_slowpath_null+0xd/0x10
Jul  3 04:09:08 FileServer kernel:  [<c11b624d>] ata_qc_issue+0x10b/0x308
Jul  3 04:09:08 FileServer kernel:  [<c11ac0ba>] ? scsi_init_sgtable+0x5b/0x76
Jul  3 04:09:08 FileServer kernel:  [<c11ba260>] ata_scsi_translate+0xd1/0xff
Jul  3 04:09:08 FileServer kernel:  [<c11a816c>] ? scsi_done+0x0/0xd
Jul  3 04:09:08 FileServer kernel:  [<c11a816c>] ? scsi_done+0x0/0xd
Jul  3 04:09:08 FileServer kernel:  [<c11baa40>] ata_sas_queuecmd+0x120/0x1d7
Jul  3 04:09:08 FileServer kernel:  [<c11bc6df>] ? ata_scsi_pass_thru+0x0/0x21d
Jul  3 04:09:08 FileServer kernel:  [<f841369a>] sas_queuecommand+0x65/0x20d [libsas]
Jul  3 04:09:08 FileServer kernel:  [<c11a816c>] ? scsi_done+0x0/0xd
Jul  3 04:09:08 FileServer kernel:  [<c11a82c0>] scsi_dispatch_cmd+0x147/0x181
Jul  3 04:09:08 FileServer kernel:  [<c11ace4d>] scsi_request_fn+0x351/0x376
Jul  3 04:09:08 FileServer kernel:  [<c11269a8>] __generic_unplug_device+0x26/0x29
Jul  3 04:09:08 FileServer kernel:  [<c112934a>] blk_execute_rq_nowait+0x56/0x73
Jul  3 04:09:08 FileServer kernel:  [<c11293dc>] blk_execute_rq+0x75/0x91
Jul  3 04:09:08 FileServer kernel:  [<c11292cc>] ? blk_end_sync_rq+0x0/0x28
Jul  3 04:09:08 FileServer kernel:  [<c11295fa>] ? blk_recount_segments+0x16/0x24
Jul  3 04:09:08 FileServer kernel:  [<c11250c6>] ? blk_rq_bio_prep+0x47/0x78
Jul  3 04:09:08 FileServer kernel:  [<c11ac58a>] scsi_execute+0xbf/0x113
Jul  3 04:09:08 FileServer kernel:  [<c11bb6d0>] ata_cmd_ioctl+0xf8/0x1ba
Jul  3 04:09:08 FileServer kernel:  [<c11bb940>] ata_sas_scsi_ioctl+0x1ae/0x1e8
Jul  3 04:09:08 FileServer kernel:  [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf
Jul  3 04:09:08 FileServer kernel:  [<f841333d>] sas_ioctl+0x36/0x43 [libsas]
Jul  3 04:09:08 FileServer kernel:  [<c11a94c8>] scsi_ioctl+0x299/0x2ad
Jul  3 04:09:08 FileServer kernel:  [<f8413307>] ? sas_ioctl+0x0/0x43 [libsas]
Jul  3 04:09:08 FileServer kernel:  [<c11b326d>] sd_ioctl+0x80/0x8c
Jul  3 04:09:08 FileServer kernel:  [<c112a420>] __blkdev_driver_ioctl+0x50/0x62
Jul  3 04:09:08 FileServer kernel:  [<c112ad1c>] blkdev_ioctl+0x8b0/0x8dc
Jul  3 04:09:08 FileServer kernel:  [<c1131e2d>] ? kobject_get+0x12/0x17
Jul  3 04:09:08 FileServer kernel:  [<c112b0f8>] ? get_disk+0x4a/0x61
Jul  3 04:09:08 FileServer kernel:  [<c101b028>] ? kmap_atomic+0x14/0x16
Jul  3 04:09:08 FileServer kernel:  [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf
Jul  3 04:09:08 FileServer kernel:  [<c104a179>] ? filemap_fault+0xb8/0x305
Jul  3 04:09:08 FileServer kernel:  [<c1048c43>] ? unlock_page+0x18/0x1b
Jul  3 04:09:08 FileServer kernel:  [<c1057c63>] ? __do_fault+0x3a7/0x3da
Jul  3 04:09:08 FileServer kernel:  [<c105985f>] ? handle_mm_fault+0x42d/0x8f1
Jul  3 04:09:08 FileServer kernel:  [<c108b6c6>] block_ioctl+0x2a/0x32
Jul  3 04:09:08 FileServer kernel:  [<c108b69c>] ? block_ioctl+0x0/0x32
Jul  3 04:09:08 FileServer kernel:  [<c10769d5>] vfs_ioctl+0x22/0x67
Jul  3 04:09:08 FileServer kernel:  [<c1076f33>] do_vfs_ioctl+0x478/0x4ac
Jul  3 04:09:08 FileServer kernel:  [<c105c983>] ? expand_downwards+0x109/0x136
Jul  3 04:09:08 FileServer kernel:  [<c1076f93>] sys_ioctl+0x2c/0x45
Jul  3 04:09:08 FileServer kernel:  [<c1002935>] syscall_call+0x7/0xb
Jul  3 04:09:08 FileServer kernel: ---[ end trace d2bdebf2054f7c59 ]---
Jul  3 04:09:09 FileServer preclear_disk-start[12265]: smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Jul  3 04:09:09 FileServer preclear_disk-start[12265]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jul  3 04:09:09 FileServer preclear_disk-start[12265]: 
Jul  3 04:09:09 FileServer preclear_disk-start[12265]: === START OF INFORMATION SECTION ===
Jul  3 04:09:09 FileServer preclear_disk-start[12265]: Model Family:     Western Digital Caviar Blue Serial ATA family
...

The above was timed right down to the second I initiated the first preclear. Ever seen anything like this before? Not necessarily related to the JMB363 SATA controller as I first assumed, is it?

Quote

July 8, 201213 yr

Nobody feeling chatty tonight?

I ended up running a reiserfsch on the redballed disk10, and it came back 100% clean. Therefore elected to use the trust parity procedure instead of rebuilding the data on disk10.

That would have run it on the emulated (simulated) file system. It had absolutely nothing to do with the physical disk.

Before I initiated this, I had rebooted the server in order to be running a clean go script, and unfortunately this had the side effect that the array automatically started unprotected, and lots of writes to the disks were performed in this step. The parity check now in progress found 69 errors immediately (within the first 0.1%) and no additional errors so far the following 2 percent. I can't know for sure if the parity mismatches were because of the improper shutdown or because the array was started without disk10, but in any case I'm hopeful (almost confident) I've done the right thing here.

I'm not as certain. A small number of parity errors at the very beginning of a disk are probably in the housekeeping area of the disk tracking journal entries. They might be as of the result of the hard shutdown.

Good luck.

Personally, I would NEVER use the "trust" process. If a disk is "red" it is GUARANTEED a write to it failed. You are asking that it be trusted even though it does NOT contain what was written to it.

Joe L.

Quote

July 8, 201213 yr

Author

Thanks. I think it may be a little more nuanced in this case. I had suffered an improper shutdown, which doesn't exactly make one want to trust the parity either, does it? Between a rock and a hard place - either trust the redballed disk10 or rebuild it from a parity which very likely was corrupted due to the improper shutdown.

If only using the web-interface my only option would have been to unassign disk10 and then reassign it, after which point unraid would have rebuild that drive from parity.

The reiserfsck I did first was on the actual drive (with the array stopped, checking /dev/sdf1). Using the trust procedure allowed me to get back to a protected state and then re-sync the parity immediately after. One could say I used the same commands used to "trust my parity", but in this case the purpose and effect was to "trust my disk10".

As it happened, the drive redballed about 30% through the parity sync. Have moved it to a different controller port and checked all the cables are plugged all the way in, seems to have gotten further this time.

Quote

July 8, 201213 yr

Disk10 was disabled by unRAID and the parity disk was not. Parity is more likely correct. Examining SMART reports could provide more info. The attempt to preserve disk10 and rebuild parity is mistaken. It will take exactly as long to rebuild disk 10 as it does to generate new parity. Disk 10 has definitely has a write failure and the correct information could be restored from parity. The "trust" procedure does not return the array to a protected state if one the disks has bad data.

Quote

July 9, 201213 yr

Author

OK guys, you win. I suppose the fact that parity was deemed valid could indicate my hamfisted attempts at manually stopping the array may have been successful for all disks except disk10 (including parity). Will keep in mind should there come another time - Thanks.

Did anybody have any input on the syslog siplets I quoted in the above posts?

BTW: I did check SMART data and the disk's got a 100% clean bill of health (no reallocated or pending sectors etc., short selftest completes without errors)

Quote

4.7 ugly server crash - red dot - HELP Please

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)