January 13, 201610 yr My server is acting up and is basically unusable -- things hang and then can't be killed (even with -9) I've had this happen with innocuous things like chmod and lsof. I can't shutdown the system remotely since "shutdown -r now" hangs (after sending a broadcast that the system is going down). It just hangs and nothing else happens. The web server is unresponsive. When I telnet in, I can see the share data but get the 'hang' errors above when trying to do anything with some of it. In particular, anything that tries to access data written recently hangs (and takes the entire telnet session with it). So, I found this in my syslog: root@Tower:/var/log# grep "1\:0\:6" syslog* syslog:Jan 12 23:04:48 Tower kernel: sd 1:0:6:0: [sdk] Synchronizing SCSI cache syslog:Jan 12 23:04:49 Tower kernel: scsi target1:0:6: attempting target reset! scmd(ffff880009fa8600) syslog:Jan 12 23:04:49 Tower kernel: sd 1:0:6:0: [sdk] CDB: opcode=0x88 88 00 00 00 00 00 00 5c 57 88 00 00 00 80 00 00 syslog:Jan 12 23:04:49 Tower kernel: scsi target1:0:6: target reset: SUCCESS scm d(ffff880009fa8600) syslog:Jan 12 23:04:49 Tower kernel: sd 1:0:6:0: [sdk] CDB: opcode=0x88 88 00 00 00 00 00 00 5c 57 88 00 00 00 80 00 00 root@Tower:/var/log# This is good (?)...well, in the sense that it explains some of my problem -- files older than this are ok but newer ones aren't. It seems that one of my drives is dead. fdisk -l /dev/sdk returns nothing so I think the drive has just become non-responsive. I believe that what I'm seeing in "/mnt/disk4" is being reconstructed using parity information, but don't know how to know for sure. Anyway, all I want to do now is shut down and reboot. I can't do it remotely and probably just need to hold down the power button. Assuming the problem is that a drive 'disappeared' due to failure or bad cable, shouldn't unraid deal more appropriately with such a problem? Is there a way other than "shutdown -r now" to try to remotely reboot? I'm running v6 (6.1-r3 I think). PS: server uptime is 114 days, and nothing changed in that period. PPS: In looking at the forums, this seems to be a fairly common problem, but potoentiall misdiagnosed/undiagnosed since unraid doesn't deal with this kind of hardware failure particularly well. Here's one: http://lime-technology.com/forum/index.php?topic=44305
January 20, 201610 yr Author This happened again. I am not physically near the server, and can't reboot it. Is there a way to unmount the unresponsive drive and try again?
January 20, 201610 yr It looks like the controller has lost communication with the disk. There isn't much you can do remotely to fix that. It looks likely that a site visit is needed. Regarding the link you gave, there is just no information given in that posting for anyone to be able to offer any suggestions.
January 20, 201610 yr Author I did an emergency reboot (remotely). I posted details in a separate thread here: http://lime-technology.com/forum/index.php?topic=45782.0
February 10, 201610 yr Author Yes. The remote emergency reboot 'fixed' the problem. I was able to get the server back up and running -- disk read/write and remote access all worked as expected, but I do have a flaky drive. It stayed up for about 2-3 weeks, then the same error which I documented here happened again. I need to replace the drive, but getting access (even to a failing drive) was a lifesaver.
Archived
This topic is now archived and is closed to further replies.