August 1, 200916 yr We lost power here a few days ago. Since then my unraid server has been locking up about 2 or so hours after I bring it up. I am pretty sure it is a hardware problem but I was hoping some one "in the know" could verify there is nothing in the sys log that might help. I looked but didnt see anything off hand.
August 1, 200916 yr Everything looks quite normal to me. When your server "locks up" how does it "lock up"? Can you log in via the console? In any case, it could be a corrupted file-system as a result of the power failure. You might want to check each of your disks following the procedure described in the wiki http://lime-technology.com/wiki/index.php/Check_Disk_Filesystems Joe L.
August 1, 200916 yr Author When it locks up you get no video (just a black screen with no cursor or words, like if the monitor is powered off), no keyboard, no network response of any kind. The server is still powered up but it is totally unresponsive. I have attached the latest syslog. You will see tons of disk erorrs. It is running a parity check during these errors. and the report says there is 2191 errors.
August 1, 200916 yr The disk errors all appear to involve Disk 2 (sde), uncorrectable (UNC) media errors, so you should obtain a SMART report for sde. For instructions, please see the Troubleshooting page, Obtaining a SMART report section. It may also be a good idea to run a SMART long test, mentioned at the bottom of that section. I did not actually look at EVERY error, but in skimming through the second syslog, they all appear to be the same. Disk 2 may be failing, although we can't make any conclusions until we see the SMART report. It is possible that the drive was hanging the system while dealing with the bad sectors. By the way, with all of those errors, you should stop the parity check or sync - in the syslog, it looks like a sync was started? Some suggestions, from a fellow nForce board user - you appear to have an older nForce board, from the fact that it only has OHCI USB support (USB 1.1). The latest Linux kernel versions seem to me to be better at handling the nForce quirks. I strongly recommend upgrading to unRAID v4.5-beta6 to get the latest kernel. If you don't want to upgrade that far, you should at least upgrade to v4.4.2, and consider adding swncq=0 to your syslinux.cfg file (see the Boot Codes wiki page). The early SWNCQ support seemed especially buggy, and your syslog shows a number of FIS-related errors (related to SWNCQ). The better option is to upgrade to v4.5-beta6, see "How do I upgrade the unRAID software?" The nForce family of chipsets have had a number of characteristics, some good, some bad, and I've had my share of the bad. One of the lesser known quirks is the "kernel: spurious 8259A interrupt: IRQ7", found in all or most of the nForce based syslogs, and as far as I know, is never seen in other syslogs. It appears to be harmless in most of the systems I have seen it in, but from personal experience I can say it can sometimes be a serious problem. On my system, it would randomly 'fire' its interrupt, such that on a rare occasion, the kernel would disable the interrupt, thereby disabling ALL subsystems that had been assigned to use it. Since my kernel at the time, probably expressing its thrill-seeking nature, had assigned 3 of my more important subsystems - my networking chipset, and one disk controller with TWO drives attached, and my unRAID flash drive, to IRQ7, they ALL went down together! An unRAID system with 2 drives down, the networking gone, and an inaccessible USB flash drive does not work very well! What I had to do was reserve IRQ7 in the BIOS config, so that it was unavailable for assignment, and I have had no problems since. In your syslog, the USB port used by your flash drive was reset at the end of the first syslog. The USB support for it happens to be using Interrupt 23, which I think is the ACPI controlled hook to IRQ7 (7+16). I can't say for sure, but it is possible the reset was caused by a 'spurious IRQ7'. One thing you can try is to go into your BIOS at boot up, and locate the menu for physical reservations of IRQ's, and reserve IRQ 7, so it won't be used. Then check future syslogs for those resets, "kernel: usb 1-4: reset full speed USB device using ohci_hcd and address 2". A correctly programmed reset should be harmless, and seems to be here, but they are not normal, and should not be occurring. For completeness, I will mention one more *oddity* (I don't have a better name for it). Your fifth onboard SATA port was ID'd as scsi5 and associated with ata5, but there is no "SATA link up/down" report for it. In all of the hundreds of syslogs that I have examined, I have never ever seen this before! There is always a "SATA link down", or a "SATA link up" followed by the link speed and the rest of the setup for that drive. Your third Promise SATA port is empty, and is properly reported as such with "ata9: SATA link down". I have no idea of the significance of this, if there is any, but I don't like inconsistencies.
August 2, 200916 yr Author Thank you very much for the completeness of your observations. The "Lockup" issue seems to be resolved. The server has been up for almost a full day now. What I didn't mention before is that I upgraded to the latest beta during the power outtage. That is when the lockups started happening. yesterday I went back to my original version (4.4) and all is well again. I am not sure I should report this as a bug or not. I will be buying a new motherboard this fall so I can run VMware on the unraid box, so I think I will wait until then to upgrade to the latest version of Unraid.
August 4, 200916 yr "Posted by: pat_smith1969 When it locks up you get no video (just a black screen with no cursor or words, like if the monitor is powered off), no keyboard, no network response of any kind. The server is still powered up but it is totally unresponsive." Sadly, I just encountered this same problem and it wasn't power outage but rather my 2 yrs old pushed the power button when I was copying my 30GB .m2ts file. Every time I reboot the system, I get a new fresh syslog so I can't really tell from the log. Is there a way to get that syslog out when your system is "totally unresponsive"? I am using the Pro version 4.4.2. I've checked my disks from the http://lime-technology.com/wiki/index.php/Check_Disk_Filesystems ( Check Disk Filesystems ) steps and no corruptions were found. thanks, ~joy
August 4, 200916 yr I forgot to mention how I encountered this "totally unresponsive" issue. This have happened 2 times already when I was trying to copy files off a USB external hard drive from Computer A to my Server.
August 4, 200916 yr "Posted by: pat_smith1969 When it locks up you get no video (just a black screen with no cursor or words, like if the monitor is powered off), no keyboard, no network response of any kind. The server is still powered up but it is totally unresponsive." Sadly, I just encountered this same problem and it wasn't power outage but rather my 2 yrs old pushed the power button when I was copying my 30GB .m2ts file. Every time I reboot the system, I get a new fresh syslog so I can't really tell from the log. Is there a way to get that syslog out when your system is "totally unresponsive"? I am using the Pro version 4.4.2. I've checked my disks from the http://lime-technology.com/wiki/index.php/Check_Disk_Filesystems ( Check Disk Filesystems ) steps and no corruptions were found. thanks, ~joy You can edit your go file, in the config folder of your unRAID flash drive, to add a line to copy the syslog to your flash drive, which will let you examine and post it from a Windows machine. Using WordPad (or any editor that preserves Linux end-of-line characters) add the following lines to your go file: cp /var/log/syslog /boot/syslog.txt chmod a-x /boot/syslog.txt The chmod line just turns off the Hidden and System flags, so isn't strictly necessary, but helpful if you haven't set your Windows system to show you Hidden files. Check Disk Filesystems would have been my first recommendation! So that leaves the very unlikely possibility of a corrupted flash drive, or more probable, a hardware issue. You did not say how far it gets before you lose the monitor screen and responsiveness. Do you get the initial video and boot screens?
August 4, 200916 yr Check Disk Filesystems would have been my first recommendation! So that leaves the very unlikely possibility of a corrupted flash drive, or more probable, a hardware issue. You did not say how far it gets before you lose the monitor screen and responsiveness. Do you get the initial video and boot screens? The first time was about 6 hrs into copying file over. The 2nd time was about 1hr into copying file over. My Computer A complains that file path is too deep or something. I can't remember the exact warning message then it lost mapping drive. I check my server screen and nothing but black. Hitting the keyboard doesn't have any response but the server system is running. well, atleast I see the fans and lights are running.
August 5, 200916 yr Verify that networking is lost, try using Telnet to access it. Both telnet & ping can't reach my server. I am going try to reproduce it and see if I can get the syslog tonight. thanks, ~joy
Archived
This topic is now archived and is closed to further replies.