August 17, 200916 yr I feel like my Unraid 4.4.2 as my Windows OS that have to restart it once a week. Somehow copying data from my XP computer caused Unraid to hang. When it happens, the unraid power is still on and I can see the hard drive lights are green. It just doesn't response to telnet or use the keyboard from the server. I have encountered this problem 4 times already and each time I have to manually turn it off then turn it back on again. Everytime I turn it on, I starts the parity check(not sure that's the proper way but all 4 times return 0 error). I have also followed the Check Disk Filesystems steps and no corruptions were found on each drive. This problem seems to happen randomly because copying the same file can't reproduce it. The last time it happened was 2 weeks ago that I thought the problem went away after I installed the Supermicro 8 sata port card. Here's the syslog that I got from Unraid but I don't know how useful it is since it generate a new syslog everytime you start the server. Here's the things that you might want to know. 1. It's random. It can happen during the first 30mins or 6hrs into copying. 2. Copy 1 TB of data from mounting a USB to Server works perfectly fine. I think it ran like 20hrs. 3. When it hang, pinging, telnet or use keyboard to server doesn't do anything. 4. Happened on 2 different XP computers while copying USB hard drive data to Unraid Server System info: P4 3.2ghz Abit IC7-G mobo (4 SATA - using 3) 2GB GSkill memory Supermicro AOC-SAT2-MV8 (8 SATA - using 3) 2 WD 2TB hard drives 4 WD 1TB hard drives Norco 4220 Sony USB Pro License key from Tom Router DLink DIR-655 Here's the link to the other thread that I thought it's related but I guess I should have started a new topic since someone else may or may not have encountered this before. http://lime-technology.com/forum/index.php?topic=4106.msg36479#msg36479 BTW, I do have another Pro License USB drive, should I give that a try. If I do replace the USB drive, is it just simple as that or I have to redo Unraid all over again. Thanks in advance, ~joy
August 17, 200916 yr I have not reviewed the syslog, however a locked system is usually a hardware issue. I've had this happen with a couple MSI boards and PCI cards. I used some boot parameters and it alleviated the issue. YMMV. I think I used nopaic or irqpoll and it worked. Example in syslinux.cfg (just an example). append initrd=bzroot rootdelay=10 acpi=off nolapic noapic irqpoll Here are other options. * nolapic o may be required to get some motherboards to work * noapic o may be required to get some motherboards to work o may be required for boards based on the nForce 5 or higher chipsets, until unRAID v4.4 final * acpi=off o may be required to get some motherboards to work * acpi=force o may be required to get some motherboards to work (Asus P4SDX may need this) * irqpoll o wastes cpu cycles, but is sometimes needed for some motherboards * pci=routeirq o may be required to get some motherboards to work * pci=noacpi o may be required to get some motherboards to work * pci=nomsi o may be required to get some motherboards to work * swncq=0 o only if needed, for boards based on the nForce 5 or higher chipsets, and possibly only for unRAID v4.4-beta2
August 19, 200916 yr Author I just ran a 6 hrs memtest and it passed. Just want to confirm this before making this type of changes. Is the following correct? Does it matter if I use Word or Vi to make the changes? ================================= default menu.c32 menu title Lime Technology LLC prompt 0 timeout 50 label unRAID OS menu default kernel bzimage append initrd=bzroot rootdelay=10 acpi=off nolapic noapic irqpoll label Memtest86+ kernel memtest ==================================== Also, is there any way to capture the syslog before it locks up? Will tail command work? tail -f syslog > syslog.backup One last question. Is there a way to monitor the CPU temp while Unraid is running? I notice my CPU temp was hitting 58C when I was checking some bios setup stuff. thanks, ~joy
August 19, 200916 yr Author This is bad. After I reply my previous post. I got another lockup with about 5 mins into copying data over to the Server. Unfortunately, I didn't capture the log. I didn't realize it wipe out the log directory when it starts up. So I am copying it to my USB drive with this command now: tail -f syslog > /boot/mylog/syslog.backup I was tailing it with telnet when it happened but there's nothing there when it locked up. It's repeating these 2 lines from telnet. Aug 19 00:00:15 Tower kernel: ACPI: Transitioning device [FAN] to D0 Aug 19 00:00:15 Tower kernel: ACPI: Unable to turn cooling device [f78165a0] 'on' Does anyone know what's this thing doing? I have 3 fan plugs on my motherboard which have 1 cpu, 1 NB and 1 System. The only one that doesn't have a fan plugged is the System. Btw, this is without WeeboTech's changes. I want to confirm my previous post before committing this change. Update: I was able to reproduce it again :-( Attached is the log. Strange enough, my tail log didn't capture the last two lines which my telnet was showing. My Unraid lockup at the time when these 2 lines showed up. Aug 19 00:31:52 Tower kernel: ACPI: Transitioning device [FAN] to D0 Aug 19 00:31:52 Tower kernel: ACPI: Unable to turn cooling device [f78165a0] 'on' thanks, ~joy
August 19, 200916 yr If the fans are working its an ACPI bug (which alot of MB have) Definately try the no ACPI boot codes
August 20, 200916 yr Author I think I am loosing my head on this one. I brought a new cpu cooler to replace my stock one and temp dropped from 58C to 40C. I have not see that ACPI complaints anymore but as always, new problems show up. Without the ACPI changes: 1. Copying data from my computer to Server still lose connection. I can't telnet or ping it. Obvious, I can't see the tower/main page but I can access the server from my server keyboard. Is there a command line to reboot Unraid gracefully like the button from tower/main page? I tried to do following and hoping I can see the tower/main agian but no luck. /etc/rc.d/rc.inet1 stop /etc/rc.d/rc.inet1 start With the ACPI changes: 1. Same as before, copying data to server and server lockup. thanks, ~joy
August 20, 200916 yr Is there a command line to reboot Unraid gracefully like the button from tower/main page? You can reboot gracefully by going through a series of commands You can "try" to take the array off-line cleanly by typing the following series of commands: cd cp /var/log/syslog /boot/syslog.txt killall smbd nmbd sync for disk in /mnt/disk* /mnt/cache do umount $disk done mdcmd stop Then you can power down by typing: poweroff or reboot with reboot When you have connectivity, attach the copy of the syslog you made to the flash drive to your next post. It might just have the clues needed to figure out what else is happening. If all you want to get to is the web-management page, you might be able to do that by killall emhttp nohup /usr/local/sbin/emhttp & If you can get to it, it might save you from typing all the other commands above. Joe L.
August 21, 200916 yr Author This is great. I think this should have been on the Wiki. I am planning to use a spare router and some different cables just to make sure it's not the DLink 655 router issue. thanks, ~joy
August 21, 200916 yr This is great. I think this should have been on the Wiki. ... So have you added it to the wiki yet then
August 21, 200916 yr This is great. I think this should have been on the Wiki. ... So have you added it to the wiki yet then lol
August 21, 200916 yr Author :'( Different router and cables are still no go. Copying data is still locking up the server. Here's the change in syslinux.cfg append initrd=bzroot rootdelay=10 acpi=off nolapic Here's the log: This is strange since it started to copy data around 00:30. The last file that it attempted to copy over was at 02:00 but there's nothing in the log. Aug 21 00:03:19 Tower emhttp: shcmd (76): /etc/rc.d/rc.samba stop >/dev/null Aug 21 00:03:19 Tower emhttp: shcmd (77): /etc/rc.d/rc.nfsd stop >/dev/null Aug 21 00:03:20 Tower emhttp: shcmd (78): hostname Tower Aug 21 00:03:20 Tower emhttp: shcmd (79): echo '# Generated' >/etc/hosts Aug 21 00:03:20 Tower emhttp: shcmd (80): echo '127.0.0.1 Tower localhost' >>/etc/hosts Aug 21 00:03:20 Tower emhttp: shcmd (81): cp /etc/exports- /etc/exports Aug 21 00:03:21 Tower emhttp: shcmd (82): /etc/rc.d/rc.samba start >/dev/null Aug 21 00:03:21 Tower emhttp: shcmd (83): /etc/rc.d/rc.nfsd start >/dev/null Aug 21 00:07:52 Tower in.telnetd[1858]: connect from 192.168.1.100 (192.168.1.100) Aug 21 00:08:07 Tower login[1859]: ROOT LOGIN on `pts/0' from `192.168.1.100' Aug 21 00:09:10 Tower in.telnetd[1871]: connect from 192.168.1.100 (192.168.1.100) Aug 21 00:09:24 Tower login[1872]: ROOT LOGIN on `pts/1' from `192.168.1.100' I still have another USB Pro License. Can I just use that and replace my current one to see if it works? If that still doesn't work, I guess I will have to change my mobo and cpu. Most likely the Supermirco C2SEA + E5200 or E8400. thanks, ~joy
August 22, 200916 yr Author I caught something while tail syslog. Surprisingly, this time server didn't lockup and able to retain connection. Of course, the copying stopped. The only different this time was that I disabled User Share. Does anyone know what this error mean other then lost/regain connection? Aug 22 01:22:22 Tower kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Aug 22 01:22:22 Tower kernel: Tx Queue <0> Aug 22 01:22:22 Tower kernel: TDH <22> Aug 22 01:22:22 Tower kernel: TDT <3e> Aug 22 01:22:22 Tower kernel: next_to_use <3e> Aug 22 01:22:22 Tower kernel: next_to_clean <22> Aug 22 01:22:22 Tower kernel: buffer_info[next_to_clean] Aug 22 01:22:22 Tower kernel: time_stamp <5e97> Aug 22 01:22:22 Tower kernel: next_to_watch <22> Aug 22 01:22:22 Tower kernel: jiffies <5f50> Aug 22 01:22:22 Tower kernel: next_to_watch.status <0> Aug 22 01:22:24 Tower kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Aug 22 01:22:24 Tower kernel: Tx Queue <0> Aug 22 01:22:24 Tower kernel: TDH <22> Aug 22 01:22:24 Tower kernel: TDT <3e> Aug 22 01:22:24 Tower kernel: next_to_use <3e> Aug 22 01:22:24 Tower kernel: next_to_clean <22> Aug 22 01:22:24 Tower kernel: buffer_info[next_to_clean] Aug 22 01:22:24 Tower kernel: time_stamp <5e97> Aug 22 01:22:24 Tower kernel: next_to_watch <22> Aug 22 01:22:24 Tower kernel: jiffies <6018> Aug 22 01:22:24 Tower kernel: next_to_watch.status <0> Aug 22 01:22:26 Tower kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Aug 22 01:22:26 Tower kernel: Tx Queue <0> Aug 22 01:22:26 Tower kernel: TDH <22> Aug 22 01:22:26 Tower kernel: TDT <3e> Aug 22 01:22:26 Tower kernel: next_to_use <3e> Aug 22 01:22:26 Tower kernel: next_to_clean <22> Aug 22 01:22:26 Tower kernel: buffer_info[next_to_clean] Aug 22 01:22:26 Tower kernel: time_stamp <5e97> Aug 22 01:22:26 Tower kernel: next_to_watch <22> Aug 22 01:22:26 Tower kernel: jiffies <60e0> Aug 22 01:22:26 Tower kernel: next_to_watch.status <0> Aug 22 01:22:27 Tower kernel: ------------[ cut here ]------------ Aug 22 01:22:27 Tower kernel: WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0xf0/0x16d() Aug 22 01:22:27 Tower kernel: NETDEV WATCHDOG: eth0 (e1000): transmit timed out Aug 22 01:22:27 Tower kernel: Modules linked in: md_mod ata_piix piix ide_core sata_mv sata_sil libata e1000 Aug 22 01:22:27 Tower kernel: Pid: 0, comm: swapper Not tainted 2.6.27.7-unRAID #3 Aug 22 01:22:27 Tower kernel: [<c011cc08>] warn_slowpath+0x61/0x86 Aug 22 01:22:27 Tower kernel: [<c011545b>] enqueue_task+0xa/0x14 Aug 22 01:22:27 Tower kernel: [<c01154eb>] activate_task+0x16/0x1b Aug 22 01:22:27 Tower kernel: [<c011816c>] try_to_wake_up+0x11c/0x125 Aug 22 01:22:27 Tower kernel: [<c01157c0>] __wake_up_common+0x34/0x58 Aug 22 01:22:27 Tower kernel: [<c0115fc8>] complete+0x28/0x36 Aug 22 01:22:27 Tower kernel: [<c015185a>] dma_pool_free+0xde/0x128 Aug 22 01:22:27 Tower kernel: [<c0130a7b>] clocksource_get_next+0x39/0x3f Aug 22 01:22:27 Tower kernel: [<c012fafa>] update_wall_time+0x584/0x71d Aug 22 01:22:27 Tower kernel: [<c020e218>] strlcpy+0x14/0x41 Aug 22 01:22:27 Tower kernel: [<c02c2491>] dev_watchdog+0xf0/0x16d Aug 22 01:22:27 Tower kernel: [<c012ece6>] sched_clock_cpu+0x13e/0x149 Aug 22 01:22:27 Tower kernel: [<c011ab49>] scheduler_tick+0xa0/0xc9 Aug 22 01:22:27 Tower kernel: [<c012dd86>] hrtimer_run_pending+0x1a/0x78 Aug 22 01:22:27 Tower kernel: [<c02c23a1>] dev_watchdog+0x0/0x16d Aug 22 01:22:27 Tower kernel: [<c01239be>] run_timer_softirq+0x107/0x15a Aug 22 01:22:27 Tower kernel: [<c01204b9>] __do_softirq+0x6c/0xcf Aug 22 01:22:27 Tower kernel: [<c012054e>] do_softirq+0x32/0x36 Aug 22 01:22:27 Tower kernel: [<c0105179>] do_IRQ+0x54/0x67 Aug 22 01:22:27 Tower kernel: [<c01035a3>] common_interrupt+0x23/0x28 Aug 22 01:22:27 Tower kernel: [<c0107daa>] default_idle+0x2a/0x3d Aug 22 01:22:27 Tower kernel: [<c01019e1>] cpu_idle+0xbd/0xd5 Aug 22 01:22:27 Tower kernel: ======================= Aug 22 01:22:27 Tower kernel: ---[ end trace 1c5a44a222f55607 ]--- Aug 22 01:22:27 Tower ifplugd(eth0)[976]: Link beat lost. Aug 22 01:22:30 Tower kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Aug 22 01:22:30 Tower ifplugd(eth0)[976]: Link beat detected.
August 24, 200916 yr This is good! You finally got the defective part to reveal itself. It looks to me like you have a bad network card or chipset. Try disabling it in the BIOS setup, and installing a good network card, and test again.
August 24, 200916 yr Author This is good! You finally got the defective part to reveal itself. It looks to me like you have a bad network card or chipset. Try disabling it in the BIOS setup, and installing a good network card, and test again. I hope so because I just finished copying 70GB of data over the network with an old cheapo TrendNet Gig Card. :-) I am going to try copy 500GB over tonight and see what happen. (crossing fingers) thanks, ~joy
August 25, 200916 yr Author Yup, that's it. It's the NIC issue. The 500GB copy over the network went very well. I want to say thank you to all that contributed. At time, I think my choice of words maybe a bit demanding. Sorry, my bad QA habit that expected everything to be fixed or working. I would think Unraid should have better error handling then this. Personally, I feel a bad NIC shouldn't lock up your system. thanks, ~joy
August 25, 200916 yr This same scenerio happened to me too. I wound up replacing EVERYTHING except the NIC. Took me over a year, but finally tried a new nic and all was good. I saw that it happened to someone else too. At least I was able to suggest a new nic to him and he saved trouble-shooting time. NIC's are cheap. I'd suggest that anyone having a lock-up problem while transfering large files or large sets of files try replacing the NIC first.
September 3, 200916 yr I would think Unraid should have better error handling then this. Personally, I feel a bad NIC shouldn't lock up your system. I'm glad you were finally able to resolve this. You are certainly not the first to have NIC issues that required replacement, much more common than it should be. Perhaps just a technicality, but in this case it was a hardware issue that was crashing the Linux OS, not really the fault of unRAID itself. unRAID runs on top of Linux, and can't generally be faulted if the OS and its hardware support is crashing. As you say, it would be nice if there were better handling of bad NIC's in Linux ...
Archived
This topic is now archived and is closed to further replies.