June 24, 200818 yr I suddenly noticed that I could no longer access files on my unRAID server from any other computer connected to the network. I then connected to the Server Management Utility using my browser. Everything looked fine. All disks showed green dots under the disk status, temps were good, etc., etc. So, I decided to try restarting the machine. Using the commands/buttons on the Server Management Utility screen; first, I tried stopping the array to take it offline. Then I was going to restart it. Instead, it just hung at after the Stop command. I failed to capture the syslog before this, so I can't provide any more info. I wasn't expecting it to lock up at that point. I restarted the server manually and it booted up. I checked the status again with the Server Management Utility running in my browser and it showed the command "Starting" with one of my disks showing "Mounting" under the Free column. It seemed to take a long time to start. I couldn't find any more information about this in the online manual or WIKI, so I was wondering if anyone else knows what might be going on here?? After waiting for about 20 minutes, "Started" finally appeared under command area. All disks look like they are back to normal. It is now going through a parity check and so far, so good. In case this happens again, are there any suggestions for what I need to do. Of course I need to capture the syslog. Anything else? Is this a symptom of other problems that might be starting? Thanks
June 24, 200818 yr Why don't you capture a syslog now - may give some clues. I had something similar happen when I was using IDE drives with round cables. It didn't get better with a reboot though. I also have a slightly odd behavior where Samba reloads a default config after many days (weeks?) of power. Never get drives stuck mounting though.
June 24, 200818 yr Author The server seems to be acting a little funny. The strange thing is everything looks fine from the management utility or browser console (whatever you call it). It is 75% through the parity check and there are no errors. The disk status shows that all drives are healthy. However, when I try to play a media file (specifically some of my music files) from several of the machines connected to the network, then I have troubles. I was also trying to sync a portable player and WMP was having trouble reading the tracks and the file sync/transfer was just reporting 'error'. Is this because the server is going through a parity check? I've run parity checks and have been able to both read and write to the server without any problems in the past. I've attached my syslog to this post. Let me know if it gives you any clues. Thanks for your help.
June 24, 200818 yr Author There's something I forgot to mention in my first post. When I went to the management utility, I noticed 3 of my 4 drives were spun down. So before stopping the array, I pick the button to spin up all the drives. Then after all the green dots stopped flashing, I picked the stop button. So I wonder if even though the management utility showed the green dots had stopped flashing, maybe the drives were not fully spun up when I picked the stop button.
June 24, 200818 yr I am not an expert at reading these logs (but I'm learning). Hopefully someone else will chime in. I did notice two points in the log where there were substantial delays: Jun 23 18:17:42 Media kernel: can't shrink filesystem on-line Jun 23 18:20:33 Media emhttp: shareCount not found Jun 23 18:21:48 Media last message repeated 2 times Jun 23 18:37:18 Media kernel: ReiserFS: md1: replayed 1177 transactions in 1178 seconds It appears that there is some issue with usershares? ("sharecount not found" took 3 minutes to return.) Someone else needs to look at this that knows more than me. The last line I am familiar with. Whenever there is a dirty shutdown you will find unapplied transactions that need to be written to disk. The fact that it took 20 minutes to replay 1177 transactions seems excessively long. You should run smartctl on your "disk1" disk to make sure it is not having problems. (Follow Troubleshooting link in my sig and search for smartctl for instructions). I wish that unRAID would do a "sync" command internally from time to time to prevent large numbers of unapplied transactions staying unwritten for long periods of time. There is another 10 minute delay almost at the bottom of the log, again seems related to shares ... Jun 23 18:37:24 Media emhttp: shcmd (13): killall -HUP smbd Jun 23 18:47:28 Media emhttp: shareUseCache not found Note that yoiu are running 4.3b6 - a version that contains an issue with adding new disks to the array. Make sure to update before trying to add a new drive.
June 24, 200818 yr Author This morning I checked to see if the parity check completed without errors. It looks like the server isn't working now. After hitting the Refresh command, the browser than showed "Problem loading page". I can no longer make a Telnet connection. So I am going to try connecting a monitor to see what's happening.
June 24, 200818 yr Author I've attached a screen shot to this post. The cursor is flashing at the bottom of the screen and unRAID doesn't respond to any keystrokes. What should I do next? Should I try manually restarting the server? Should I remove the usb flash drive and update the files to the latest version of unRAID?
June 24, 200818 yr Looks like the kernel crashed. I surmise that there is an issue with a directory and when samba(smbd) looked at it, the kernel had issues. Do you remember what disk was accessed last? or what file you tried to access... I suspect that particular disk may have issues and do a reiserfsck on it. after the volume was unmounted. First and foremost, reboot the server and do a memtest to insure memory is not an issue.
June 24, 200818 yr Author I think it was probably disk1. There's a slight chance it could also be disk3.
June 24, 200818 yr I've attached a screen shot to this post. The cursor is flashing at the bottom of the screen and unRAID doesn't respond to any keystrokes. What should I do next? Should I try manually restarting the server? Should I remove the usb flash drive and update the files to the latest version of unRAID? You have had a kernel crash. Your only choice is to restart the server by rebooting. It will do a parity check since you did not cleanly shut down. Once that is completed, perform a file system check on each of your disks. Instructions in the wiki here: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems Your stack trace shows the crash occurred in the SMB Daemon. Odds are it is file-system corruption that caused it to reference memory that did not exist. Outside chance is bad memory, or incorrect memory voltage, or timing setting in the BIOS. If the file systems check out OK, stop the array, and reboot again, this time choosing the memory test option to check it out. Only after those tasks are completed, consider upgrading to the latest release. No sense complicating your life until then. Joe L.
June 24, 200818 yr I would suggest doing a memory test first. Doesn't have to be overnight. At least 1 pass. If you have bad memory, things will slowly get worse if any bad cells continue to be accessed. If the memory test is good, then you have eliminated a possibility of corruption.
June 24, 200818 yr Author It is bad memory (or a faulty memory socket on my Abit AB9 Pro motherboard. I'm using Crucial Ballistix 2 x 1GB kit. I went through a bad memory problem in early May. At that time I contacted my vendor and they replaced the memory. I also made sure the voltage settings in the bios were high enough to support the memory. For the Crucial Ballistix you need the DDR2 voltage to be set at 2.2V. The server worked fine after installing the new memory. Since the memory was pretty inexpensive (maybe the price reflects the quality?), I decided to by a spare 2GB kit. I'll try experimenting with different combinations. Hopefully, the motherboard isn't the problem. If I learn that the Crucial Ballistix is a bad match for this motherboard, does anybody have a recommendation for inexpensive, reliable memory that will work with the Abit AB9 Pro mobo?
June 24, 200818 yr Did your memory test prove bad memory I'm using OCZ memory on my AB9 PRO. I also re-tuned the memory in the bios (you could try the same). For some reason the clock speed was 272 and not 266, so I adjusted manually.
June 24, 200818 yr I am not an expert at reading these logs (but I'm learning). Hopefully someone else will chime in. I did notice two points in the log where there were substantial delays: Jun 23 18:17:42 Media kernel: can't shrink filesystem on-line Jun 23 18:20:33 Media emhttp: shareCount not found Jun 23 18:21:48 Media last message repeated 2 times Jun 23 18:37:18 Media kernel: ReiserFS: md1: replayed 1177 transactions in 1178 seconds It appears that there is some issue with usershares? ("sharecount not found" took 3 minutes to return.) Someone else needs to look at this that knows more than me. The last line I am familiar with. Whenever there is a dirty shutdown you will find unapplied transactions that need to be written to disk. The fact that it took 20 minutes to replay 1177 transactions seems excessively long. You should run smartctl on your "disk1" disk to make sure it is not having problems. (Follow Troubleshooting link in my sig and search for smartctl for instructions). I wish that unRAID would do a "sync" command internally from time to time to prevent large numbers of unapplied transactions staying unwritten for long periods of time. There is another 10 minute delay almost at the bottom of the log, again seems related to shares ... Jun 23 18:37:24 Media emhttp: shcmd (13): killall -HUP smbd Jun 23 18:47:28 Media emhttp: shareUseCache not found Note that you are running 4.3b6 - a version that contains an issue with adding new disks to the array. Make sure to update before trying to add a new drive. Good job Brian, and by the way, superb troubleshooting in that thread about the P5B VM DO power problems. I'll just comment that there was a parity check running in the background that explains how long the transaction replay was taking, and probably the extent of the other delays too. That 20 minutes explains the delay in mounting Disk 1 in the first post. The 3 occurrences of 'shareCount not found' (probably one per data disk) are an unexplained oddity. I've only seen one other instance of it, a single instance there and also unexplainable, so for now I'll chalk it up to bad memory here. I have found instances of 'shareUseCache' not found' in all 3 license types, both those with a Cache drive and those without. It often displays when User Shares appear to be in process of being configured, and may appear inconsistently on stopping the array. All of this is unimportant though while memory is suspect. Must have good memory... Any troubleshooting and error reports are worthless, until memory can be trusted.
June 24, 200818 yr Author Did your memory test prove bad memory I'm using OCZ memory on my AB9 PRO. I also re-tuned the memory in the bios (you could try the same). For some reason the clock speed was 272 and not 266, so I adjusted manually. After restarting the server, the first thing I did was run memtest. It immediately showed errors. The screen just starting spouting rows of red. So I shut down the server and removed one stick of memory. I restarted and before running memtest, I went into the bios to double check the voltage setting. For some strange reason the setting was back to the default of 1.8V. So I changed it to 2.2V. If the bios was reset to defaults, I wonder why all the other settings were not changed, liked the boot order, etc. Then I ran memtest on the single stick of memory. It seems okay, but I only ran one full pass. Then I shut down the server and swapped the memory sticks. This way I could test to see if the other was bad. The second stick did not display any errors after a full pass. I shut it down another time and put the first stick pack in the 2nd slot. So now they are both installed. I am running memtest and so far (after only about 40 minutes) it's made a little over one pass and there have been no errors. This problem might have been caused by the wrong voltage setting. And I wonder how it could have changed automatically to 1.8V after running it for over a month at 2.2V. I'm going to restart the server and run unRAID. Then I will test the drives to make sure they are okay by running smartctl and reiserfsck.
June 24, 200818 yr This problem might have been caused by the wrong voltage setting. The problem WAS caused by the wrong voltage setting.... you can be sure....
June 24, 200818 yr Author Well... hopefully this won't happen again. Last time after I changed the voltage setting in the bios, I disconnected the keyboard and monitor. So the server was headless. There's no way to make changes to the bios settings when it is like that. So I just don't understand how it could have changed. I went ahead and ran smartctl on all the drives. I don't know exactly how to interpret them, but it doesn't appear to show anything out of the ordinary. I've attached them to this post if anyone is interested. I also probably run reiserfsck just to make sure everything is ok. Unless you think it would be a total waste of time.
June 24, 200818 yr These smartctl outputs look fine. For some reason, disk3 is running about 7C hotter than the others. Still under 40C, so no problem. Thought you might be interested. Seems like memory voltage is the issue. I would recommend that you use the "Load Defaults" BIOS option, and then reset any changed settings (including your voltage change). I have never heard of a BIOS Setting changing on its own, but have heard of problems caused by doing a BIOS update and then NOT loading defaults afterwards. Hopefully you'll soon be back enjoying your unRAID instead of troubleshooting it.
June 24, 200818 yr Author These smartctl outputs look fine. For some reason, disk3 is running about 7C hotter than the others. Still under 40C, so no problem. Thought you might be interested. Seems like memory voltage is the issue. I would recommend that you use the "Load Defaults" BIOS option, and then reset any changed settings (including your voltage change). I have never heard of a BIOS Setting changing on its own, but have heard of problems caused by doing a BIOS update and then NOT loading defaults afterwards. Hopefully you'll soon be back enjoying your unRAID instead of troubleshooting it. The first 3 drives (parity, disk1, and disk2) are in a 3-bay drive cage that has a 120mm fan attached to it. The fan sucks outside air directly onto those drives. I've attached a pic if anyone is interested. I bought it from a friend, but I think you might find them on ebay. Here's one they sell at Tigerdirect: http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1580897&CatId=32 The 4th drive (disk3) is mounted below and doesn't have a fan blowing directly on it. In the future I hope to replace my chassis with something that would work well for all drives.
Archived
This topic is now archived and is closed to further replies.