January 20, 201016 yr Hi, I recently updated to 4.5, I have 5 data disks and 1 parity disk – in total 3.75TB. Thus far my UNRAID has always run flawlessly. I do see in the boot logs several EMask drive errors, so I also recently replaced all the data cables with new ones. The EMask errors remained – (are they ok?) I was doing a backup last week of about 85GB. The backup stopped abruptly with the message ‘network connection lost’, and ‘could not size the file’. I was using Norton Ghost 15 from a Windows 7 box, but the remote OS and backup program do not seem to matter. The backup stopped, and the UNRAID server became hung and unresponsive. I was forced to do a hard-restart. When the system came back up I replaced a hard disk, which I thought was the culprit. The parity rebuilt with 127 errors - tried the backup again, with the same results. System frozen, then after a hard reset, this time only 1 parity error. So instead of using the Norton Ghost program, I simply started a copy over the network to the UNRAID system which at this time had just finished a complete parity check, with zero errors. I tried an 85GB file, and 5%-10% into the copy, the same results happen. The UNRAID server is once again frozen, unresponsive and will need a hard-restart which is probably causing the parity errors, yes? I do notice a lot of UDMA_CRC_Error_Counts on the parity disk, but that number is no longer growing, I thought the data cable might have been too close to the power supply so I re-routed it. I can’t get the log because the system keeps crashing during the copy. I am now suspecting the parity drive could be causing my issues and I have a replacement drive. I haven’t replaced it yet, as I am now seeking the expert’s help. What should I do from here? I’ve already safe guarded the important data from the UNRAID elsewhere. Thanks, Reuben
January 20, 201016 yr Hi, I recently updated to 4.5, I have 5 data disks and 1 parity disk – in total 3.75TB. Thus far my UNRAID has always run flawlessly. I do see in the boot logs several EMask drive errors, so I also recently replaced all the data cables with new ones. The EMask errors remained – (are they ok?) No. They are not normal I was doing a backup last week of about 85GB. The backup stopped abruptly with the message ‘network connection lost’, and ‘could not size the file’. I was using Norton Ghost 15 from a Windows 7 box, but the remote OS and backup program do not seem to matter. The backup stopped, and the UNRAID server became hung and unresponsive. I was forced to do a hard-restart. When the system came back up I replaced a hard disk, which I thought was the culprit. The parity rebuilt with 127 errors - tried the backup again, with the same results. System frozen, then after a hard reset, this time only 1 parity error. So instead of using the Norton Ghost program, I simply started a copy over the network to the UNRAID system which at this time had just finished a complete parity check, with zero errors. I tried an 85GB file, and 5%-10% into the copy, the same results happen. The UNRAID server is once again frozen, unresponsive and will need a hard-restart which is probably causing the parity errors, yes? I do notice a lot of UDMA_CRC_Error_Counts on the parity disk, but that number is no longer growing, I thought the data cable might have been too close to the power supply so I re-routed it. I can’t get the log because the system keeps crashing during the copy. I am now suspecting the parity drive could be causing my issues and I have a replacement drive. I haven’t replaced it yet, as I am now seeking the expert’s help. What should I do from here? I’ve already safe guarded the important data from the UNRAID elsewhere. Thanks, Reuben First.. and this is general advice to ANYONE asking for assistance in troubleshooting their server... Post a syslog. What you've done so far is the equivalent of saying you've been driving your car around for several months and every once in a while it stops running. You've tried driving to different locations, and you even used different tires, but it still breaks down. You've tried an oil-filter, but it did not help. You've tried different roads... but the car still stops running. I realize you think you gave clues we would need, but in reality, you have not. Some of the things you tried were good, but possibly not applicable to your problems. moving data cables away from power cables is good. Providing a clean source of power is critical. No disk will work properly if the power supply is not up to it. (describe your hardware... your specific power supply, motherboard, etc. We know you have 6 disks... we really can't tell much more without specifics.) Unless you can provide important clues that can assist us, (the syslog) we cannot assist you in any meaningful way. All anybody can do is give general advice. A system that freezes is almost always hardware... or software... (kernel crashes) Software is not very likely, as many of us use the exact same software, and we do not crash, so that leaves hardware. Step 1. Perform a memory test. Check the BIOS settings to ensure the memory voltage, timing, and speed are set for your specific RAM strips. It is not enough to let the BIOS automatically set them for you, as they often get it wrong. Step 2. open a system console or telnet session and type tail -f /var/log/syslog Let it run while you perform another copy of files to your server. One possibility is that the errors are filling the syslog to where the kernel runs out of memory for its own use. When that happens, processes are terminated in an attempt to free memory. (the samba networking process perhaps, causing you to lose network connectivity) Basically, if you start to see lots of errors being added to the syslog, stop the file teansfer, capture a syslog (instructions in the wiki) and zip it up and post it on your next post. If you are running 4.5 unRAID, capturing the syslog is very easy from your browser. Joe L.
January 20, 201016 yr Author Sorry, I deserved that. I was at work and remoted into my home network. When I saw the crash again I was so upset I put together that first post. (which as you pointed out not that helpful.) I did run the tail on the log file, but when the UNRAID vanished, so did my putty session. Tonight I will put together and more comprehensive message, with a log if possible and send it up. Do I need to allow the party check to complete after a crash, or my I stop it? Thanks, Reuben
January 20, 201016 yr Sorry, I deserved that. I was at work and remoted into my home network. When I saw the crash again I was so upset I put together that first post. (which as you pointed out not that helpful.) I did run the tail on the log file, but when the UNRAID vanished, so did my putty session. You can try again on the system console... Tonight I will put together and more comprehensive message, with a log if possible and send it up. I have a dance lesson early this evening (I do have a life other than unRAID ), but will be home afterwords and take a look if nobody else does. Do I need to allow the party check to complete after a crash, or may I stop it? Yes, you do. Thanks, Reuben You are welcome... I know the analogy in my original response was probably a simplification. I just wanted others following this thread in the future to see what is needed to provide the analysis clues. With no real clues, it can only be very general assistance.
January 20, 201016 yr Author So I probably won't post anything tonight because the parity check will take 8 hours. However you mentioned I must let the parity check complete. Maybe this is my issue...... the very first time the UNRAID hung, and I did the hard-reset, when it came back up I never let the parity check complete. I just pulled the 'thought to be bad' drive out, replaced it, and then let the next parity check complete.... could I have screw up right there? Maybe this is the root of my issues? Should I start from scratch some how, or keep on the current path? Thanks a lot for your attention. Reuben
January 20, 201016 yr Author Here is the syslog before the crash. Please notice several of these: Jan 20 17:37:41 UNRAID-NAS kernel: ata7: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xf t4 Any ideas? Thanks again, Reuben unraid_log__pre-crash.txt
January 23, 201016 yr Author I wanted to tie this up in hopes that the post might be useful to someone. In the end, there was no way for me to get the logs post-crash. The UNRAID would become totally unresponsive and all attempts to tail the syslog to a file did not prove successful for me. Bottom-line problem turned out to be a couple of Western Digital ‘AAKS’ drives - to be exact, “WD5000AAKS” (WD Caviar SE16). I had previously known about the potential problems (http://www.lime-technology.com/wiki/index.php?title=Hardware_Compatibility#Hard_Disk_Drives) however I had one in my UNRAID since’ day-one’ and there was never any issues until I added the second one. I didn’t use the UNRAID immediately after adding the second “WD5000AAKS” so when the hanging (freezing, crashing, unresponsiveness, etc…) started I did not put two-and-two together. According to forums, these drive going out of spin-down mode is the root cause of the crashing. The error messages I spoke about in my syslog were totally unrelated to my hanging issues: Jan 23 02:41:28 UNRAID-NAS kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 23 02:41:28 UNRAID-NAS kernel: ata2.00: ATA-8: WDC WD10EADS-00L5B1, 01.01A01, max UDMA/133 Jan 23 02:41:28 UNRAID-NAS kernel: ata2.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 0/32) Jan 23 02:41:28 UNRAID-NAS kernel: ata2.00: configured for UDMA/133 ? Jan 23 02:41:28 UNRAID-NAS kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xf t4 Jan 23 02:41:28 UNRAID-NAS kernel: ata2: hotplug_status 0x2 Jan 23 02:41:28 UNRAID-NAS kernel: ata2: hard resetting link Jan 23 02:41:28 UNRAID-NAS kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 23 02:41:28 UNRAID-NAS kernel: ata2.00: configured for UDMA/133 Jan 23 02:41:28 UNRAID-NAS kernel: ata2: EH complete After researching the internet, the above error “exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xf t4” turns out to be related to my “Promise TX4 SATA300” interface card. Apparently, users with this card see these Emask errors but they seem to be benign according to reports I’ve read. Since my new motherboard only has four built-in SATA ports I added one of the Promise TX4 SATA300 cards back in to cover my seven drive scenario. Sure enough, the two drives connected to the Promise TX4 SATA300 still show these errors, but as I mentioned they do not seem to cause me any problems. So before I realized it was just the infamous ‘AAKS’ issue, I had already totally sprung to upgrade my UNRAID. I purchased a GigaByte G31M-ES2L motherboard (tested by prostuff1 here: http://lime-technology.com/forum/index.php?topic=3429.0), Pentium E5300 dual core, 2 GB of DDR2 memory, and new 750 watt power supply, and two Green WD10EADS 1TB drives to replace the problem children. All < $400 from http://microcenter.com/. I purchased from their ‘brick and mortar’ store in Long Island NYC. I hope this helps – definitely, and as already advised - read the hardware recommendations prior to building your UNRAID, and stay away from the “WD5000AAKS” hard drives when considering drives for you UNRAID. Btw – I do use the “WD5000AAKS” hard drives in an unrelated system in a RADI5 configuration with no issues. I still love my UNRAID. Good Luck! Reuben NYC
January 24, 201016 yr I purchased a GigaByte G31M-ES2L motherboard Then you should get yourself very familiar with the HPA issue with GigaByte motherboards. Read here: http://lime-technology.com/wiki/index.php?title=UnRAID_Topical_Index#HPA
January 24, 201016 yr Author Thanks, good looking out! I was checking into that here. So far no signs of the HPA issue. I do have the latest BIOS and it seems as if it saves it would save to the USB Flash drive, which shouldn’t be an issue. However, to be safe I moved the parity drive to the fourth SATA connector, as it was in the first. (don’t want to take any chances) This board has dual auto saving bios chips. If one fails it defaults to the other, and then tries to resave back to the first that failed. I think with this feature maybe they removed the auto-save to hard drive ‘feature’ under the redundancy and over-kill laws. lol Thanks, Reuben
Archived
This topic is now archived and is closed to further replies.