April 3, 201115 yr I'm still in the process of troubleshooting, but I wanted to go ahead and get a post in to fill out the backstory of my issues while I waiting on a memtest. I've been using UnRAID (initially free, and now pro) for a few months now and was initially very pleased with the results. My hardware is mostly new, consisting of an ASUS M4A79T Deluxe, Athlon X2 250 CPU, 4GB of Patriot DD3 RAM, and 7 SATA drives (mix of Western Digital, Hitachi, and Seagate). It is powered by a 650 watt Corsair power supply. During this system's uptime, I've upgraded the UnRAID version I believe 3 times. Currently I'm using version 4.7. The above is the server's current configuration. Since initial setup as an UnRAID server, I've change the ram (2GB generic to the 4GB Patriot), replaced the PS (500 watt Thermaltake to the Corsait 650), and add a couple drives (Notably, I've replaced a dead Western Digital 2TB, and added 2 1.5TB Seagates). Recently, the system has been unusually unstable. It's been hard locking (no local termination response, web response, telnet response, or samba response). It will do this, seemingly of it's own accord when idle. Although that's a bit infrequent. However, if I attempt to copy a file of reasonable size (i.e. more than a couple hundred MB) from a location on the UnRAID server to another location on the UnRAID server via samba, it will consistently lock after a few seconds. This occurs when using either disk shares or user shares. I feel confident that this isn't a power or cooling problem. I don't think any specific disk is at fault as smart shows nothing unusual and the lockups happen during copy regardless of which disks you are copying to or from. Memtest has been running for quite some time now with no errors. I have tailed the syslog while attempting to trigger the failure, but it shows nothing unusual. Likewise, the syslog shows no immediately apparent red flags from system startup. At this point, I'm at my wit's end. All of my data is on here, and with the frequent crashing, I don't feel that it's very secure. Likewise, the family depends on this server functioning for our backups and media serving. Any advice would be appreciated. I will update/repost with log data in a bit after I'm done with memtest and have tried a handful of other things.
April 3, 201115 yr Author Attached are my syslog and SMART reports for each drive. Except for the 7th drive, which is a hot spare and was disconnected for testing. syslog.txt hdc.txt sda.txt sdc.txt
April 3, 201115 yr i don't see anything obvious... Can you try telneting into your unRaid server, then typing: tail -f -n 20 /var/log/syslog this way, whenever syslog is updated, it outputs it to your telnet screen... leave it running and when your server crashes, you will be able to see the last 20-30 lines in your syslog to see what happened.
April 4, 201115 yr Author Thanks for the reply, but as I say in the op: I have tailed the syslog while attempting to trigger the failure, but it shows nothing unusual. Likewise, the syslog shows no immediately apparent red flags from system startup. Nothing at all was output to the syslog when the crashes occur. I'm stumped.
April 4, 201115 yr Author Can you clarify what system specs you need that aren't included in paragraph 2 of my first post?
April 4, 201115 yr It is probably combination of memory and BIOS (you probably have old version) issues. After the memtest make sure you flash to the latest BIOS as you have some problematic messages in the syslog that can bu cured by a newer version. After that load the default setting, disable anything you wont use (serial and par ports, audio, firewire, floppy even the IDE controller if you are not going to use any older PATA drive). You can run another memtest (just a single pass) to make sure that the "branded" memory plays nice with the new BIOS. If the new memory requires a higher voltage make sure you set up this manually.
April 4, 201115 yr how about running top in another window to keep track of memory usage and processes when it crashes ? I assume you've tried to reseat all cables/interface cards ? Have you checked your network connections ethtool eth0, and checked the SMART values ? smartctl -a /dev/sdX You could also try testing the HD: hdparm -tT /dev/[hs]d? I noticed that I had a brand new drive that was flaky which caused weird system crashes that wasn't obvious thru the logs... Thanks for the reply, but as I say in the op: I have tailed the syslog while attempting to trigger the failure, but it shows nothing unusual. Likewise, the syslog shows no immediately apparent red flags from system startup. Nothing at all was output to the syslog when the crashes occur. I'm stumped.
April 9, 201115 yr Author Sorry for the delayed response, had to go out of town for a few days and was unable to work on the problem. I will check things out with HDparm and Ethtool, but in the meantime a new possibility has arisen. When I got back into town, and attempted to boot the system back up, it would just hang repeatedly. After some forum googling here and elsewhere about this issue, I starting seeing numerous posts about the Seagate 1.5TB drive (the early ones like mine) being extremely wonky. They don't necessarily fail, but just behave very badly. Even after updating the firmware (which I have done) they have a spotty record for reliability. So I disconnected the remaining 1.5TB from the system and it booted right up with no issue. At this point, I'm thinking this unit could potentially be the cause of some of the oddity I'm seeing. After a bit, I was able to get the system to start back up with the drive connected. I'm currently in the process of shuffling the data off of it to other drives manually. But it's taking an extremely long time. rsyncing an 8GB file from the 1.5 to another drive is taking nearly an hour per GB. Even with parity calculation that seems extreme. Any ideas there?
April 9, 201115 yr Author During my rsync, I checked the syslog, and found a whole load of REISERFS warnings about: "free space seems wrong" and "invalid format found in block" on md5 (the 1.5TB). Currently running a reiserfsck and hoping for the best.
April 10, 201115 yr Author Yep, looks like my suspicions about that 1.5TB drive were correct. The the reiserfsck turned up many issues. I performed a tree-rebuild, and rebooted the server. I then cancelled the parity check and am continuing to move the (hopefully mostly non-corrupted) data off the problem drive and onto others. The Reiserfsck also appears to have fixed my uber-slow transfer speed issue. Once this is done, I'll pull the bum drive and recalculate parity. Then I can continue to test the original issue if it still persists.
April 10, 201115 yr A file system error does not necessarily mean the drive is failing. Pre-clear it after replacement and see if it passes.
April 12, 201115 yr Author Update: So I moved the data around and pulled the suspect drive. I realize that a few reiser errors do not necessarily mean the drive is messed up, but given teh ill reputation these drives have online, I'm not taking chances at the moment. After pulling the drive, I recomputed parity. This completed earlier today. Since then, the array has been working as expected. I was able to copy to/from and delete files over samba without incident. Just now, I attempted to organize some more data, which involved moving files around via samba. Once again, this caused the system to hard-lock. On reboot, it again started a parity check. At this point I took bcbgboy13's advice and reflashed my bios with the newest version. Afterwards, I skimmed the startup syslog and found most of the same warnings and errors that were present before. However, the system continues to pass memtests without issue. I just can't for the life of me figure out why this thing is so damned unstable, and why this specific usage case caused the failure. Any additional advice is appreciated. I'll likely pursue the HDparm tests, next.
April 12, 201115 yr this does sound pretty weird... i'm wondering if there is a SAFE way to start pulling drives to try to isolate whether it's being caused by the HDs or not... it almost does sound like a heat/component issue rather than a drive issue...
July 30, 201114 yr Author Eventually I did resolve this problem. It would seem that my issues were caused by a number of things. 1) The Seagate 1.5s were definitively problematic. I've yet to use either of them for an application where something strange didn't happen with storage I/O. So pulling them from the system helped immensely. 2) After pulling them, though less frequent, the system would still had lock. Also, the video card in the system died. I replaced it with an identical card, and it fried again approximately 2 weeks later. At this point, I began to suspect a power issue. Sure enough, I swapped the PSU with another unit and the rest of my issues vanished. The PSU itself seems to function correctly, as I've since placed it in another machine and had no issue. I can only assume that the specific combination of hardware in the UNRAID server was overtaxing ti just enough to be problematic. So if you can, try swapping in a better or more powerful PSU. If not, try removing as many power drawing devices from the machine as possible during testing, then slowly add them back to see if the stability decreases.
Archived
This topic is now archived and is closed to further replies.