August 9, 200916 yr I inserted a new drive to expand my array. Machine booted up fine I started the rebuild via the web gui overnight. No problems. I come down in the morning and after a refresh of the screen I can no longer access the web gui. Check Network, nothing got kicked out, link lights on. Machine is still rebuilding as expected (well, all the lights are still blinking Hmm... So I hook up a monitor a keyboard locally enter my Root login and password. Computer displays: Linux 2.6.26.5-unRaid then drops back to login. It's a valid login because an incorrect one tells me so. I think I'm just going to have to powerfail the array and bring it up again so the other half can watch her shows again, but it certainly seems odd to me. Anyone with similar happenstance. I'm still on 4.4beta 2 as it's been happy.
August 9, 200916 yr I inserted a new drive to expand my array. Machine booted up fine I started the rebuild via the web gui overnight. No problems. I come down in the morning and after a refresh of the screen I can no longer access the web gui. Check Network, nothing got kicked out, link lights on. Machine is still rebuilding as expected (well, all the lights are still blinking Hmm... So I hook up a monitor a keyboard locally enter my Root login and password. Computer displays: Linux 2.6.26.5-unRaid then drops back to login. It's a valid login because an incorrect one tells me so. I think I'm just going to have to powerfail the array and bring it up again so the other half can watch her shows again, but it certainly seems odd to me. Anyone with similar happenstance. I'm still on 4.4beta 2 as it's been happy. Your description is exactly what happens when you run out of RAM. The kernel out-of-memory processing kills processes it thinks might not be needed... in an attempt to free more memory. Apparently, emhttp was one killed, and your login shell (when you log in via the console) is being killed soon after you log in. From what you've described, you've probably had a ton of errors logged to the syslog, filling it, and using up all your RAM. From my own experience, it is really hard to try to regain control... (other than a reboot) but then you lose the history of what happened in the syslog. You can try logging in until a session lives long enough for you to look around, but odds are pressing the power button will be your only recourse. Next time you want to expand the array, you might want to use the preclear_disk.sh script I created. It allows you to pre-test, and -pre-clear a new drive so when you add it to the array the array is only off-line for a few minutes. It would probably have identified a drive with errors before they filled the system log. Joe L.
August 9, 200916 yr Author Ah, It was a bit of a catch 22 because I couldn't get to the syslog to see what was causing the issues. I ended up just cycling the system and unraid picked up the process as a straight data rebuild. Nice to have a better idea that was was going on was 'normal' in some sense. I am a little concerned at the prospect of one of my drives being a drive with errors... ack! Now I need to figure out which is the culprit. I guess I'll run each drive through a round of 'Spinright' and see what it comes up with. I'm looking into running the monthly parity check script as well as some of the other addons for a little more peace of mind. Your preclear script certainly looks interesting. My expansions are actually 'upgrades' at this point since I'm maxed out, but anything that helps keep the array up is good. Now I just need to wait for this data-rebuild to finish so I can see the contents of my flash drive again and install some add-ons. Thanks for the info Joe-
August 9, 200916 yr Ah, It was a bit of a catch 22 because I couldn't get to the syslog to see what was causing the issues. I ended up just cycling the system and unraid picked up the process as a straight data rebuild. Nice to have a better idea that was was going on was 'normal' in some sense. I am a little concerned at the prospect of one of my drives being a drive with errors... ack! Now I need to figure out which is the culprit. I guess I'll run each drive through a round of 'Spinright' and see what it comes up with. I'm looking into running the monthly parity check script as well as some of the other addons for a little more peace of mind. Your preclear script certainly looks interesting. My expansions are actually 'upgrades' at this point since I'm maxed out, but anything that helps keep the array up is good. Now I just need to wait for this data-rebuild to finish so I can see the contents of my flash drive again and install some add-ons. Thanks for the info Joe- Running out of RAM is not normal... it is an indication of something very wrong. In my own case, I had a non-standard process that used up all my memory. It was something that I was developing... (in other words, it was my own fault I ran it out of RAM) In my case, I was able to eventually log on and regain control. While the "upgrade" of a disk is in progress, you can monitor the syslog, just in case. On the system console you can type tail -f /var/log/syslog and it will follow the tail end of the syslog. If errors occur, you'll see them. (At worst case, you can snap a picture and post it if a crash occurs) Before doing any "upgrade" of a disk, it is highly recommended that you run a full parity check. It usually will alert you to any disks that are acting up and would interfere with the actual upgrade. I am a bit confused by your statement that you cannot get access to your server while a data rebuild is in progress... Are you replacing an existing disk, upsizing it? Or are you adding a new disk, and have assigned it to a new "slot" in the array? It is only in the latter case are you locked out from the server while unRAID clears the new disk. If just upgrading an existing disk, you should still have full access. Joe L.
August 9, 200916 yr Author I am a bit confused by your statement that you cannot get access to your server while a data rebuild is in progress... Are you replacing an existing disk, upsizing it? Or are you adding a new disk, and have assigned it to a new "slot" in the array? I can get access while data rebuild is going on that parts all good. When I goto my 'flash' share, I can see the config and boot directories but no files. Strange I don't recall it being that way, but playing with during a rebuild seems sub-optimal I though maybe it was a lockout mechanism. The files are there where I plug in my thumbdirve to my computer. Don't know that I can address it till tomorrow.
August 9, 200916 yr I am a bit confused by your statement that you cannot get access to your server while a data rebuild is in progress... Are you replacing an existing disk, upsizing it? Or are you adding a new disk, and have assigned it to a new "slot" in the array? I can get access while data rebuild is going on that parts all good. When I goto my 'flash' share, I can see the config and boot directories but no files. Strange I don't recall it being that way, but playing with during a rebuild seems sub-optimal I though maybe it was a lockout mechanism. The files are there where I plug in my thumbdirve to my computer. Don't know that I can address it till tomorrow. You just need to configure your windows explorer on your PC to show system and hidden files. The files will then show as usual through file-explorer when you browse \\tower\flash. It has nothing to do with the "expansion" process. Joe L.
August 9, 200916 yr Author DOh. Re-installed recently Onward and upward. Appreciate all the help you give around here.
August 9, 200916 yr Author Ah ha! I indeed caught the system in a kernal panic and out of memory. No farther than that though, Joy. If there's an issue I suspect it may be the new drive, but I'll put the old back one in and give it a whirl. Except that I can't because now the array needs a 1.5 and the old was a 750. ugh This is going to be fun... hahaha
August 10, 200916 yr Author So to add to the saga... It looks like Disk 12 in the array is shutting itself down during rebuild causing massive failures. (I suspect that's what's taking place since there is no longer a temperature reading) over 6M read errors I caught the syslog when it was 500MB right before failure. For Completeness: Aug 10 01:10:51 Media kernel: ata13.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Aug 10 01:10:51 Media kernel: ata13.00: port_status 0x20200000 Aug 10 01:10:51 Media kernel: ata13.00: cmd 25/00:b8:27:bd:18/00:02:36:00:00/e0 tag 0 dma 356352 in Aug 10 01:10:51 Media kernel: res 71/04:04:9d:00:32/04:00:00:00:00/e0 Emask 0x1 (device error) Aug 10 01:10:51 Media kernel: ata13.00: status: { DRDY DF ERR } Aug 10 01:10:51 Media kernel: ata13.00: error: { ABRT } Aug 10 01:10:51 Media kernel: ata13.00: both IDENTIFYs aborted, assuming NODEV Aug 10 01:10:51 Media kernel: ata13.00: revalidation failed (errno=-2) Aug 10 01:10:51 Media kernel: ata13: failed to recover some devices, retrying in 5 secs Aug 10 01:10:56 Media kernel: ata13: hard resetting link Aug 10 01:10:56 Media kernel: ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 10 01:10:56 Media kernel: ata13.00: both IDENTIFYs aborted, assuming NODEV Aug 10 01:10:56 Media kernel: ata13.00: revalidation failed (errno=-2) Aug 10 01:10:56 Media kernel: ata13: failed to recover some devices, retrying in 5 secs Aug 10 01:11:01 Media kernel: ata13: hard resetting link Aug 10 01:11:02 Media kernel: ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 10 01:11:02 Media kernel: ata13.00: both IDENTIFYs aborted, assuming NODEV Aug 10 01:11:02 Media kernel: ata13.00: revalidation failed (errno=-2) Aug 10 01:11:02 Media kernel: ata13.00: disabled Aug 10 01:11:02 Media kernel: ata13: EH complete Aug 10 01:11:02 Media kernel: sd 13:0:0:0: [sdn] Result: hostbyte=0x04 driverbyte=0x00 Aug 10 01:11:02 Media kernel: end_request: I/O error, dev sdn, sector 907590951 Aug 10 01:11:02 Media kernel: sd 13:0:0:0: [sdn] Result: hostbyte=0x04 driverbyte=0x00 Aug 10 01:11:02 Media kernel: end_request: I/O error, dev sdn, sector 907591647 Aug 10 01:11:02 Media kernel: sd 13:0:0:0: [sdn] Result: hostbyte=0x04 driverbyte=0x00 Aug 10 01:11:02 Media kernel: end_request: I/O error, dev sdn, sector 907592239 Aug 10 01:11:02 Media kernel: md: disk12 read error Aug 10 01:11:02 Media kernel: handle_stripe read error: 907590888/12, count: 1 Aug 10 01:11:02 Media kernel: md: disk12 read error Aug 10 01:11:02 Media kernel: handle_stripe read error: 907590896/12, count: 1 Aug 10 01:11:02 Media kernel: md: disk12 read error Aug 10 01:11:02 Media kernel: handle_stripe read error: 907590904/12, count: 1 Aug 10 01:11:02 Media kernel: md: disk12 read error Aug 10 01:11:02 Media kernel: handle_stripe read error: 907590912/12, count: 1
August 11, 200916 yr Your first instinct was to run SpinRite on it, but we have been finding that it is best to test for the real problem first. I think I can say that of the hard drive related cases we have seen recently, the majority of them have actually been interface problems, NOT physical drive problems. Interface problems could include bad cables, bad or loose connectors, bad backplane connections, not well seated backplane connections, or power problems, including bad power supply, too many drives on one insufficient rail, bad power connectors, and bad or loose power splitters. SpinRite is a great tool, for real drive problems, but can't do anything for interface problems. I believe some users have RMA'd drives, without realizing that there wasn't really anything wrong with the drive itself, just a cabling or connection problem! Your drive issue is obviously serious, enough for the system to disable it, and it was probably causing a huge number of drive errors, which used up a lot of your RAM. But the syslog extract above only shows its last moments, as it is being disabled. What we need to see are the first errors that occurred, to determine what the real issue is. It could very well be a bad cable! My feeling is that a drive that is found on each boot but then is disabled AND did not make unusual noises, is probably fine but has interface issues. Please see my sig for the Troubleshooting link, for instructions in capturing and posting your syslog. Also, Obtain a SMART report for the drive. From these 2 reports, we should be able to advise what the problem is, and the next steps to take.
August 11, 200916 yr Author Thanks Rob All good troubleshooting steps. I did make sure the cables were seated and nothing shifted as part of my troubleshooting. In looking at what was taking place with the drive temp not showing any longer,it reminded me of the old Seagate firmware issue that caused the drive to randomly spin down. I thought of spinright once I realized it was probably a drive issue and in an attempt to try to recover it enough for the rebuild. (All it did to spinright was crash it hard. Something I hadn't seen before.) I've since gone another direction. I wasn't unaware of smartctl until last night. But i'll be sure to try to run it on one of other boxes. Interesting to me was that the system passed a parity check not 3 days before. Guess it's possible it was on the edge. What I was able to grab (up until the repetitions started) syslog-wise is below for those interested. Looks like I'll be addressing ata7 next. And using smartctl to start. Syslog http://docs.google.com/View?id=dcdwsrd_124ggg44gck
August 13, 200916 yr Thank you for the syslog. So that really was all of the error messages for sdn (on ata13.00). I was positive there were many more, as those looked like just the tail end of the issue. DF (Device Fault) and "both IDENTIFYs aborted" are firsts for me, obviously a serious loss of contact at the drive level. It was disabled extremely quickly, in less than 12 seconds, which included a pair of 5 second waits. It still maintained the SATA link (at 3.0 Gbps), so the drive was still powered and at least some subsystems were working, but apparently the higher level drive functions were crashed, not responding. Because the drive seems to be identified and recovered fine on each boot, I like the idea of a firmware crash here, as a theory of what may be happening. A firmware upgrade might be a solution. ata7: SError: { UnrecovData 10B8B Dispar BadCRC Handshk } This looks more conventional, concerning sdh on ata7.00. We have found this to be characteristic of a bad cable. Replacing the cable with a better quality one has invariably stopped the errors.
August 13, 200916 yr Author Thanks for looking at that Rob- Glad to bring something new to the table Yup, that was it. Really strange. The end of the syslog log repeated what was there up for more than 500 megs with the same error so I truncated, but no other indication. I did try to flash the HDD up to the latest as one of my last troubleshooting steps since it wasn't on the latest (which could have potentially caused the issue in the first place on a drive of that ilk) but that didn't help. I often think back to a phrase I once heard... everything that's broke was working yesterday I will definitely swap out the sata cable this weekend on ATA7 (sdh) (god knows I have enough of them) Good to have a lead.
Archived
This topic is now archived and is closed to further replies.