hooger Posted February 12, 2013 Share Posted February 12, 2013 This is the first time anything like this has happened to me so please bear with me, I have searched around, but am still unaware of what to do exactly. So starting yesterday I noticed that the monthly parity check showed there were a bunch of errors, something in excess of 6000 errors. Thinking that was odd, I searched the forums. Found some information that I should run a check only parity check twice to see if the errors were the same or not. And then to run a memtest, and then other procedures. So I started to do just that. I was about 70% through my first no correct parity check, while watching media streaming from the tower. My media stopped playing and there was nothing in my tv share. I logged onto the webgui and notice that my disk6 has been red-balled. This drive is a recent added (2 weeks ago) that I precleared and it passed no problems (Seagate 3TB). I've checked all the SATA connections, and all the power connections. Also have swapped out the SATA cable for a brand new one. And the drive is still red. I ran a SMART short test, and it failed half way through saying host interruption. So I ran it again, and it completed fine with no errors. I'm guessing my next step is to run the longer SMART test. But after that I don't know what to do. System Specs: unRAID Version: unRAID Server Pro, Version 5.0-rc8a Motherboard: ASUSTeK - P5N-D Processor: Q6600 (i think not sure, it doesn't list it here) - 2.666 GHz Cache: Memory: 2GB Network: 1000Mb/s - Full Duplex Syslog and smart report added. Link to syslog EDIT: realized in that syslog i forgot to start the array heres the updated log I'm also assuming all data on the 6th drive is lost right? Seeing as there were a load of parity issues... I'm actually assuming all my data is not safe. EDIT: just noticed that my cache drive is now showing "unformatted"... It's directly connected to the motherboard, and not on the same controller as drive 6. Is there a way to fix that too, or is everything now gone? (had a lot of programs data set up there) EDIT2 (RobJ): remove typo in 2nd link smart.txt Quote Link to comment
dgaschk Posted February 12, 2013 Share Posted February 12, 2013 The updated syslog link does not work. Please zip and attach. Quote Link to comment
hooger Posted February 12, 2013 Author Share Posted February 12, 2013 Sorry about that. I've attached the log. A little update, I took the cache drive out and was able to recover the data on it. And I replaced it with a new WD Black 640GB drive , and put my data back on it. That part is running smoothly, however my "Downloads" cache-only share no longer works. I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc). Any ideas why the Download cache-only share doesn't work? Other then that I swapped out the power cables, and the SATA cables again. And now the drive shows up as orange, so I'm in the process of rebuilding data. Honestly I'm not that concerned if the data is lost, as long as the drive is still working. syslog2.zip Quote Link to comment
RobJ Posted February 12, 2013 Share Posted February 12, 2013 There are a number of problems. 1. I fixed a typo in your 2nd link, but it's always best to attach the entire syslog, zipped. 2. Your attached SMART report is for Disk 5, and looks fine, but is probably not the one you meant to attach. 3. Syslog ends in "kernel: [Hardware Error]: Machine check events logged", which is serious. Primary cause I think is memory, which is also the easiest to test, and almost the easiest to fix. Please run a long memory test. The UnRAID boot menu has a good one, let it run for several hours at least, or until errors occur. If memory checks out fine, then there may be another issue with your motherboard, or power issues. I'm not sufficiently experienced in Linux to know where those 'check events' are logged. Hopefully, a real Linux expert will advise you where to look, *before* you reboot next. 4. Your Cache drive has bad sectors at the very beginning, which stops it from mounting. This has to be fixed. I see you have now replaced it. 5. Your Parity drive Disk 6 (3TB drive) is attached to one of the slowest SATA ports you have, on a Silicon Image card, which limits it to UDMA/100 speed at best. Worse yet, it is acting a little jiggly, raising an exception 3 or 4 times, which resulted in it being slowed by the exception handler to UDMA/66, which will limit its speed to less than half of what it is capable. Your 3TB drives are your fastest, so reconnect both of them to your fastest onboard ports. And make sure the cables (both ends) to it are tight. Loose cable *might* be causing the exceptions, or something is wrong with the first port on that card, or this is another indication of power issues. At some point, you should also consider retiring both Silicon Image cards, as they are limiting performance to any attached drives. 6. There is no evidence of issues with Disk 6, but for that we would have needed to see the syslog when that happened. It would have been much better if you had captured it after those errors occurred, before rebooting. At the moment, Disk 6 looks fine. But since there are indications of either memory or power issues, I don't doubt other problems could occur too. It's absolutely imperative you rule out or fix any memory issues first. Then you may need to try a better power supply. Edit: I mixed up the 2 3TB drives, Parity and Disk 6. Inexcusable. Quote Link to comment
RobJ Posted February 12, 2013 Share Posted February 12, 2013 A little update, I took the cache drive out and was able to recover the data on it. And I replaced it with a new WD Black 640GB drive , and put my data back on it. That part is running smoothly, however my "Downloads" cache-only share no longer works. I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc). Any ideas why the Download cache-only share doesn't work? Not sure how you were able to recover data from the old Cache drive. It would not surprise me that even if you got it to mount, the Reiser file system would have been corrupted. That could cause an incomplete recovery, resulting in issues with the contents of the new Cache disk. Quote Link to comment
hooger Posted February 12, 2013 Author Share Posted February 12, 2013 A little update, I took the cache drive out and was able to recover the data on it. And I replaced it with a new WD Black 640GB drive , and put my data back on it. That part is running smoothly, however my "Downloads" cache-only share no longer works. I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc). Any ideas why the Download cache-only share doesn't work? Not sure how you were able to recover data from the old Cache drive. It would not surprise me that even if you got it to mount, the Reiser file system would have been corrupted. That could cause an incomplete recovery, resulting in issues with the contents of the new Cache disk. I unplugged it, and installed it in a spare computer. Then searching found a tool with a GUI that would allow me to read the drive and copy the data off. I thought it would be a long shot but it turned out to work. EDIT: I should note the only data i recovered was the config files for SAB,CP,SB. I tested them on a VM install of the programs, and they seemed to work fine. Haven't copied it to the server yet. About the parity drive I just checked the SATA cable and it is indeed connected to one of the onboard SATA ports. I have two Seagate 3TB drives: Parity: ST3000DM001-1CH166_W1F16W9X (sdi) 2930266532 Disk 6: ST3000DM001-1CH166_Z1F1QJHT (sda) 2930266532 Disk 6 is plugged into a PCI card, and the parity to the onboard. I know its not ideal, but I'm saving my money for an upgrade. I can try connecting both 3TB to onboard ports. About the power supply it is a relatively new Corsair 850w PSU, which i upgraded a few weeks back from a 400w PSU. I will run a mem test, but should I wait till unraid is finished its "Data-Rebuild" or just cancel it now? And yes the SMART log was the wrong one, i apologize it was late and I was tired. I have attached the proper one this time (I hope.. hah). What is the best expansion card on a budget? smart.txt Quote Link to comment
dgaschk Posted February 12, 2013 Share Posted February 12, 2013 The drive Z1F1QJHT is experiencing power problems. The Power-Off_Retract_Count is about 4 times the Power_Cycle_Count. This could be a bad or loose power connection. Quote Link to comment
hooger Posted February 12, 2013 Author Share Posted February 12, 2013 The drive Z1F1QJHT is experiencing power problems. The Power-Off_Retract_Count is about 4 times the Power_Cycle_Count. This could be a bad or loose power connection. That SMART report was taken after changing the power cable, but after this memtest is done i will change it to another power cord. EDIT: I have swapped out the power cable for a new solo one straight from the PSU thats not powering anything else. I ran the memtest for 4 hours, with 5 passes and no errors. I will run an overnight memtest tonight. I have attactched a zip contaning another syslog, and SMART reports for my cache, parity, and "red-balled" disk. Quote Link to comment
hooger Posted February 13, 2013 Author Share Posted February 13, 2013 EDIT: I have swapped out the power cable for a new solo one straight from the PSU thats not powering anything else. I ran the memtest for 4 hours, with 5 passes and no errors. I will run an overnight memtest tonight. I have attactched a zip contaning another syslog, and SMART reports for my cache, parity, and "red-balled" disk. Apparently you can't attach a file when you click modify, I'm sorry for the double post please correct me if I'm wrong... I couldn't see an attach button. EDIT: I just happened to be checking the syslog again (after uploading this one) and notice this was logged: 2502: Feb 12 20:44:22 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks. 2501: Feb 12 20:44:22 Tower kernel: md: recovery thread rebuilding disk6 ... 2500: Feb 12 20:44:22 Tower kernel: md: recovery thread woken up ... 2499: Feb 12 20:44:22 Tower kernel: mdcmd (43): check NOCORRECT 2498: Feb 12 20:38:29 Tower kernel: [Hardware Error]: Machine check events logged That hardware error is back. logs.zip Quote Link to comment
RobJ Posted February 13, 2013 Share Posted February 13, 2013 A little update, I took the cache drive out and was able to recover the data on it. And I replaced it with a new WD Black 640GB drive , and put my data back on it. That part is running smoothly, however my "Downloads" cache-only share no longer works. I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc). Any ideas why the Download cache-only share doesn't work? Not sure how you were able to recover data from the old Cache drive. It would not surprise me that even if you got it to mount, the Reiser file system would have been corrupted. That could cause an incomplete recovery, resulting in issues with the contents of the new Cache disk. I unplugged it, and installed it in a spare computer. Then searching found a tool with a GUI that would allow me to read the drive and copy the data off. I thought it would be a long shot but it turned out to work. EDIT: I should note the only data i recovered was the config files for SAB,CP,SB. I tested them on a VM install of the programs, and they seemed to work fine. Haven't copied it to the server yet. About the parity drive I just checked the SATA cable and it is indeed connected to one of the onboard SATA ports. I have two Seagate 3TB drives: Parity: ST3000DM001-1CH166_W1F16W9X (sdi) 2930266532 Disk 6: ST3000DM001-1CH166_Z1F1QJHT (sda) 2930266532 Disk 6 is plugged into a PCI card, and the parity to the onboard. I know its not ideal, but I'm saving my money for an upgrade. I can try connecting both 3TB to onboard ports. About the power supply it is a relatively new Corsair 850w PSU, which i upgraded a few weeks back from a 400w PSU. I will run a mem test, but should I wait till unraid is finished its "Data-Rebuild" or just cancel it now? And yes the SMART log was the wrong one, i apologize it was late and I was tired. I have attached the proper one this time (I hope.. hah). What is the best expansion card on a budget? No need to apologize, in fact, I'm the one that needs to apologize, mixing up your 2 3TB drives! I'm sorry. I somehow switched the Parity drive with Disk 6. So there's nothing at issue with the Parity drive, and the exceptions and slowdowns were actually on Disk 6. And in view of dgaschk's comment (very good catch by the way!), Disk 6 has been having other issues too. Its SMART report looks fine though, apart from the power issue, so Disk 6 is probably a good drive, once it's connected to a good port with good power. I see you have done that. I noticed that the first syslog showed your server had 2TB of RAM, and your second syslog showed it had 4TB. You did not mention anything about increasing the memory, so I have to verify, did you add memory? If not, this is obviously a red flag. I would not interrupt the rebuild. You are rebuilding Disk 6? I'm curious what the tool with gui was, that you used to copy off the data? As to the "best expansion card on a budget", that's something I would like to learn too! Quote Link to comment
hooger Posted February 13, 2013 Author Share Posted February 13, 2013 Well if anything you made me realize I should have my fastest drives on the on board SATA. So no harm done. About the RAM yes my mistake. I recently received 4gb of higher quality RAM so i swapped out the 2gb RAM for the better sticks. Yes unraid is rebuilding disk 6. Once it's done I hope everything will be sorted out. Still unsure of the Hardware error... Ive Google'd around. And the average answer is its usually a problem with the RAM. So after the rebuild I will run an overnight memtest. I will edit in a link to the tool as im on my phone right now. EDIT: This was the post for reading the drive on windows. Which lead me to this site: http://yareg.akucom.de/ Quote Link to comment
hooger Posted February 25, 2013 Author Share Posted February 25, 2013 In case anyone was wondering, I seem to have fixed the "[Hardware Error]: Machine check events logged" error. It turns out my motherboard had a VERY old BIOS on it. And it didn't correctly support my CPU. So a simple BIOS upgrade and I was good to go. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.