May 28, 200917 yr Greetings, Well I've received my first red indicator beside a drive. I have included as syslog as well. (4.3.3 Pro - setup in sig) I have read the wiki regarding such events. I have tried to run a smartctl test, but it won't run (via mymain or the prompt). I'm not actually sure what I should do now. Can I run a smart test to verify the drive is okay, or if it's actually dead somehow? I googled "COMRESET failed (errno=-16)", which seemed to be a recurring issue, but it didn't get me very far. I need some help - thx
May 28, 200917 yr I googled "COMRESET failed (errno=-16)", which seemed to be a recurring issue, but it didn't get me very far. I need some help - thx COMRESET indicates unRAID is attempting to reset the communications from the disk controller to the drive. Basic procedure for you to follow is in the wiki. http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F First, check for a loose cable or connector. If that fails to find anything. and if "smartctl" commands do not show successful output, first try a different SATA cable, then a different SATA connector on the interface card, and a different POWER connector (they fail too... ask me how I know ). If the same drive is still failed, you will need to replace it. Until you have replaced the drive and are back with all green indicators you are not protected from a second concurrent disk failure. If there are any "critical" files on the failed disk, or on any other disk, while waiting for a replacement disk, make a copy of the "critical" files on one of the other working disks. That way, a second disk failure will not lose a file that is impossible to replace. Whatever you do, if you replace the drive with a new drive, DO NOT press the "Restore" button.... it does not restore data. After you replace the drive, press "Start" to start the array and begin the reconstruction process. Joe L.
May 28, 200917 yr I googled "COMRESET failed (errno=-16)", which seemed to be a recurring issue, but it didn't get me very far. I need some help - thx COMRESET indicates unRAID is attempting to reset the communications from the disk controller to the drive. Basic procedure for you to follow is in the wiki. http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F First, check for a loose cable or connector. If that fails to find anything. and if "smartctl" commands do not show successful output, first try a different SATA cable, then a different SATA connector on the interface card, and a different POWER connector (they fail too... ask me how I know ). If the same drive is still failed, you will need to replace it. Until you have replaced the drive and are back with all green indicators you are not protected from a second concurrent disk failure. If there are any "critical" files on the failed disk, or on any other disk, while waiting for a replacement disk, make a copy of the "critical" files on one of the other working disks. That way, a second disk failure will not lose a file that is impossible to replace. Whatever you do, if you replace the drive with a new drive, DO NOT press the "Restore" button.... it does not restore data. After you replace the drive, press "Start" to start the array and begin the reconstruction process. Joe L. I just wanted to make sure that the following was clear (I just reread the wiki and not sure that it is). When one drive has a red ball next to it, yet the array is started, unRAID has kicked the "red" drive from the array, but is SIMULATING it using the other array drives + parity. You may or may not notice degraded performance on the simulated disk (probably you would if you were paying close attention), but performance to other drives would not be impacted. If you were to lose a second drive, the ability to simulate the failed disk would be lost and you would lose data on both failed disks. While in simulated mode, writes to the failed disk ARE STILL ALLOWED. These writes would be simulated using parity, and not actually write to the failed disk. Even if you correct the underlying problem with the failed disk, it will not magically turn green and be welcomed back into the array automatically. You would either need to rebuild the disk or use the "Trust my parity" procedure. If you have done writes to the disk while it is simulated, the parity check at the end of the "trust my parity" procedure will find parity sync errors. This is not a bad thing (the parity check will correct these errors), but might be a little scarey if you aren't expecting it. Even if you haven't written to the disk there still may be some parity sync issues, especially if the computer crashed when the drive was disabled, which happens sometimes. Note that if you have written data to the simulated disk, it will be LOST by using the "trust my parity" process. If this is an issue, a rebuild (which preserves data written to the simulated disk) may be more suitable. If ultimately you determine that the disk has failed (as opposed to a lose cable), you will need to rebuild the disk onto a new physical drive.
May 28, 200917 yr Something very serious happened at May 28 19:44:52, and it caused a brief loss of communications to 7 drives! The system then quickly began recovery efforts, and was successful with 6 of them, but could not regain contact with Disk 12 (sdj, ata9.00, ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT****127), and disabled it, which is unfortunately fatal. All further efforts just produced errors, and the drive was red balled. May 28 19:46:03 Tower kernel: ata9: reset failed, giving up May 28 19:46:03 Tower kernel: ata9.00: disabled Earlier during boot and setup, all 16 drives were identified and setup with no issues at all, at maximum speeds, so there is no hint of communications difficulties before that event at 19:44:52. That means your drives should be fine, or at least they were fine, just before this. The only causes I can think of are the system was dropped or kicked (!), which you would probably have mentioned, or you were hit with a very serious electrical spike or brief outage. Was there a lightning strike nearby? The advice given above should get you back up and running. You will want to test Disk 12, once you have regained access to it. And if it was a bad spike, you should probably test ALL of your drives, and run a parity check, to see if other data corruption could have occurred. The latest unRAID version has a read-only parity check option, which might be a good choice here, since if a data drive received a corrupting spike, the parity info could be more correct than the data drive's data. You would need to upgrade to v4.5-beta6 first, and follow the specific instructions for a read-only parity check (can't remember them myself!). I think I'll add one last recommendation, do not under any circumstances write to Disk 12. It can only complicate recovery efforts, especially if another drive has spike-related issues.
May 28, 200917 yr Author Thx for all the help. So if it's a lose cable, then it won't automatically recognise it as fixed when it starts up again? Something very serious happened at May 28 19:44:52, and it caused a brief loss of communications to 7 drives! The system then quickly began recovery efforts, and was successful with 6 of them, but could not regain contact with Disk 12 (sdj, ata9.00, ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT****127), and disabled it, which is unfortunately fatal. All further efforts just produced errors, and the drive was red balled. Nothing obvious took place. We were watching something and the movie paused for a second or two then continued. The movie wasn't even on that disk. The server is powered through an UPS and a separate surge protector. There were no lightning strikes or noticeable spikes. I upgraded a drive to a larger one last week, so perhaps a cable has been slightly dislodged? I think I'll wait for a replacement drive to be shipped before I proceed - if all goes well, then it should arrive tomorrow. Last thing - I could NOT run a smart test. When I tried, it failed. I guess I need to get comms back with the drive before I do this. I think I'll wait for a new drive to arrive and replace this one - then do tests on it after it's out.
May 28, 200917 yr Thx for all the help. So if it's a lose cable, then it won't automatically recognise it as fixed when it starts up again? Nope... once a disk is marked as bad, only way to get it back online is to un-assign-it, reboot, and then re-assign it. Just use the "Start" button and not the "Restore" button... because unless you are using the "Trust my ..." procedure, you can lose data if the disk is really bad. Something very serious happened at May 28 19:44:52, and it caused a brief loss of communications to 7 drives! The system then quickly began recovery efforts, and was successful with 6 of them, but could not regain contact with Disk 12 (sdj, ata9.00, ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT****127), and disabled it, which is unfortunately fatal. All further efforts just produced errors, and the drive was red balled. Nothing obvious took place. We were watching something and the movie paused for a second or two then continued. The movie wasn't even on that disk. The server is powered through an UPS and a separate surge protector. There were no lightning strikes or noticeable spikes. I upgraded a drive to a larger one last week, so perhaps a cable has been slightly dislodged? I think I'll wait for a replacement drive to be shipped before I proceed - if all goes well, then it should arrive tomorrow. Last thing - I could NOT run a smart test. When I tried, it failed. I guess I need to get comms back with the drive before I do this. I think I'll wait for a new drive to arrive and replace this one - then do tests on it after it's out. If the disk failed, or the connector came loose, then smartctl commands will fail... In either case, something caused a major lockup of many of your disks and the disk controller had to re-establish communications with all of them when the failure occurred. It was able to reset all but the one. When you get the new drive, install it in place of the failed drive and then use the "Start:" button to rebuild the old disk contents onto the new. at a later time you can try the old on a different port and see if it responds at all. If yes, great... if not, RMA it. Joe L.
May 28, 200917 yr I would like to re-emphasize that this was a 7 drive problem, not just a single drive problem. All of these drives were working perfectly until this event at 19:44:52, then 7 different drives on 3 different controllers, WITHIN THE SAME SECOND, reported communications related difficulties with the drives! The affected drives are sdc, sdf, sdg, sdh, sdi, sdj, and sdn. From your additional comments, and another look at the error types, I'm beginning to lean more to a physical jarring event, rather than an electrical one. Some of the errors are PHY changes, which understand are like hotplug events, as if the drive was temporarily dislodged from its slot (or briefly lost power). And the fact that one of them (sdj) still can't be accessed, also seems like it may be a physical event, as if the drive was jarred loose. Many of the drives were issued hard resets, and responded, while sdj did not respond, perhaps did not even receive the hard reset. It would not have surprised me if 3 more drives had been unrecoverable. Is it possible that your server is in another room, and someone (perhaps a kid) was playing ball, and the ball slammed the server? Or perhaps there is another reason for a hard jarring of the server. You should open it up and check all cable connections, and especially any backplane connections. Make sure all drives are still firmly seated, cables too.
May 29, 200917 yr Author Nope... once a disk is marked as bad, only way to get it back online is to un-assign-it, reboot, and then re-assign it. 1 - What happens to parity if I unassign the drive then re-assign the drive and it is actually dead? (I'm worried about losing the parity info and being unable to reconstruct the drive.) 2 - If I unassign the drive, am I still covered by a further drive failure, or do I have to re-calculate parity (As it is with the drive having a red indicator, I assume a further drive failure will result in losing info on both drives?? 3 - If I unassign the suspect drive, can I still re-construct it with a replacement? 4 - If it was a cable connection and I rectify it, do I have to unassign it to run a smart test? I would like to re-emphasize that this was a 7 drive problem, not just a single drive problem. All of these drives were working perfectly until this event at 19:44:52, then 7 different drives on 3 different controllers, WITHIN THE SAME SECOND, reported communications related difficulties with the drives! The affected drives are sdc, sdf, sdg, sdh, sdi, sdj, and sdn. From your additional comments, and another look at the error types, I'm beginning to lean more to a physical jarring event, rather than an electrical one. Some of the errors are PHY changes, which understand are like hotplug events, as if the drive was temporarily dislodged from its slot (or briefly lost power). And the fact that one of them (sdj) still can't be accessed, also seems like it may be a physical event, as if the drive was jarred loose. Many of the drives were issued hard resets, and responded, while sdj did not respond, perhaps did not even receive the hard reset. It would not have surprised me if 3 more drives had been unrecoverable. Is it possible that your server is in another room, and someone (perhaps a kid) was playing ball, and the ball slammed the server? Or perhaps there is another reason for a hard jarring of the server. You should open it up and check all cable connections, and especially any backplane connections. Make sure all drives are still firmly seated, cables too. You pose some scary things. With no kids and the server in a secure position (no phyiscal impact short of an earthquake) I'm not sure what could have caused such a thing. It's got me concerned. I'm about to open it up and check all the connections. I have a replacement drive on the way - should be here tomorrow hopefully. Thx
May 29, 200917 yr Author I have checked all the cables, replaced the one to disk12 and rebooted the server. Disk12 (red indicated) is now available via a smart test. Here are the results, along with a syslog after the reboot. There doesn't appear to be any errors, and the smart test says the drive passes. If the drive appears fine, should I use the "Trust my Array" procedure, or unassign/reassign and reconstruct. My parity was calculated 5 days ago. Thx
May 29, 200917 yr Nope... once a disk is marked as bad, only way to get it back online is to un-assign-it, reboot, and then re-assign it. 1 - What happens to parity if I unassign the drive then re-assign the drive and it is actually dead? (I'm worried about losing the parity info and being unable to reconstruct the drive.) An un-assigned drive is exactly the same as a failed drive, as far as unRAID is concerned. If the drive is really dead, it will not be a potential for rebuilding, and the indicator will still be red, and your parity data will still be intact, even if you re-assign it... (if you even can.. as it might not even be recognized by the BIOS if it is really dead) 2 - If I unassign the drive, am I still covered by a further drive failure, or do I have to re-calculate parity (As it is with the drive having a red indicator, I assume a further drive failure will result in losing info on both drives?? No need to re-calculate parity if you un-assign the failed drive. No, an un-assigned drive does not protect you from a subsequent failure on a different drive. To do that you would need to re-calculate parity without the failed drive, and to remove a drive from the array and re-compute parity without it you would need to press the dreaded button labeled "restore" which will set a new configuration based on the currently assigned and working drives and then compute parity with those drives (throwing away any parity from any prior drive configuration, along with the ability to recover any data, until parity is calculated on the remaining working drives. You DO NOT want to do this, as you do not want to lose the ability to restore data from your failed drive... So.. do NOT press "restore" ... 3 - If I unassign the suspect drive, can I still re-construct it with a replacement? Yes, you can, once you install it and then assign it as a replacement to the failed slot in the array... 4 - If it was a cable connection and I rectify it, do I have to unassign it to run a smart test? No, you can run smartctl on a drive, regardless of it being assigned or not, as long as you know the /dev/sd? (or /dev/hd?) device name of the drive. You will need to un-assign, reboot, and then re-assign to get it to rebuild on the drive... or, if all the drives used when you last calculated parity are good you can try the "trust my ..." procedure as described in the wiki. But, don't do that just yet. I would like to re-emphasize that this was a 7 drive problem, not just a single drive problem. All of these drives were working perfectly until this event at 19:44:52, then 7 different drives on 3 different controllers, WITHIN THE SAME SECOND, reported communications related difficulties with the drives! The affected drives are sdc, sdf, sdg, sdh, sdi, sdj, and sdn. From your additional comments, and another look at the error types, I'm beginning to lean more to a physical jarring event, rather than an electrical one. Some of the errors are PHY changes, which understand are like hotplug events, as if the drive was temporarily dislodged from its slot (or briefly lost power). And the fact that one of them (sdj) still can't be accessed, also seems like it may be a physical event, as if the drive was jarred loose. Many of the drives were issued hard resets, and responded, while sdj did not respond, perhaps did not even receive the hard reset. It would not have surprised me if 3 more drives had been unrecoverable. Is it possible that your server is in another room, and someone (perhaps a kid) was playing ball, and the ball slammed the server? Or perhaps there is another reason for a hard jarring of the server. You should open it up and check all cable connections, and especially any backplane connections. Make sure all drives are still firmly seated, cables too. You pose some scary things. With no kids and the server in a secure position (no phyiscal impact short of an earthquake) I'm not sure what could have caused such a thing. It's got me concerned. I'm about to open it up and check all the connections. I have a replacement drive on the way - should be here tomorrow hopefully. Thx It could have been thermal expansion and everything "shifted" when everything expanded, and something lost its connection momentarily. With as much as happened, is it possible that the drives involved were all in one portion of the case, or on one series of power splitters? RobJ is trying to identify a common event that might have affected all the drives involved. Any ideas you might have, or a description of your physical arrangement of drives in the enclosure might give a clue. Are the 7 drives all in a common cage assembly? Joe L.
May 29, 200917 yr I have checked all the cables, replaced the one to disk12 and rebooted the server. Disk12 (red indicated) is now available via a smart test. Here are the results, along with a syslog after the reboot. There doesn't appear to be any errors, and the smart test says the drive passes. If the drive appears fine, should I use the "Trust my Array" procedure, or unassign/reassign and reconstruct. My parity was calculated 5 days ago. Thx It is good you know your parity was calculated recently... It is your choice... I would use the "Trust my parity" procedure, and set disk slot 99 as invalid... But you could just as easily let it rebuild the drive. Be absolutely certain to follow the steps as described in the wiki. If you press "Restore" and do not get the proper command entered at the command line to set slot 99 as not valid, you will lose your parity data, and your ability to restore data from parity and the other data drives. As long as you type the "mdcmd set invalidslot 99" command and get back the expected response before pressing "Start" you should be fine. When the array comes back online, after you press "Start" it will begin a full parity check. I'd let it go to completion. Joe L.
May 29, 200917 yr Author Thx Joe - really helpful answers. I've learned a lot about how these things actually work. At this point, I think I will unassign/reassign the drive and let it rebuild. It seems easier to me (and less scary) than that "Trust my array" procedure. Is one better than the other? What happens if I use "rebuild" and the drive fails during that process? Am I still protected, or is current parity lost when the rebuild process is started (thereby re-calculating it)? Does the smart test result data look good? Should I trust the drive? I looked carefully at the HDD setup. There doesn't seem to be any common connection between the HDD's. They all reside in different places within the server (no backplanes). They are bunched (well mostly) in groups of 3 within a TT icage that has it's own fan (each drive is screwed in with 4 screws). I've tried to separate them between different power connectors when I built it. Disk12 (red), along with D11, D13 are well away from the others located toward the rear of the server. When I pulled the SATA cable from the TX4 card, I thought it seemed slightly dislodged, but then again, SATA cables feel that way anyway. Currently it is winter here in NZ, so the drive temp's aren't that high either. It really is a mystery. My machine has been very reliable, so I hope I don't suddenly start having issues. I wish I KNEW what caused it so I could sleep a little easier Thx again.
May 29, 200917 yr Your syslog looks great, not even any transactions replayed, and that SMART report is fine too. The system looks so good now, that I would definitely run the Trust My Array procedure. It is quick, easy, and safe when used in situations like this. You will be back to normal in just a few minutes, with a parity check running, which you should probably allow to complete.
May 29, 200917 yr Thx Joe - really helpful answers. I've learned a lot about how these things actually work. At this point, I think I will unassign/reassign the drive and let it rebuild. It seems easier to me (and less scary) than that "Trust my array" procedure. Is one better than the other? The risk you take, in letting it rebuild, is that for the duration of the rebuilding process you are unprotected from a second concurrent disk failure of any one of the other disks in the array. If it takes 12 hours to rebuild, it will be 12 hours before you have parity protection of all your data. If you use the trust-my-parity procedure, you will only be without parity protection for a few minutes (once you start the array you will be protected from a subsequent drive falilure) For that reason, your data is less at risk of loss from another disk failure using the trust-my-parity procedure. What happens if I use "rebuild" and the drive fails during that process? Am I still protected, or is current parity lost when the rebuild process is started (thereby re-calculating it)? If the same drive fails, no problem... your parity data is still available to rebuild the next drive you put into place as a replacement. If a different drive fails, then you lose the data on both drives. Does the smart test result data look good? Should I trust the drive? It looks good. I'd trust it. I looked carefully at the HDD setup. There doesn't seem to be any common connection between the HDD's. They all reside in different places within the server (no backplanes). They are bunched (well mostly) in groups of 3 within a TT icage that has it's own fan (each drive is screwed in with 4 screws). I've tried to separate them between different power connectors when I built it. Disk12 (red), along with D11, D13 are well away from the others located toward the rear of the server. When I pulled the SATA cable from the TX4 card, I thought it seemed slightly dislodged, but then again, SATA cables feel that way anyway. Currently it is winter here in NZ, so the drive temp's aren't that high either. It really is a mystery. My machine has been very reliable, so I hope I don't suddenly start having issues. I wish I KNEW what caused it so I could sleep a little easier Thx again. I know how you feel... I had an intermittent "Y" power splitter in my array. Caused me some hair loss until I found it. Joe L.
May 29, 200917 yr Author Thx guys. I'll give the "Trust my Array" process a whirl. A few minutes sounds better than the 18 hours my system takes . Just out of interest, can you use the server during a data rebuild? Should you use it? What happens if a power outage happens during a rebuild? (I have an UPS, but it is not going to power the server much passed 30 mins). Update - Okay - just did the above and everything seemed to go fine. unRAID is now performing a parity check. The wiki makes a comment about sync errors. I did get some (59) but as the thing progresses no more are showing up. The wiki does say by the time they are reported they have already been corrected. Seeing I got some parity errors, do I need to run another parity check at the completion of this one? Update 2 - Got up this morning to find the server had completed it's parity check and everything seems fine. unMenu reports "Parity updated 59 times to address sync errors. " None of the drives showed any errors. Thx again for the help.
Archived
This topic is now archived and is closed to further replies.