December 8, 200916 yr Last night I upgraded disk 7, replacing a 1TB Seagate with a 1.5TB WD. The rebuild seemed to go fine and the web status page said the parity was rebuilt without error. This took from about 20:48 to 05:08. Tonight I tried to access files on one of the shares but the directory was showing as empty, so I looked at the status page and it showed two drives (disk1 and disk4) had not spun up. I hit the spin up button and after a while the green ball appeared beside both these, but the drive temperature was not showing. It was then I noticed the 14929 errors on both of these drives. The syslog shows that there was some sort of link issue with these two drives (which are I think ata2 and ata8) during the rebuild at around 01:48 and then the rebuild finished at 05:08. A few hours later at 08:12 these two drives start reporting errors until about 09:32 when the errors stop and then don't restart until I tried to access the share at about 19:01. At first I thought it might be a problem with one of my 2-port SIL PCIe SATA cards, but one of the problem drives is attached to the motherboard. So what's the next step? I've split the log in two parts, first part is attached. Regards, Stephen
December 8, 200916 yr Author I am wondering if it might be a power issue, both these drives are connected to the same SATA power cable from the power supply, the other drives which are not affected are connected to a different cable. Stephen
December 8, 200916 yr I am at work right now, but when I get home I will take a look at the syslog parts and see if I can some up with anything.
December 8, 200916 yr Author The unRAID status page looks like this: parity ata-WDC_WD15EADS-00P8B0_WD-WMAVU0375809 * 1,465,138,552 - 3,947,917 13,713 0 disk1 ata-WDC_WD10EADS-65L5B1_WD-WCAU4A880139 * 976,762,552 406,358,168 3,947,917 13,713 15,036 disk2 Not installed disk3 ata-WDC_WD10EADS-65L5B1_WD-WCAU4A878787 * 976,762,552 261,262,896 2,591,926 6 0 disk4 ata-WDC_WD10EADS-00L5B1_WD-WCAU4A158377 * 976,762,552 356,381,024 2,591,926 6 15,036 disk5 Not installed disk6 ata-WDC_WD10EACS-00ZJB0_WD-WCASJ2064707 * 976,762,552 3,448,472 2,604,716 6 0 disk7 ata-WDC_WD15EADS-00P8B0_WD-WCAVU0359932 * 1,465,138,552 487,753,120 28,297 4,087,329 0 disk8 ata-WDC_WD15EADS-00P8B0_WD-WCAVU0266415 * 1,465,138,552 661,921,420 3,953,596 6 0 disk9 ata-WDC_WD10EADS-00L5B1_WD-WCAU49319526 * 976,762,552 447,812 2,615,326 6 0 Note disk7 with the 4,087,329 writes was the drive that was just upgraded. The only other drives to have any writes are disk1 and parity, they both have 13713 writes. Stephen
December 8, 200916 yr I just had a drive go offline because the reallocated sector count was rising very very fast. went from 5 to over 2,000 in about 3 days. Thankfully Seagate is replacing the replacement at no charge and I should get the new drive in a couple of days. See if you can get smart reports on the drives and see what happens. I ran SMART on my one drive this morning before shutting down the server and it flat out FAILED the smart test.
December 8, 200916 yr Author I just had a drive go offline because the reallocated sector count was rising very very fast. went from 5 to over 2,000 in about 3 days. Thankfully Seagate is replacing the replacement at no charge and I should get the new drive in a couple of days. See if you can get smart reports on the drives and see what happens. I ran SMART on my one drive this morning before shutting down the server and it flat out FAILED the smart test. I did try to get a smart report last night from unmenu, but I got back an error that the device was not available or something similar. So I'm guessing that unRAID has shut down those two devices (sda and sdg) completely? At this point I have not stopped the array or rebooted, the only things I have tried are to spin up the drives (they have not come up) and to look at files on the working drives (which seem ok). Once I've done a reboot I'll probably run smart tests and reiserfs checks before bringing the array online. My current plan (apart from waiting for more suggestions) is to: 1. shutdown unraid 2. power down 3. check the cables on the affected drives (I think there is a power splitter in use here, so I might change or remove it) 4. power up 5. see if I can see the missing drives 6. run smart tests on the missing drives 7. run the reiserfs checks on the missing drives 8. if 5-8 are ok then try starting the array Stephen Stephen
December 8, 200916 yr Author One thing that is bothering me about this is that the unRAID main status page is still showing green lights on all of the drives, despite two of them not really working. The only indications that anything is wrong are: 1. the drive temperatures for the two bad drives are missing 2. the error count the two drives is now non-zero Since these are not colour-coded one can easily miss them in a casual glance (i.e. bring up the page, check all the lights are green and then assume all is ok). Stephen
December 8, 200916 yr you can try getting a smart report manually (not through unmenu). I don't have a direct link at hand but you can probably find directions in the FAQ or Topical Index.
December 9, 200916 yr Author you can try getting a smart report manually (not through unmenu). I don't have a direct link at hand but you can probably find directions in the FAQ or Topical Index. After I rebooted I was able to get run short smart tests and the drives appear ok. So it appears that the drivers got completely disabled. Stephen
December 9, 200916 yr Author I was able to shutdown the system. After I hit the "stop" button unraid finally changed the status lights to red and marked the two misbehaving drives as missing. Why didn't it do that while the system was running? I then powered off and checked the cables, the two drives that had stopped working were both on the same molex to sata power splitter, so I replace that with a new one, reseated the sata data cables and started the system. I ran a memory test pass and all was fine, so I let the system boot. The array started without incident and all drives appear fine. The only indication of past troubles in the system log is this: Dec 9 07:01:10 saturn kernel: ReiserFS: md6: checking transaction log (md6) Dec 9 07:01:10 saturn kernel: ReiserFS: md1: replayed 2 transactions in 0 seconds Dec 9 07:01:10 saturn kernel: ReiserFS: md8: Using r5 hash to sort names Dec 9 07:01:10 saturn kernel: ReiserFS: md4: replayed 2 transactions in 0 seconds where ReiserFS reports replaying 2 transactions on the two drives that had issues. I've attached the complete syslog for future reference. Stephen
December 9, 200916 yr you can try getting a smart report manually (not through unmenu). I don't have a direct link at hand but you can probably find directions in the FAQ or Topical Index. After I rebooted I was able to get run short smart tests and the drives appear ok. So it appears that the drivers got completely disabled. Stephen I would still check the power cable, especially if they share a common cable as you suspected. A loose power cable that makes intermittent contact might only work at times... and it WILL cause hair-loss. (You'll be pulling out your hair trying to get to your data) Joe L.
December 9, 200916 yr I was able to shutdown the system. After I hit the "stop" button unraid finally changed the status lights to red and marked the two misbehaving drives as missing. Why didn't it do that while the system was running? Unless you press the "refresh" button on your browser, the web-management display is not updated. It probably would have shown the status as off-line if you had. Joe L.
December 9, 200916 yr I would still check the power cable, especially if they share a common cable as you suspected. A loose power cable that makes intermittent contact might only work at times... and it WILL cause hair-loss. (You'll be pulling out your hair trying to get to your data) Joe L. I agree with this statement 100%. It would be interesting to see the smart reports for the 2 drives also, to make sure something else did not cause the problem. I would also make sure to watch closely the GUI to make sure nothing else happens. A long test on the drives is probably worth doing just to make sure nothing is wrong with them. Get the smart report before, then run the test, and then get the smart report again. You can then compare the results and see if anything has changed.
December 9, 200916 yr Author I was able to shutdown the system. After I hit the "stop" button unraid finally changed the status lights to red and marked the two misbehaving drives as missing. Why didn't it do that while the system was running? Unless you press the "refresh" button on your browser, the web-management display is not updated. It probably would have shown the status as off-line if you had. Joe L. Actually I did press the browser's refresh button many times, as well as the refresh button within the page and neither caused the green lights to go red or the text to change to say "missing disk". The contents of the page were changing, because things like the drive temperatures and transfer counts changed. I also tried holding down the shift key while clicking the browser's refresh button and that had no effect (usually this causes the browser to regrab a fresh copy of a page rather than using the cached copy). Stephen
December 9, 200916 yr Author I would still check the power cable, especially if they share a common cable as you suspected. A loose power cable that makes intermittent contact might only work at times... and it WILL cause hair-loss. (You'll be pulling out your hair trying to get to your data) Joe L. I agree with this statement 100%. It would be interesting to see the smart reports for the 2 drives also, to make sure something else did not cause the problem. I would also make sure to watch closely the GUI to make sure nothing else happens. A long test on the drives is probably worth doing just to make sure nothing is wrong with them. Get the smart report before, then run the test, and then get the smart report again. You can then compare the results and see if anything has changed. Yes my chief suspect was the power cable as well. The main reasons for this were: 1. I had just done a drive upgrade, so that meant I had done some mechanical work in the case, so I might have loosened something 2. Both failed drives were on the same power splitter and were, in fact, the only drives on that particular power cable from the power supply (the reason I needed the splitter was really to convert from molex to sata power connectors as I had run out of sata power connectors) When I had the case open and powered up I wiggled the power splitter and heard the two drives make some noise, so I'm thinking that there's a loose contact in the splitter. After powering down I replaced the splitter with a new one. Stephen
December 10, 200916 yr I think you have correctly figured out the problem, seems consistent with the syslog. All you really need to see in the syslog are the following 2 lines, that occur rather early. After that, all other errors can be ignored. After being disabled, sdb and sdg are no longer accessible, at all, until after the next reboot. And there was nothing you could do about it *but* reboot. The drives themselves are probably completely fine. Dec 7 08:13:17 saturn kernel: ata2.00: disabled Dec 7 08:13:18 saturn kernel: ata8.00: disabled
December 10, 200916 yr Author I think you have correctly figured out the problem, seems consistent with the syslog. All you really need to see in the syslog are the following 2 lines, that occur rather early. After that, all other errors can be ignored. After being disabled, sdb and sdg are no longer accessible, at all, until after the next reboot. And there was nothing you could do about it *but* reboot. The drives themselves are probably completely fine. Dec 7 08:13:17 saturn kernel: ata2.00: disabled Dec 7 08:13:18 saturn kernel: ata8.00: disabled Ah, I see them now. So upon running into a situation (like power loss) where the drive will not respond the driver for that drive will be taken off line - which prevents anything from for modifying the drive? Any thoughts on why the web status page still kept showing the drives as ok (i.e. green lights) and did not change to red until I hit the "stop" button? Stephen
December 10, 200916 yr The driver is not taken offline, but the device is marked as disabled. The exception handler continues trying and later actually recognizes drives at both of these device 'slots', sets them up as sdj and sdk, device IDs that unRAID knows nothing about. unRAID is at a higher level, and only knows sdb and sdg, and does not even know that both of them are gone. I did not examine any further in that huge syslog, but usually those devices will continue to be recovered and lost multiple times. This is a real problem. Good software development practices often include creating independent objects and separation between them, by both encapsulating devices and services, and creating layers within protocols. The problem is that communication sometimes suffers, so that one object never learns important details of other objects. In this case, it is the kernel and its 'driver' modules that are dealing directly with the hardware, and are using the exception handler mechanisms to deal with issues. unRAID (or actually the high level I/O system) sits somewhat higher in the hierarchy, and knows nothing of the exceptions or actions taken or errors returned. It only knows what is finally returned to it, either the correct result of a request (such as a requested block to read), or a generic error return (such as a read error). And that is why drives remain inexplicably green, until an operation is finally attempted that discloses the true status of the drive. I think at some point, it would be good for Tom to add additional error detection at the lower level where the unRAID modules intercept the I/O of array drives. There must be some way to detect drive disabling, even if it is as crude as tailing the syslog when errors are returned, and analyzing for relevant exception lines. This will be tricky, and includes the complexity of communicating in a new way back to the higher level I/O routines. Perhaps it will entail creating an advanced exception handler that encapsulates the current exception handlers.
Archived
This topic is now archived and is closed to further replies.