June 19, 200917 yr So I don't even know where to begin since it's been a long couple weeks. My Unraid array had been running like clockwork for the last 9 months or so since I purchase the product. I've added additional disks and eventually got up to 14 drives total in my system. This was a large mixture of 1.5TB drives down to 400GB drives from all different manufacturers. There was the occasional hiccup but things worked rather well. Now we go to 3-4 weeks ago where I decided to purchase two new 1.5TB drives. I inserted them into my server and then did something I've never done before. I decided to disable parity. Yeah, I know dumb idea but I had run a full parity check just the day before and everything was running like clockwork. Plus my writes were so slow at around 6-13MB a second it would've taken forever to copy data from the old drives to the new. So I copied data over throuch MC but I had a couple weird lockups inside MC as well as copy failures.This concerned me some but then I noticed that I was getting errors on a couple of my 400GB drives that I was replacing. One completely failed on me so I lost around 400GB but that was my fault. I thought that was weird since I run a full parity check a couple days before and they appeared fine. So I decided I'll finish copying what I can and what the heck I'll buy 4 more 1.5TB drives since they were such a good deal to get rid of all the older drives in my system. So I ran the array as is for a few days while I waited for the new drives. Once they arrived I pulled the old drives out of the system and then added the four new drives to the array and reassigned the old 1.5TB parity drive that I had been running with previously. So after about 35 hours or so I was ready to use the array. Again like I said the parity check was so slow before the help of other members. I went ahead and started copying data off 500GB drives and 750GB drives in my system to further consolidate. I continued to notice weird issues but it was hard to tell what exactly was happening as I'm still horrible with interpreting linux syslogs. I couldn't pinpoint these errors to one specific drive. I then started getting errors on one of my two older 1.5TB Seagate drives. I was unable to stop the array so I would have to power off manually which required rebuild after rebuild. So i copied the data off and since I knew when I pulled the drive I would have to redo parity I decided the parity disk should go as well since it's the same vintage and I'll just send them both in for repair since they were purchased a month or so after the first 1.5TB drives were sold by Seagate. I knew these had been reported to have issues so why not use the warranty I have. I was so frustrated with these 2 day long parity checks after every weird failure I was getting I decided to post asking for help to decrease this time so I could troubleshoot easier. It was recommended to move my 4 largest drives to my onboard controller so I moved the drives consolidated data even further and moved from 2 SAT2-MV8 Supermicro controllers down to 1 with 3 drives attached and four onboard. This greatly increase parity checks and now they finish in about 11 hours instead of 32-38. Now I was really starting to think my 3 weeks of hell might be over with but after the parity check completed more weird errors happened. I still had issues stopping the array and I get errors listed for currently two of my drives. Right now I can't access either one either from the share or from /mnt/disk2 or mnt/disk5 without getting input/output errors. Can someone please help me?
June 19, 200917 yr The 2 drives listed at the bottom of the UnMENU Disk Management screen as "Not in Protected Array", are actually Disk 2 and Disk 5, sdb and sdc. They were very suddenly dropped from the array, simultaneously, as if they had been disconnected. The third drive on that card also had communications difficulties, but after several resets, was recovered. These 2 drives could not be, and when they finally responded, they were individually added as sdi and sdj. So there is no way to access them as Disk 2 and Disk 5, until you reboot. I would say that very likely you have something loose, either the SATA cables or the power cables or a backplane is coming loose. The other possibility is that there is something wrong with the disk controller card. There does not appear to be anything wrong with these drives, and there may not be anything wrong with your other drives either. I have more to say, but have to work now, more later. By the way, there is no reason to run the parity check all the way through, when you can't trust the system any way. Just abort it. Once you have the system configured and stable, then run it. And use the Trust My Array procedure when you can. More later ... For now, shut down and check all cables and connections.
June 20, 200917 yr A considerable number of transactions were replayed on Disk 2 and Disk 5 ONLY, which is very unusual. The only way I can explain that is this same simultaneous loss of the same 2 drives had happened in the previous session too, being suddenly dropped from the array, while the rest of the drives were correctly unmounted. Another important detail about their 'demise' here, it occurred within seconds of all of the drives being spun down, which is too close to be coincidental. I have no explanation at all for it. The third drive came very close to being lost also. I have to wonder about this card ... You might try this card in the other PCI slot. Hopefully you found something loose. Was that 400GB drive connected to this card? If so, it probably is perfectly fine. Because of the various crashes, I think you should run reiserfsck on all of the data drives (but NOT the parity drive), when you can. Please see Check Disk File systems. emhttp: disk_spinning: open: No such file or directory There were perhaps a thousand of these messages, a new one to me. I have to assume they were associated with the 2 drives that in one sense were no longer there, yet actually were there, which confused the system.
June 20, 200917 yr Author The 2 drives listed at the bottom of the UnMENU Disk Management screen as "Not in Protected Array", are actually Disk 2 and Disk 5, sdb and sdc. They were very suddenly dropped from the array, simultaneously, as if they had been disconnected. The third drive on that card also had communications difficulties, but after several resets, was recovered. These 2 drives could not be, and when they finally responded, they were individually added as sdi and sdj. So there is no way to access them as Disk 2 and Disk 5, until you reboot. I would say that very likely you have something loose, either the SATA cables or the power cables or a backplane is coming loose. The other possibility is that there is something wrong with the disk controller card. There does not appear to be anything wrong with these drives, and there may not be anything wrong with your other drives either. I have more to say, but have to work now, more later. By the way, there is no reason to run the parity check all the way through, when you can't trust the system any way. Just abort it. Once you have the system configured and stable, then run it. And use the Trust My Array procedure when you can. More later ... For now, shut down and check all cables and connections. First I need to preface this with thank you so much for your thoughtful responses. I checked the connections this morning for both the power and the SATA. It apppears that the power is really tightly connected in the molex so I pushed them in again but I can't imagine that it's the connection that's the issue. I also looked at the SATA connections and repushed them all in. Nothing seemed loose so I'm not sure if it's a connection issue. I suppose it could be a SATA cable. I have plenty of those to replace them with. I could also move them all down to other drive bays to test to see if it's a backplane issue. I have a wedding in Milwaukee, WI and am leaving in twenty minutes, just wanted to test and get back to you since you took time out of your day to help me. I can remote into home and run some things over this weekend like the reiserfsck on the drives that stay up over the weekend. As far as moving the card it will be impossible to move since since the card then extends over the onboard SATA ports. I can replace it if the problems persist when I return with the one I just removed. Again thanks for all your help. I'll respond on Monday at the latest.
June 20, 200917 yr I suppose it could be a SATA cable. It does not feel like a cable issue to me. I did not see any of the typical errors associated with bad cables. Plus, the problems were near simultaneous on the 3 drives attached to the card, very quickly after all drives were spun down. That makes it very hard to blame the cables, even if bad. One thing to test - individually turn off spin down for Disk 2, 5, and 6 (the 3 drives on the card). This should help to confirm or deny a connection between spin down and this card's handling of it. Your syslog does show evidence of another problem, not one you can do anything about, only Tom may be able to improve. This is not the first time I have seen, where the kernel and the unRAID modules are not communicating sufficiently. The kernel or drive controlling module had not only disabled the drives, but 'detached' them (see below), yet unRAID was not aware of this and continued to pass on various I/O commands, which then result in phony read errors, write errors, and stripe errors (not in this case, but others I have seen). Perhaps Tom, on detecting any serious I/O errors, could check drive params somewhere, and verify the physical drive status. It's in the syslog, so it seems like it should be accessible somewhere. Jun 19 07:46:31 server kernel: ata3: SATA link down (SStatus 0 SControl 300) Jun 19 07:46:31 server kernel: ata3: limiting SATA link speed to 1.5 Gbps Jun 19 07:46:32 server kernel: ata3: hard resetting link Jun 19 07:46:32 server kernel: ata3: SATA link down (SStatus 0 SControl 310) Jun 19 07:46:32 server kernel: ata3.00: disabled Jun 19 07:46:32 server kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4 Jun 19 07:46:32 server kernel: ata3: edma_err_cause=00000010 pp_flags=00000000, dev connect Jun 19 07:46:32 server kernel: ata3: hard resetting link Jun 19 07:46:33 server kernel: ata3: SATA link down (SStatus 0 SControl 300) Jun 19 07:46:33 server kernel: ata3: EH complete Jun 19 07:46:33 server kernel: ata3.00: detaching (SCSI 2:0:0:0)
June 23, 200917 yr Author @RobJ I didn't get back until late lastnight so I started work this morning on figuring out this issue. I've moved the drives down a row to a different backplane. They are still on the same card but now going to different ports with different cables. I also decided to start running the reiserfsck command on all my drives. I'm only on the first drive and it's been running for almost 2 hours now and it's still going, so this might take awhile. I'll let you know what I find out and then post another syslog. Thanks again for your help. *UPDATE* The two drives that seem to be causing all the issues dropped again. I am still running the file system checks on the remaining drives and will attempt a controller swapout this evening. I might also contact seagate to see if I can get a new edition of the firmware for these drives.
June 25, 200917 yr Author So I swapped out the controller and at first the controller wasn't even showing during boot even though it looked fully seated. I then attempted to reseat after a shutdown and reboot and then all the drives showed. In addition to doing this I also moved all three drives that were having issues up to the first row that is running off the onboard sata controller. So now this evening after a full parity check ran with 1 sync error it seemed to be working. Now all the sudden disk 1 dropped *ugh* what a pain in the butt. So I've attached a new syslog as I'm not sure where to go from here. Maybe the drive firmware I'm running isn't compatible with the SAT2-Mv8 card?
June 25, 200917 yr Essentially the same issue happened. The system booted mostly cleanly, ran a parity check without issue, then after an hour, spun the drives down. Immediately (about 2 seconds) after the drives were spun down, the 3 drives connected to the card were dropped, and all 3 SATA links went down. Two (Disk 3 and Disk 4, sdd and sda) were recovered correctly, one (Disk 1, sdb) was not, was recovered as a different device ID, sdi. As usual with the loss of Disk 1, there were read, write, handle stripe read, and handle stripe write errors reported. I suspect that a subsequent set of spin downs might cause the loss of more drives connected to this card. This is obviously not a workable configuration. It looks like you moved different Seagate 1.5TB drives to the MV8 card, so that seems to rule out specific drives. All of the drives involved so far are Seagate 1.5TB drives, with firmware CC1H, which is I believe an OK firmware release, but perhaps someone else can confirm? This card behaved identically to the other, so it is probably not a faulty card. That leaves 2 possible incompatibilities, the card is not compatible with these drives after spin down, or the card is not compatible with this motherboard. Have you tested my other suggestion, to individually disable spin down for drives connected to the card? Although probably not desirable permanently, that really seems worth a test.
July 4, 200917 yr "All of the drives involved so far are Seagate 1.5TB drives, with firmware CC1H, which is I believe an OK firmware release, but perhaps someone else can confirm?" When I first read the original post I saw this quote coming from someone a mile away. I can't confirm this specific firmware, but I can confirm drive problems w/seagate drives larger than 750gbs, where the firmware has been updated or is up to date, and seagate would not acknowledge my problem. The problem w/seagates greater than 750 is they have this weird timing out issue and doesn't only occur when doing something intensive. For instance, even playing an mp3 about once an hour the drive would skip the song like a bad record for about 10 seconds before resuming. Windows can handle it (although it's annoying), raid controllers (or in this case unraid) cannot as it sees this as a drive failure (and it is since raid controllers don't allow for timeouts for a drive to catch up). Anyways, out of the 3 seagate drives (one as 1TB and the other two 1.5TB) I had larger than 750, they all went back for permanent return. Get rid of those drives and I bet you'll get rid of your problems. They're awful, I had logs showing the drive failure on the 1TB, but seagate said, your drive is not associated with the issue basically saying it didn't exist even though I had a drive that was failing (well doing the lag thing) with proof. I was a loyal seagate customer from 1992-2008 (including 15 750gbs in my raid5), but am taking a hiatus until they get things figured out. Keeps us posted on what happens with your server. Nate
Archived
This topic is now archived and is closed to further replies.