February 28, 201412 yr I am so FRUSTRATED by this situation!! I've tried to fix this problem in every way I can think of, I've been using unRaid for a while and I've never experienced anything like this and I have no idea how to fix the problem. Let me just post some basic facts about my system, yes it's overkill and no I don't care. I just can't understand why all this starting happening at once -- here is the unRaid machine specs: unRAID Version: unRAID Server Pro, Version 5.0.5 Motherboard: MSI - Z77A-GD65 (MS-7751) Processor: Intel® Core i5-3570K CPU @ 3.40GHz - 3.4 GHz Cache: CPU Internal L2 = 1024 kB (max. 1024 kB) CPU Internal L1 = 256 kB (max. 256 kB) CPU Internal L3 = 6144 kB (max. 6144 kB) Memory: 32768 MB (max. 32 GB) BANK 0 = 8192 MB, 1333 MHz 8192 MB = 1333 MHz, BANK 1 1333 MHz = 1333 MHz, 8192 MB 1333 MHz = BANK 2, 1333 MHz BANK 3 = 8192 MB, 1333 MHz = 1333 MHz, Network: eth0: 1000Mb/s - Full Duplex Uptime: 8 days, 3 hours, 45 minutes, 55 seconds Here are my Hard Drives: Parity Disk PASSED (DiskId: ST3000DM001-1CH166) 3TB Disk 1 PASSED (DiskId: WDC_WD30EZRX-00DC0B0) 3TB Disk 2 PASSED (DiskId: WDC_WD30EZRX-00DC0B0) 3TB Disk 3 PASSED (DiskId: WDC_WD30EZRX-00D8PB0) 3TB Disk 4 PASSED (DiskId: WDC_WD30EZRX-00DC0B0) 3TB Disk 5 PASSED (DiskId: WDC_WD30EZRX-00MMMB0) 3TB Disk 6 PASSED (DiskId: WDC_WD20EARS-00MVWB0) 3TB Other Notes of interest to someone I HOPE: I have NEVER had a SMART error EVER, even when this problem occurs -- no errors are ever reported in Syslog or as SMART error. I'm using a RPC-4224 case I have three AOC-SAS2LP-MV8 cards and I have testing every single drive tray on the RPC-4224 to ensure that the all backplanes are operational as expected My hard drives NEVER get above 40C, even when I'm writing to them for several hours; they normally stay at about 23 - 33C under "normal" circumstances, I've been pleased by that fact. So I guess you are wondering, cause I know I sure the heck am -- WHAT IS THE PROBLEM? Every time, I mean EVERY TIME, I run the parity check process; a few minutes into the parity check process one of the drives DISABLES and turns red. It's not even the same drive twice that the system disables and no errors are written to Syslog or produce a SMART error. So my server thinks that Parity sync has never been performed. It will not make it through the process. I always have to do 'New Config", re-assign all the drives, and then start the Array. After I do all of that all my data is still there and then it goes through and does the "initial parity process", that part works fine and no drives are disabled during that process. I just get that "Parity has not been checked in 16129" message at the top of the "myMain" page. Then as soon as I run the sync process, within 30 minutes of starting that process one (just randomly) of my drives shows as disabled or the dreaded 'Red-Balled". This is driving me crazy!!! I have NO IDEA what to do next. I am convinced that all my hard drives are good! Do any of you have any idea why this would be happening? I will try anything! I love unRaid and this is the only problem I've ever had using unRaid. Please someone throw out an idea -- cause I'm at a loss. I've built a lot of machines and I'm a computer programmer by day and needless to say I can't stand not being able to solve a problem. If the same drive was being disabled every time I might be thinking hard drive failure but it just randomly disables drives for no apparent reason. Please help if you can
February 28, 201412 yr Author I'll be the fist to say it.... Post a system long. Not a problem, I can do that -- here is the Syslog since last reboot (8 days). Thanks to anyone who can solve the case. Shoot at this point I'm thinking about sending $100 bucks to anybody's Paypal account just so I can get over it -- I'm just racking my brain trying to figure out what the heck is going on. unRaid_Syslog.txt
February 28, 201412 yr Cables, memory, or power ... the 3 most likely culprits. First, I'd unplug two of your memory modules. In an unbuffered board, the memory subsystem is far more reliable if you keep the bus loading to a max of 2 installed modules. Try that and see if it makes a difference. Second, what power supply are you using? (make/model) And how do you have it connected to your backplanes? Third, double-check your cables and be sure they are ALL securely seated.
February 28, 201412 yr Author Cables, memory, or power ... the 3 most likely culprits. First, I'd unplug two of your memory modules. In an unbuffered board, the memory subsystem is far more reliable if you keep the bus loading to a max of 2 installed modules. Try that and see if it makes a difference. Second, what power supply are you using? (make/model) And how do you have it connected to your backplanes? Third, double-check your cables and be sure they are ALL securely seated. Ok, I have NOT tried the RAM thing but I can do that for sure. I guess that could cause a problem Power supply is CORSAIR AX series AX860 - connected to the backplane by molex connectors (4 on each cable back to power supply) I have double checked and triple checked all cables (power, SAS) and all are connected securely. And I tested all the SAS cards separately in every single spot of the motherboard.
February 28, 201412 yr First, I'd unplug two of your memory modules. In an unbuffered board, the memory subsystem is far more reliable if you keep the bus loading to a max of 2 installed modules. huh... [me=DaleWilliams]squirrels away that factoid for future use[/me] Sounds like you may have tried this, but I was thinking of more fans cooling the drives. A parity check is one of the rare instances when every drive will spin up and run at the same time. That's a nice power supply...do you have another around the house that you could swap in for awhile?
February 28, 201412 yr Author First, I'd unplug two of your memory modules. In an unbuffered board, the memory subsystem is far more reliable if you keep the bus loading to a max of 2 installed modules. huh... [me=DaleWilliams]squirrels away that factoid for future use[/me] Sounds like you may have tried this, but I was thinking of more fans cooling the drives. A parity check is one of the rare instances when every drive will spin up and run at the same time. That's a nice power supply...do you have another around the house that you could swap in for awhile? I don't have another power supply to try with at the moment. The highest temp I've ever seen was during a parity check I had one drive at 41C, that is as high as I've seen. I have taken the first suggestion and took action. I have removed 2 memory modules. I have just kicked off the parity check process so I guess I'll know in about 8 to 10 hours or so. I'll update this thread just in case anyone finds a similar problem.
February 28, 201412 yr short of a miracle, my bet is on RAM or PSU. I had weird issues with my desktop for 6 month. at random it would BSD any time. I replaced PSU as it was old and most search pointed to it as a culprit no go. as a desperate measure I pull 2 old RAM modules form it and never got BSD again. granted by the time I figure out it was my RAM the whole setup is screwed. and I am not looking at full windows reinstall but that is another story all together.
February 28, 201412 yr Author This time it failed within minutes of starting parity check and then unRaid disabled disk 3, I think it disabled disk 1 last time, and 2 the time before that -- again NO SMART errors! Disk 3 is disabled, but yet I can still access and even access stuff that was copied to that drive last night. Have no clue what to do
February 28, 201412 yr Author Have you actually run memtest from the unRAID boot menu? I have ran the test before, but not since this has started happening. I think I will do that now. Thanks for the reminder! Memtest in progress; I will post update at the conclusion of test. Thanks to everyone on this board for supporting the user base!
February 28, 201412 yr Disk 3 is disabled, but yet I can still access and even access stuff that was copied to that drive last night.No help for the original issue, but I think it's critical that you understand why you can see and work with a disabled disk. This ability is what allows unraid to rebuild 1 failed disk, so if you don't understand it, you won't be in a good position to work with it. Whenever a write to a disk fails, unraid disables the disk, so no more data is sent to it. Instead, ALL disk operations to the disabled disk are redirected to the phantom disk that is maintained by parity calculations spanning the entire array. You could continue to operate this way, but if another disk fails, you immediately lose access to all the data on BOTH disks that failed. That's why it is critical to deal with a single disk failure promptly, and never run an unraid array (or any array really) with a disk that you suspect is bad. When I first started playing with unraid, I figured, it's fault tolerant, I can have a failed disk rebuilt no problem, so why should I care about the health of my disks, if one fails, unraid will tell me and fix it. WRONG ATTITUDE. It bit me big time, when one disk failed, another questionable disk acted up during the rebuild, and I lost some data. I know that doesn't apply to you right now, I'm just throwing out the info in general. As to your specific problem, try removing 2 of your AOC controller cards and operate off of just 1 for a troubleshooting run, since you should have enough ports to operate that way with your limited drive count. I suspect the three identical cards are causing some sort of timing issue.
February 28, 201412 yr Author Disk 3 is disabled, but yet I can still access and even access stuff that was copied to that drive last night.No help for the original issue, but I think it's critical that you understand why you can see and work with a disabled disk. This ability is what allows unraid to rebuild 1 failed disk, so if you don't understand it, you won't be in a good position to work with it. Whenever a write to a disk fails, unraid disables the disk, so no more data is sent to it. Instead, ALL disk operations to the disabled disk are redirected to the phantom disk that is maintained by parity calculations spanning the entire array. You could continue to operate this way, but if another disk fails, you immediately lose access to all the data on BOTH disks that failed. That's why it is critical to deal with a single disk failure promptly, and never run an unraid array (or any array really) with a disk that you suspect is bad. When I first started playing with unraid, I figured, it's fault tolerant, I can have a failed disk rebuilt no problem, so why should I care about the health of my disks, if one fails, unraid will tell me and fix it. WRONG ATTITUDE. It bit me big time, when one disk failed, another questionable disk acted up during the rebuild, and I lost some data. I know that doesn't apply to you right now, I'm just throwing out the info in general. As to your specific problem, try removing 2 of your AOC controller cards and operate off of just 1 for a troubleshooting run, since you should have enough ports to operate that way with your limited drive count. I suspect the three identical cards are causing some sort of timing issue. That's an interesting idea about the cards, I may try that later. Thanks for the suggestion
February 28, 201412 yr That's an interesting idea about the cards, I may try that later. Thanks for the suggestion I had similar problems when one of my SASLP-MV8 cards was not quite properly seated. It would work most of the time, but then fail a drive in an unpredictable manner exactly as you are seeing. It also seemed particularly prone to failing under load.
February 28, 201412 yr Author That's an interesting idea about the cards, I may try that later. Thanks for the suggestion I had similar problems when one of my SASLP-MV8 cards was not quite properly seated. It would work most of the time, but then fail a drive in an unpredictable manner exactly as you are seeing. It also seemed particularly prone to failing under load. OK, I have just verified that all cards are properly seated, not saying there may be an issue with the motherboard playing nicely with these boards but they do appear to be seated correctly.
February 28, 201412 yr Author Disk 3 is disabled, but yet I can still access and even access stuff that was copied to that drive last night.No help for the original issue, but I think it's critical that you understand why you can see and work with a disabled disk. This ability is what allows unraid to rebuild 1 failed disk, so if you don't understand it, you won't be in a good position to work with it. Whenever a write to a disk fails, unraid disables the disk, so no more data is sent to it. Instead, ALL disk operations to the disabled disk are redirected to the phantom disk that is maintained by parity calculations spanning the entire array. You could continue to operate this way, but if another disk fails, you immediately lose access to all the data on BOTH disks that failed. That's why it is critical to deal with a single disk failure promptly, and never run an unraid array (or any array really) with a disk that you suspect is bad. When I first started playing with unraid, I figured, it's fault tolerant, I can have a failed disk rebuilt no problem, so why should I care about the health of my disks, if one fails, unraid will tell me and fix it. WRONG ATTITUDE. It bit me big time, when one disk failed, another questionable disk acted up during the rebuild, and I lost some data. I know that doesn't apply to you right now, I'm just throwing out the info in general. As to your specific problem, try removing 2 of your AOC controller cards and operate off of just 1 for a troubleshooting run, since you should have enough ports to operate that way with your limited drive count. I suspect the three identical cards are causing some sort of timing issue. OK jonathanm, I have reconfigured the RPC-4224 so that all 7 drives are running off of one of the AOC-SAS2LP-MV8 controller. I guess I just have to wait and see what happens.
February 28, 201412 yr As to your specific problem, try removing 2 of your AOC controller cards and operate off of just 1 for a troubleshooting run, since you should have enough ports to operate that way with your limited drive count. I suspect the three identical cards are causing some sort of timing issue. OK jonathanm, I have reconfigured the RPC-4224 so that all 7 drives are running off of one of the AOC-SAS2LP-MV8 controller. I guess I just have to wait and see what happens. Did you remove the extra cards from their slots?
March 1, 201412 yr Author As to your specific problem, try removing 2 of your AOC controller cards and operate off of just 1 for a troubleshooting run, since you should have enough ports to operate that way with your limited drive count. I suspect the three identical cards are causing some sort of timing issue. OK jonathanm, I have reconfigured the RPC-4224 so that all 7 drives are running off of one of the AOC-SAS2LP-MV8 controller. I guess I just have to wait and see what happens. Did you remove the extra cards from their slots? No, I did not but there are no drives connected to those backplanes. It's been running for about 30 minutes now and so far no errors. Of course if this does fix the problem what does that mean? Does this motherboard just not handle these cards very well? I mean that was kinda the point of having three cards to support future upgrades.
March 1, 201412 yr As Jonathan suggested, you should remove the unused boards from the system. I would also leave the extra memory modules unplugged ... in fact, I'd leave the system that way permanently unless you have a specific reason you need more than 16GB. [Note: For those who didn't realize the impact of installing additional modules, you should watch Item #10 in the Corsair Memory Basics presentation ... it's designed to show why it's a good idea to use registered RAM modules when possible; but clearly shows why you want to keep the bus loading as low as possible. http://www.xlrq.com/stacks/corsair/153707/index.html ] Note: While it seems unlikely, there may be an issue with the SAS2LP-MV8 in an x4 slot. An x8 device is SUPPOSED to work just fine in such a case ... it should just work with 4 lanes instead of the designed 8. But if all is well with single card in, then I'd plug the 2nd card in the x8 slot (assuming you used the x16 for the first one) ... and then connect the drives to that card. If that also works well, try splitting the drives between those 2 cards. If that also works well, then I'd just leave the 3rd card out. With 8 motherboard ports, and 2 8-port cards, you'll have support for 24 drives with just 2 cards anyway!! Any reason you wanted more ??
March 1, 201412 yr Note: If you do want all the drives on add-in controllers, using an older SASLP-MV8 (as dgaschk suggested) might do the trick. Since it's an x4 card, it wouldn't be operating outside of its designed speed in your x4 slot.
March 1, 201412 yr Author As Jonathan suggested, you should remove the unused boards from the system. I would also leave the extra memory modules unplugged ... in fact, I'd leave the system that way permanently unless you have a specific reason you need more than 16GB. [Note: For those who didn't realize the impact of installing additional modules, you should watch Item #10 in the Corsair Memory Basics presentation ... it's designed to show why it's a good idea to use registered RAM modules when possible; but clearly shows why you want to keep the bus loading as low as possible. http://www.xlrq.com/stacks/corsair/153707/index.html ] Note: While it seems unlikely, there may be an issue with the SAS2LP-MV8 in an x4 slot. An x8 device is SUPPOSED to work just fine in such a case ... it should just work with 4 lanes instead of the designed 8. But if all is well with single card in, then I'd plug the 2nd card in the x8 slot (assuming you used the x16 for the first one) ... and then connect the drives to that card. If that also works well, try splitting the drives between those 2 cards. If that also works well, then I'd just leave the 3rd card out. With 8 motherboard ports, and 2 8-port cards, you'll have support for 24 drives with just 2 cards anyway!! Any reason you wanted more ?? I will remove the unused cards as Jonathan suggested. I have to say that right now I'm 40% on parity check and no drives have been disabled so that's an improvement at least. I'm hope that is the light I see at the end of the tunnel. Thanks to everybody for their insight on this; thankful that I had a community to ask.
March 1, 201412 yr Author Parity Check completed without incident and extra controller cards have been removed. I may try some of the additional things that were suggested in this thread at some point, but for now.... a working system is a happy system!! Thanks to everyone in the unRAID Server Community for their input and suggestions.
March 2, 201412 yr I'd leave the system with 16GB (only 2 installed modules); add one more controller card in the x8 slot (leaving the x4 slot empty) ... and confirm that all works well like that. Assuming it does, you'll have support for 24 drives ... which I suspect will be plenty for a good while
Archived
This topic is now archived and is closed to further replies.