steve1977 Posted July 16, 2019 Share Posted July 16, 2019 Disk 12 just failed. Any insights from the diagnostic whether the issue is with the disk or the controller? Thinking whether to rebuild the parity from scratch or to set up a new array (which may be on the safer side). Thanks! Quote Link to comment
itimpi Posted July 16, 2019 Share Posted July 16, 2019 If you actually posted the diagnostics it would help Quote Link to comment
steve1977 Posted July 16, 2019 Author Share Posted July 16, 2019 Sorry tower-diagnostics-20190716-0938.zip Quote Link to comment
steve1977 Posted July 18, 2019 Author Share Posted July 18, 2019 Any thoughts whether this is a disk or a cable/controller error? Quote Link to comment
KC Posted July 19, 2019 Share Posted July 19, 2019 (edited) Run a SMART report on the disk if you can. http://SrvrName-IP/Main/Device?name=disk? inserting of course your server & disk # http://136.294.75.128/Main/Device?name=disk## It'll tell you what's wrong . If you suspect otherwise replace both data/power cables on it and test again. Do you have Backups or a Spare disk? I personally check my array everyday for errors/issues. I know it's not very cool - but it keeps to above the curve, right? Edited July 19, 2019 by KC Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 I ran both a quick and extended SMART test. Both completed without error. The diagnostic was done after the SMART tests, but I don't know how to best read them. Quote Link to comment
KC Posted July 19, 2019 Share Posted July 19, 2019 (edited) You do the SMRT on all drives? If not, start now. If they all pass SMRT. Do a parity check w/ no correction. If they all pass? Sometimes shit happens. To be Sammy Safety recheck all power/SAS-SATA connections and repeat. Otherwise? Drives can and will correct themselves if they find a few bits off. However not reporting via SMRT is kind of rare. See this older topic SMRT and bit correction are still the same on physical drives. https://forums.unraid.net/topic/27913-parity-disk-red-balled/?tab=comments#comment-260742 ***Edit Long smart test - to be sure. Edited July 19, 2019 by KC Quote Link to comment
JorgeB Posted July 19, 2019 Share Posted July 19, 2019 6 hours ago, steve1977 said: Any thoughts whether this is a disk or a cable/controller error? Diags are after rebooting, so not much help, but the disk looks fine. Also, check/replace cables on disk13. Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 Let me attach a new diagnostic right after running the extended SMART test. I only ran it on disk 12, but will try the other disks later. I keep on having this issue with Unraid. I had changed cables several times, so that's not the issue. I figured out though that the real issue is over-heating of my controller card. Once I kept the slot next to the controller card empty, the issue was largely resolved as the card wasn't exposed to the heat of the GPU. The issue still seems to happen though when building/rebuilding parity (I assume that parity building puts a lot of load/heat on the controller card). The issue only happens with drives connected to the controller card. It seemed largely under control for close to a year now, so I am wondering whether the issue this time is indeed a broken disk. If no signs in extended SMART, I assume that my old heat issue is back? Maybe I need to buy another fan? Any other idea welcome! Quote Link to comment
JorgeB Posted July 19, 2019 Share Posted July 19, 2019 LSI is in raid mode, IT mode would be best. Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 Mmh... Would this reduce heat? Or any other difference this would bring? I believe I had it in IT mode before (not sure though). I remember it was hugely complicated to change it for whatever reason... Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 I think I still have another controller card. How can I see what mode it is in? I am thinking to change it if the parity rebuild yet again fails. Quote Link to comment
JorgeB Posted July 19, 2019 Share Posted July 19, 2019 30 minutes ago, steve1977 said: Would this reduce heat? No. 31 minutes ago, steve1977 said: Or any other difference this would bring? It would use the HBA driver instead of the RAID driver, if nothing else it's the most tested and used, also possibly a little better performance. Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 Did you see it in the log that it is in raid mode? I can switch it out and try the old card, which may be in IT mode. I need to somehow check though whether it is and I remember doing this with the dos bootdisk was a huge pain. But maybe I am lucky and the old card is already in IT mode. Quote Link to comment
JorgeB Posted July 19, 2019 Share Posted July 19, 2019 39 minutes ago, steve1977 said: Did you see it in the log that it is in raid mode? Yes, and it's using the megaraid driver, which is unusual for an LSI 2008 based controller, they usually still use HBA driver even in raid mode, megaraid divers is usually for the most advanced raid controllers, with BBU, etc. Quote Link to comment
WizADSL Posted July 19, 2019 Share Posted July 19, 2019 Do you know if your power supply is properly sized and healthy? I've had problems in the past that would only appear during a parity check (all drives spinning) because the power supply was of the proper capacity but had become weak over time and didn't have any noticeable issues until it had a higher load on it. Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 My server has a long history and I had the issue of ejecting disks. I have changed so many components in an attempt to make it work. Changed the mobo, the PSU twice (!), and also the controller twice (!). Plus several times the cabling. All in the attempt to stop Unraid from ejecting one disk every couple of weeks. The symptoms have always been the same. So, my 1,000W PSU should not be the issue. I had thought about buying a UPS to secure "cleaner" electricity supply, but haven't done so. But I doubt that's the unlock. Maybe 1-2 years ago, I had finally found what I thought was the answers of many years of issues. Overheating of the controller seemed to have been the issue. Someone in the forum had suggested this. I planned to buy a slot fan, but as a first step moved the controller card in a different slot. (one slot away from the GPU with an empty slot in between). I also thought that the issue may have been the slot, but it yet again started "ejecting the disk" when I added a second GPU to the now empty slot. So, it only runs stable when the slot next to the controller is empty. I don't know why the card is set to megaraid, but it may well be that someone had suggested a few years ago. I bought 3 or even 4 M1015 cards as I thought that maybe a faulty or counterfeit controller card may be the issue. I also thought it may be a FW issue and I struggled to flash it in the past myself, so I bought an alrady flashed card. That's the current one with megaraid. I think I still have 1-2 others somewhere at home, so I can try another card and see whether it is in IT mode and may be more stable. Having said this, this card / setup has now finally been stable for more than a year, which is the best I ever had with a long history and many many years having Unraid issues. I used also for some while to use Unraid without the parity. While this defeats the purpose, this has always been rock solid. My controller overheating issue never impacted the actual disks or usage, but it only impacts the parity in the way that it ejects the disk (when I say "eject", I should probably say disable; i.e., the red cross). The parity rebuild now seems to be running at 40%, so maybe it goes through. Let me wait whether this indeed happens. I am thinking to try three more things: 1) Switch to another card, which may be flashed to IT mode 2) Add two controller cards and add 4 disks each (instead of 8 to one). This should reduce the load on each card? 3) Buy a fan. Ideally I can find something that can be positioned in a way that I can even use the now empty slot for a second GPU (although my GPU passthrough anyways has issues when adding a second GPU (even when not passing it through), which doesn't make a lot of sense, but is a replicable issue). Thanks again for the guidance on this board. As you can tell from this, I really must love Unraid that I still keep trying to use although I have been struggling with it so much. Let alone the cost of new mobo/CPU and the three GPUs. I hope it will one way be 100% stable and I believe the controller card (heat!) may be the issue preventing it from it today. Quote Link to comment
KC Posted July 19, 2019 Share Posted July 19, 2019 Is the behavior the same if all plugins are removed? Just going stock for a day or week? Looks like you've got a lot of plugins installed. Quote Link to comment
WizADSL Posted July 19, 2019 Share Posted July 19, 2019 Have you considered upgrading you MB BIOS? The diagnostics you posted show you are running version 0402 dated 06/13/17 and version 1902 dated 07/19/2019 is available. Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 5 minutes ago, KC said: Is the behavior the same if all plugins are removed? Just going stock for a day or week? Looks like you've got a lot of plugins installed. Highly doubt that this is the issue. I used to run Unraid without plugins (or maybe only with the UD plugin, which I need). I doubt the plugins have anything to do with my issues. The dockers maybe as they cause read/write activities on the array, which may put even more load on the controller during rebuild. Quote Link to comment
steve1977 Posted July 19, 2019 Author Share Posted July 19, 2019 7 minutes ago, WizADSL said: Have you considered upgrading you MB BIOS? The diagnostics you posted show you are running version 0402 dated 06/13/17 and version 1902 dated 07/19/2019 is available. I've been thinking to do so, but it is a bit painful as I run Unraid headless (and don't have a screen). A pity that bios upgrades cannot be done headless. I'd need to move it another room where I can then access it to a television. Have been thinking to do that to then also play around with some over-clocking of CPU speed. Have shied away from doing so as I am not sure whether my CPU fan is large enough for overclocking and also my CPU does not seem to benefit from it that much anyways. Do you think that the bios upgrade could bring a change to my controller issues? Why? If yes, I could make the effort and give it a try (maybe together with an attempt to OC). Quote Link to comment
KC Posted July 19, 2019 Share Posted July 19, 2019 No worries - but it's hard to diagnose something if you have X# of addons installed. I'd suggest removing them all and going stock for a while - be easier to find out exactly what's going on. Quote Link to comment
WizADSL Posted July 19, 2019 Share Posted July 19, 2019 I would definitely update the BIOS. There are a lot of changes/fixes that can be made in the BIOS code which could affect your problem. I WOULD NOT OVERCLOCK ***** ESPECIALLY IF YOU SUSPECT HEAT IS AN ISSUE -NOW- ***** Quote Link to comment
WizADSL Posted July 19, 2019 Share Posted July 19, 2019 No, but that is one of the things he'd like to attempt if he connects it to a monitor in order to do the BIOS upgrade. I would strongly discourage overclocking ANY server let alone one where heat may already be an issue WITHOUT overclocking. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.