Jump to content
steve1977

Disk 12 failed

39 posts in this topic Last Reply

Recommended Posts

Disk 12 just failed. Any insights from the diagnostic whether the issue is with the disk or the controller? Thinking whether to rebuild the parity from scratch or to set up a new array (which may be on the safer side). Thanks!

Share this post


Link to post

Run a SMART report on the disk if you can.

http://SrvrName-IP/Main/Device?name=disk? inserting of course your server & disk #

http://136.294.75.128/Main/Device?name=disk##

It'll tell you what's wrong .

If you suspect otherwise replace both data/power cables on it and test again.

Do you have Backups or a Spare disk?

I personally check my array everyday for errors/issues. I know it's not very cool - but it keeps to above the curve, right?

Edited by KC

Share this post


Link to post

I ran both a quick and extended SMART test. Both completed without error. The diagnostic was done after the SMART tests, but I don't know how to best read them.

Share this post


Link to post

You do the SMRT on all drives?

If not, start now.

If they all pass SMRT.

Do a parity check w/ no correction.

If they all pass? Sometimes shit happens. To be Sammy Safety recheck all power/SAS-SATA connections and repeat.

Otherwise? Drives can and will correct themselves if they find a few bits off. However not reporting via SMRT is kind of rare.

See this older topic SMRT and bit correction are still the same on physical drives. https://forums.unraid.net/topic/27913-parity-disk-red-balled/?tab=comments#comment-260742

***Edit
Long smart test - to be sure.

Edited by KC

Share this post


Link to post
6 hours ago, steve1977 said:

Any thoughts whether this is a disk or a cable/controller error?

Diags are after rebooting, so not much help, but the disk looks fine.

 

Also, check/replace cables on disk13.

Share this post


Link to post

Let me attach a new diagnostic right after running the extended SMART test. I only ran it on disk 12, but will try the other disks later.

 

I keep on having this issue with Unraid. I had changed cables several times, so that's not the issue. I figured out though that the real issue is over-heating of my controller card. Once I kept the slot next to the controller card empty, the issue was largely resolved as the card wasn't exposed to the heat of the GPU. The issue still seems to happen though when building/rebuilding parity (I assume that parity building puts a lot of load/heat on the controller card). The issue only happens with drives connected to the controller card.

 

It seemed largely under control for close to a year now, so I am wondering whether the issue this time is indeed a broken disk. If no signs in extended SMART, I assume that my old heat issue is back? Maybe I need to buy another fan? Any other idea welcome!

Share this post


Link to post

Mmh... Would this reduce heat? Or any other difference this would bring? I believe I had it in IT mode before (not sure though). I remember it was hugely complicated to change it for whatever reason...

Share this post


Link to post

I think I still have another controller card. How can I see what mode it is in? I am thinking to change it if the parity rebuild yet again fails.

Share this post


Link to post
30 minutes ago, steve1977 said:

Would this reduce heat?

No.

 

31 minutes ago, steve1977 said:

Or any other difference this would bring?

It would use the HBA driver instead of the RAID driver, if nothing else it's the most tested and used, also possibly a little better performance.

Share this post


Link to post

Did you see it in the log that it is in raid mode? I can switch it out and try the old card, which may be in IT mode. I need to somehow check though whether it is and I remember doing this with the dos bootdisk was a huge pain. But maybe I am lucky and the old card is already in IT mode.

Share this post


Link to post
39 minutes ago, steve1977 said:

Did you see it in the log that it is in raid mode?

Yes, and it's using the megaraid driver, which is unusual for an LSI 2008 based controller, they usually still use HBA driver even in raid mode, megaraid divers is usually for the most advanced raid controllers, with BBU, etc.

Share this post


Link to post

Do you know if your power supply is properly sized and healthy?  I've had problems in the past that would only appear during a parity check (all drives spinning) because the power supply was of the proper capacity but had become weak over time and didn't have any noticeable issues until it had a higher load on it.

Share this post


Link to post

My server has a long history and I had the issue of ejecting disks. I have changed so many components in an attempt to make it work. Changed the mobo, the PSU twice (!), and also the controller twice (!). Plus several times the cabling. All in the attempt to stop Unraid from ejecting one disk every couple of weeks. The symptoms have always been the same. So, my 1,000W PSU should not be the issue. I had thought about buying a UPS to secure "cleaner" electricity supply, but haven't done so. But I doubt that's the unlock.

 

Maybe 1-2 years ago, I had finally found what I thought was the answers of many years of issues. Overheating of the controller seemed to have been the issue. Someone in the forum had suggested this. I planned to buy a slot fan, but as a first step moved the controller card in a different slot. (one slot away from the GPU with an empty slot in between). I also thought that the issue may have been the slot, but it yet again started "ejecting the disk" when I added a second GPU to the now empty slot. So, it only runs stable when the slot next to the controller is empty.

 

I don't know why the card is set to megaraid, but it may well be that someone had suggested a few years ago. I bought 3 or even 4 M1015 cards as I thought that maybe a faulty or counterfeit controller card may be the issue. I also thought it may be a FW issue and I struggled to flash it in the past myself, so I bought an alrady flashed card. That's the current one with megaraid. I think I still have 1-2 others somewhere at home, so I can try another card and see whether it is in IT mode and may be more stable. Having said this, this card / setup has now finally been stable for more than a year, which is the best I ever had with a long history and many many years having Unraid issues.

 

I used also for some while to use Unraid without the parity. While this defeats the purpose, this has always been rock solid. My controller overheating issue never impacted the actual disks or usage, but it only impacts the parity in the way that it ejects the disk (when I say "eject", I should probably say disable; i.e., the red cross).

 

The parity rebuild now seems to be running at 40%, so maybe it goes through. Let me wait whether this indeed happens.

 

I am thinking to try three more things:

 

1) Switch to another card, which may be flashed to IT mode

 

2) Add two controller cards and add 4 disks each (instead of 8 to one). This should reduce the load on each card?

 

3) Buy a fan. Ideally I can find something that can be positioned in a way that I can even use the now empty slot for a second GPU (although my GPU passthrough anyways has issues when adding a second GPU (even when not passing it through), which doesn't make a lot of sense, but is a replicable issue).

 

Thanks again for the guidance on this board. As you can tell from this, I really must love Unraid that I still keep trying to use although I have been struggling with it so much. Let alone the cost of new mobo/CPU and the three GPUs. I hope it will one way be 100% stable and I believe the controller card (heat!) may be the issue preventing it from it today.

Share this post


Link to post

Is the behavior the same if all plugins are removed? Just going stock for a day or week?

Looks like you've got a lot of plugins installed.


Share this post


Link to post

Have you considered upgrading you MB BIOS?  The diagnostics you posted show you are running version 0402 dated 06/13/17 and version 1902 dated 07/19/2019 is available.

Share this post


Link to post
5 minutes ago, KC said:

Is the behavior the same if all plugins are removed? Just going stock for a day or week?

Looks like you've got a lot of plugins installed.

Highly doubt that this is the issue. I used to run Unraid without plugins (or maybe only with the UD plugin, which I need). I doubt the plugins have anything to do with my issues. The dockers maybe as they cause read/write activities on the array, which may put even more load on the controller during rebuild.

Share this post


Link to post
7 minutes ago, WizADSL said:

Have you considered upgrading you MB BIOS?  The diagnostics you posted show you are running version 0402 dated 06/13/17 and version 1902 dated 07/19/2019 is available.

I've been thinking to do so, but it is a bit painful as I run Unraid headless (and don't have a screen). A pity that bios upgrades cannot be done headless. I'd need to move it another room where I can then access it to a television. Have been thinking to do that to then also play around with some over-clocking of CPU speed. Have shied away from doing so as I am not sure whether my CPU fan is large enough for overclocking and also my CPU does not seem to benefit from it that much anyways.

 

Do you think that the bios upgrade could bring a change to my controller issues? Why? If yes, I could make the effort and give it a try (maybe together with an attempt to OC).

Share this post


Link to post

No worries - but it's hard to diagnose something if you have X# of addons installed.

I'd suggest removing them all and going stock for a while - be easier to find out exactly what's going on.

Share this post


Link to post

I would definitely update the BIOS.  There are a lot of changes/fixes that can be made in the BIOS code which could affect your problem.  I WOULD NOT OVERCLOCK ***** ESPECIALLY IF YOU SUSPECT HEAT IS AN ISSUE -NOW- *****

Share this post


Link to post

No, but that is one of the things he'd like to attempt if he connects it to a monitor in order to do the BIOS upgrade.  I would strongly discourage overclocking ANY server let alone one where heat may already be an issue WITHOUT overclocking.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.