Array Disks missing/going offline


Recommended Posts

Dropped in some other CPUs, but no luck. Server will not only shut down randomly, but if left alone, turn back on randomly. 
I would assume my board is toast at this point, unless theres some other logical explanation for this behavior. 

What an adventure this has been.

Link to comment

Bummer. Yeah, sounds like the board is starting to give its last then. It's still possible that the original PSU may have caused the motherboard's premature end - PSU noise can cause damage to onboard regulators, but it's pretty uncommon - so I'd say the old PSU is likely acceptable. Might be worth doing some basic tests, if you have the capability.

Link to comment
  • 2 weeks later...

UPDATE

 

I ordered a replacement board which arrived a few days ago. To solve my 2U psu woes, I decided to migrate to a new case so I could use a standard ATX and save a lot of money. I grabbed an 850w gold from EVGA and the fractal XL R2. Had to do the 3.3v electric tape trick on my HDDs since im no longer using a server grade PSU, but it went fine and upon first boot all drives were showing up, and none seemed to be disappearing. The array was not started, and I left it sitting like this for over 12 hours to see if she was stable. Everything seemed fine so I started the array and let the rebuild begin and went to sleep. I didnt check on it til about noon the following day when I remoted in from work and was not able to access the server. When I got home I booted it back up and let it just sit - no array started, and it stayed on for multiple days no problem. I decided to try to start the array one last time, and it shut off again within an hour.
 

At this point the only 3 suspects I have left are

 

a. RAM. I have done a nearly 24 hour memtest with no errors, it only stopped when the server randomly powered off. This test was done several months ago, when I first started having random shut downs. I've never SEEN bad ram cause a server to go offline in such a manner (not even IPMI was accessible) but I suppose anything is possible? It's pretty difficult to test out this ram, though. I've got all 18 slots filled, so that would take a while. 

 

b. Unraid. I'm far too ignorant to even try and explain how unraid could be at fault here, but it has been a constant throughout all of this so it's still a suspect too me.

 

c. the HDDs themselves. I suppose the best thing is to run extended tests on all drives one at a time and see what happens?
 

Any thoughts or suggestions would be wonderful, Im starting to pull my hair out. Having a server offline this long really makes you appreciate high availability...

Link to comment

If your board has a dedicated management port (and you're using it) the disappearing IPMI access means the BMC died too -- BMC should run sans RAM, on most systems. Hell, the BMC in some systems will run without a CPU. The only thing I can imagine faulting the BMC and shutting down a system would be failing hardware (CPUs, motherboard, RAM, power supply, cooling) -- but even with an overheat situation, the BMC will normally stay up.

 

I've never heard of this "3.3v electrical tape trick" and it horrifies me. Electrical tape is awful and frankly I would petition distributors to stop selling it fully. There is no situation in which electrical tape is the best option, literally ever, especially for things like this. Please use Kapton tape for mods like this in the future -- it's not easily punctured, its adhesive doesn't turn into a greasy black snot smear after brief exposure to warmth and/or moisture, it doesn't stretch or shrink or slide out from beneath pressure -- all qualities "electrical tape" revels in. Kapton tape. Polyimide tape.

 

Also having to workaround a problem related to the power supply not being good enough for hard disks on a dual-CPU server-grade system we've been suspecting the power supply on -- gotta say it doesn't instill confidence.

 

Anyway - if you're still suspecting it's an Unraid problem (despite your report of it doing the exact same thing in the memtest before according to previous messages, which 100% clears Unraid of responsibility if this is the case) feel free to boot that Windows environment you built, use the storage drivers you slipstreamed into it to assemble a striped Storage Space, and run a batch of benchmarks and/or stress tests on it so Windows is doing more than just sitting idle at the equivalent of a stopped array screen and never crashing... I'd imagine you have your own benchmarking tools selected, since the sort of person who can build a PE with that sort of integration packed in tends to have related tools picked out already. If Windows runs for several hours but Unraid still power-faults, you have your answer. Actually, do yourself a favor and do that anyway. It seems to be a looming question in your mind, it'd be best to solve it instead of continuing to be offline for so long with no motion.

 

Ideally something here brings a solution one step closer. Good luck, and let us know. I apologize if I'm intermittently slow, things don't look to be queting down for me in the nearish future. I'll try to check back in regularly.

Link to comment
  • 3 weeks later...

I was so frustrated I pushed it aside for a week because I couldn't stand to look at it, but came back feeling optimistic that I could solve this. Since the problem has to either be the OS or RAM (since it's the ONLY part that hasnt been swapped out) I took out 6 of the 18 dimms (those 6 are IBM RAM, the rest samsung) which caused the server to boot noticeably faster, and things seemed to be normal for several days, I was even able to complete a rebuild without the server shutting down halfway through. I still lost 1 disk (8TB) worth of data somehow, but it was almost entirely media which can be re-downloaded so oh well I guess... At this point I was feeling pretty good and the problem seemed to be gon!

But then tonight, out of nowhere, I noticed the server was once again off. I turned it back on and let it start doing a parity check. It's shut off twice more since then. 
Im truly at a loss now. This is a different motherboard, with different CPUs, a different PSU, and RAM that tested without error for almost 24 hours straight. There's really no rhyme or reason to the shutdowns. It's sat in Windows for well over a day with no problem. It's only ever shut down while booted into unraid. I realize there's a small chance it could still be a bad stick of RAM, but holy hell testing out 18 dimms is not an easy task.. Especially when it can go days on end without problems. There's no good way to approach this that isn't going to take weeks/months. I've already lost 8tb of data from the crashes/drives going offline/attempted parity syncs and rebuilds. I'm not even posting asking for help anymore, I guess Im just venting at this point. I dunno what to do anymore, aside from backing up what data I can and literally wiping the thing clean and starting fresh. If I do that, I'd want to re-flash the USB that unraid lives on, but if I understand it correctly you can't really re-use the USB because it burns the GUID. Maybe I'm misunderstanding this, but I was under the assumption that once you use the USB creator tool, you cant ruin it again on the same drive. 

Link to comment
On 5/18/2021 at 3:39 AM, codefaux said:

I've never heard of this "3.3v electrical tape trick" and it horrifies me. Electrical tape is awful and frankly I would petition distributors to stop selling it fully. There is no situation in which electrical tape is the best option, literally ever, especially for things like this. Please use Kapton tape for mods like this in the future -- it's not easily punctured, its adhesive doesn't turn into a greasy black snot smear after brief exposure to warmth and/or moisture, it doesn't stretch or shrink or slide out from beneath pressure -- all qualities "electrical tape" revels in. Kapton tape. Polyimide tape.


A lot of WD drives wont spin up (mine are white label red drives shucked from WD easy store/elements) when a consumer PSU, so you have to cover the 3.3v line. (my old setup was a 2U case/PSU, and the drives plugged into a backplane that was powered via molex. In my new case (fractal XL) im using a standard ATX supply so I had to perform this 'mod'. Yes, Kaptop tape would be ideal, but I didn't have any handy and electric tape has been used many times without issue, so it is what it is I guess.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.