Array Disks missing/going offline


Recommended Posts

I recently upgraded to the latest version of unraid and decided to try the new additional pools feature. I started with a pool of 2x500GB SSDS. This went fine and i moved appdata to it. Following Spaceinvader One's advice, I created another pool with a 3TB drive as a download pool. Didnt bother with a preclear because this data is not considered critical. I assigned the drive to the pool, and attemtped to format it. It started but stopped prematurely. I tried again and the same result. I had another drive handy so tried it and got the same result. I shut down the server and re-seated everything. same problem after reboot, but now disk 5 showing as disabled. After some basic troubleshooting other disks have begun sometimes go "missing". At this point ive tried to keep the array offline till I figure out whats happening. Sometimes the disk will show up as unassigned devices then disappear completely, but after refreshing the page it will be back to its slot. at one point all devices showed up so i started the array. disk 5 still disabled, but now disk 2 showing as unmountable.

In the BIOS the disks dont seem to be disappearing at all, and the LED indicators on the front of the drive sled dont indicate the power is being lost. Based on what i'm seeing this seems to be something with unraid. Any input on how what to try is greatly appreciated. 

tower-diagnostics-20210426-2044.zip

Link to comment

Previous to this problem, I was having an issue where my server was randomly shutting down. After much time and testing I concluded my PSU was simply not enough (500w) and must have been shutting off under heavy load. My drives are powered from a backplane with 2 molex connectors (each molex powers 4 drive bays). To temporarily solve my issue while I decide what to do long term (2U PSUs are kinda rare right now) I connected the backplane to a second PSU with a jumper. This setup has worked fine for several weeks now, so I didn't immediately suspect it could be the cause, especially since there have been no more random shut downs. I suspect my issue could be my HBA (h310) or possibly the cables? The confusing thing to me is that in the BIOS the drives didnt disappear, but when I physically removed a drive the BIOS DID reflect that without a reboot, so I know it could tell if a drive was dropping offline. I guess I'll test in another OS for more conclusive results and report back. TIl then, anything in particular you recommend trying?

thanks!!

Link to comment

ive tested out PSU and HBA/cables, but the problem remains. I reverted back to the onboard sata ports and for a moment I thought everything was working okay, but as soon as the array started the problem returned, and unraid is telling me 3 disks are now unmountable. The drives all appear to still have all their data intact when mounted manually with the array stopped. I would assume doing a new config here would be fine, however im scared the disks may continue to "drop" afterwards.

Link to comment

Skimming your logs, I'm seeing an utter TON of disk read errors. Everyone complains about missing logs, does anyone ever use them? Crap, man.

 

Do you have good SMART data coming back from your disks? Tune in here, if you have no idea what I'm talking about, I'd rather not rewrite the info a dozen times -- but, REPLY HERE. I won't help multiple people in one thread.

 

 

 

If you do have SMART data, check it for all of your disks. If you don't have SMART data, or you don't know what you're looking at (see bove) let me know here.

Edited by codefaux
Wrong diagnostic.zip loaded
Link to comment
8 hours ago, 2Piececombo said:

ive tested out PSU and HBA/cables, but the problem remains. I reverted back to the onboard sata ports and for a moment I thought everything was working okay, but as soon as the array started the problem returned, and unraid is telling me 3 disks are now unmountable. The drives all appear to still have all their data intact when mounted manually with the array stopped. I would assume doing a new config here would be fine, however im scared the disks may continue to "drop" afterwards.

Pinging you with a quote so you get the notification - looks like you're not subscribed to replies on this thread and failing disks are worth a ping.

Link to comment

I will pull each drive out and test it manually tonight, but I have not seen any SMART warnings pop up other than a temp warning on one of the drives at one point, though I see no reason that would be related to this issue.

 

The issue only seems to happen once the array is started, I can mount the disks just fine while the array is stopped with no issue. But once the array starts things get weird. Im considering pulling the data off all my disks and just nuking the whole thing and starting over. I really need my server up and running again sooner rather than later and I feel like im not making an progress on diagnosing the root problem here.

Link to comment
12 hours ago, codefaux said:

missing logs, does anyone ever use them?

 

I'm sure that's what this comment was based on:

On 4/27/2021 at 3:00 AM, JorgeB said:

Looks like a power/connection problem.

 

 

12 hours ago, codefaux said:

Do you have good SMART data coming back from your disks?

You can see the SMART reports of all attached disks in the Diagnostics posted. The usually monitored attributes for all disks looks OK, but none of them have ever had an extended SMART test run on them.

 

16 minutes ago, 2Piececombo said:

I will pull each drive out and test it manually tonight

 

I suggest NOT removing any disks. You can run an extended SMART test on each disk from the Unraid webUI by clicking on each disk to get to its page.

 

But as noted, probably not a disk problem.

 

Check all connections, power and SATA, both ends, including splitters. Make sure each connection doesn't have any tension on it, such as you might get from neatly bundling cables.

 

Post new diagnostics after checking connections and starting array so we can see what the current state is as far as disabled or unmountable disks.

 

 

 

Link to comment
18 minutes ago, trurl said:

SMART reports of all attached disks in the Diagnostics posted

Apologies - I missed that folder. Actually looking at this I may have opened someone else's diagnostic.zip file by mistake. I'm still working on getting a good flow going. I'll be more careful.

 

 

19 minutes ago, trurl said:

But as noted, probably not a disk problem.

I concur.

Link to comment
46 minutes ago, 2Piececombo said:

I feel like im not making an progress

I understand the sentiment and it can be difficult to have your entire server offline. I definitely know how that goes, I'm new here because I had major problems recently myself. We'll work with you, but time zones can make things tricky. Once we have a fresh diagnistics dump, we can take another look and ideally we'll have that eureka moment.

Link to comment

I started an extended smart test, the gui said it never went past 10%, but i refreshed the page after about 30-45 mins and it says last smart test passed. Not sure if that meant the extended or not.. Not good enough not reading unraid diagnostic logs yet. 
 

With the array stopped disk 1 has a red X and says "device is disabled, contents emulated" and disk 5 shows missing. If i start the array, 3 disks show that they are unmountable. (dont remember which 3 but i dont really wanna start the array again) At this point is is possible to pick up where I left off? At one point disk 5 did a parity.. sync? im 99% sure it DIDNT say rebuild, and it went very quickly! like multiple gb/s IIRC, too fast for a rebuild.  Am I able to somehow import those disks back into the array and tell unraid dont worry about it just rebuild parity as is? If so, will I then run into problems with dockers that we started temporarily while the drives were misbehaving? 

 

Also to note, i still hear one of the drive "restarting" it sounds like. But again this only seems to happen while booted into unraid. I'll continue testing this to see if I can recreate it elsewhere. In the mean time, have fun with my logs :P

 

thanks for the help guys!

tower-diagnostics-20210429-2253.zip

Edited by 2Piececombo
spelling..
Link to comment

I'm gonna be honest here, this sounds pretty severe. Your NICs are dropping and coming back frequently, your disks are coming an dgoing, your SATA controllers are bouncing, I see messages about your UPS making and breaking contact frequently. 

 

Bluntly - looking at how much of this syslog seems just...completely unstable, I really suspect hardware. Power supply is a big one, especially if it was cheap and/or came with a case. CPUs, RAM, motherboard failing, overclocking or a failing cooler- something hardware, and it seems severe.

 

This is beyond my ability to troubleshoot over the internet, but I'll still field any other questions you have to the best of my ability.

Link to comment

The ups is expected, as I've moved the server into my office away from its usual home and is no longer connected to its ups.  those other issues though.. Sounds pretty bad.

 

The psu is a 500w 80+ Gold 2U from seasonic. 

Im fine with replacing it, the problem is finding another 2U. They seem to not exist right now, at least anything over 500w. I'd really like to go for a 800w but I just can't find them. So maybe time to change cases as well..

 

All that aside, I still want to have a better idea of what to do about the array. Once im no longer dealing with possibly failing hardware, what do I do about the disks being unmountable (but data still intact)? new config and remap the drives them rebuild parity?

Link to comment
2 hours ago, 2Piececombo said:

disks being unmountable (but data still intact)

If a disk really is unmountable that would indicate that the filesystem is so corrupt that it can't be mounted. Often these can be repaired at least partially. But it might just be an artifact of the hardware problems you are having.

 

Since multiple disks are affected that would indicate a power or controller issue.

12 hours ago, 2Piececombo said:

If i start the array, 3 disks show that they are unmountable. (dont remember which 3 but i dont really wanna start the array again)

But you didn't take those latest diagnostics with the array started so they don't show any mounted disks at all. Looking through syslog it doesn't look like the array was ever started after that bootup, maybe too many missing disks to even attempt to start.

 

Reviewed the earlier diagnostics, which were also taken without array started so nothing was mounted at the time. But digging through syslog, it looks like it immediately had problems communicating with disk1 and disk1 couldn't be mounted. Disk2 and disk3 did mount, disk5 couldn't be mounted. Nothing assigned as disk4 apparently. And cache mounted.

 

Are the problem disks all plugged in to this controller?

08:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
	Subsystem: Dell SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1028:1f4f]
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

Maybe you need to reseat the card? Is there a heat issue?

Link to comment
2 hours ago, trurl said:

If a disk really is unmountable that would indicate that the filesystem is so corrupt that it can't be mounted. Often these can be repaired at least partially. But it might just be an artifact of the hardware problems you are having.

The array doesnt autostart because the disks show up as missing, but I can still use the drop-down menu and "re-assign" them, at which point unraid seems to think all is well and I can start the array, but when I do it shows 3 disks being un-mountable. 

 

2 hours ago, trurl said:

 

Are the problem disks all plugged in to this controller?


08:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
	Subsystem: Dell SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1028:1f4f]
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

Only one disk is plugged into it now. I switched back to the onboard SATA ports thinking the issue could be my HBA, but the problem didnt go away. Also I have reseated everything alread,y that was one of the first things I did once I noticed the odd behavior. 

 

I would simply try another PSU (as previously mentioned I have one PSU powering the drives, and another powering everything else) but to remove the 24 pin connector requires the cooler be taken off, but to take the cooler off requires the board be taken out of the chasis, so ive been avoiding this. But I suppose at this point I have no other choice but to do it.

 

My main concern at this point is the drive (drive 5) continues to go online/offline over and over, regardless of what PSU is powering it or what its plugged into (HBA/SATA). And since I only observe this in unraid and nowhere else, im not confident that the "nuclear" option of dumping all data do another location and starting over from scratch will even fix it. If this was a power/motherboard/HBA/etc problem, this same thing SHOULD happen while booted into something else, but it doesnt. ONLY in unraid. What are the odds unraid is somehow the cause? Im feeling rather lost at this point, even with all the help and suggestions I feel like the main problem isnt being addressed (dont get me wrong I truly appreciate everyone here) but Im still stuck on the fact that this issue only seems to happen while booted into unraid. I mentioned previously that I could hear the drive 'restarting' or something, and upon taking the disk in question out and connecting it to another system, it doenst do that. Again, only in unraid. :/ 

 

I will try a new PSU, might even make a new unraid USB, pull out all my current disks and throw in some spares to see if somehow my install is somehow bugged. 

 

Thanks again everyone

Link to comment
1 hour ago, 2Piececombo said:

 

I would simply try another PSU (as previously mentioned I have one PSU powering the drives, and another powering everything else) but to remove the 24 pin connector requires the cooler be taken off

As a loose suggestion, order a 24-pin extender from Amazon or something. Then you can unplug the PSU from the extender instead. Next time, obviously doesn't help now unfortunately.

 

 

Regarding the "main problem not being addressed" -- you bet it isn't, because we're still trying to figure out what the problem IS. We can't address it until we identify it. It WILL NOT be fixed until after something GETS fixed. Stop telling us you feel like it isn't fixed yet, we're still here because it isn't fixed yet. For now.

 

Regarding things working on another Linux OS but not in Unraid -- I literally cannot actually believe that system runs stable in any Linux-based OS, because Unraid is JUST LINUX with some really neat extra tools. There's no magic here. I mean, the while filesystem/jagged parity/all-in-one nature is pretty amazing, but it's done by hard work, creation, and layering all these extra things on top of Linux. You've got kernel messages indicating errors before the initrd even gets to anything unique to Unraid.

 

Now, from what I've seen, you've been told repeatedly that it seems like the PSU and haven't swapped it yet. I feel like we're not making any progress here, and the potential fix isn't being addressed.

 

 

 

Link to comment

  

45 minutes ago, codefaux said:

Also I just looked up your motherboard -- it has a 24-pin power, and TWO 8-pin power headers. Are you connecting all of them as you should be?

 

My second CPU wouldnt work without it ;) I actually build PCs as part of my job, so I can confidently say all my cables are good :P

 

 

4 hours ago, codefaux said:

As a loose suggestion, order a 24-pin extender from Amazon or something. Then you can unplug the PSU from the extender instead. Next time, obviously doesn't help now unfortunately.

 

 

Regarding the "main problem not being addressed" -- you bet it isn't, because we're still trying to figure out what the problem IS. We can't address it until we identify it. It WILL NOT be fixed until after something GETS fixed. Stop telling us you feel like it isn't fixed yet, we're still here because it isn't fixed yet. For now.

 

 

I actually thought about the PSU extender, but sadly would not fit given how cramped my case already is :/

 

If my words came off in any way as a signal of anger or frustration towards anyone trying to help I apologize. I wrote that after having spent multiple hours on the phone with centurylink. Any anger I felt at the time was merely residual from the hell that is trying to get anything done with that god awful company. 

 

6 hours ago, trurl said:

Are you saying that you can boot this very same hardware in another Linux and the disks are accessible? If that is not what you are saying, then what?

4 hours ago, codefaux said:

Regarding the "main problem not being addressed" -- you bet it isn't, because we're still trying to figure out what the problem IS. We can't address it until we identify it. It WILL NOT be fixed until after something GETS fixed. Stop telling us you feel like it isn't fixed yet, we're still here because it isn't fixed yet. For now.

 

Regarding things working on another Linux OS but not in Unraid -- I literally cannot actually believe that system runs stable in any Linux-based OS, because Unraid is JUST LINUX with some really neat extra tools. There's no magic here. I mean, the while filesystem/jagged parity/all-in-one nature is pretty amazing, but it's done by hard work, creation, and layering all these extra things on top of Linux. You've got kernel messages indicating errors before the initrd even gets to anything unique to Unraid.

 

 

Well, my test didnt include booting into linux, but rather a customized WinPE environment. Admittedly not the BEST environment to use, but I had it handy and figured I'd at least be able to observer something in disk management. I'll try a more proper test now, i've just been short on time which has made the whole situation more stressful. I will also disassemble the ol gal and test out that damn PSU. BTW, anyone selling a 700w+ 2U PSU? :P

 

cheers guys

Edited by 2Piececombo
Link to comment
3 hours ago, 2Piececombo said:

customized WinPE environment

The Windows kernel operates drastically differently from the Linux kernel. I'll give you the benefit of the doubt and assume the customized environment contains driver packages to actually load and activate all of the hardware on the machine. Honestly I'm surprised even that much worked, with how unstable everything looks in those logs.

 

3 hours ago, 2Piececombo said:

I apologize

I understand the frustration, and the justification of having just dealt with useless tech support and so on, and I'll accept the apology with the footnote you did it twice and not just once, on two separate days which likely did not both have prior tech support calls, and neither time proved to be helpful. I'm sorry for reacting with the equivalent of a brick, but we're unpaid and don't have direct access to your hardware. Certain considerations are going to have to continue to be made until this is resolved or walked away from, unfortunately.

 

If you still suspect things are okay and it's "just an Unraid thing" feel free to fire up a free Linux distro like sysrescuedisc and use it to do some basic drive tests. If that runs and Unraid doesn't, you're on to something.

 

Unfortunately, I have no further ideas, so unless something important situationally here changes -- IE you find a significant lead to a problem, or eliminate current suspects -- I'll only reply to questions I can address.

Link to comment

Things have improved. After a good CPU repasting using the last of my NT-H1, I used a 1200w PSU that I know to be good and reconnected everything. First boot seemed normal enough, and when I got to the ui things seemed to be stable enough. No disappearing drives or other weirdness. Starting the array left me with a disk 5 showing is good, but disk 1 missing/emulated. GIven that no other disks are showing up as unmountable I guess I can rebuild that disk? 

 

The other issue is that my 2nd pool (consisting of 2x500gb SSDs) are both showing as having no filesystem. I had previously moved appdata and VMs off my main cache drive to this second cache pool. Well, copied I guess, because all the data is still on the other cache drive. So it seems that I havent lost anything I had put on this pool, if im understanding things correctly. The other missing drive (download) was never actually formatted to begin with so it's warning can be ignored.

 

I took diags with array stopped, then again after it started, in case that could be useful in some way. As of now it would seem it was the PSU all along. I'm just going to let it sit for a bit and watch it. I stopped all dockers to be safe. If all seems well and stable after a bit ill rebuild the missing disk, unless anyone has a better suggestion. 

 

17 hours ago, codefaux said:

and I'll accept the apology with the footnote you did it twice and not just once, on two separate days which likely did not both have prior tech support calls

 

Unfortunately I did have to talk to them two days in a row. And for once it wasn't tech support, it was the billing dept. Better or worse, I don't really know. That entire company can burn to the ground for all I care. Thanks for all the help and suggestions though, appreciate it.

 

 

Huge thanks to you all. This community is why I use unraid and not something else. You're all wonderful. Accept my humble apologies for being "that guy". 

cheers

  • Like 1
Link to comment

Hey man I'm real glad to hear it seems good so far. Keep an ear to the ground and check those logs regularly for a little bit, but it sounds like it's time for that unclenching that comes with finally figuring it out.

 

I can't comment on the cache pools, as I have no experience with that, but it sounds like things are alright.

 

Rebuilding the disk does sound like the next action, and I agree with giving it a stable wait period before starting on it again.

 

1 hour ago, 2Piececombo said:

Accept my humble apologies for being "that guy". 

Hey man, I was that guy last week. Same forum. I'm not exactly advocating it, but as long as we level off and keep it civil, no harm done in my book. Let us know if you need anything else.

Link to comment

Sadly the server just randomly shut down a few minutes ago. And wouldnt turn back on for several minutes. Gave it another shot, but it restarted twice in a row after severla minutes. Im confident ram is good (did a 24h memtest somewhat recently) so it's either CPU or motherboard. Luckily I have some alternate 1366b CPUs I can swap out. (higher clocks, less cores. sorta worth?) 

 

Will update once I disassemble again and swap them out. stay tuned..

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.