unRAID freeze completely randomly


Recommended Posts

Hello!

I've decided I wanted to try unRAID for a new NAS build, so I've installed it and had in running for a few months. But I'm experiencing severe issues where I can't consider this build complete.

I randomly get errors and it wasn't as bad in the beginning, but I never really hammered the NAS very much.

I get a whole bunch of different errors. Attached you see some of the ones I've screenshotted from my BNC. I have obtained a diagnostic dump after I've confirmed one error got on the screen but didn't make the unraid unresponsive yet. But it always happens.

I also suspect it can be related to docker, which I currently only run a rtorrent docker, I've attempted to heavily limit the number of files it allowed to keep open at once and limited it's ability to create connection, but it had no effect on unraid just crashing completely. I've kept my server in a good state longer if I don't start any dockers, and when I start rtorrent. The problems always start occurring within two days or so with uptime.

I've rand memtest86 on my hardware to check it, but it's found nothing.

unraid error 2.png

unraid error.png

unraid error3.png

tower-diagnostics-20190506-2015.zip

Edited by Rudde
Link to comment

I've added the diagnostic dump to my OP. I've rebooted it now, but my trail has expired among with my two extensions and I can't access the WebUI anymore. And it doesn't seem like /dev/md3 exist either. I suspect this is what unraid have named their pool, and that it's not a reference to one of my disks.

I have ran parity check several times where it completed without any issues.

Link to comment
5 hours ago, Rudde said:

And it doesn't seem like /dev/md3 exist either. I

The mdX type devices all relate to array disks with ‘X’ being the disk slot number.   This means that md3 is equivalent to disk3.

Link to comment
7 hours ago, trurl said:

That diagnostic is nearly 2 weeks old.

 

Have you done a memtest recenty?

The memtest was done in the same time area. The server hasn't really been used since the diagnosis because of the problems descibed here being a prior for a long time prior to it.

3 hours ago, itimpi said:

The mdX type devices all relate to array disks with ‘X’ being the disk slot number.   This means that md3 is equivalent to disk3.

Okey. Thanks. Do they count from Disk 1 -> md1, Disk 2 -> md2 and such? Where does cache and parity 1 and 2 land in this scheme?

Link to comment
4 hours ago, Rudde said:

Do they count from Disk 1 -> md1, Disk 2 -> md2 and such? Where does cache and parity 1 and 2 land in this scheme?

disk1 is md1, etc. md is only disks mounted in the parity array, so cache isn't part of this. Parity is sometimes referred to as disk0. but it can't be md0 because it doesn't have a filesytem.

 

If you look at syslog you will see Unraid taking inventory of the disks as slot0, slot1, etc. Parity2 is slot29, after any possible data disks.

 

md is really about the disks as they are used with parity. When working with the md devices, parity is part of that. Writing to an md device updates parity, a disabled disk can still be accessed as an md device from the parity calculation, when repairing a filesytem you always use the md device so parity will be maintained.

 

For disks outside the array, sometimes the sd device will be referred to instead. Don't assume a specific sd device always refers to the same disk since that can change between boots, especially if you add or remove disks.

Link to comment
  • 2 weeks later...

I've managed to run xfs_repair on the device in question. It have had absolutely on effect on any of the issues described in this thread. I have no idea what to do next. Is there anything indicating anything wrong in the diagnostics files?

Link to comment
8 hours ago, trurl said:

Could you give us some details about that? The output from running the repair would be preferred.

There was no errors in it, it was the standard output.

These issues also occurred way before that error popped up, also feel like this is the most irrelevant error to the symptoms experienced. Why would a completely corrupt storage-disk even crash unRAID?

Link to comment
On 5/29/2019 at 5:36 PM, trurl said:

Can you get us a new diagnostic from after running the repair?

Yes. Here is a completely fresh one I took just now, after a fresh boot, after having not turned it on since the last crash.

No errors have yet occurred on monitor during this boot prior to taking the diagostics. I will let it run, and see if an error occur on screen, and take another one before it become completely useless and I have to hard-rest it.

tower-diagnostics-20190531-0916.zip

Link to comment

Now it crashed again, but it is not responsive, it's on and display the error, but I can't connect to it with SSH or access the WebUI to get another diagnostics.

While this error is not one of the those I've have managed to screenshot, it is an error I've gotten before, it's "nf_nat_setup_info"

 

this happen after I started to stress-test my rTorrent docker, as I have suspected it's related to docker in some way. I started adding a few torrents, one after another and it very quickly crashed.

 

I tried to move things away from the SSD when I got this error message last time because I didn't want the SSD to be involved with the docker instances as I was afraid it had depleted sectors, so I moved the data to make sure that wasn't whats wrong.

Edited by Rudde
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.