Experiencing unresponsive array among other issues

June 13, 20206 yr

Hey all, been having some strange issues lately that I'm not even sure how to properly describe.

Over the past 2 weeks or so, I've been experiencing issues like my array becoming unresponsive, webui not loading all elements, docker service becoming stuck, unable to reboot gracefully. I've experienced about 3-4 times in this period where I've had to power cycle the server via idrac, no graceful shutdown. I can use the idrac console or ssh into unraid to run a reboot command, but it just says the server is going down for a reboot, and just sits there until I power cycle.

I believe the issue is with my cache drives. I have a 2x800gb ssd cache pool. I have been seeing that they have been running hot often, but usually coming back down in temp shortly after I get the temperature notifications.

In my /root directiry, I see a file called dead.letter. Here are the contents of that file

Event: Unraid Parity check
Subject: Notice [STEVENET] - Parity check started
Description: Size: 8 TB
Importance: warning



fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error


fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error


fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error


fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error

When I console in through idrac, I see some messages like this (attached)

My SMART checks have all passed when I last ran them, which was a day or two ago, after I started experiencing these issues.

Is this a sign of my cache drive(s) failing, or possibly something else? Not sure which steps to take next.

Thanks

Quote

June 14, 20206 yr

Community Expert

Please post the diagnostics: Tools -> Diagnostics

Quote

June 14, 20206 yr

Author

sorry, diagnostics attached

stevenet-diagnostics-20200614-1835.zip

Quote

June 15, 20206 yr

Community Expert

SAS2 LSI controllers don't currently support trim, see if you can connect the SSDs to another controller, like onboard AHCI SATA ports if available.

Quote

June 15, 20206 yr

Author

8 hours ago, johnnie.black said:

SAS2 LSI controllers don't currently support trim, see if you can connect the SSDs to another controller, like onboard AHCI SATA ports if available.

Thanks for the info. I'll have to take a look when I get home. I thought I read that on the r720XD the onboard sata is disabled

how important is it to run trim on the SSDs? would I be better off disabling the trim plugin if I can't use the onboard SATA?

Quote

June 15, 20206 yr

Community Expert

2 minutes ago, Debaser said:

how important is it to run trim on the SSDs?

That's not a fast/easy answer question, you can always google it and read the various opinions.

10 minutes ago, Debaser said:

would I be better off disabling the trim plugin if I can't use the onboard SATA?

I won't cause any problems, but might as well disable it.

Quote

June 15, 20206 yr

Author

gotcha, thanks for the info.

so you're saying the SSD trim isn't likely what is causing these issues with unraid? it's been stable for the last 24 hours or so, but has been pretty unstable for the past week

Quote

June 15, 20206 yr

Community Expert

No, lack of trim can slow down writes, but it depends on the amount you do and SSDs used, it can also have a small impact on the SSD life, but other than that it shouldn't be your problem.

Quote

June 24, 20206 yr

Author

so a few times since my last post about a week ago, my server has needed a hard reboot 2-3 times. it's currently in a state where some containers and services are working, but some are not. not sure what is going on. i'm pretty sure if i try a graceful shutdown at this point it will just hang and require another hard reboot

pages in unraid such as docker, dashboard, etc are unresponsive or do not load. others such as Main, Shares seem to be loading fine

anyone have any advice on how to troubleshoot this?

Quote

June 24, 20206 yr

Community Expert

You can enable this and post the syslog after a crash, but if it's a hardware problem most likely there won't be anything logged before crashing.

Quote

June 24, 20206 yr

Hi there,

Saw your email into support and have read through this thread. You're definitely getting some good advice by folks in here. Johnnie's last post is especially important, although you can also use the idrac connection to print the log to the screen in real time with this command:

tail /var/log/syslog -f

Then when the server hard crashes, you can take a screenshot of what you see and post it back here. The trouble with crashes is that often times the crash occurs before the events can be written to the log file, so having it printed to the screen can be just fast enough to capture the crash event in the log.

It would be helpful to know when these issues started occurring. If you're hardware has been fine for months and months and all of the sudden these issues come out of nowhere, it is likely the result of a problem in the hardware. Might need better cooling, cabling, etc. and keep in mind that the temperature warnings in Unraid are generic. You have to look up the actual hardware you're using to find the temp ratings and then manually set those temps in the disk settings page for those devices. Otherwise you may be giving yourself a false impression that heat isn't the issue when perhaps the drives are being forced to operate at a much higher temp than they should be.

Lastly, if you are banging your head against the wall on this, the best thing to do is to try reducing the number of components in your system setup (apps, containers, VMs, cache drive) until you isolate the issue. Try running the array without the cache assigned at all. Turn off all docker containers. Let it just run for a while and see if it crashes. Then slowly start turning services on one by one until you recreate the issue. Once you've found the key variable that causes the crash, we can try and replicate to recreate if its a bug, or give you advice on how to solve it within the software via a configuration tweak.

Quote

June 25, 20206 yr

Author

17 hours ago, johnnie.black said:

You can enable this and post the syslog after a crash, but if it's a hardware problem most likely there won't be anything logged before crashing.

5 hours ago, jonp said:

Hi there,

Saw your email into support and have read through this thread. You're definitely getting some good advice by folks in here. Johnnie's last post is especially important, although you can also use the idrac connection to print the log to the screen in real time with this command:

tail /var/log/syslog -f

Then when the server hard crashes, you can take a screenshot of what you see and post it back here. The trouble with crashes is that often times the crash occurs before the events can be written to the log file, so having it printed to the screen can be just fast enough to capture the crash event in the log.

It would be helpful to know when these issues started occurring. If you're hardware has been fine for months and months and all of the sudden these issues come out of nowhere, it is likely the result of a problem in the hardware. Might need better cooling, cabling, etc. and keep in mind that the temperature warnings in Unraid are generic. You have to look up the actual hardware you're using to find the temp ratings and then manually set those temps in the disk settings page for those devices. Otherwise you may be giving yourself a false impression that heat isn't the issue when perhaps the drives are being forced to operate at a much higher temp than they should be.

Lastly, if you are banging your head against the wall on this, the best thing to do is to try reducing the number of components in your system setup (apps, containers, VMs, cache drive) until you isolate the issue. Try running the array without the cache assigned at all. Turn off all docker containers. Let it just run for a while and see if it crashes. Then slowly start turning services on one by one until you recreate the issue. Once you've found the key variable that causes the crash, we can try and replicate to recreate if its a bug, or give you advice on how to solve it within the software via a configuration tweak.

Thanks guys for the responses. I've gone and set up a remote syslog server, as well as tailing the syslog in my idrac console. I'll post back with any information on the next crash.

@jonp - this issue did just start happening in the past few weeks, after more than a year running perfectly stable. I did notice the cache ssd temperature warnings and realized that unraid has some generic temperate warning levels. I found my SSDs should warn around 60c and critical around 75, so I have adjusted those levels, so heat does not seem to be the issue.

I haven't made any hardware changes recently, and according to grafana and the unraid dashboard, my resources don't seem to be strapped. i've been transcoding a large video library from x264 to x265 using tdarr since January which may be putting some strain on the CPUs. I've dropped it from 3 workers to 1 to lessen the strain, but it does not seem like anything has changed.

Quote

June 25, 20206 yr

Author

Hey guys, update here. I've been running the syslog server since last night. Everything as far as I can tell is still running smooth other than Grafana showing my Cache read/write at a constant 0. I have noticed this usually happens before I notice any major issues. Most of my docker containers are running in cache, so there should be some constant read/write activity I believe.

I've attached my syslog output. There are plenty of critical errors such as the following

2020-06-24 21:22:55	Kernel.Info	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS info (device loop2): no csum found for inode 4004 start 0
2020-06-24 21:22:55	Kernel.Critical	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS critical (device loop2): corrupt leaf: root=7 block=12937510912 slot=43, bad key order, prev (18446744073709551606 128 12001943552) current (18446744073709551606 16 72057081092011947)
2020-06-24 21:22:55	Kernel.Critical	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS critical (device loop2): corrupt leaf: root=7 block=12937510912 slot=43, bad key order, prev (18446744073709551606 128 12001943552) current (18446744073709551606 16 72057081092011947)
2020-06-24 21:22:55	Kernel.Info	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS info (device loop2): no csum found for inode 4004 start 4096

After some googling, it appears loop2 would be my docker.img? Could this be the cause of my system crashing on me? What would my troubleshooting steps be here, delete my docker image and reinstall my containers?

Or am I going down the wrong path here?

SyslogCatchAll-2020-06-24.txt SyslogCatchAll-2020-06-25.txt

Quote

June 25, 20206 yr

Community Expert

3 minutes ago, Debaser said:

it appears loop2 would be my docker.img?

It usually is, but we'd need the full diags to confirm, and if it is it needs to be recreated.

Quote

June 25, 20206 yr

Author

10 minutes ago, johnnie.black said:

It usually is, but we'd need the full diags to confirm, and if it is it needs to be recreated.

Got it, thanks. diagnostics attached.

stevenet-diagnostics-20200625-1059.zip

Quote

June 25, 20206 yr

Community Expert

It is the docker image.

Quote

June 25, 20206 yr

Author

1 hour ago, johnnie.black said:

It is the docker image.

Thanks. I've deleted the docker image and re-installed my containers from community applications. I stepped away during the install of the containers. It looks like it finished, as all my containers are there, but it looks like sometime after the system had a kernel panic, and the system became unresponsive. I had to do a power cycle via the idrac.

I've attached a screenshot of the unresponsive console with the kernel panic, as well as the syslog output from today.

After power cycling, unraid is back up and everything seems to be running ok for now. I'll continue to monitor to see what happens

SyslogCatchAll-2020-06-25.txt

Quote

July 1, 20206 yr

Author

So, had a few good days with zero issues, then looks like around 1am today (tuesday), looks like server crashed and went unresponsive again, had to power cycle through idrac interface

Here's the syslog from the past 2 days. another kernel panic on the console output

Any ideas?

SyslogCatchAll-2020-06-29.txt SyslogCatchAll-2020-06-30.txt

Quote

July 1, 20206 yr

Community Expert

Crash appears to be network related, try simplifying your network as much as possible to test, remove any bonding if enable, also any dockers with custom IP addresses.

Quote

July 16, 20205 yr

Author

hey all, checking in here, system has been stable for about 2 weeks now. it seems the culprit may have been a pi-hole docker container i was running that was having issues to begin with on its own. ever since removing that container, it seems to have been running stable. going to keep monitoring for a while, but hopeful that may have been the solution. appreciate the help!

Quote

Experiencing unresponsive array among other issues

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)