Experiencing unresponsive array among other issues


Recommended Posts

Hey all, been having some strange issues lately that I'm not even sure how to properly describe.

 

Over the past 2 weeks or so, I've been experiencing issues like my array becoming unresponsive, webui not loading all elements, docker service becoming stuck, unable to reboot gracefully. I've experienced about 3-4 times in this period where I've had to power cycle the server via idrac, no graceful shutdown. I can use the idrac console or ssh into unraid to run a reboot command, but it just says the server is going down for a reboot, and just sits there until I power cycle.

 

I believe the issue is with my cache drives. I have a 2x800gb ssd cache pool. I have been seeing that they have been running hot often, but usually coming back down in temp shortly after I get the temperature notifications.

 

In my /root directiry, I see a file called dead.letter. Here are the contents of that file

Event: Unraid Parity check
Subject: Notice [STEVENET] - Parity check started
Description: Size: 8 TB
Importance: warning



fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error


fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error


fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error


fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error

When I console in through idrac, I see some messages like this (attached)

 

My SMART checks have all passed when I last ran them, which was a day or two ago, after I started experiencing these issues.

 

Is this a sign of my cache drive(s) failing, or possibly something else? Not sure which steps to take next.

 

Thanks

Screen Shot 2020-06-13 at 5.44.11 PM.png

Link to comment
8 hours ago, johnnie.black said:

SAS2 LSI controllers don't currently support trim, see if you can connect the SSDs to another controller, like onboard AHCI SATA ports if available.

Thanks for the info. I'll have to take a look when I get home. I thought I read that on the r720XD the onboard sata is disabled

 

how important is it to run trim on the SSDs? would I be better off disabling the trim plugin if I can't use the onboard SATA?

Link to comment
2 minutes ago, Debaser said:

how important is it to run trim on the SSDs?

That's not a fast/easy answer question, you can always google it and read the various opinions.

 

10 minutes ago, Debaser said:

would I be better off disabling the trim plugin if I can't use the onboard SATA?

I won't cause any problems, but might as well disable it.

 

Link to comment
  • 2 weeks later...

so a few times since my last post about a week ago, my server has needed a hard reboot 2-3 times. it's currently in a state where some containers and services are working, but some are not. not sure what is going on. i'm pretty sure if i try a graceful shutdown at this point it will just hang and require another hard reboot

 

pages in unraid such as docker, dashboard, etc are unresponsive or do not load. others such as Main, Shares seem to be loading fine

 

anyone have any advice on how to troubleshoot this?

Link to comment

Hi there,

 

Saw your email into support and have read through this thread.  You're definitely getting some good advice by folks in here.  Johnnie's last post is especially important, although you can also use the idrac connection to print the log to the screen in real time with this command:

 

tail /var/log/syslog -f

 

Then when the server hard crashes, you can take a screenshot of what you see and post it back here.  The trouble with crashes is that often times the crash occurs before the events can be written to the log file, so having it printed to the screen can be just fast enough to capture the crash event in the log.

 

It would be helpful to know when these issues started occurring.  If you're hardware has been fine for months and months and all of the sudden these issues come out of nowhere, it is likely the result of a problem in the hardware.  Might need better cooling, cabling, etc. and keep in mind that the temperature warnings in Unraid are generic.  You have to look up the actual hardware you're using to find the temp ratings and then manually set those temps in the disk settings page for those devices.  Otherwise you may be giving yourself a false impression that heat isn't the issue when perhaps the drives are being forced to operate at a much higher temp than they should be.

 

Lastly, if you are banging your head against the wall on this, the best thing to do is to try reducing the number of components in your system setup (apps, containers, VMs, cache drive) until you isolate the issue.  Try running the array without the cache assigned at all.  Turn off all docker containers.  Let it just run for a while and see if it crashes.  Then slowly start turning services on one by one until you recreate the issue.  Once you've found the key variable that causes the crash, we can try and replicate to recreate if its a bug, or give you advice on how to solve it within the software via a configuration tweak.

Link to comment
17 hours ago, johnnie.black said:

You can enable this and post the syslog after a crash, but if it's a hardware problem most likely there won't be anything logged before crashing.

 

5 hours ago, jonp said:

Hi there,

 

Saw your email into support and have read through this thread.  You're definitely getting some good advice by folks in here.  Johnnie's last post is especially important, although you can also use the idrac connection to print the log to the screen in real time with this command:

 

tail /var/log/syslog -f

 

Then when the server hard crashes, you can take a screenshot of what you see and post it back here.  The trouble with crashes is that often times the crash occurs before the events can be written to the log file, so having it printed to the screen can be just fast enough to capture the crash event in the log.

 

It would be helpful to know when these issues started occurring.  If you're hardware has been fine for months and months and all of the sudden these issues come out of nowhere, it is likely the result of a problem in the hardware.  Might need better cooling, cabling, etc. and keep in mind that the temperature warnings in Unraid are generic.  You have to look up the actual hardware you're using to find the temp ratings and then manually set those temps in the disk settings page for those devices.  Otherwise you may be giving yourself a false impression that heat isn't the issue when perhaps the drives are being forced to operate at a much higher temp than they should be.

 

Lastly, if you are banging your head against the wall on this, the best thing to do is to try reducing the number of components in your system setup (apps, containers, VMs, cache drive) until you isolate the issue.  Try running the array without the cache assigned at all.  Turn off all docker containers.  Let it just run for a while and see if it crashes.  Then slowly start turning services on one by one until you recreate the issue.  Once you've found the key variable that causes the crash, we can try and replicate to recreate if its a bug, or give you advice on how to solve it within the software via a configuration tweak.

Thanks guys for the responses. I've gone and set up a remote syslog server, as well as tailing the syslog in my idrac console. I'll post back with any information on the next crash.

 

@jonp - this issue did just start happening in the past few weeks, after more than a year running perfectly stable. I did notice the cache ssd temperature warnings and realized that unraid has some generic temperate warning levels. I found my SSDs should warn around 60c and critical around 75, so I have adjusted those levels, so heat does not seem to be the issue.

 

I haven't made any hardware changes recently, and according to grafana and the unraid dashboard, my resources don't seem to be strapped. i've been transcoding a large video library from x264 to x265 using tdarr since January which may be putting some strain on the CPUs. I've dropped it from 3 workers to 1 to lessen the strain, but it does not seem like anything has changed.

Link to comment

Hey guys, update here. I've been running the syslog server since last night. Everything as far as I can tell is still running smooth other than Grafana showing my Cache read/write at a constant 0. I have noticed this usually happens before I notice any major issues. Most of my docker containers are running in cache, so there should be some constant read/write activity I believe.

 

I've attached my syslog output. There are plenty of critical errors such as the following

 

2020-06-24 21:22:55	Kernel.Info	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS info (device loop2): no csum found for inode 4004 start 0
2020-06-24 21:22:55	Kernel.Critical	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS critical (device loop2): corrupt leaf: root=7 block=12937510912 slot=43, bad key order, prev (18446744073709551606 128 12001943552) current (18446744073709551606 16 72057081092011947)
2020-06-24 21:22:55	Kernel.Critical	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS critical (device loop2): corrupt leaf: root=7 block=12937510912 slot=43, bad key order, prev (18446744073709551606 128 12001943552) current (18446744073709551606 16 72057081092011947)
2020-06-24 21:22:55	Kernel.Info	192.168.1.204	Jun 24 21:22:53 stevenet kernel: BTRFS info (device loop2): no csum found for inode 4004 start 4096

After some googling, it appears loop2 would be my docker.img? Could this be the cause of my system crashing on me? What would my troubleshooting steps be here, delete my docker image and reinstall my containers?

 

Or am I going down the wrong path here?

SyslogCatchAll-2020-06-24.txt SyslogCatchAll-2020-06-25.txt

Link to comment
1 hour ago, johnnie.black said:

It is the docker image.

Thanks. I've deleted the docker image and re-installed my containers from community applications. I stepped away during the install of the containers. It looks like it finished, as all my containers are there, but it looks like sometime after the system had a kernel panic, and the system became unresponsive. I had to do a power cycle via the idrac.

 

I've attached a screenshot of the unresponsive console with the kernel panic, as well as the syslog output from today.

 

After power cycling, unraid is back up and everything seems to be running ok for now. I'll continue to monitor to see what happens

 

 

TeamViewer_3BRGMZ1r37.png

SyslogCatchAll-2020-06-25.txt

Link to comment
  • 2 weeks later...

hey all, checking in here, system has been stable for about 2 weeks now. it seems the culprit may have been a pi-hole docker container i was running that was having issues to begin with on its own. ever since removing that container, it seems to have been running stable. going to keep monitoring for a while, but hopeful that may have been the solution. appreciate the help!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.