Debaser Posted June 13, 2020 Posted June 13, 2020 Hey all, been having some strange issues lately that I'm not even sure how to properly describe. Over the past 2 weeks or so, I've been experiencing issues like my array becoming unresponsive, webui not loading all elements, docker service becoming stuck, unable to reboot gracefully. I've experienced about 3-4 times in this period where I've had to power cycle the server via idrac, no graceful shutdown. I can use the idrac console or ssh into unraid to run a reboot command, but it just says the server is going down for a reboot, and just sits there until I power cycle. I believe the issue is with my cache drives. I have a 2x800gb ssd cache pool. I have been seeing that they have been running hot often, but usually coming back down in temp shortly after I get the temperature notifications. In my /root directiry, I see a file called dead.letter. Here are the contents of that file Event: Unraid Parity check Subject: Notice [STEVENET] - Parity check started Description: Size: 8 TB Importance: warning fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error When I console in through idrac, I see some messages like this (attached) My SMART checks have all passed when I last ran them, which was a day or two ago, after I started experiencing these issues. Is this a sign of my cache drive(s) failing, or possibly something else? Not sure which steps to take next. Thanks Quote
JorgeB Posted June 14, 2020 Posted June 14, 2020 Please post the diagnostics: Tools -> Diagnostics Quote
Debaser Posted June 14, 2020 Author Posted June 14, 2020 sorry, diagnostics attached stevenet-diagnostics-20200614-1835.zip Quote
JorgeB Posted June 15, 2020 Posted June 15, 2020 SAS2 LSI controllers don't currently support trim, see if you can connect the SSDs to another controller, like onboard AHCI SATA ports if available. Quote
Debaser Posted June 15, 2020 Author Posted June 15, 2020 8 hours ago, johnnie.black said: SAS2 LSI controllers don't currently support trim, see if you can connect the SSDs to another controller, like onboard AHCI SATA ports if available. Thanks for the info. I'll have to take a look when I get home. I thought I read that on the r720XD the onboard sata is disabled how important is it to run trim on the SSDs? would I be better off disabling the trim plugin if I can't use the onboard SATA? Quote
JorgeB Posted June 15, 2020 Posted June 15, 2020 2 minutes ago, Debaser said: how important is it to run trim on the SSDs? That's not a fast/easy answer question, you can always google it and read the various opinions. 10 minutes ago, Debaser said: would I be better off disabling the trim plugin if I can't use the onboard SATA? I won't cause any problems, but might as well disable it. Quote
Debaser Posted June 15, 2020 Author Posted June 15, 2020 gotcha, thanks for the info. so you're saying the SSD trim isn't likely what is causing these issues with unraid? it's been stable for the last 24 hours or so, but has been pretty unstable for the past week Quote
JorgeB Posted June 15, 2020 Posted June 15, 2020 No, lack of trim can slow down writes, but it depends on the amount you do and SSDs used, it can also have a small impact on the SSD life, but other than that it shouldn't be your problem. Quote
Debaser Posted June 24, 2020 Author Posted June 24, 2020 so a few times since my last post about a week ago, my server has needed a hard reboot 2-3 times. it's currently in a state where some containers and services are working, but some are not. not sure what is going on. i'm pretty sure if i try a graceful shutdown at this point it will just hang and require another hard reboot pages in unraid such as docker, dashboard, etc are unresponsive or do not load. others such as Main, Shares seem to be loading fine anyone have any advice on how to troubleshoot this? Quote
JorgeB Posted June 24, 2020 Posted June 24, 2020 You can enable this and post the syslog after a crash, but if it's a hardware problem most likely there won't be anything logged before crashing. Quote
jonp Posted June 24, 2020 Posted June 24, 2020 Hi there, Saw your email into support and have read through this thread. You're definitely getting some good advice by folks in here. Johnnie's last post is especially important, although you can also use the idrac connection to print the log to the screen in real time with this command: tail /var/log/syslog -f Then when the server hard crashes, you can take a screenshot of what you see and post it back here. The trouble with crashes is that often times the crash occurs before the events can be written to the log file, so having it printed to the screen can be just fast enough to capture the crash event in the log. It would be helpful to know when these issues started occurring. If you're hardware has been fine for months and months and all of the sudden these issues come out of nowhere, it is likely the result of a problem in the hardware. Might need better cooling, cabling, etc. and keep in mind that the temperature warnings in Unraid are generic. You have to look up the actual hardware you're using to find the temp ratings and then manually set those temps in the disk settings page for those devices. Otherwise you may be giving yourself a false impression that heat isn't the issue when perhaps the drives are being forced to operate at a much higher temp than they should be. Lastly, if you are banging your head against the wall on this, the best thing to do is to try reducing the number of components in your system setup (apps, containers, VMs, cache drive) until you isolate the issue. Try running the array without the cache assigned at all. Turn off all docker containers. Let it just run for a while and see if it crashes. Then slowly start turning services on one by one until you recreate the issue. Once you've found the key variable that causes the crash, we can try and replicate to recreate if its a bug, or give you advice on how to solve it within the software via a configuration tweak. Quote
Debaser Posted June 25, 2020 Author Posted June 25, 2020 17 hours ago, johnnie.black said: You can enable this and post the syslog after a crash, but if it's a hardware problem most likely there won't be anything logged before crashing. 5 hours ago, jonp said: Hi there, Saw your email into support and have read through this thread. You're definitely getting some good advice by folks in here. Johnnie's last post is especially important, although you can also use the idrac connection to print the log to the screen in real time with this command: tail /var/log/syslog -f Then when the server hard crashes, you can take a screenshot of what you see and post it back here. The trouble with crashes is that often times the crash occurs before the events can be written to the log file, so having it printed to the screen can be just fast enough to capture the crash event in the log. It would be helpful to know when these issues started occurring. If you're hardware has been fine for months and months and all of the sudden these issues come out of nowhere, it is likely the result of a problem in the hardware. Might need better cooling, cabling, etc. and keep in mind that the temperature warnings in Unraid are generic. You have to look up the actual hardware you're using to find the temp ratings and then manually set those temps in the disk settings page for those devices. Otherwise you may be giving yourself a false impression that heat isn't the issue when perhaps the drives are being forced to operate at a much higher temp than they should be. Lastly, if you are banging your head against the wall on this, the best thing to do is to try reducing the number of components in your system setup (apps, containers, VMs, cache drive) until you isolate the issue. Try running the array without the cache assigned at all. Turn off all docker containers. Let it just run for a while and see if it crashes. Then slowly start turning services on one by one until you recreate the issue. Once you've found the key variable that causes the crash, we can try and replicate to recreate if its a bug, or give you advice on how to solve it within the software via a configuration tweak. Thanks guys for the responses. I've gone and set up a remote syslog server, as well as tailing the syslog in my idrac console. I'll post back with any information on the next crash. @jonp - this issue did just start happening in the past few weeks, after more than a year running perfectly stable. I did notice the cache ssd temperature warnings and realized that unraid has some generic temperate warning levels. I found my SSDs should warn around 60c and critical around 75, so I have adjusted those levels, so heat does not seem to be the issue. I haven't made any hardware changes recently, and according to grafana and the unraid dashboard, my resources don't seem to be strapped. i've been transcoding a large video library from x264 to x265 using tdarr since January which may be putting some strain on the CPUs. I've dropped it from 3 workers to 1 to lessen the strain, but it does not seem like anything has changed. Quote
Debaser Posted June 25, 2020 Author Posted June 25, 2020 Hey guys, update here. I've been running the syslog server since last night. Everything as far as I can tell is still running smooth other than Grafana showing my Cache read/write at a constant 0. I have noticed this usually happens before I notice any major issues. Most of my docker containers are running in cache, so there should be some constant read/write activity I believe. I've attached my syslog output. There are plenty of critical errors such as the following 2020-06-24 21:22:55 Kernel.Info 192.168.1.204 Jun 24 21:22:53 stevenet kernel: BTRFS info (device loop2): no csum found for inode 4004 start 0 2020-06-24 21:22:55 Kernel.Critical 192.168.1.204 Jun 24 21:22:53 stevenet kernel: BTRFS critical (device loop2): corrupt leaf: root=7 block=12937510912 slot=43, bad key order, prev (18446744073709551606 128 12001943552) current (18446744073709551606 16 72057081092011947) 2020-06-24 21:22:55 Kernel.Critical 192.168.1.204 Jun 24 21:22:53 stevenet kernel: BTRFS critical (device loop2): corrupt leaf: root=7 block=12937510912 slot=43, bad key order, prev (18446744073709551606 128 12001943552) current (18446744073709551606 16 72057081092011947) 2020-06-24 21:22:55 Kernel.Info 192.168.1.204 Jun 24 21:22:53 stevenet kernel: BTRFS info (device loop2): no csum found for inode 4004 start 4096 After some googling, it appears loop2 would be my docker.img? Could this be the cause of my system crashing on me? What would my troubleshooting steps be here, delete my docker image and reinstall my containers? Or am I going down the wrong path here? SyslogCatchAll-2020-06-24.txt SyslogCatchAll-2020-06-25.txt Quote
JorgeB Posted June 25, 2020 Posted June 25, 2020 3 minutes ago, Debaser said: it appears loop2 would be my docker.img? It usually is, but we'd need the full diags to confirm, and if it is it needs to be recreated. Quote
Debaser Posted June 25, 2020 Author Posted June 25, 2020 10 minutes ago, johnnie.black said: It usually is, but we'd need the full diags to confirm, and if it is it needs to be recreated. Got it, thanks. diagnostics attached. stevenet-diagnostics-20200625-1059.zip Quote
Debaser Posted June 25, 2020 Author Posted June 25, 2020 1 hour ago, johnnie.black said: It is the docker image. Thanks. I've deleted the docker image and re-installed my containers from community applications. I stepped away during the install of the containers. It looks like it finished, as all my containers are there, but it looks like sometime after the system had a kernel panic, and the system became unresponsive. I had to do a power cycle via the idrac. I've attached a screenshot of the unresponsive console with the kernel panic, as well as the syslog output from today. After power cycling, unraid is back up and everything seems to be running ok for now. I'll continue to monitor to see what happens SyslogCatchAll-2020-06-25.txt Quote
Debaser Posted July 1, 2020 Author Posted July 1, 2020 So, had a few good days with zero issues, then looks like around 1am today (tuesday), looks like server crashed and went unresponsive again, had to power cycle through idrac interface Here's the syslog from the past 2 days. another kernel panic on the console output Any ideas? SyslogCatchAll-2020-06-29.txt SyslogCatchAll-2020-06-30.txt Quote
JorgeB Posted July 1, 2020 Posted July 1, 2020 Crash appears to be network related, try simplifying your network as much as possible to test, remove any bonding if enable, also any dockers with custom IP addresses. Quote
Debaser Posted July 16, 2020 Author Posted July 16, 2020 hey all, checking in here, system has been stable for about 2 weeks now. it seems the culprit may have been a pi-hole docker container i was running that was having issues to begin with on its own. ever since removing that container, it seems to have been running stable. going to keep monitoring for a while, but hopeful that may have been the solution. appreciate the help! 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.