DBJordan Posted June 20, 2022 Share Posted June 20, 2022 (edited) Unraid freezes multiple times a week and I have to cold reboot it. (This isn't just the http server going down -- it won't respond to key presses from a keyboard directly connected to the server.) This configuration used to work for months at a time, but it seems something has gone wrong. I've run memtest86 overnight with no findings. I'm not sure what else to try. Any ideas? truesource-diagnostics-20220620-1353.zip Edited June 20, 2022 by DBJordan Quote Link to comment
JorgeB Posted June 21, 2022 Share Posted June 21, 2022 Enable the syslog server and post that after a crash. Quote Link to comment
DBJordan Posted June 21, 2022 Author Share Posted June 21, 2022 Good idea. I've set it up and started the array. It usually takes a few days, so I'll post once I see a freeze. Thanks! Quote Link to comment
DBJordan Posted June 22, 2022 Author Share Posted June 22, 2022 It didn't hard crash yet, but after the parity check completed it did take the array offline for some reason. I started the array back up. Here are the logs at this point. I'll post again once I SyslogCatchAll-2022-06-21.txtSyslogCatchAll-2022-06-22.txtsee it go completely unresponsive. Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 Btrfs is detecting data corruption, more info here, you should set RAM to 1866MT/s as explained here, also check the power supply idle control setting, after RAM is correctly set reset the btrfs stats and keep monitoring, if more corruptions are found run memtest. Quote Link to comment
DBJordan Posted June 25, 2022 Author Share Posted June 25, 2022 Thanks for the help! I was able to set RAM to 1866 but couldn't find an option to handle power supply idle control or c-states. I tried this: Started here and picked "CPU Configuration" Once in there, I changed C6 mode from "enabled" to "disabled." Also, btrfs scrub detected some irreparable errors in the syslog: 22-06-25 17:42:53 Kernel.Info 172.16.100.100 Jun 25 17:42:53 Truesource kernel: BTRFS info (device nvme0n1p1): device stats zeroed by btrfs (25391) 2022-06-25 17:42:53 Kernel.Info 172.16.100.100 Jun 25 17:42:53 Truesource kernel: BTRFS info (device nvme0n1p1): device stats zeroed by btrfs (25391) 2022-06-25 17:42:56 Kernel.Info 172.16.100.100 Jun 25 17:42:56 Truesource kernel: BTRFS info (device nvme0n1p1): device stats zeroed by btrfs (25402) 2022-06-25 17:42:56 Kernel.Info 172.16.100.100 Jun 25 17:42:56 Truesource kernel: BTRFS info (device nvme0n1p1): device stats zeroed by btrfs (25402) 2022-06-25 17:43:10 Kernel.Info 172.16.100.100 Jun 25 17:43:09 Truesource kernel: BTRFS info (device nvme0n1p1): scrub: started on devid 1 2022-06-25 17:43:10 Kernel.Info 172.16.100.100 Jun 25 17:43:09 Truesource kernel: BTRFS info (device nvme0n1p1): scrub: started on devid 2 2022-06-25 17:44:00 Kernel.Warning 172.16.100.100 Jun 25 17:44:00 Truesource kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 932456402944 on dev /dev/nvme0n1p1, physical 222713057280, root 5, inode 6909694, offset 2324074496, length 4096, links 1 (path: PRIVATE) 2022-06-25 17:44:00 Kernel.Error 172.16.100.100 Jun 25 17:44:00 Truesource kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 2022-06-25 17:44:00 Kernel.Error 172.16.100.100 Jun 25 17:44:00 Truesource kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 932456402944 on dev /dev/nvme0n1p1 2022-06-25 17:44:40 Kernel.Warning 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1262362894336 on dev /dev/nvme0n1p1, physical 574094385152, root 5, inode 13617139, offset 347021312, length 4096, links 1 (path: PRIVATE) 2022-06-25 17:44:40 Kernel.Error 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 2022-06-25 17:44:40 Kernel.Error 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1262362894336 on dev /dev/nvme0n1p1 2022-06-25 17:44:41 Kernel.Warning 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1280042659840 on dev /dev/nvme0n1p1, physical 591774150656, root 5, inode 14185170, offset 340320256, length 4096, links 1 (path: PRIVATE) 2022-06-25 17:44:41 Kernel.Error 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 2022-06-25 17:44:41 Kernel.Error 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1280042659840 on dev /dev/nvme0n1p1 2022-06-25 17:44:41 Kernel.Info 172.16.100.100 Jun 25 17:44:40 Truesource kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 1 with status: 0 2022-06-25 17:46:02 Kernel.Warning 172.16.100.100 Jun 25 17:46:02 Truesource kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 932456402944 on dev /dev/nvme1n1p1, physical 222692085760, root 5, inode 6909694, offset 2324074496, length 4096, links 1 (path: PRIVATE) 2022-06-25 17:46:02 Kernel.Error 172.16.100.100 Jun 25 17:46:02 Truesource kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 2022-06-25 17:46:02 Kernel.Error 172.16.100.100 Jun 25 17:46:02 Truesource kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 932456402944 on dev /dev/nvme1n1p1 2022-06-25 17:48:19 Kernel.Warning 172.16.100.100 Jun 25 17:48:19 Truesource kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1262362894336 on dev /dev/nvme1n1p1, physical 574073413632, root 5, inode 13617139, offset 347021312, length 4096, links 1 (path: PRIVATE) 2022-06-25 17:48:19 Kernel.Error 172.16.100.100 Jun 25 17:48:19 Truesource kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 2022-06-25 17:48:19 Kernel.Error 172.16.100.100 Jun 25 17:48:19 Truesource kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1262362894336 on dev /dev/nvme1n1p1 2022-06-25 17:48:22 Kernel.Warning 172.16.100.100 Jun 25 17:48:21 Truesource kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1280042659840 on dev /dev/nvme1n1p1, physical 591753179136, root 5, inode 14185170, offset 340320256, length 4096, links 1 (path: PRIVATE) 2022-06-25 17:48:22 Kernel.Error 172.16.100.100 Jun 25 17:48:21 Truesource kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 2022-06-25 17:48:22 Kernel.Error 172.16.100.100 Jun 25 17:48:21 Truesource kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1280042659840 on dev /dev/nvme1n1p1 2022-06-25 17:48:22 Kernel.Info 172.16.100.100 Jun 25 17:48:22 Truesource kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 2 with status: 0 Quote Link to comment
DBJordan Posted June 25, 2022 Author Share Posted June 25, 2022 Oh, lemme set memtest running, will let you know if anything pops up. Quote Link to comment
JorgeB Posted June 26, 2022 Share Posted June 26, 2022 9 hours ago, DBJordan said: Also, btrfs scrub detected some irreparable errors in the syslog: Delete/replace affected files, then reset the stats and keep monitoring for more errors. Quote Link to comment
DBJordan Posted June 26, 2022 Author Share Posted June 26, 2022 (edited) Ooo, sneaky. I'll give that a shot and see what happens. Thanks! Update: no errors found during scrub. Will keep capturing syslogs and see what happens next. Thanks! Edited June 26, 2022 by DBJordan Quote Link to comment
nadmax Posted June 26, 2022 Share Posted June 26, 2022 Hi, I also have this issue after upgrading to 6.10.3. Typically: Server becomes unreachable on the network Physically, server is on but unresponsive - even via direct keyboard / mouse interaction I had to give it a hard reset a couple of times but it won't last 8 hours before freezing again, even the parity checks don't get a chance to complete I have reverted to the previous version as my backups are stored on the Unraid environment and I couldn't afford to be without. After roll back, the server has been stable for just over 24 hours. I'll have a look at the BIOS settings as recommended above and change accordingly if needed. Quote Link to comment
JorgeB Posted June 27, 2022 Share Posted June 27, 2022 12 hours ago, nadmax said: I also have this issue after upgrading to 6.10.3. Typically: You should start your own thread, and post the diagnostics and the log from the syslog server after a crash. Quote Link to comment
DBJordan Posted July 17, 2022 Author Share Posted July 17, 2022 Still get some intermittent reboots a few times a week. Had a crash and reboot just before 0500. Auto-starting the array after unexpected shutdown is disabled, so the logs say nothing after reboot until I logged in to the webpage at 1123. Have noticed the automated mover exits with a 1. When I run it manually, it completes with return code of 0. Any thoughts on whether this is an indicator as to why the system is subsequently rebooting? Logs attached. SyslogCatchAll-2022-07-17.txt Quote Link to comment
DBJordan Posted July 23, 2022 Author Share Posted July 23, 2022 Hmm I'm going to try to replace the motherboard and cpu. They're getting long in the tooth, anyway. Will let you know how it goes! Quote Link to comment
JorgeB Posted July 24, 2022 Share Posted July 24, 2022 There's nothing relevant in the log posted, that usually indicates a hardware problem. Quote Link to comment
Solution DBJordan Posted August 4, 2022 Author Solution Share Posted August 4, 2022 No problems so far. Thanks for the help trying to get the old hardware working. Can you mark this one solved or whatever it is you do? I'll make a new topic if anything new pops up. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.