Airwu Posted February 17 Share Posted February 17 My Unraid server has recently encountered instability issues and has been unable to access the system. Timeline: 1 My server has been purchased for about a year, has been running Unraid without any issues until last month. 2 About a month ago, the server began experiencing random reboots (approximately every 3 days) Posted: 3 About a week ago, the server's CPU randomly showed several cores at 100% usage (also approximately every 3 days). When this happened, the web page became inaccessible, SSH login was possible, but the system did not respond to the reboot command. The system could only be restarted using reboot -nf. 4 Starting today, immediately after restarting, the system again showed several cores at 100% usage. Attempted Solutions: 1 Passed the memtest. 2 Disk expansion smart test passed. 3 Errors previously indicated by "zpool status -v" have been fixed. 4 After executing the diagnostics command, there was no response 5 I can't access my data 6 I found this message in syslog Feb 17 21:19:49 Tower root: Starting Nginx server daemon... Feb 17 21:19:49 Tower kernel: mdcmd (31): set md_num_stripes 1280 Feb 17 21:19:49 Tower kernel: mdcmd (32): set md_queue_limit 80 Feb 17 21:19:49 Tower kernel: mdcmd (33): set md_sync_limit 5 Feb 17 21:19:49 Tower kernel: mdcmd (34): set md_write_method 1 Feb 17 21:19:49 Tower kernel: mdcmd (35): start STOPPED Feb 17 21:19:49 Tower kernel: unraid: allocating 15750K for 1280 stripes (3 disks) Feb 17 21:19:49 Tower kernel: md1p1: running, size: 3907018532 blocks Feb 17 21:19:49 Tower emhttpd: shcmd (27): udevadm settle Feb 17 21:19:49 Tower emhttpd: Opening encrypted volumes... Feb 17 21:19:49 Tower emhttpd: shcmd (28): touch /boot/config/forcesync Feb 17 21:19:49 Tower emhttpd: Mounting disks... Feb 17 21:19:49 Tower emhttpd: mounting /mnt/disk1 Feb 17 21:19:49 Tower emhttpd: shcmd (29): mkdir -p /mnt/disk1 Feb 17 21:19:49 Tower emhttpd: /usr/sbin/zpool import -f -d /dev/md1p1 2>&1 Feb 17 21:19:52 Tower emhttpd: pool: disk1 Feb 17 21:19:52 Tower emhttpd: id: 9902428395924024116 Feb 17 21:19:52 Tower emhttpd: shcmd (30): /usr/sbin/zpool import -f -N -o autoexpand=on -d /dev/md1p1 9902428395924024116 disk1 Feb 17 21:19:58 Tower emhttpd: shcmd (31): /usr/sbin/zpool online -e disk1 /dev/md1p1 Feb 17 21:19:58 Tower rsyslogd: action 'action-3-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2102.0 try https://www.rsyslog.com/e/2359 ] Feb 17 21:19:59 Tower emhttpd: /usr/sbin/zpool status -PL disk1 2>&1 Feb 17 21:19:59 Tower emhttpd: pool: disk1 Feb 17 21:19:59 Tower emhttpd: state: ONLINE Feb 17 21:19:59 Tower emhttpd: scan: scrub repaired 0B in 03:19:55 with 0 errors on Thu Feb 15 01:19:56 2024 Feb 17 21:19:59 Tower emhttpd: config: Feb 17 21:19:59 Tower emhttpd: NAME STATE READ WRITE CKSUM Feb 17 21:19:59 Tower emhttpd: disk1 ONLINE 0 0 0 Feb 17 21:19:59 Tower emhttpd: /dev/md1p1 ONLINE 0 0 0 Feb 17 21:19:59 Tower emhttpd: errors: No known data errors Feb 17 21:19:59 Tower emhttpd: shcmd (32): /usr/sbin/zfs set mountpoint=/mnt/disk1 disk1 Feb 17 21:19:59 Tower emhttpd: shcmd (33): /usr/sbin/zfs set atime=off disk1 Feb 17 21:19:59 Tower emhttpd: shcmd (34): /usr/sbin/zfs mount disk1 Feb 17 21:19:59 Tower emhttpd: shcmd (35): /usr/sbin/zpool set autotrim=off disk1 Feb 17 21:19:59 Tower emhttpd: shcmd (36): /usr/sbin/zfs set compression=off disk1 Feb 17 21:20:00 Tower emhttpd: mounting /mnt/nvme Feb 17 21:20:00 Tower emhttpd: shcmd (37): mkdir -p /mnt/nvme Feb 17 21:20:00 Tower emhttpd: shcmd (38): /usr/sbin/zpool import -f -N -o autoexpand=on -d /dev/nvme1n1p1 -d /dev/nvme0n1p1 7424498333111026621 nvme Feb 17 21:20:00 Tower kernel: VERIFY3(rs_get_end(rs, rt) >= end) failed (115970260992 >= 58546911126228992) Feb 17 21:20:00 Tower kernel: PANIC at range_tree.c:482:range_tree_remove_impl() Feb 17 21:20:00 Tower kernel: Showing stack for process 9822 Feb 17 21:20:00 Tower kernel: CPU: 8 PID: 9822 Comm: metaslab_group_ Tainted: P O 6.1.74-Unraid #1 Feb 17 21:20:00 Tower kernel: Hardware name: Default string Default string/MS-Terminator B660M, BIOS H3.41G 04/29/2022 Feb 17 21:20:00 Tower kernel: Call Trace: Feb 17 21:20:00 Tower kernel: <TASK> Feb 17 21:20:00 Tower kernel: dump_stack_lvl+0x44/0x5c Feb 17 21:20:00 Tower kernel: spl_panic+0xd0/0xe8 [spl] Feb 17 21:20:00 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs] Feb 17 21:20:00 Tower kernel: ? zfs_btree_find_in_buf+0x4c/0x94 [zfs] Feb 17 21:20:00 Tower kernel: ? zfs_btree_find+0x16d/0x1b0 [zfs] Feb 17 21:20:00 Tower kernel: range_tree_remove_impl+0x1ea/0x406 [zfs] Feb 17 21:20:00 Tower kernel: ? zio_wait+0x1ee/0x1fd [zfs] Feb 17 21:20:00 Tower kernel: space_map_load_callback+0x70/0x79 [zfs] Feb 17 21:20:00 Tower kernel: space_map_iterate+0x2d3/0x324 [zfs] Feb 17 21:20:00 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs] Feb 17 21:20:00 Tower kernel: space_map_load_length+0x93/0xcb [zfs] Feb 17 21:20:00 Tower kernel: metaslab_load+0x33b/0x6e3 [zfs] Feb 17 21:20:00 Tower kernel: ? _raw_spin_unlock_irqrestore+0x24/0x3a Feb 17 21:20:00 Tower kernel: ? __wake_up_common_lock+0x88/0xbb Feb 17 21:20:00 Tower kernel: metaslab_preload+0x4c/0x97 [zfs] Feb 17 21:20:00 Tower kernel: taskq_thread+0x266/0x38a [spl] Feb 17 21:20:00 Tower kernel: ? wake_up_q+0x44/0x44 Feb 17 21:20:00 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl] Feb 17 21:20:00 Tower kernel: kthread+0xe4/0xef Feb 17 21:20:00 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b Feb 17 21:20:00 Tower kernel: ret_from_fork+0x1f/0x30 Feb 17 21:20:00 Tower kernel: </TASK> Feb 17 21:20:02 Tower SysDrivers: SysDrivers Build Complete Quote Link to comment
trurl Posted February 17 Share Posted February 17 15 minutes ago, Airwu said: Passed the memtest How long did you let it run? Quote Link to comment
Airwu Posted February 17 Author Share Posted February 17 When I disable disk auto start , I can login to webpage and run diagnostics tower-diagnostics-20240217-2204.zip Quote Link to comment
Airwu Posted February 17 Author Share Posted February 17 Just now, trurl said: How long did you let it run? about 1 hour Quote Link to comment
JorgeB Posted February 18 Share Posted February 18 Symptoms suggest a hardware issue, memtest is only definitive if it finds an errors, try running the server with just one stick of RAM, if the same, try the other one, that will basically rule out a RAM issue. Quote Link to comment
Airwu Posted February 18 Author Share Posted February 18 1 hour ago, JorgeB said: Symptoms suggest a hardware issue, memtest is only definitive if it finds an errors, try running the server with just one stick of RAM, if the same, try the other one, that will basically rule out a RAM issue. I'll try later, Thank you Quote Link to comment
Solution Airwu Posted February 18 Author Solution Share Posted February 18 Maybe this is a zfs filesystem crashing or ZFS bug. I put my NVMe disk into another computer running latest Ubuntu, than use zpool to open, system crash again. I can't read my data. I find a github page about this crash https://github.com/openzfs/zfs/issues/13483 , someone have same crash with me, and I use this zfs value to open disk and read my data Quote vfs.zfs.spa.load_verify_data=0 vfs.zfs.spa.load_verify_metadata=0 vfs.zfs.recover=1 vfs.zfs.zil.replay_disable=1 Now I changing all my disk from type zfs to btrfs. Quote Link to comment
JorgeB Posted February 19 Share Posted February 19 Or the zfs filesystem may be corrupt, that can also happen as a symptom of an underlying hardware issue, like bad RAM. Quote Link to comment
Airwu Posted February 19 Author Share Posted February 19 5 hours ago, JorgeB said: Or the zfs filesystem may be corrupt, that can also happen as a symptom of an underlying hardware issue, like bad RAM. I used memory without ECC, maybe I am not suitable for using ZFS. Quote Link to comment
JorgeB Posted February 19 Share Posted February 19 It's suitable if it's working correctly, but IMHO ECC is always better, especially if you care for data integrity. Quote Link to comment
Airwu Posted February 22 Author Share Posted February 22 On 2/19/2024 at 10:56 PM, JorgeB said: It's suitable if it's working correctly, but IMHO ECC is always better, especially if you care for data integrity. I ran memtest last night for about 12 hours and found no errors. Quote Link to comment
itimpi Posted February 22 Share Posted February 22 6 hours ago, Airwu said: I ran memtest last night for about 12 hours and found no errors That is good, but just worth mentioning that although a failure in memtest is definitive it is still possible to pass it and have RAM issues when the system is under heavy load. Quote Link to comment
itimpi Posted February 22 Share Posted February 22 5 minutes ago, Airwu said: so, what can I do? It all depends on whether you continue to get problems. If it still looks like they may be RAM based then one possibility is to run with less sticks of RAM as that puts less load load on the memory controller so that tends to rule out RAM issues if you still get the same failures. You should also make sure that there is no over-clocking going on and that the RAM is not being clocked faster than the motherboard/CPU is capable of handling. It is easy to get RAM rated at a higher speed than they can support and a lot of people just assume they can run the RAM at the rated speed without taking into consideration motherboard/CPU limitations. Quote Link to comment
Airwu Posted February 22 Author Share Posted February 22 I understand, thank you. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.