I can't login to my server, and can't access my data


Go to solution Solved by Airwu,

Recommended Posts

My Unraid server has recently encountered instability issues and has been unable to access the system.

 

Timeline:

1 My server has been purchased for about a year, has been running Unraid without any issues until last month.

2 About a month ago, the server began experiencing random reboots (approximately every 3 days) Posted:

3 About a week ago, the server's CPU randomly showed several cores at 100% usage (also approximately every 3 days). When this happened, the web page became inaccessible, SSH login was possible, but the system did not respond to the reboot command. The system could only be restarted using reboot -nf.

4 Starting today, immediately after restarting, the system again showed several cores at 100% usage.

 

 

Attempted Solutions:

1 Passed the memtest.

2 Disk expansion smart test passed.

3 Errors previously indicated by "zpool status -v" have been fixed.

4 After executing the diagnostics command, there was no response

5 I can't access my data

6 I found this message in syslog

Feb 17 21:19:49 Tower root: Starting Nginx server daemon...
Feb 17 21:19:49 Tower kernel: mdcmd (31): set md_num_stripes 1280
Feb 17 21:19:49 Tower kernel: mdcmd (32): set md_queue_limit 80
Feb 17 21:19:49 Tower kernel: mdcmd (33): set md_sync_limit 5
Feb 17 21:19:49 Tower kernel: mdcmd (34): set md_write_method 1
Feb 17 21:19:49 Tower kernel: mdcmd (35): start STOPPED
Feb 17 21:19:49 Tower kernel: unraid: allocating 15750K for 1280 stripes (3 disks)
Feb 17 21:19:49 Tower kernel: md1p1: running, size: 3907018532 blocks
Feb 17 21:19:49 Tower emhttpd: shcmd (27): udevadm settle
Feb 17 21:19:49 Tower emhttpd: Opening encrypted volumes...
Feb 17 21:19:49 Tower emhttpd: shcmd (28): touch /boot/config/forcesync
Feb 17 21:19:49 Tower emhttpd: Mounting disks...
Feb 17 21:19:49 Tower emhttpd: mounting /mnt/disk1
Feb 17 21:19:49 Tower emhttpd: shcmd (29): mkdir -p /mnt/disk1
Feb 17 21:19:49 Tower emhttpd: /usr/sbin/zpool import -f -d /dev/md1p1 2>&1
Feb 17 21:19:52 Tower emhttpd:    pool: disk1
Feb 17 21:19:52 Tower emhttpd:      id: 9902428395924024116
Feb 17 21:19:52 Tower emhttpd: shcmd (30): /usr/sbin/zpool import -f -N -o autoexpand=on  -d /dev/md1p1 9902428395924024116 disk1
Feb 17 21:19:58 Tower emhttpd: shcmd (31): /usr/sbin/zpool online -e disk1 /dev/md1p1
Feb 17 21:19:58 Tower rsyslogd: action 'action-3-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2102.0 try https://www.rsyslog.com/e/2359 ]
Feb 17 21:19:59 Tower emhttpd: /usr/sbin/zpool status -PL disk1 2>&1
Feb 17 21:19:59 Tower emhttpd:   pool: disk1
Feb 17 21:19:59 Tower emhttpd:  state: ONLINE
Feb 17 21:19:59 Tower emhttpd:   scan: scrub repaired 0B in 03:19:55 with 0 errors on Thu Feb 15 01:19:56 2024
Feb 17 21:19:59 Tower emhttpd: config:
Feb 17 21:19:59 Tower emhttpd:  NAME          STATE     READ WRITE CKSUM
Feb 17 21:19:59 Tower emhttpd:  disk1         ONLINE       0     0     0
Feb 17 21:19:59 Tower emhttpd:    /dev/md1p1  ONLINE       0     0     0
Feb 17 21:19:59 Tower emhttpd: errors: No known data errors
Feb 17 21:19:59 Tower emhttpd: shcmd (32): /usr/sbin/zfs set mountpoint=/mnt/disk1 disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (33): /usr/sbin/zfs set atime=off disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (34): /usr/sbin/zfs mount disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (35): /usr/sbin/zpool set autotrim=off disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (36): /usr/sbin/zfs set compression=off disk1
Feb 17 21:20:00 Tower emhttpd: mounting /mnt/nvme
Feb 17 21:20:00 Tower emhttpd: shcmd (37): mkdir -p /mnt/nvme
Feb 17 21:20:00 Tower emhttpd: shcmd (38): /usr/sbin/zpool import -f -N -o autoexpand=on  -d /dev/nvme1n1p1 -d /dev/nvme0n1p1 7424498333111026621 nvme
Feb 17 21:20:00 Tower kernel: VERIFY3(rs_get_end(rs, rt) >= end) failed (115970260992 >= 58546911126228992)
Feb 17 21:20:00 Tower kernel: PANIC at range_tree.c:482:range_tree_remove_impl()
Feb 17 21:20:00 Tower kernel: Showing stack for process 9822
Feb 17 21:20:00 Tower kernel: CPU: 8 PID: 9822 Comm: metaslab_group_ Tainted: P           O       6.1.74-Unraid #1
Feb 17 21:20:00 Tower kernel: Hardware name: Default string Default string/MS-Terminator B660M, BIOS H3.41G 04/29/2022
Feb 17 21:20:00 Tower kernel: Call Trace:
Feb 17 21:20:00 Tower kernel: <TASK>
Feb 17 21:20:00 Tower kernel: dump_stack_lvl+0x44/0x5c
Feb 17 21:20:00 Tower kernel: spl_panic+0xd0/0xe8 [spl]
Feb 17 21:20:00 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs]
Feb 17 21:20:00 Tower kernel: ? zfs_btree_find_in_buf+0x4c/0x94 [zfs]
Feb 17 21:20:00 Tower kernel: ? zfs_btree_find+0x16d/0x1b0 [zfs]
Feb 17 21:20:00 Tower kernel: range_tree_remove_impl+0x1ea/0x406 [zfs]
Feb 17 21:20:00 Tower kernel: ? zio_wait+0x1ee/0x1fd [zfs]
Feb 17 21:20:00 Tower kernel: space_map_load_callback+0x70/0x79 [zfs]
Feb 17 21:20:00 Tower kernel: space_map_iterate+0x2d3/0x324 [zfs]
Feb 17 21:20:00 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs]
Feb 17 21:20:00 Tower kernel: space_map_load_length+0x93/0xcb [zfs]
Feb 17 21:20:00 Tower kernel: metaslab_load+0x33b/0x6e3 [zfs]
Feb 17 21:20:00 Tower kernel: ? _raw_spin_unlock_irqrestore+0x24/0x3a
Feb 17 21:20:00 Tower kernel: ? __wake_up_common_lock+0x88/0xbb
Feb 17 21:20:00 Tower kernel: metaslab_preload+0x4c/0x97 [zfs]
Feb 17 21:20:00 Tower kernel: taskq_thread+0x266/0x38a [spl]
Feb 17 21:20:00 Tower kernel: ? wake_up_q+0x44/0x44
Feb 17 21:20:00 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl]
Feb 17 21:20:00 Tower kernel: kthread+0xe4/0xef
Feb 17 21:20:00 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Feb 17 21:20:00 Tower kernel: ret_from_fork+0x1f/0x30
Feb 17 21:20:00 Tower kernel: </TASK>
Feb 17 21:20:02 Tower SysDrivers: SysDrivers Build Complete

 

Link to comment
1 hour ago, JorgeB said:

Symptoms suggest a hardware issue, memtest is only definitive if it finds an errors, try running the server with just one stick of RAM, if the same, try the other one, that will basically rule out a RAM issue.

I'll try later, Thank you

Link to comment
  • Solution

Maybe this is a zfs filesystem crashing or ZFS bug.

I put my NVMe disk into another computer running latest Ubuntu, than use zpool to open, system crash again. I can't read my data.

I find a github page about this crash https://github.com/openzfs/zfs/issues/13483 , someone have same crash with me, and I use this zfs value to open disk and read my data

Quote

vfs.zfs.spa.load_verify_data=0

vfs.zfs.spa.load_verify_metadata=0

vfs.zfs.recover=1

vfs.zfs.zil.replay_disable=1

 

Now I changing all my disk from type zfs to btrfs.

Link to comment
6 hours ago, Airwu said:

I ran memtest last night for about 12 hours and found no errors

That is good, but just worth mentioning that although a failure in memtest is definitive it is still possible to pass it and have RAM issues when the system is under heavy load.

Link to comment
5 minutes ago, Airwu said:

so, what can I do?

It all depends on whether you continue to get problems.   If it still looks like they may be RAM based then one possibility is to run with less sticks of RAM as that puts less load load on the memory controller so that tends to rule out RAM issues if you still get the same failures.

 

You should also make sure that there is no over-clocking going on and that the RAM is not being clocked faster than the motherboard/CPU is capable of handling.   It is easy to get RAM rated at a higher speed than they can support and a lot of people just assume they can run the RAM at the rated speed without taking into consideration motherboard/CPU limitations.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.