I can't login to my server, and can't access my data

Airwu · February 17

My Unraid server has recently encountered instability issues and has been unable to access the system.

Timeline:

1 My server has been purchased for about a year, has been running Unraid without any issues until last month.

2 About a month ago, the server began experiencing random reboots (approximately every 3 days) Posted:

3 About a week ago, the server's CPU randomly showed several cores at 100% usage (also approximately every 3 days). When this happened, the web page became inaccessible, SSH login was possible, but the system did not respond to the reboot command. The system could only be restarted using reboot -nf.

4 Starting today, immediately after restarting, the system again showed several cores at 100% usage.

Attempted Solutions:

1 Passed the memtest.

2 Disk expansion smart test passed.

3 Errors previously indicated by "zpool status -v" have been fixed.

4 After executing the diagnostics command, there was no response

5 I can't access my data

6 I found this message in syslog

Feb 17 21:19:49 Tower root: Starting Nginx server daemon...
Feb 17 21:19:49 Tower kernel: mdcmd (31): set md_num_stripes 1280
Feb 17 21:19:49 Tower kernel: mdcmd (32): set md_queue_limit 80
Feb 17 21:19:49 Tower kernel: mdcmd (33): set md_sync_limit 5
Feb 17 21:19:49 Tower kernel: mdcmd (34): set md_write_method 1
Feb 17 21:19:49 Tower kernel: mdcmd (35): start STOPPED
Feb 17 21:19:49 Tower kernel: unraid: allocating 15750K for 1280 stripes (3 disks)
Feb 17 21:19:49 Tower kernel: md1p1: running, size: 3907018532 blocks
Feb 17 21:19:49 Tower emhttpd: shcmd (27): udevadm settle
Feb 17 21:19:49 Tower emhttpd: Opening encrypted volumes...
Feb 17 21:19:49 Tower emhttpd: shcmd (28): touch /boot/config/forcesync
Feb 17 21:19:49 Tower emhttpd: Mounting disks...
Feb 17 21:19:49 Tower emhttpd: mounting /mnt/disk1
Feb 17 21:19:49 Tower emhttpd: shcmd (29): mkdir -p /mnt/disk1
Feb 17 21:19:49 Tower emhttpd: /usr/sbin/zpool import -f -d /dev/md1p1 2>&1
Feb 17 21:19:52 Tower emhttpd:    pool: disk1
Feb 17 21:19:52 Tower emhttpd:      id: 9902428395924024116
Feb 17 21:19:52 Tower emhttpd: shcmd (30): /usr/sbin/zpool import -f -N -o autoexpand=on  -d /dev/md1p1 9902428395924024116 disk1
Feb 17 21:19:58 Tower emhttpd: shcmd (31): /usr/sbin/zpool online -e disk1 /dev/md1p1
Feb 17 21:19:58 Tower rsyslogd: action 'action-3-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2102.0 try https://www.rsyslog.com/e/2359 ]
Feb 17 21:19:59 Tower emhttpd: /usr/sbin/zpool status -PL disk1 2>&1
Feb 17 21:19:59 Tower emhttpd:   pool: disk1
Feb 17 21:19:59 Tower emhttpd:  state: ONLINE
Feb 17 21:19:59 Tower emhttpd:   scan: scrub repaired 0B in 03:19:55 with 0 errors on Thu Feb 15 01:19:56 2024
Feb 17 21:19:59 Tower emhttpd: config:
Feb 17 21:19:59 Tower emhttpd:  NAME          STATE     READ WRITE CKSUM
Feb 17 21:19:59 Tower emhttpd:  disk1         ONLINE       0     0     0
Feb 17 21:19:59 Tower emhttpd:    /dev/md1p1  ONLINE       0     0     0
Feb 17 21:19:59 Tower emhttpd: errors: No known data errors
Feb 17 21:19:59 Tower emhttpd: shcmd (32): /usr/sbin/zfs set mountpoint=/mnt/disk1 disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (33): /usr/sbin/zfs set atime=off disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (34): /usr/sbin/zfs mount disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (35): /usr/sbin/zpool set autotrim=off disk1
Feb 17 21:19:59 Tower emhttpd: shcmd (36): /usr/sbin/zfs set compression=off disk1
Feb 17 21:20:00 Tower emhttpd: mounting /mnt/nvme
Feb 17 21:20:00 Tower emhttpd: shcmd (37): mkdir -p /mnt/nvme
Feb 17 21:20:00 Tower emhttpd: shcmd (38): /usr/sbin/zpool import -f -N -o autoexpand=on  -d /dev/nvme1n1p1 -d /dev/nvme0n1p1 7424498333111026621 nvme
Feb 17 21:20:00 Tower kernel: VERIFY3(rs_get_end(rs, rt) >= end) failed (115970260992 >= 58546911126228992)
Feb 17 21:20:00 Tower kernel: PANIC at range_tree.c:482:range_tree_remove_impl()
Feb 17 21:20:00 Tower kernel: Showing stack for process 9822
Feb 17 21:20:00 Tower kernel: CPU: 8 PID: 9822 Comm: metaslab_group_ Tainted: P           O       6.1.74-Unraid #1
Feb 17 21:20:00 Tower kernel: Hardware name: Default string Default string/MS-Terminator B660M, BIOS H3.41G 04/29/2022
Feb 17 21:20:00 Tower kernel: Call Trace:
Feb 17 21:20:00 Tower kernel: <TASK>
Feb 17 21:20:00 Tower kernel: dump_stack_lvl+0x44/0x5c
Feb 17 21:20:00 Tower kernel: spl_panic+0xd0/0xe8 [spl]
Feb 17 21:20:00 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs]
Feb 17 21:20:00 Tower kernel: ? zfs_btree_find_in_buf+0x4c/0x94 [zfs]
Feb 17 21:20:00 Tower kernel: ? zfs_btree_find+0x16d/0x1b0 [zfs]
Feb 17 21:20:00 Tower kernel: range_tree_remove_impl+0x1ea/0x406 [zfs]
Feb 17 21:20:00 Tower kernel: ? zio_wait+0x1ee/0x1fd [zfs]
Feb 17 21:20:00 Tower kernel: space_map_load_callback+0x70/0x79 [zfs]
Feb 17 21:20:00 Tower kernel: space_map_iterate+0x2d3/0x324 [zfs]
Feb 17 21:20:00 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs]
Feb 17 21:20:00 Tower kernel: space_map_load_length+0x93/0xcb [zfs]
Feb 17 21:20:00 Tower kernel: metaslab_load+0x33b/0x6e3 [zfs]
Feb 17 21:20:00 Tower kernel: ? _raw_spin_unlock_irqrestore+0x24/0x3a
Feb 17 21:20:00 Tower kernel: ? __wake_up_common_lock+0x88/0xbb
Feb 17 21:20:00 Tower kernel: metaslab_preload+0x4c/0x97 [zfs]
Feb 17 21:20:00 Tower kernel: taskq_thread+0x266/0x38a [spl]
Feb 17 21:20:00 Tower kernel: ? wake_up_q+0x44/0x44
Feb 17 21:20:00 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl]
Feb 17 21:20:00 Tower kernel: kthread+0xe4/0xef
Feb 17 21:20:00 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Feb 17 21:20:00 Tower kernel: ret_from_fork+0x1f/0x30
Feb 17 21:20:00 Tower kernel: </TASK>
Feb 17 21:20:02 Tower SysDrivers: SysDrivers Build Complete

trurl · February 17

15 minutes ago, Airwu said:

Passed the memtest

How long did you let it run?

Airwu · February 17

When I disable disk auto start , I can login to webpage and run diagnostics

tower-diagnostics-20240217-2204.zip

Airwu · February 17

Just now, trurl said:

How long did you let it run?

about 1 hour

JorgeB · February 18

Symptoms suggest a hardware issue, memtest is only definitive if it finds an errors, try running the server with just one stick of RAM, if the same, try the other one, that will basically rule out a RAM issue.

Airwu · February 18

1 hour ago, JorgeB said:

Symptoms suggest a hardware issue, memtest is only definitive if it finds an errors, try running the server with just one stick of RAM, if the same, try the other one, that will basically rule out a RAM issue.

I'll try later, Thank you

Airwu · February 18

Maybe this is a zfs filesystem crashing or ZFS bug.

I put my NVMe disk into another computer running latest Ubuntu, than use zpool to open, system crash again. I can't read my data.

I find a github page about this crash https://github.com/openzfs/zfs/issues/13483 , someone have same crash with me, and I use this zfs value to open disk and read my data

Quote

vfs.zfs.spa.load_verify_data=0

vfs.zfs.spa.load_verify_metadata=0

vfs.zfs.recover=1

vfs.zfs.zil.replay_disable=1

Now I changing all my disk from type zfs to btrfs.

JorgeB · February 19

Or the zfs filesystem may be corrupt, that can also happen as a symptom of an underlying hardware issue, like bad RAM.

Airwu · February 19

5 hours ago, JorgeB said:

Or the zfs filesystem may be corrupt, that can also happen as a symptom of an underlying hardware issue, like bad RAM.

I used memory without ECC, maybe I am not suitable for using ZFS.

JorgeB · February 19

It's suitable if it's working correctly, but IMHO ECC is always better, especially if you care for data integrity.

Airwu · February 22

On 2/19/2024 at 10:56 PM, JorgeB said:

It's suitable if it's working correctly, but IMHO ECC is always better, especially if you care for data integrity.

I ran memtest last night for about 12 hours and found no errors.

itimpi · February 22

6 hours ago, Airwu said:

I ran memtest last night for about 12 hours and found no errors

That is good, but just worth mentioning that although a failure in memtest is definitive it is still possible to pass it and have RAM issues when the system is under heavy load.

Airwu · February 22

so, what can I do?

itimpi · February 22

5 minutes ago, Airwu said:

so, what can I do?

It all depends on whether you continue to get problems. If it still looks like they may be RAM based then one possibility is to run with less sticks of RAM as that puts less load load on the memory controller so that tends to rule out RAM issues if you still get the same failures.

You should also make sure that there is no over-clocking going on and that the RAM is not being clocked faster than the motherboard/CPU is capable of handling. It is easy to get RAM rated at a higher speed than they can support and a lot of people just assume they can run the RAM at the rated speed without taking into consideration motherboard/CPU limitations.

Airwu · February 22

I understand, thank you.

I can't login to my server, and can't access my data

Recommended Posts

Airwu

Link to comment

trurl

Link to comment

Airwu

Link to comment

Airwu

Link to comment

JorgeB

Link to comment

Airwu

Link to comment

Airwu

Link to comment

JorgeB

Link to comment

Airwu

Link to comment

JorgeB

Link to comment

Airwu

Link to comment

itimpi

Link to comment

Airwu

Link to comment

itimpi

Link to comment

Airwu

Link to comment

Join the conversation