Daily Unresponsive Server 6.12.8

kbareis · February 20

Hi all, I have been banging my head against an issue I am seeing with Unraid. My server has been solid for months but within the last two weeks or so it has taken a nose dive.

System Specs:
5800x Ryzen

32GB Gskill

Asus X570 Tuf

2x Crucial NVME drives

Smattering of 3-4 TB WD and Seagate drives

Bios Changes-

C States disabled

Power set to typical

DOCP off

Last night the lock up was so bad it broke the filesystem on my mirrored btrfs cache drives. Booted today without Docker and VMs running and the thing still crashed.

Dump from Syslog server attached

syslogDump.log

kbareis · February 20

For those that don't want to download the log, here's a snippet of one of the syslog errors

Feb 20 13:18:33 Oldtown kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Feb 20 13:18:33 Oldtown kernel: rcu: #0119-....: (1319739 ticks this GP) idle=ca84/1/0x4000000000000000 softirq=781013/781013 fqs=516199
Feb 20 13:18:33 Oldtown kernel: #011(t=1320207 jiffies g=2460473 q=3174295 ncpus=16)
Feb 20 13:18:33 Oldtown kernel: CPU: 9 PID: 16255 Comm: find Tainted: P      D    O       6.1.74-Unraid #1
Feb 20 13:18:33 Oldtown kernel: Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI), BIOS 4602 02/23/2023
Feb 20 13:18:33 Oldtown kernel: RIP: 0010:xfs_buf_get_map+0x108/0x804 [xfs]
Feb 20 13:18:33 Oldtown kernel: Code: e8 ee 65 03 00 0f 0b bd 8b ff ff ff e9 eb 06 00 00 48 8d 54 f5 40 4c 8b 22 49 83 e4 fe 75 07 49 89 d4 49 83 cc 01 41 f6 c4 01 <74> 41 48 89 d0 48 83 c8 01 49 39 c4 75 de 48 8b 6d 30 48 85 ed 74
Feb 20 13:18:33 Oldtown kernel: RSP: 0018:ffffc90017827b10 EFLAGS: 00000202
Feb 20 13:18:33 Oldtown kernel: RAX: 0000000000000001 RBX: ffff88814b446c00 RCX: 000000003b627298
Feb 20 13:18:33 Oldtown kernel: RDX: ffff888112411290 RSI: ffff8882759e3a80 RDI: ffffc90017827b60
Feb 20 13:18:33 Oldtown kernel: RBP: ffff888112410000 R08: ffffffffa0d7fac7 R09: 0000000000000000
Feb 20 13:18:33 Oldtown kernel: R10: 0000000000000000 R11: ffff8881069ee018 R12: ffff888112411a91
Feb 20 13:18:33 Oldtown kernel: R13: ffff8882759e3a80 R14: ffff888106e7e000 R15: ffffc90017827c40
Feb 20 13:18:33 Oldtown kernel: FS:  000014f85bf33740(0000) GS:ffff88880ea40000(0000) knlGS:0000000000000000
Feb 20 13:18:33 Oldtown kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 13:18:33 Oldtown kernel: CR2: 0000000000456020 CR3: 0000000150ace000 CR4: 0000000000750ee0
Feb 20 13:18:33 Oldtown kernel: PKRU: 55555554
Feb 20 13:18:33 Oldtown kernel: Call Trace:
Feb 20 13:18:33 Oldtown kernel: <IRQ>
Feb 20 13:18:33 Oldtown kernel: ? rcu_dump_cpu_stacks+0x95/0xb9
Feb 20 13:18:33 Oldtown kernel: ? rcu_sched_clock_irq+0x345/0xa45
Feb 20 13:18:33 Oldtown kernel: ? tick_init_jiffy_update+0x7c/0x7c
Feb 20 13:18:33 Oldtown kernel: ? update_process_times+0x62/0x81
Feb 20 13:18:33 Oldtown kernel: ? tick_sched_timer+0x43/0x71
Feb 20 13:18:33 Oldtown kernel: ? __hrtimer_run_queues+0xeb/0x190
Feb 20 13:18:33 Oldtown kernel: ? hrtimer_interrupt+0x9c/0x16e
Feb 20 13:18:33 Oldtown kernel: ? __sysvec_apic_timer_interrupt+0xc5/0x12f
Feb 20 13:18:33 Oldtown kernel: ? sysvec_apic_timer_interrupt+0x80/0xa6
Feb 20 13:18:33 Oldtown kernel: </IRQ>
Feb 20 13:18:33 Oldtown kernel: <TASK>
Feb 20 13:18:33 Oldtown kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Feb 20 13:18:33 Oldtown kernel: ? xfs_buf_get_map+0x9b/0x804 [xfs]
Feb 20 13:18:33 Oldtown kernel: ? xfs_buf_get_map+0x108/0x804 [xfs]
Feb 20 13:18:33 Oldtown kernel: xfs_buf_read_map+0x51/0x1b3 [xfs]
Feb 20 13:18:33 Oldtown kernel: ? xfs_buf_readahead_map+0x5/0x50 [xfs]
Feb 20 13:18:33 Oldtown kernel: xfs_buf_readahead_map+0x30/0x50 [xfs]
Feb 20 13:18:33 Oldtown kernel: ? xfs_buf_readahead_map+0x5/0x50 [xfs]
Feb 20 13:18:33 Oldtown kernel: xfs_da_reada_buf+0x6c/0xa1 [xfs]
Feb 20 13:18:33 Oldtown kernel: xfs_dir2_leaf_readbuf+0x260/0x2f5 [xfs]
Feb 20 13:18:33 Oldtown kernel: xfs_dir2_leaf_getdents+0xe0/0x322 [xfs]
Feb 20 13:18:33 Oldtown kernel: ? xfs_bmap_last_offset+0x8a/0xc2 [xfs]
Feb 20 13:18:33 Oldtown kernel: xfs_readdir+0x14e/0x190 [xfs]
Feb 20 13:18:33 Oldtown kernel: iterate_dir+0x97/0x146
Feb 20 13:18:33 Oldtown kernel: __do_sys_getdents64+0x6b/0xd8
Feb 20 13:18:33 Oldtown kernel: ? compat_filldir+0x17a/0x17a
Feb 20 13:18:33 Oldtown kernel: do_syscall_64+0x6b/0x81
Feb 20 13:18:33 Oldtown kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Feb 20 13:18:33 Oldtown kernel: RIP: 0033:0x14f85c00d283
Feb 20 13:18:33 Oldtown kernel: Code: 89 df e8 20 05 fb ff 48 83 c4 08 48 89 e8 5b 5d c3 66 0f 1f 44 00 00 b8 ff ff ff 7f 48 39 c2 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 61 0b 11 00 f7 d8
Feb 20 13:18:33 Oldtown kernel: RSP: 002b:00007fffb298e2b8 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
Feb 20 13:18:33 Oldtown kernel: RAX: ffffffffffffffda RBX: 00000000004588f0 RCX: 000014f85c00d283
Feb 20 13:18:33 Oldtown kernel: RDX: 0000000000008000 RSI: 0000000000458920 RDI: 0000000000000008
Feb 20 13:18:33 Oldtown kernel: RBP: 00000000004588f4 R08: 00000000ffffffff R09: 000000000044f6c0
Feb 20 13:18:33 Oldtown kernel: R10: 0000000000000100 R11: 0000000000000293 R12: ffffffffffffff88
Feb 20 13:18:33 Oldtown kernel: R13: 0000000000000000 R14: 0000000000444c90 R15: 000000000000106f
Feb 20 13:18:33 Oldtown kernel: </TASK>

trurl · February 20

Probably filesystem corruption but not enough context to identify the disk

Attach Diagnostics to your NEXT post in this thread.

kbareis · February 20

Hey @trurl-

Diags are hard to get due to the device becoming unresponsive to webui, ssh, and input from hardwired keyboard and mouse. I setup an external syslog server just to try and diagnose. Attaching a fresh diagnostics from when it rebooted. Likely won't have all the errors that we are looking for. Happy to gather more off the syslog server if that is of interest. Currently doing a parity check on the drives.

oldtown-diagnostics-20240220-1536.zip

trurl · February 21

All disks mounted and nothing in syslog yet.

JorgeB · February 21

Still a good idea to run xfs_repair without -n on all xfs filesystems.

trurl · February 21

4 hours ago, JorgeB said:

run xfs_repair

Check filesystem from webUI to make sure you get the right command. Post the output.

kbareis · February 21

Thanks all! Parity check is just finishing up and then I will run the xfs repair.

One thing I have tweaked since posting is unassigning my cache drives due to constant btrfs errors being logged out to syslog. Things appear to be running better now that they are out. 18 hours uptime so far! If I had to guess at this point, I believe the btrfs filesystem was hosed and was causing full system instability. When I have tried to do data recovery mounting the btrfs cache drives with read-only mount commands is also when I see instability. Going to run the system without cache drives for a week or so to test stability and if good, likely reformat them, and test them for issues prior to dropping them back in.

Memtest also passed three loops without issue.

kbareis · February 23

Dropping an update here for anyone who might be having stability issues with Ryzen and Unraid 6.12.8.

Memtested my RAM again and ensured 3 successful passes
Down clocked it from 3600 to 2400 (could do a bit higher per the document below but I decided to just run DDR4 default spec).
Disabled C-States on the processor (bios)
Set my Power Idle Control to typical (bios)
Removed both cache SSDs
Did a parity check to ensure data on the array
Changed Docker from macvlan to ipvlan per some stability issue reports on these forms and reddit
Used recovery tool to pull files off BTRFS cache drives
Rebuilt the cache drives from scratch ensuring that only critical data was corruption free for the rebuild

Overall I have been up for longer than I have been in two weeks. Will keep posting on this thread with any additional updates or recommendations. My best guess at this point is that due to some sort of corruption event from ether cstates, power, memory timings, or macvlan that caused btrfs, vm container or the docker container to become unstable thus rendering the UI to freeze and stop responding after 24-48 hours.

kbareis · February 25

Well I spoke to soon... attaching syslog and diagnostics. Same thing, can ping the server but ssh and webgui down. Looks like issues started at 19:28 in the logs with some kernel errors on several cores that quickly spread into btrfs issues which bricked the whole thing.

syslog-192.168.1.30.log oldtown-diagnostics-20240224-1954.zip

JorgeB · February 25

Looks more hardware related to me, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

kbareis · February 25

I’ll try but this hardware has been running in this config since last year. There are numerous posts on these forums of issues on 6.12.x. I worry there is an unknown software issue. I am considering ether moving to zfs for the cache disks and container images or completely leaving unraid

kbareis · March 21

Marking @JorgeB as the correct answer. I have swapped basically everything and what I finally came to was unstable RAM. Both sticks individually passed 3x passes. Both passed together with 3x passes but after running a 7x pass they failed.

Currently, I have a 8x2 set of Corsair in there and uptime is at 5+ days

Daily Unresponsive Server 6.12.8

Recommended Posts

kbareis

Link to comment

kbareis

Link to comment

trurl

Link to comment

kbareis

Link to comment

trurl

Link to comment

JorgeB

Link to comment

trurl

Link to comment

kbareis

Link to comment

kbareis

Link to comment

kbareis

Link to comment

JorgeB

Link to comment

kbareis

Link to comment

kbareis

Link to comment

Join the conversation