Kernel oops while cache rebuilding with BTRFS operation running

WEHA · December 16, 2019

So 2 days ago I had to change the network configuration (enable VLAN) so I decided to update as well.

From the moment I disabled the array, 1 nvme cache drive was "missing"...

I had this before but only after a reboot, I already updated BIOS & have new nvme ssd's (Intels before, now Samsung)

After a cold boot, the cache drive was back, reassigned and unraid updated from 6.7.0 to 6.7.2.

Everything went well until a few hours ago I could not connect to one VM and SMB shares do not seem to work either.

I connected on SSH to find the load above 280 (now 303 and steadily rising).

Processes itself don't seem to use that much of the cpu so I'm guessing IO.

Every tool checking IO status hangs so unable to see what is going on.

On the GUI it said mover running, got that to stop, then it said BTRFS operation running.

Found this thread and executed btrfs balance status /mnt/cache --> No balance found on '/mnt/cache'

I also found a Kernel oops in dmesg (below)

His problem was solved after a reboot, but he did not have a cache pool.

Am I good to execute a reboot here? I won't lose my data?

Diagnostics download is running, not sure if it will finish...

[71615.611132] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[71615.611268] PGD 8000000f138a1067 P4D 8000000f138a1067 PUD f68c96067 PMD 0
[71615.611341] Oops: 0000 [#1] SMP PTI
[71615.611431] CPU: 0 PID: 26293 Comm: fstrim Not tainted 4.19.56-Unraid #1
[71615.611515] Hardware name: ASUSTeK COMPUTER INC. P10S WS/P10S WS, BIOS 3402 07/12/2018
[71615.611590] RIP: 0010:btrfs_trim_fs+0x166/0x369
[71615.611658] Code: 00 00 48 c7 44 24 38 00 00 00 00 49 8b 45 10 48 c7 44 24 40 00 00 00 00 48 c7 44 24 30 00 00 00 00 48 89 44 24 20 48 8b 43 68 <48> 8b 80 80 00 00 00 48 8b 80 f8 03 00 00 48 8b 80 a8 01 00 00 0f
[71615.611873] RSP: 0018:ffffc9002eaa7c90 EFLAGS: 00010297
[71615.611942] RAX: 0000000000000000 RBX: ffff8890339ae400 RCX: 0000000000000000
[71615.612014] RDX: ffff888fff1e9c00 RSI: 00000192c100d000 RDI: ffff88901ee0e378
[71615.612086] RBP: 00000000ffffffe4 R08: 0000606fc0a09bb0 R09: ffffffff8122acea
[71615.612158] R10: ffffea000d907200 R11: ffff88903f220b80 R12: ffff88901ee0e000
[71615.612230] R13: ffffc9002eaa7d20 R14: 0000000000000000 R15: ffff8887bec88000
[71615.612302] FS:  000014a5d096c780(0000) GS:ffff88903f200000(0000) knlGS:0000000000000000
[71615.612375] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71615.612468] CR2: 0000000000000080 CR3: 000000029787c001 CR4: 00000000003626f0
[71615.612553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[71615.612625] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[71615.612697] Call Trace:
[71615.612763]  btrfs_ioctl_fitrim.isra.7+0xfe/0x135
[71615.612832]  btrfs_ioctl+0x4f6/0x28ad
[71615.612900]  ? queue_var_show+0x12/0x15
[71615.612967]  ? _copy_to_user+0x22/0x28
[71615.613035]  ? cp_new_stat+0x14b/0x17a
[71615.613102]  ? vfs_ioctl+0x19/0x26
[71615.613167]  vfs_ioctl+0x19/0x26
[71615.613233]  do_vfs_ioctl+0x526/0x54e
[71615.613300]  ? __se_sys_newfstat+0x3c/0x5f
[71615.613368]  ksys_ioctl+0x39/0x58
[71615.613434]  __x64_sys_ioctl+0x11/0x14
[71615.613524]  do_syscall_64+0x57/0xf2
[71615.613604]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[71615.613674] RIP: 0033:0x14a5d0a9e397
[71615.613741] Code: 00 00 90 48 8b 05 f9 2a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2a 0d 00 f7 d8 64 89 01 48
[71615.613956] RSP: 002b:00007fff498e2be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[71615.614029] RAX: ffffffffffffffda RBX: 00007fff498e2d40 RCX: 000014a5d0a9e397
[71615.614123] RDX: 00007fff498e2bf0 RSI: 00000000c0185879 RDI: 0000000000000003
[71615.614208] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000416bb0
[71615.614279] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000415aa0
[71615.614351] R13: 0000000000415a20 R14: 0000000000415aa0 R15: 000014a5d096c6b0
[71615.614423] Modules linked in: vhost_net tun vhost tap kvm_intel kvm xt_nat veth xt_CHECKSUM ipt_MASQUERADE ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter ip_tables xfs md_mod nfsd lockd grace sunrpc ipmi_devintf bonding igb i2c_algo_bit x86_pkg_temp_thermal intel_powerclamp coretemp hid_logitech_hidpp wmi_bmof mxm_wmi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_uncore pcc_cpufreq intel_rapl_perf i2c_i801 ahci ie31200_edac libahci video acpi_pad button hid_logitech_dj i2c_core nvme nvme_core aacraid cp210x usbserial cdc_acm wmi backlight [last unloaded: kvm]
[71615.615190] CR2: 0000000000000080
[71615.615725] ---[ end trace eb0e9ccf73a2e8b9 ]---

JorgeB · December 16, 2019

If the diagnostics don't finish try to at least add the complete syslog.

WEHA · December 16, 2019

3 minutes ago, johnnie.black said:

If the diagnostics don't finish try to at least add the complete syslog.

tower-syslog-20191216-1145.zip

JorgeB · December 16, 2019

There are read/write with both cache devices:

Dec 15 06:42:51 Tower kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme0n1p1 errs: wr 2376234322, rd 2009373892, flush 25123281, corrupt 0, gen 0
Dec 15 06:42:51 Tower kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 357, flush 0, corrupt 0, gen 0

Mainly cache1 which is most likely regularly dropping offline, but there are also some read errors on cache2.

Probably best to backup what you can and reformat the pool, then see here for better pool monitoring, you need to fix what's causing those errors or will continue to have pool issues.

WEHA · December 16, 2019

1 minute ago, johnnie.black said:

Mainly cache1 which is most likely regularly dropping offline, but there are also some read errors on cache2.

Probably best to backup what you can and reformat the pool, then see here for better pool monitoring, you need to fix what's causing those errors or will continue to have pool issues.

oh dear.. I didn't know there was a problem like this...

But how do I continue now? Is there any way to get it properly restarted? Or do I just force a reboot?

Stopping / Killing anything which is running on the cache just hangs, nothing stops or gets killed.

JorgeB · December 16, 2019

Just now, WEHA said:

Or do I just force a reboot?

Most likely, if you can still access cache data backup anything you can now, it can be unmountable after rebooting.

There are some recovery options here if needed.

WEHA · December 16, 2019

7 hours ago, johnnie.black said:

Most likely, if you can still access cache data backup anything you can now, it can be unmountable after rebooting.

There are some recovery options here if needed.

Ok so I have now reset the system and have the array in a stopped state.

The cache disks appear normal but can I trust unraid to mount it properly / fix it properly or how can I check that it will or will not mount properly?

thank you so far for your assistance.

WEHA · December 16, 2019

I went back to unraid and saw this meesgae:

Unraid Cache disk message: 16-12-2019 20:31

Warning [TOWER] - Cache pool BTRFS missing device(s)
Samsung_SSD_970_EVO_2TB (nvme0n1)

~~So I'm assuming I better not use disk 0n1, can I just unassign 0n1 and just use 1n1?~~

EDIT2: according to this I can just unassign one disk and continue on, can you just tell me what the safest option is?

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?tab=comments#comment-480418

EDIT: With mount -o degraded,usebackuproot,ro /dev/sdX1 /x I can read 1n1 so it seems fine.

Edited December 16, 2019 by WEHA

WEHA · December 17, 2019

So I tried figuring it out (made a backup first):

removing the "faulty" cache disk results in not being able to read the cache
assigning 1n1 from slot 2 to slot 1 has the same result
reassigning 0n1 to slot 1 and 1n1 to slot 2 which then says 0n1 will lose all data: now I can read the cache

However: now it says 4TB protected, but these are 2 2TB drives?

Usually when this happens the size changes when it's "recovering" but this does not happen now, it remains the same.

How can I correct this? (this is balancing right?)

JorgeB · December 17, 2019

25 minutes ago, WEHA said:

However: now it says 4TB protected, but these are 2 2TB drives?

It's likely using single profile for data, at least partially, check output of:

btrfs fi usage -T /mnt/cache

WEHA · December 17, 2019

10 minutes ago, johnnie.black said:
btrfs fi usage -T /mnt/cache

Looks like this needs to be fixed, but tbh I don't understand what I'm reading...

From what I gather it is indeed partly RAID1?

Overall:
    Device size:                   7.28TiB
    Device allocated:              1.92TiB
    Device unallocated:            5.36TiB
    Device missing:                1.82TiB
    Used:                          1.82TiB
    Free (estimated):              2.73TiB      (min: 2.73TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

                  Data      Metadata System
Id Path           RAID1     RAID1    RAID1     Unallocated
-- -------------- --------- -------- --------- -----------
 3 /dev/nvme0n1p1 250.00GiB  2.00GiB         -     1.57TiB
 4 /dev/nvme0n1p1  43.00GiB        -         -     1.78TiB
 2 /dev/nvme1n1p1 980.00GiB  3.00GiB  32.00MiB   879.99GiB
 1 missing        687.00GiB  1.00GiB  32.00MiB  -688.03GiB
-- -------------- --------- -------- --------- -----------
   Total          980.00GiB  3.00GiB  32.00MiB     3.54TiB
   Used           930.08GiB  1.13GiB 176.00KiB

Balance status:

No balance found on '/mnt/cache'

JorgeB · December 17, 2019

That pool is a mess, better to reformat and start over.

WEHA · December 17, 2019

3 hours ago, johnnie.black said:

That pool is a mess, better to reformat and start over.

Thought as much... thank you for your assistance!

Kernel oops while cache rebuilding with BTRFS operation running

Recommended Posts

WEHA

Link to comment

JorgeB

Link to comment

WEHA

Link to comment

JorgeB

Link to comment

WEHA

Link to comment

JorgeB

Link to comment

WEHA

Link to comment

WEHA

Link to comment

WEHA

Link to comment

JorgeB

Link to comment

WEHA

Link to comment

JorgeB

Link to comment

WEHA

Link to comment

Join the conversation