Issues with Cache... (btrfs)

TheSystemAdmin · January 28, 2022

Hello unRAID Community!

I was watching Plex when it disconnected on me. I hopped onto my webGUI and received no notifications, but it did not look good.

1. Several (not all) containers were stopped.

2. All VMs are gone ("No Virtual Machines installed")

3. Several TBs of data is not showing up in Windows or through the "Shares" tab, but the utilization on the disks appears to be correct.

Logs are spamming this:

Jan 28 11:37:55 TSA-NAS01 kernel: blk_update_request: I/O error, dev sdk, sector 73447704 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 28 11:37:55 TSA-NAS01 kernel: BTRFS error (device sdf1): bdev /dev/sdk1 errs: wr 52, rd 8464053, flush 0, corrupt 0, gen 0
Jan 28 11:37:55 TSA-NAS01 kernel: sd 1:0:0:0: [sdf] tag#31 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Jan 28 11:37:55 TSA-NAS01 kernel: sd 1:0:0:0: [sdf] tag#31 CDB: opcode=0x88 88 00 00 00 00 00 00 3e ae 20 00 00 00 20 00 00
Jan 28 11:37:55 TSA-NAS01 kernel: blk_update_request: I/O error, dev sdf, sector 4107808 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Jan 28 11:37:55 TSA-NAS01 kernel: BTRFS error (device sdf1): bdev /dev/sdf1 errs: wr 54, rd 10210651, flush 0, corrupt 0, gen 0
Jan 28 11:37:55 TSA-NAS01 kernel: sd 2:0:0:0: [sdk] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Jan 28 11:37:55 TSA-NAS01 kernel: sd 2:0:0:0: [sdk] tag#18 CDB: opcode=0x88 88 00 00 00 00 00 00 3e 0e 20 00 00 00 20 00 00
Jan 28 11:37:55 TSA-NAS01 kernel: blk_update_request: I/O error, dev sdk, sector 4066848 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Jan 28 11:37:55 TSA-NAS01 kernel: BTRFS error (device sdf1): bdev /dev/sdk1 errs: wr 52, rd 8464054, flush 0, corrupt 0, gen 0
Jan 28 11:37:55 TSA-NAS01 kernel: BTRFS info (device sdf1): no csum found for inode 72150 start 1000931328
Jan 28 11:37:55 TSA-NAS01 kernel: sd 1:0:0:0: [sdf] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Jan 28 11:37:55 TSA-NAS01 kernel: sd 1:0:0:0: [sdf] tag#22 CDB: opcode=0x88 88 00 00 00 00 00 04 61 59 18 00 00 00 08 00 00
Jan 28 11:37:55 TSA-NAS01 kernel: blk_update_request: I/O error, dev sdf, sector 73488664 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 28 11:37:55 TSA-NAS01 kernel: BTRFS error (device sdf1): bdev /dev/sdf1 errs: wr 54, rd 10210652, flush 0, corrupt 0, gen 0

From what I can tell in my quick (panicked) Google searches is there is something wrong with my cache.

I have a pool of 2 SSDs that show 0 Errors, if I try to scrub them, I get an aborted status:

UUID:             bdbe2a64-9dd0-40b4-82fb-75fba1b30eca
Scrub started:    Fri Jan 28 11:21:47 2022
Status:           aborted
Duration:         0:00:00
Total to scrub:   178.97GiB
Rate:             0.00B/s
Error summary:    no errors found

Also getting this on the Balance Status:

Before I start ripping things apart and re-seating cables. I wanted to make sure I'm on the right direction. While losing data is not the end of the world, I would rather not have to rebuild everything.

Both SSDs are connected straight to the motherboard while the rest of my data disks are through an HBA.

I do have backups utilizing the CloudBerry App to a Backblaze S2 bucket which does show data (woo!) I also have backups via the CA Backup / Restore Appdata plugin which appears to have run today at 3am. Though it currently reports it has no backup sets since that data is now missing on the unRAID side. (Again, also in Backblaze)

Any help would be really appreciated!

Thank you.

tsa-nas01-diagnostics-20220128-1140.zip

JorgeB · January 28, 2022

Jan 28 10:35:32 TSA-NAS01 kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xb0010000 flags=0x0000]

Problem with the onboard SATA controller, both cache devices dropped offline because of that:

Jan 28 10:36:33 TSA-NAS01 kernel: ata1.00: disabled
Jan 28 10:37:30 TSA-NAS01 kernel: ata2.00: disabled

This is quite common with some Ryzen boards, rebooting should bring the pool back but if it keeps happening best to use an ad-don controller (or replace the board).

Squid · January 28, 2022

Cabling certainly appears to be the prime suspect (the drive isn't even showing any SMART report at all)

6 minutes ago, TheSystemAdmin said:

Though it currently reports it has no backup sets since that data is now missing on the unRAID side

Since it looks like you sync the backup from the plugin to backblaze it's probably not a major issue, but I don't recommend storing a backup of the drive you're backing up on the drive itself.

TheSystemAdmin · January 28, 2022

2 minutes ago, Squid said:

Cabling certainly appears to be the prime suspect (the drive isn't even showing any SMART report at all)

Since it looks like you sync the backup from the plugin to backblaze it's probably not a major issue, but I don't recommend storing a backup of the drive you're backing up on the drive itself.

True, I have been debating on plugging an external drive in and having it backup to that for a local copy but the data footprint is so small that pulling from the cloud wouldn't take more than an hour or two.

TheSystemAdmin · January 28, 2022

8 minutes ago, JorgeB said:
Jan 28 10:35:32 TSA-NAS01 kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xb0010000 flags=0x0000]
Problem with the onboard SATA controller, both cache devices dropped offline because of that:
Jan 28 10:36:33 TSA-NAS01 kernel: ata1.00: disabled
Jan 28 10:37:30 TSA-NAS01 kernel: ata2.00: disabled
This is quite common with some Ryzen boards, rebooting should bring the pool back but if it keeps happening best to use an ad-don controller (or replace the board).

Reboot appears to have resolved it, data is back, containers started up and VMs are reflecting.

Will definitely consider replacing the board if this issue occurs a second time. Been debating on switching to Intel but the wife won't approve any more tech spending for a few months. Haha

TheSystemAdmin · January 28, 2022

Since my "system" share is on the cache and I lost both drives, unRAID just ran with what it had on the array? Would that account for data not reflecting, performance being terrible and VMs missing?

JorgeB · January 28, 2022

5 minutes ago, TheSystemAdmin said:

Would that account for data not reflecting, performance being terrible and VMs missing?

Correct, all pool data became inaccessible.

Issues with Cache... (btrfs)

Recommended Posts

TheSystemAdmin

Link to comment

JorgeB

Link to comment

Squid

Link to comment

TheSystemAdmin

Link to comment

TheSystemAdmin

Link to comment

TheSystemAdmin

Link to comment

JorgeB

Link to comment

Join the conversation