BTRFS errors, docker and vms won't start

November 8, 20241 yr

I'm seeing lots of BTRFS errors in logs, examples below...

Nov  8 04:46:47 Max kernel: BTRFS warning (device nvme0n1p1: state EA): csum failed root 5 ino 18493665 off 180224 csum 0x27aafbf4 expected csum 0x27aafb74 mirror 1
Nov  8 04:46:47 Max kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 224, gen 0



Nov  8 09:15:13 Max kernel: BTRFS info (device loop3: state E): forced readonly
Nov  8 09:15:13 Max kernel: BTRFS warning (device loop3: state E): Skipping commit of aborted transaction.
Nov  8 09:15:13 Max kernel: BTRFS: error (device loop3: state EA) in cleanup_transaction:1992: errno=-5 IO failure

Obviously the docker and vm issues are symptoms of 'forced readonly'. But not sure what would be causing the BTRFS issues, or how to recover. Any help would be greatly appreciated. Diagnostics attached

max-diagnostics-20241108-0936.zip

Quote

November 8, 20241 yr

Community Expert

Btrfs csum errors are usually caused by bad RAM.

Quote

November 12, 20241 yr

Author

Hi, thanks for the suggestion. I've swapped out the RAM for new modules and I'm still getting issues. It seems I can start the server, log on and everything's fine but as soon as I start the array it craps out

Quote

November 12, 20241 yr

Author

I've run SMART self-test on the cache drive and no issues found, although I noticed error count of 31, but no further information... image.png.dcb4c91697a3a1e986b95fc8e476827a.png

Quote

November 12, 20241 yr

Community Expert

Run a correcting scrub on the pool and post the results.

Quote

November 12, 20241 yr

Author

I just want to check before running... is this right? Just click scrub with 'repair corrupted blocks' selected?

Quote

November 12, 20241 yr

Community Expert

Yep

Quote

November 13, 20241 yr

Author

I've started the process 3 times now. First time it stopped after 600MB, I left it overnight in case it's a super long running process, but it didn't progress. Had to hard reboot the server as unraid wasn't responsive. Just run it again and it stopped at 10% with the following...

Quote

November 13, 20241 yr

Community Expert

25 minutes ago, aphillippe said:

Just run it again and it stopped at 10% with the following...

Do you mean the scrub stopped, or the server stopped responding?

If the former, also post new diags please.

Quote

November 13, 20241 yr

Author

Retried several times and it finally finished.

image.png.22130a0d098d655837d155a9764a6e84.png

I think there was an issue with docker or VMs flooding the network. After starting the array I was noticing network issues (other devices losing network/internet). Unplugging the unraid server from the switch solved it instantly. It only happened after the array so I assume it was docker or a VM. So I suspect the scrub may have been finishing in the background but the network issues were causing the UI and ssh to become unresponsive. The above was after I disabled docker and VMs and ran it again. Seems stable with the array started but no dockers/VMs for now.

So, is my SSD toast? Do I wipe and reinstall? Can I restore from my appdata backup or will that data be corrupt too? Thanks

Edited November 13, 20241 yr by aphillippe

Quote

November 13, 20241 yr

Community Expert

You should delete/restore the files mentioned in the syslog, they have data corruption, then re-run the scrub to confirm 0 errors.

Quote

November 14, 20241 yr

Author

Ok, I've deleted all the mentioned files. Rerunning the scrub, it seems to be stuck doing nothing...

I've hit cancel, shows up in logs but doesn't seem to cancel.

I've tried to reboot the server but logs now show this...

And no reboot. I'm guessing there's something more fundamental that just those three files?

Thanks for the help so far, by the way. Much appreciated

Quote

November 14, 20241 yr

Community Expert

This error usually means bad RAM, or board/CPU.

Quote

November 15, 20241 yr

Author

I've already replaced the RAM. What's the next step? I don't have spare board or CPU to swap out. Return both? Would wiping the SSD and putting new file system on there help? Or would this (or other similar) issues likely surface again? I'm a bit stuck at this point

Quote

November 15, 20241 yr

Author

I left it running overnight and got this in the logs, if it helps...

Quote

November 15, 20241 yr

Community Expert

24 minutes ago, aphillippe said:

Would wiping the SSD and putting new file system on there help?

You can try, and if new issues continue to appear, it will basically point to a hardware problem.

Quote

BTRFS errors, docker and vms won't start

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)