BTRFS issues

BradJ · September 17, 2022

I noticed all my Dockers were not working. Upon trying to restart the Docker service I was getting "Docker service unable to start."

I did a reboot to see if that would fix the issue. The diagnostics posted are before the reboot.

Upon reboot, Docker failed to start again. I noticed a lot of BTRFS errors on my cache drives in the logs. So I ran a Scrub. No help - tons of errors in the logs.

I don't know what to do next. These are relatively new redundant cache drives, just a few weeks old.

The first diagnostics was right when I first noticed the problem. The second diagnostics is after the reboot and scrub.

Please help!

Brad

tower-diagnostics-20220916-2012.zip tower-diagnostics-20220916-2055.zip

BradJ · September 17, 2022

This is the last thing in the log when I try to start the Docker service:

Sep 16 21:06:08 Tower root: mount: /var/lib/docker: wrong fs type, bad option, bad superblock on /dev/loop2, missing codepage or helper program, or other error.
Sep 16 21:06:08 Tower kernel: BTRFS error (device loop2): bad tree block start, want 24931565568 have 6449542684566666880
Sep 16 21:06:08 Tower kernel: BTRFS error (device loop2): bad tree block start, want 24931565568 have 12792500698256506278
Sep 16 21:06:08 Tower kernel: BTRFS warning (device loop2): couldn't read tree root
Sep 16 21:06:08 Tower kernel: BTRFS error (device loop2): open_ctree failed
Sep 16 21:06:08 Tower root: mount error

Sep 16 21:06:08 Tower emhttpd: shcmd (651): exit status: 1

Should I recreate the docker image file or is something bigger happening here?

BradJ · September 17, 2022

Okay, I have recreated the docker image and my docks are all working again.

I'm just not sure what the underlying issue is/was.

Can anything be determined from the logs?

JorgeB · September 17, 2022

6 hours ago, BradJ said:

I'm just not sure what the underlying issue is/was.

Cache2 dropped offline:

Sep 14 04:40:02 Tower kernel: sd 2:0:0:0: [sdk] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Sep 14 04:40:02 Tower kernel: sd 2:0:0:0: [sdk] tag#22 CDB: opcode=0x2a 2a 00 00 00 08 80 00 00 08 00
Sep 14 04:40:02 Tower kernel: blk_update_request: I/O error, dev sdk, sector 2176 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Sep 14 04:40:02 Tower kernel: BTRFS warning (device sdj1): lost page write due to IO error on /dev/sdk1 (-5)

Check/replace cables and also see here for better pool monitoring.

BradJ · September 17, 2022

JorgeB to the rescue again!

I ran the script and I have tons of errors:

[/dev/sdb1].write_io_errs 0
[/dev/sdb1].read_io_errs 0
[/dev/sdb1].flush_io_errs 0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0
[/dev/sdc1].write_io_errs 304324335
[/dev/sdc1].read_io_errs 4301948
[/dev/sdc1].flush_io_errs 2290865
[/dev/sdc1].corruption_errs 14826483
[/dev/sdc1].generation_errs 16809

Do you recommend I replace the cache2 cable and then run another scrub?

JorgeB · September 17, 2022

Yes, check/replace power cable also, and check that all errors were corrected by the scrub.

BradJ · September 22, 2022

There may have been some tension on the SATA cable. I rerouted and reseated the SATA cable.

I ran another Scrub and no errors are being reported.

I reset the BTRFS stats according to the post you referenced about the BTRFS monitor script. After re-running the script all errors are now 0.

The script is now scheduled to run daily to monitor the cache pool.

Once again, thank you JorgeB. I would be lost without you.

BTRFS issues

Recommended Posts

BradJ

Link to comment

BradJ

Link to comment

BradJ

Link to comment

JorgeB

Link to comment

BradJ

Link to comment

JorgeB

Link to comment

BradJ

Link to comment

Join the conversation