MariaDB Docker dissapearing, BTRFS Cache (2xSSD) in readonly mode

July 21, 20205 yr

One of my 2 SSD's in the Cache pool seems to have errors in BTRFS. Making it go in read-only mode and crashing my Docker containers.

Diagnostic zip is attached.

I cannot seem to fix it with

btrfs check --repair /dev/sdi1 -p

It gives a few errors I cannot recall at the moment at the first of seven steps ([1/7]).

If needed than I would set the array in maintenance mode to check.

Please, let me know if that is needed.

The SSD log says:

Quote

Jul 21 21:46:10 PCUS kernel: ata8: SATA max UDMA/133 abar m2048@0xf33ff000 port 0xf33ff180 irq 26
Jul 21 21:46:10 PCUS kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 21 21:46:10 PCUS kernel: ata8.00: supports DRM functions and may not be fully accessible
Jul 21 21:46:10 PCUS kernel: ata8.00: disabling queued TRIM support
Jul 21 21:46:10 PCUS kernel: ata8.00: ATA-9: Samsung SSD 850 EVO 1TB, S2RFNX0H502874J, EMT02B6Q, max UDMA/133
Jul 21 21:46:10 PCUS kernel: ata8.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Jul 21 21:46:10 PCUS kernel: ata8.00: supports DRM functions and may not be fully accessible
Jul 21 21:46:10 PCUS kernel: ata8.00: disabling queued TRIM support
Jul 21 21:46:10 PCUS kernel: ata8.00: configured for UDMA/133
Jul 21 21:46:10 PCUS kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
Jul 21 21:46:10 PCUS kernel: sd 8:0:0:0: [sdi] Write Protect is off
Jul 21 21:46:10 PCUS kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Jul 21 21:46:10 PCUS kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 21 21:46:10 PCUS kernel: sdi: sdi1
Jul 21 21:46:10 PCUS kernel: sd 8:0:0:0: [sdi] Attached SCSI removable disk
Jul 21 21:46:10 PCUS kernel: BTRFS: device fsid ea8b613a-3208-4cd9-a512-80eb1fb736c5 devid 1 transid 14672899 /dev/sdi1
Jul 21 21:46:53 PCUS emhttpd: Samsung_SSD_850_EVO_1TB_S2RFNX0H502874J (sdi) 512 1953525168
Jul 21 21:46:53 PCUS emhttpd: import 30 cache device: (sdi) Samsung_SSD_850_EVO_1TB_S2RFNX0H502874J
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): disk space caching is enabled
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): has skinny extents
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): enabling ssd optimizations
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): start tree-log replay
Jul 21 21:47:04 PCUS kernel: BTRFS warning (device sdi1): block group 3243245568 has wrong amount of free space
Jul 21 21:47:04 PCUS kernel: BTRFS warning (device sdi1): failed to load free space cache for block group 3243245568, rebuilding it now
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): checking UUID tree
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): resizing devid 1
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): new size for /dev/sdi1 is 1000204853248
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): resizing devid 2
Jul 21 21:47:04 PCUS kernel: BTRFS info (device sdi1): new size for /dev/sdh1 is 1000204853248
Jul 21 21:47:05 PCUS s3_sleep: included disks=sdd sde sdf sdg sdh sdi sdj sdk
Jul 21 21:48:48 PCUS kernel: BTRFS critical (device sdi1): corrupt leaf: root=2 block=3032002707456 slot=167, unexpected item end, have 929783252 expect 7432
Jul 21 21:48:48 PCUS kernel: BTRFS: error (device sdi1) in __btrfs_free_extent:6805: errno=-5 IO failure
Jul 21 21:48:48 PCUS kernel: BTRFS critical (device sdi1): corrupt leaf: root=2 block=3032002707456 slot=167, unexpected item end, have 929783252 expect 7432
Jul 21 21:48:48 PCUS kernel: BTRFS info (device sdi1): forced readonly
Jul 21 21:48:48 PCUS kernel: BTRFS: error (device sdi1) in btrfs_run_delayed_refs:2935: errno=-5 IO failure
Jul 21 21:48:48 PCUS kernel: BTRFS: error (device sdi1) in __btrfs_free_extent:6805: errno=-5 IO failure
Jul 21 21:48:48 PCUS kernel: BTRFS: error (device sdi1) in btrfs_run_delayed_refs:2935: errno=-5 IO failure
Jul 21 21:48:48 PCUS kernel: BTRFS error (device sdi1): pending csums is 27144192

And the syslog says in almost red everywhere:

Quote

Jul 21 22:05:54 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1348832
Jul 21 22:05:54 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 347, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:06:25 PCUS kernel: loop: Write error at byte offset 686866432, length 4096.
Jul 21 22:06:25 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1341536
Jul 21 22:06:25 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 348, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:06:25 PCUS kernel: loop: Write error at byte offset 690601984, length 4096.
Jul 21 22:06:25 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1348832
Jul 21 22:06:25 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 349, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:06:56 PCUS kernel: loop: Write error at byte offset 686866432, length 4096.
Jul 21 22:06:56 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1341536
Jul 21 22:06:56 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 350, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:06:56 PCUS kernel: loop: Write error at byte offset 690601984, length 4096.
Jul 21 22:06:56 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1348832
Jul 21 22:06:56 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 351, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:07:26 PCUS kernel: loop: Write error at byte offset 686866432, length 4096.
Jul 21 22:07:26 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1341536
Jul 21 22:07:26 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 352, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:07:26 PCUS kernel: loop: Write error at byte offset 690601984, length 4096.
Jul 21 22:07:26 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1348832
Jul 21 22:07:26 PCUS kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 353, rd 0, flush 0, corrupt 0, gen 0
Jul 21 22:07:57 PCUS kernel: loop: Write error at byte offset 686866432, length 4096.
Jul 21 22:07:57 PCUS kernel: print_req_error: I/O error, dev loop2, sector 1341536
...

etc.

I noticed Nextcloud wouldn't respond and found MariaDB Docker missing from the Docker tab in the WebGui.

I did notice 2 CRC error counts, that may be related to a bad cable connection before, but the connectors seem fine now.

I might replace the cable, if it be the cause of this.

Could it be the SSD in the Cache pool may be faulty?

The errors seem show up when I add the disapeared MariaDB Docker from the "Previous Installed" list in Community Applications.

I had already had to replace a harddrive (a spinnig one) that gave too many errors, that it got simulated.

I should RMA it as it still is in it's warranty period.

I added 2 same capacity disks to replace smaller ones.

Can it be some corruption moved from the array to the cache disks or are those isolated from each other?

I really need some system stability as it seems that I have to reboot the system every other day.

Can someone help?

pcus-diagnostics-20200721-2201.zip

Quote

July 21, 20205 yr

Author

I am not sure if the topic matches the content. I not so good at that.
I am open for suggestions about this.

Quote

July 22, 20205 yr

Community Expert

Best bet is to backup cache data and re-format the pool.

Quote

July 25, 20205 yr

Author

Thanks for te reply.

I stopped all te VM's and Dockers. Also did active the mover.

The latter didn't seem to do much.

I just copied te /mnt/cache/ folder contents to my backup array share /mnt/user0/backup/cache backup/cache/ in Midnight Commander.

There where some file errors.

I will try again when I put the array in maintenance mode.

Before this works and I format the SSD's again:

Is BTRFS still the best relaiable option for a SSD cache pool nowadays?

Edited July 25, 20205 yr by Ymetro
added MC and mover info

Quote

July 25, 20205 yr

Author

Maintenance mode didn't do much, because the disks aren't mounted.

Just restarted the server with Docker and Virtual Machines disabled and started another run of copying files with MC to the backup folder.

Quote

July 25, 20205 yr

Community Expert

1 hour ago, Ymetro said:

Is BTRFS still the best relaiable option for a SSD cache pool nowadays?

It's currently the only option for a pool, for single cache you can also use xfs.

There are some btrfs recovery options here.

Quote

August 24, 20205 yr

Author

On 7/25/2020 at 12:53 PM, johnnie.black said:

It's currently the only option for a pool, for single cache you can also use xfs.

There are some btrfs recovery options here.

Thank you for the info. I appreciate it. Even got my 10GbE MTU set to 9000 for te server and main PC NIC. It seems to get faster transfer speeds.

The reformat of the cache pool has been done and the backup has been restored to it, and it seems to run fine.

Now I am a bit worried about my parity disk as its Reported uncorrected went up to 17.
I seem to either got a bad batch of 2 Seagate's Ironwolf 14TB (non Pro) and/or my PSU is not up for them. The latter I read somewhere on this forum.
I am curious though: 2 of the 3 Ironwolfs non Pro's came in from Amazon. I hope they are kind to HDD's....

Maybe I should put this in a new topic?

Quote

MariaDB Docker dissapearing, BTRFS Cache (2xSSD) in readonly mode

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)