Hey all. First time poster, long time Unraid user. A few years ago I set up a streaming/automation server with 3 dockers (plex/jellyfin/telegraf) and 1 vm (home assistant). This have worked ‘well’ for a while, but for at least the past year I’ve been fighting a BTRFS cache pool corruption problem. The server works wonderfully….so long as I do not reboot. As soon as I reboot the VMs/Dockers will not start.
My cache shows like this:
[/dev/nvme1n1p1].write_io_errs 0
[/dev/nvme1n1p1].read_io_errs 0
[/dev/nvme1n1p1].flush_io_errs 0
[/dev/nvme1n1p1].corruption_errs 0
[/dev/nvme1n1p1].generation_errs 0
[/dev/nvme0n1p1].write_io_errs 1408272649
[/dev/nvme0n1p1].read_io_errs 820356362
[/dev/nvme0n1p1].flush_io_errs 40914172
[/dev/nvme0n1p1].corruption_errs 1630001
[/dev/nvme0n1p1].generation_errs 15037
So my ‘fix’ for this corruption (that I've done maybe 3 times in as many months) was to remove corrupted nvme0n1p1 from the pool, start the array, add it back to the pool (overwriting corrupted data) and then rebalance with the clean nvme1n1p1. The problem recently has been at some point BOTH nvme drives have started to become corrupt. Performing a scrub with ‘repair corruption’ checked does not fix the issue (even if I do it immediately after errors are detected. I installed a user script to monitor btrfs dev stats /mnt/cache so I’m aware of when the errors appear (usually many a day.)
What can I do? I ‘ve read on these forums that it could be memory issues. So I ran a memtest (free version 4 pass, came out 100% pass zero errors.) I’ve read it could be cable issues. But these are m.2 nvme drives, one connected directly to the motherboard and another connected via a PCIe 3.0 x4 adapter card. I’ve tried the latency change in the Linux config as well. Temps do regularly spike around 50-60C, but I figured that’s within normal range? One card is a Pony 1tb and another is Team 1tb.
Any advice would be greatly appreciated. I’m on Unraid 6.11.5.