970 Evo M2 SSD cache pool filled with errors [BTRFS]

ars92 · August 15, 2023

hey all,

First time having a potential hardware issue, so I apologize if I missed certain prerequisite details.

My server has been running just fine for a long time, even replaced a disk a few days back by upgrading parity and using the parity disk as a disk drive. Everything has been great even after that. But last night, docker started acting weird and I noticed errors in my cache pool when I downloaded the SMART report. i ran a memtest which went on for 13 hours for 4 pass, and it passed so I guess the 4 RAM sticks are fine.

The vdisks aren't able to copy into my array after a certain point so I guess they are corrupted. Good thing my appdata backs up every month using CA Backup. I just want to try to understand if my M.2 drives are going bad and I should purchase some new ones, or should i try my luck by reformatting and using them back again. Attached diagnostic after running scrub (which couldnt correct any of the errors) and also a command which JorgeB recommends to run (shows a whole lot of errors!)

Also attached SMART report from both cache disks. I cant do any cable checks since they are M.2 connected directly to the motherboard. I have a third drive used as an unassigned disk, which seems to be fine for now (ADATA drive bought a few years after these two)

from scrub page

UUID: d7811189-42b8-4d37-a4d0-dae7ee9e73f6
Scrub started: Tue Aug 15 21:03:35 2023
Status: aborted
Duration: 0:09:38
Total to scrub: 478.01GiB
Rate: 833.38MiB/s
Error summary: read=135
Corrected: 0
Uncorrectable: 135
Unverified: 0a

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/nvme0n1p1].write_io_errs 0
[/dev/nvme0n1p1].read_io_errs 130
[/dev/nvme0n1p1].flush_io_errs 0
[/dev/nvme0n1p1].corruption_errs 2
[/dev/nvme0n1p1].generation_errs 0
[/dev/nvme2n1p1].write_io_errs 354329
[/dev/nvme2n1p1].read_io_errs 339856
[/dev/nvme2n1p1].flush_io_errs 1334
[/dev/nvme2n1p1].corruption_errs 2806
[/dev/nvme2n1p1].generation_errs 0
root@Tower:~#

tower-diagnostics-20230815-2115.zip tower-smart-20230815-2130.zip tower-smart-20230815-2131.zip

Edited August 15, 2023 by ars92
add memtest86 results

JorgeB · August 15, 2023

Aug 15 20:04:37 Tower kernel: critical medium error, dev nvme0n1
...
Aug 15 20:25:34 Tower kernel: critical medium error, dev nvme2n1

These are device errors, for both pool devices, so yes, they should be replaced, try to copy whatever you can and create a new pool.

JorgeB · August 15, 2023

Forgot to mention, nvme2n1 dropped offline at some point in the past, that's likely why the mirrored pool cannot recover everything, since unlikely that both devices have errors on the same sectors, but since that device was out of sync it cannot be used to recover the errors on the other one, see here for better pool monitoring for the future.

ars92 · August 17, 2023

Thanks JorgeB for the prompt reply. Sure enough, today the SSD has turned read only, at least I'm not able to start any of my VMs anymore. Managed to copy files out two days back from the VMs (since some of the Vdisks weren't able to be copied out in its entirety) and the app data backup was already there due to CA Backup (thanks Squid for this!!)

Planning to get a pair of Crucial P5 Plus, since my two Evo Plus' 5 year warranty ended two months ago in June....lol

SN700 seems fun but way too expensive in my country for some reason.....

ars92 · August 20, 2023

Thanks JorgeB for the prompt reply. Sure enough, today the SSD has turned read only, at least I'm not able to start any of my VMs anymore. Managed to copy files out two days back from the VMs (since some of the Vdisks weren't able to be copied out in its entirety) and the app data backup was already there due to CA Backup (thanks Squid for this!!)

Planning to get a pair of Crucial P5 Plus, since my two Evo Plus' 5 year warranty ended two months ago in June....lol

SN700 seems fun but way too expensive in my country for some reason.....

=========================================================================================================================

So I've gotten the replacement SSD, got the disk replaced without doing any reassigning etc. since the old drives have nothing useful anymore. Everything looks good, docker service is back up, but this is worrying me a bit. I have setup scrub and balance to run monthly now, just in case it helps in the future (I will setup the script suggested by JorgeB soon) but when I run "perform full balance" the page refreshes almost immediately (maybe due to nothing in the disks) but the recommendation doesn't go off.

I then tried running the below CLI command and I get the below, but the GUI still shows the same message. Should I just ignore this?

image.png.870aef466dadfeab3bd2450fea46ece2.png

JorgeB · August 20, 2023

4 hours ago, ars92 said:

maybe due to nothing in the disks

Correct, it's normal.

970 Evo M2 SSD cache pool filled with errors [BTRFS]

Recommended Posts

ars92

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

ars92

Link to comment

ars92

Link to comment

JorgeB

Link to comment

Join the conversation