Jump to content

970 Evo M2 SSD cache pool filled with errors [BTRFS]


ars92
Go to solution Solved by JorgeB,

Recommended Posts

hey all,

 

First time having a potential hardware issue, so I apologize if I missed certain prerequisite details.

 

My server has been running just fine for a long time, even replaced a disk a few days back by upgrading parity and using the parity disk as a disk drive.  Everything has been great even after that. But last night, docker started acting weird and I noticed errors in my cache pool when I downloaded the SMART report. i ran a memtest which went on for 13 hours for 4 pass, and it passed  so I guess the 4 RAM sticks are fine.

 

The vdisks aren't able to copy into my array after a certain point so I guess they are corrupted.  Good thing my appdata backs up every month using CA Backup. I just want to try to understand if my M.2 drives are going bad and I should purchase some new ones, or should i try my luck by reformatting and using them back again. Attached diagnostic after running scrub (which couldnt correct any of the errors) and also a command which JorgeB recommends to run (shows a whole lot of errors!)

 

Also attached SMART report from both cache disks. I cant do any cable checks since they are M.2 connected directly to the motherboard. I have a third drive used as an unassigned disk, which seems to be fine for now (ADATA drive bought a few years after these two)

 

from scrub page  

UUID:             d7811189-42b8-4d37-a4d0-dae7ee9e73f6
Scrub started:    Tue Aug 15 21:03:35 2023
Status:           aborted
Duration:         0:09:38
Total to scrub:   478.01GiB
Rate:             833.38MiB/s
Error summary:    read=135
  Corrected:      0
  Uncorrectable:  135
  Unverified:     0a


 

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     130
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  2
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme2n1p1].write_io_errs    354329
[/dev/nvme2n1p1].read_io_errs     339856
[/dev/nvme2n1p1].flush_io_errs    1334
[/dev/nvme2n1p1].corruption_errs  2806
[/dev/nvme2n1p1].generation_errs  0
root@Tower:~#

 

  

 

 

tower-diagnostics-20230815-2115.zip tower-smart-20230815-2130.zip tower-smart-20230815-2131.zip

P_20230815_194837.jpg

Edited by ars92
add memtest86 results
Link to comment
  • Solution

Forgot to mention, nvme2n1 dropped offline at some point in the past, that's likely why the mirrored pool cannot recover everything, since unlikely that both devices have errors on the same sectors, but since that device was out of sync it cannot be used to recover the errors on the other one, see here for better pool monitoring for the future.

Link to comment

Thanks JorgeB for the prompt reply. Sure enough, today the SSD has turned read only, at least I'm not able to start any of my VMs anymore. Managed to copy files out two days back from the VMs (since some of the Vdisks weren't able to be copied out in its entirety)  and the app data backup was already there due to CA Backup (thanks Squid for this!!)

 

Planning to get a pair of Crucial P5 Plus, since my two Evo Plus' 5 year warranty ended two months ago in June....lol

SN700 seems fun but way too expensive in my country for some reason.....

  • Like 1
Link to comment

Thanks JorgeB for the prompt reply. Sure enough, today the SSD has turned read only, at least I'm not able to start any of my VMs anymore. Managed to copy files out two days back from the VMs (since some of the Vdisks weren't able to be copied out in its entirety)  and the app data backup was already there due to CA Backup (thanks Squid for this!!)

 

Planning to get a pair of Crucial P5 Plus, since my two Evo Plus' 5 year warranty ended two months ago in June....lol

SN700 seems fun but way too expensive in my country for some reason.....


=========================================================================================================================

 

So I've gotten the replacement SSD, got the disk replaced without doing any reassigning etc. since the old drives have nothing useful anymore. Everything looks good, docker service is back up, but this is worrying me a bit. I have setup scrub and balance to run monthly now, just in case it helps in the future (I will setup the script suggested by JorgeB soon) but when I run "perform full balance" the page refreshes almost immediately (maybe due to nothing in the disks) but the recommendation doesn't go off. 

image.thumb.png.c2301f5c05b4286593fa3925b78dde2a.png

I then tried running the below CLI command and I get the below, but the GUI still shows the same message. Should I just ignore this?

image.png.870aef466dadfeab3bd2450fea46ece2.png

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...