ars92 Posted August 15, 2023 Share Posted August 15, 2023 (edited) hey all, First time having a potential hardware issue, so I apologize if I missed certain prerequisite details. My server has been running just fine for a long time, even replaced a disk a few days back by upgrading parity and using the parity disk as a disk drive. Everything has been great even after that. But last night, docker started acting weird and I noticed errors in my cache pool when I downloaded the SMART report. i ran a memtest which went on for 13 hours for 4 pass, and it passed so I guess the 4 RAM sticks are fine. The vdisks aren't able to copy into my array after a certain point so I guess they are corrupted. Good thing my appdata backs up every month using CA Backup. I just want to try to understand if my M.2 drives are going bad and I should purchase some new ones, or should i try my luck by reformatting and using them back again. Attached diagnostic after running scrub (which couldnt correct any of the errors) and also a command which JorgeB recommends to run (shows a whole lot of errors!) Also attached SMART report from both cache disks. I cant do any cable checks since they are M.2 connected directly to the motherboard. I have a third drive used as an unassigned disk, which seems to be fine for now (ADATA drive bought a few years after these two) from scrub page UUID: d7811189-42b8-4d37-a4d0-dae7ee9e73f6 Scrub started: Tue Aug 15 21:03:35 2023 Status: aborted Duration: 0:09:38 Total to scrub: 478.01GiB Rate: 833.38MiB/s Error summary: read=135 Corrected: 0 Uncorrectable: 135 Unverified: 0a root@Tower:~# btrfs dev stats /mnt/cache [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 130 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 2 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme2n1p1].write_io_errs 354329 [/dev/nvme2n1p1].read_io_errs 339856 [/dev/nvme2n1p1].flush_io_errs 1334 [/dev/nvme2n1p1].corruption_errs 2806 [/dev/nvme2n1p1].generation_errs 0 root@Tower:~# tower-diagnostics-20230815-2115.zip tower-smart-20230815-2130.zip tower-smart-20230815-2131.zip Edited August 15, 2023 by ars92 add memtest86 results Quote Link to comment
JorgeB Posted August 15, 2023 Share Posted August 15, 2023 Aug 15 20:04:37 Tower kernel: critical medium error, dev nvme0n1 ... Aug 15 20:25:34 Tower kernel: critical medium error, dev nvme2n1 These are device errors, for both pool devices, so yes, they should be replaced, try to copy whatever you can and create a new pool. Quote Link to comment
Solution JorgeB Posted August 15, 2023 Solution Share Posted August 15, 2023 Forgot to mention, nvme2n1 dropped offline at some point in the past, that's likely why the mirrored pool cannot recover everything, since unlikely that both devices have errors on the same sectors, but since that device was out of sync it cannot be used to recover the errors on the other one, see here for better pool monitoring for the future. Quote Link to comment
ars92 Posted August 17, 2023 Author Share Posted August 17, 2023 Thanks JorgeB for the prompt reply. Sure enough, today the SSD has turned read only, at least I'm not able to start any of my VMs anymore. Managed to copy files out two days back from the VMs (since some of the Vdisks weren't able to be copied out in its entirety) and the app data backup was already there due to CA Backup (thanks Squid for this!!) Planning to get a pair of Crucial P5 Plus, since my two Evo Plus' 5 year warranty ended two months ago in June....lol SN700 seems fun but way too expensive in my country for some reason..... 1 Quote Link to comment
ars92 Posted August 20, 2023 Author Share Posted August 20, 2023 Thanks JorgeB for the prompt reply. Sure enough, today the SSD has turned read only, at least I'm not able to start any of my VMs anymore. Managed to copy files out two days back from the VMs (since some of the Vdisks weren't able to be copied out in its entirety) and the app data backup was already there due to CA Backup (thanks Squid for this!!) Planning to get a pair of Crucial P5 Plus, since my two Evo Plus' 5 year warranty ended two months ago in June....lol SN700 seems fun but way too expensive in my country for some reason..... ========================================================================================================================= So I've gotten the replacement SSD, got the disk replaced without doing any reassigning etc. since the old drives have nothing useful anymore. Everything looks good, docker service is back up, but this is worrying me a bit. I have setup scrub and balance to run monthly now, just in case it helps in the future (I will setup the script suggested by JorgeB soon) but when I run "perform full balance" the page refreshes almost immediately (maybe due to nothing in the disks) but the recommendation doesn't go off. I then tried running the below CLI command and I get the below, but the GUI still shows the same message. Should I just ignore this? Quote Link to comment
JorgeB Posted August 20, 2023 Share Posted August 20, 2023 4 hours ago, ars92 said: maybe due to nothing in the disks Correct, it's normal. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.