geekypenguin Posted November 24, 2023 Share Posted November 24, 2023 I replaced my cache drive about a week ago and also added a second drive as a raid1 pool. Since then, approximately every 2 days the dockers and VMs lock up and trying to write to the cache drive returns a message that the file system is read only. I've tried running a balance and a scrub and the scrub returns no errors, yet the problem keeps recurring. The only way to bring it back to life is to reboot, but it soon happens again. What have I missed? Or could the new SSDs just be faulty? Diagnostics attached. lisa-diagnostics-20231124-0959.zip Quote Link to comment
geekypenguin Posted November 24, 2023 Author Share Posted November 24, 2023 A second diagnostics download immediately after rebooting if that's of any use: lisa-diagnostics-20231124-1015.zip Quote Link to comment
JorgeB Posted November 24, 2023 Share Posted November 24, 2023 write time tree block corruption detected This error in the first diags suggests a RAM issue, also btrfs has been detecting considerable data corruption Nov 24 10:04:31 LISA kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 973, gen 0 Start by running memtest. Quote Link to comment
geekypenguin Posted November 24, 2023 Author Share Posted November 24, 2023 Thanks, I'll run a memtest now. I saw those messages about nvme1 which is what made me suspect a bad ssd? I also see messages about multiple uncorrected fatal error received, frozen state error detected, and device recovery successful. Quote Link to comment
JorgeB Posted November 24, 2023 Share Posted November 24, 2023 21 minutes ago, geekypenguin said: I saw those messages about nvme1 which is what made me suspect a bad ssd? For now it's only a filesystem issue, not a device problem. Quote Link to comment
geekypenguin Posted November 24, 2023 Author Share Posted November 24, 2023 Ok First two passes of memtest have returned zero errors. I'll keep it running a bit longer to be sure none materialise Quote Link to comment
JorgeB Posted November 24, 2023 Share Posted November 24, 2023 If no errors are found I would reset the stats and try with just one stick of RAM, if more errors come up try the other one, that will basically rule out a RAM issue, see here for how to reset the stats and monitor the pool. Quote Link to comment
geekypenguin Posted November 24, 2023 Author Share Posted November 24, 2023 Thanks for you help. Removed one ram stick and reset the stats and configured the userscript as suggested. I'll let you know how it gets on 1 Quote Link to comment
geekypenguin Posted November 27, 2023 Author Share Posted November 27, 2023 Sorry it's taken a few days to respond, there's been a lot to work through and while I still don't know the cause, I reached a point where I had to stop and revert to known good. Firstly I had the macvlan kernel issue that's known in 6.12 which frustrated things. With both sticks of ram on their own, I was getting data corruption errors, always in the same disk. I was also getting corruption of my docker.img which was causing docker's to crash without the cache going read only. As the nvme drives were new, I got a warranty replacement on the nvme drive with all the errors and attempted to rebuild the cache pool onto the second drive, but was flooded with "nvme frozen state error detected, reset controller" etc messages for the replacement drive. I read in a few bug reports to add ```nvme_core.default_ps_max_latency_us=0 pcie_aspm=off``` to the boot config but this didn't help either. This is unfortunately where I had to stop. I've removed the second cache drive and reverted to single drive mode for my cache which has been working fine for a few days now with all the ram re-installed. Not sure where else to go from this to be honest. I can stay like this with no redundancy on my cache I suppose but would like to get to the bottom of it. Quote Link to comment
JorgeB Posted November 27, 2023 Share Posted November 27, 2023 24 minutes ago, geekypenguin said: "nvme frozen state error detected, reset controller" These are hardware/firmware related errors, easiest would be to try a different NVMe device, or a different board. Quote Link to comment
geekypenguin Posted November 27, 2023 Author Share Posted November 27, 2023 (edited) Would trying a bios update be worthwhile before I go spending money? Edited November 28, 2023 by geekypenguin Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.