December 30, 2025Dec 30 Hi All,I was wondering If somebody could give me a hand with solving the errors I have with my server. It looks like that 'suddenly' one of my nvme cache drives failed. I am a bit at a loss how to proceed at this point. Any help would be much appreciated. I do have 2 cache drives (backup). First time I noticed this was because of the docker image failing to load (i.e. Sonarr was not working). I am also wondering (if it is indeed the nvme drive) how to avoid this in the future or at least be notified before failure. If this is possible of course. I started wondering about this because the SSD endurance is at 0% but this is not something that I look at frequently.regards Pietertower-diagnostics-20251230-1040.zip Edited December 30, 2025Dec 30 by poeterdebier extra info
December 30, 2025Dec 30 Community Expert Btrfs is detecting data corruption on both pool devices. This is most often a RAM issue, start by running Memtest.
December 30, 2025Dec 30 Author Will do straight away, thanks for the fast feedback. Just an extra question: let's say that some memory is bad, would that mean that the pool cannot be saved/repaired? That I have data loss?update:Memtest is ongoing. It has been running for 20 minutes and gave me a "pass" (0 errors). The Memtest is still running right now. gr Piet. Edited December 30, 2025Dec 30 by poeterdebier
December 30, 2025Dec 30 Community Expert 58 minutes ago, poeterdebier said:would that mean that the pool cannot be saved/repaired? That I have data loss?It would depend on the damage, but yes, there could be some data loss.Let memtest run a few passes, and note that it's only definitive if it finds errors.
December 30, 2025Dec 30 Author Ok, have two passes done. No errors. Will let it run for a couple of hours more as I am going out of the house any way. Will keep you posted.
December 30, 2025Dec 30 Community Expert If nothing is found, scrub the pool and post the results from the GUI.
December 30, 2025Dec 30 Author 4 passes no errors on Memtest. Restarted the server. Docker image is up and running again.Scrubbed nvme0n1 see attached screenshot. Also common problems found: Do I need to scrub the other nvme also?I did not yet started with assistance mentioned in the 'suggestion fix' from the coming problems plugin. I would like to continue with community support first. Which has been excellent. tower-diagnostics-20251230-1644.zip
December 30, 2025Dec 30 Community Expert Despite memtest not defining anything, I'm pretty sure there's a hardware problem there, RAM or CPU most likely. Try scrubbing the pool with just one of the DIMMs installed, then try the same with the other one and see if the results are the same.
December 30, 2025Dec 30 Community Expert 6 minutes ago, JorgeB said:I'm pretty sure there's a hardware problem there, RAM or CPU most likely.Maybe, but those drives have a ton of writes and are both probably end of life
December 30, 2025Dec 30 Community Expert One of them is at 33% usage, the other is 100%, and if it was a device issue, they should return an error, not corrupt data, unless they have a very bad firmware, but it's not impossible they are the problem, just don't think it's very likely.
December 30, 2025Dec 30 Community Expert 12 minutes ago, JorgeB said:One of them is at 33% usage, the other is 100%, and if it was a device issue, they should return an error, not corrupt data, unless they have a very bad firmware, but it's not impossible they are the problem, just don't think it's very likely.nvme0 is Data Units Read: 1,624,580,180 [831 TB]Data Units Written: 284,023,005 [145 TB]nvme1 isData Units Read: 492,647,517 [252 TB]Data Units Written: 234,470,619 [120 TB]Rated writes for those are ~170 or so according to my google-fu, could be getting flaky
December 30, 2025Dec 30 Community Expert Their predicted life, according to the manufacturer, also shows on the SMART reports:Percentage Used: 100%Percentage Used: 33%This, of course, is just a prediction, and typically, they last longer than 100%, had one reach close to 200% a couple of years back.
December 30, 2025Dec 30 Community Expert 11 minutes ago, JorgeB said:Their predicted life, according to the manufacturer, also shows on the SMART reports:Percentage Used: 100%Percentage Used: 33%This, of course, is just a prediction, and typically, they last longer than 100%, had one reach close to 200% a couple of years back.I get ya, but with passing memtest and if it's 'out of the blue', the drive wear looks like a definite point of interest. No shade at PNY, but I consider them to be a tier 2 OEM, and if the drive says its reached 100% of its lifespan I'd say it's time to replace - and the 33% drive might have the OG TBW factor instead of their 'updated' one for that series, it's likely really at 100% too
December 30, 2025Dec 30 Author The second SSD (1n1) I purchased a couple of years later than the 100% used one 0n1 so that likely explains the difference. Not sure exactly when I purchased it put it will be around 2024/2025 so around 4+ years younger.So, regardless of whether the error is coming from the memory or the NVMe, it looks like it is sensible to replace the NVMe with a new one (at least the 100% used one)? What would be the best way to keep an eye on this? A Docker app called Scrutiny or just have a look at it periodically? I never received an error/warning/suggestion about the NVMe reaching its (calculated) lifespan.Try scrubbing the pool with just one of the DIMMs installed, then try the same with the other one and see if the results are the same. Will try this next opportunity just to make sure. I hope it is not the memory though, talk about a bad time to have your memory go bad. With the server now up and running (maybe temporary) what would be your suggestions for backing up and data loss prevention? So far I have done the following:appdata backup flash backupUpdate:As I was typing this I noticed that I got a new error:Did not see that error before (related to this issue I mean). Got a drive that failed on me last year but that was since replaced by the Seagate. tower-diagnostics-20251230-1931.zip
December 30, 2025Dec 30 Community Expert I would be wary of trying to do any file system repairs until the issue is identified, or it may make things worse.
December 30, 2025Dec 30 Author What do you exactly mean with file system repairs? Replacing disk1 or the nvme?
December 30, 2025Dec 30 Community Expert 1 hour ago, poeterdebier said:Not sure exactly when I purchased it put it will be around 2024/2025Anything after 2021 will have the 170TB endurance "limit"
December 30, 2025Dec 30 Author I have ordered a new HDD to replace (if necessary disk 1). They are all bought around the same time and disk 2 failed July 2025 they are 5+ years old). I mean it start giving errors around that time. Not sure exactly what anymore as I just replaced the disk and rebuild parity without issues. No reason (at that time) to start looking for other problems. Unless there is something else to do before replacing disk 1, please let me know. I make sense to me to replace this disk before continuing but I am not a community expert 😄 so yeah what do I know. I am considering to purchase an extra nvme just in case one of the current nvme fails. Would you recommend to add this to the pool beforehand or just to wait until a nvme fails, then replace?So first steps: unless someone has a better approach!replace disk1 and rebuildrun a parity checkscrub with removed DIMMS as suggested by Jorge to see if there is a memory issue
December 30, 2025Dec 30 Community Expert Just now, poeterdebier said:scrub with removed DIMMS as suggested by Jorge to see if there is a memory issueRule out the bad RAM first or it will just keep sending bad data
December 30, 2025Dec 30 Author Ok, that means doing the scrubbing of the cache before touching disk1. Would that possible mean that disk1 will be mountable again? That for example memory error can cause this 'unmountable: wrong or no file system'
December 30, 2025Dec 30 Author Sorry, don't want to spam. But as an additional question regarding rebuilding disk1. Should I run xfs_repair on disk1 first? Dec 30 20:33:41 Tower kernel: XFS (md1p1): Mounting V5 Filesystem 8c6a785e-215a-4193-b49b-741cfb95f8c2 Dec 30 20:33:42 Tower kernel: XFS (md1p1): Corruption warning: Metadata has LSN (40:2960824) ahead of current LSN (40:2958096). Please unmount and run xfs_repair (>= v4.3) to resolve. Dec 30 20:33:42 Tower kernel: XFS (md1p1): log mount/recovery failed: error -22 Dec 30 20:33:42 Tower kernel: XFS (md1p1): log mount failed Dec 30 20:33:42 Tower root: mount: /mnt/disk1: wrong fs type, bad option, bad superblock on /dev/md1p1, missing codepage or helper program, or other error.
December 31, 2025Dec 31 Community Expert 11 hours ago, poeterdebier said:But as an additional question regarding rebuilding disk1. Should I run xfs_repair on disk1 first?12 hours ago, JorgeB said:I would be wary of trying to do any file system repairs until the issue is identified, or it may make things worse.
December 31, 2025Dec 31 Author I dit a scrub of the cache pool as Jorge suggested. Swapping the DIMM's. Removed 1 8GB DIM of RAM and restarted the server (no issues, accept for disk1).Scrub result (aft stick): Sstopped the server and removed the 2nd stick. Reinserted the first stick. And ran a scrub of the cache pool again.Scrub result (fwd stick) apart from the 2056 uncorrectable errors?! they look the same.
December 31, 2025Dec 31 Community Expert They do, which in a way is good, but still not sure what caused the problem; it may be intermittent.You can check the syslog, for the list of corrupt files, then delete or restore them from a backup and run another scrub to confirm zero errors.After that, reset the pool stats and keep monitoring for more errors https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-700582You can also check filesystem on disk1, but I suspect the issue is still present, so more data corruption can occur..
December 31, 2025Dec 31 Author Ok, thanks for the info. Will check the syslog/corrupt files etc. soonest. Probably not tonight 😄.Regarding the file check. That is something that you recommend to do before replacing the disk1 entirely? Or do you suspect something else is caused disk1 to become corrupt (the intermittent problem you mentioned).The reason to ask it, you want to have disk1 up and running as soon as possible right. If a second disk fails then I will be loosing data for sure.Still pretty inexperienced in all this so sorry for asking so many questions.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.