nvme cache drive failure

December 30, 2025Dec 30

Hi All,

I was wondering If somebody could give me a hand with solving the errors I have with my server. It looks like that 'suddenly' one of my nvme cache drives failed. I am a bit at a loss how to proceed at this point. Any help would be much appreciated. I do have 2 cache drives (backup). First time I noticed this was because of the docker image failing to load (i.e. Sonarr was not working).

I am also wondering (if it is indeed the nvme drive) how to avoid this in the future or at least be notified before failure. If this is possible of course. I started wondering about this because the SSD endurance is at 0% but this is not something that I look at frequently.

regards Pieter

tower-diagnostics-20251230-1040.zip

Edited December 30, 2025Dec 30 by poeterdebier
extra info

Quote

December 30, 2025Dec 30

Community Expert

Btrfs is detecting data corruption on both pool devices. This is most often a RAM issue, start by running Memtest.

Quote

December 30, 2025Dec 30

Author

Will do straight away, thanks for the fast feedback. Just an extra question: let's say that some memory is bad, would that mean that the pool cannot be saved/repaired? That I have data loss?

update:

Memtest is ongoing. It has been running for 20 minutes and gave me a "pass" (0 errors). The Memtest is still running right now.

gr Piet.

Edited December 30, 2025Dec 30 by poeterdebier

Quote

December 30, 2025Dec 30

Community Expert

58 minutes ago, poeterdebier said:
would that mean that the pool cannot be saved/repaired? That I have data loss?

It would depend on the damage, but yes, there could be some data loss.

Let memtest run a few passes, and note that it's only definitive if it finds errors.

Quote

December 30, 2025Dec 30

Author

Ok, have two passes done. No errors. Will let it run for a couple of hours more as I am going out of the house any way. Will keep you posted.

Quote

December 30, 2025Dec 30

Community Expert

If nothing is found, scrub the pool and post the results from the GUI.

Quote

December 30, 2025Dec 30

Author

4 passes no errors on Memtest. Restarted the server. Docker image is up and running again.

Scrubbed nvme0n1 see attached screenshot.

Also common problems found:

Do I need to scrub the other nvme also?
I did not yet started with assistance mentioned in the 'suggestion fix' from the coming problems plugin. I would like to continue with community support first. Which has been excellent.

tower-diagnostics-20251230-1644.zip

Quote

December 30, 2025Dec 30

Community Expert

Despite memtest not defining anything, I'm pretty sure there's a hardware problem there, RAM or CPU most likely. Try scrubbing the pool with just one of the DIMMs installed, then try the same with the other one and see if the results are the same.

Quote

December 30, 2025Dec 30

Community Expert

6 minutes ago, JorgeB said:
I'm pretty sure there's a hardware problem there, RAM or CPU most likely.

Maybe, but those drives have a ton of writes and are both probably end of life

Quote

December 30, 2025Dec 30

Community Expert

One of them is at 33% usage, the other is 100%, and if it was a device issue, they should return an error, not corrupt data, unless they have a very bad firmware, but it's not impossible they are the problem, just don't think it's very likely.

Quote

December 30, 2025Dec 30

Community Expert

12 minutes ago, JorgeB said:
One of them is at 33% usage, the other is 100%, and if it was a device issue, they should return an error, not corrupt data, unless they have a very bad firmware, but it's not impossible they are the problem, just don't think it's very likely.

nvme0 is

Data Units Read: 1,624,580,180 [831 TB]

Data Units Written: 284,023,005 [145 TB]

nvme1 is

Data Units Read: 492,647,517 [252 TB]

Data Units Written: 234,470,619 [120 TB]

Rated writes for those are ~170 or so according to my google-fu, could be getting flaky

Quote

December 30, 2025Dec 30

Community Expert

Their predicted life, according to the manufacturer, also shows on the SMART reports:

Percentage Used: 100%

Percentage Used: 33%

This, of course, is just a prediction, and typically, they last longer than 100%, had one reach close to 200% a couple of years back.

Quote

December 30, 2025Dec 30

Community Expert

11 minutes ago, JorgeB said:
Their predicted life, according to the manufacturer, also shows on the SMART reports:
Percentage Used: 100%
Percentage Used: 33%
This, of course, is just a prediction, and typically, they last longer than 100%, had one reach close to 200% a couple of years back.

I get ya, but with passing memtest and if it's 'out of the blue', the drive wear looks like a definite point of interest. No shade at PNY, but I consider them to be a tier 2 OEM, and if the drive says its reached 100% of its lifespan I'd say it's time to replace - and the 33% drive might have the OG TBW factor instead of their 'updated' one for that series, it's likely really at 100% too

Quote

December 30, 2025Dec 30

Author

The second SSD (1n1) I purchased a couple of years later than the 100% used one 0n1 so that likely explains the difference. Not sure exactly when I purchased it put it will be around 2024/2025 so around 4+ years younger.

So, regardless of whether the error is coming from the memory or the NVMe, it looks like it is sensible to replace the NVMe with a new one (at least the 100% used one)? What would be the best way to keep an eye on this? A Docker app called Scrutiny or just have a look at it periodically? I never received an error/warning/suggestion about the NVMe reaching its (calculated) lifespan.

Try scrubbing the pool with just one of the DIMMs installed, then try the same with the other one and see if the results are the same.

Will try this next opportunity just to make sure. I hope it is not the memory though, talk about a bad time to have your memory go bad.

With the server now up and running (maybe temporary) what would be your suggestions for backing up and data loss prevention? So far I have done the following:

appdata backup
flash backup

Update:

As I was typing this I noticed that I got a new error:

Screenshot 2025-12-30 at 19.30.10.png

Did not see that error before (related to this issue I mean). Got a drive that failed on me last year but that was since replaced by the Seagate.

tower-diagnostics-20251230-1931.zip

Quote

December 30, 2025Dec 30

Community Expert

I would be wary of trying to do any file system repairs until the issue is identified, or it may make things worse.

Quote

December 30, 2025Dec 30

Author

What do you exactly mean with file system repairs? Replacing disk1 or the nvme?

Quote

December 30, 2025Dec 30

Community Expert

1 hour ago, poeterdebier said:
Not sure exactly when I purchased it put it will be around 2024/2025

Anything after 2021 will have the 170TB endurance "limit"

Quote

December 30, 2025Dec 30

Author

I have ordered a new HDD to replace (if necessary disk 1). They are all bought around the same time and disk 2 failed July 2025 they are 5+ years old). I mean it start giving errors around that time. Not sure exactly what anymore as I just replaced the disk and rebuild parity without issues. No reason (at that time) to start looking for other problems.

Unless there is something else to do before replacing disk 1, please let me know. I make sense to me to replace this disk before continuing but I am not a community expert 😄 so yeah what do I know.

I am considering to purchase an extra nvme just in case one of the current nvme fails. Would you recommend to add this to the pool beforehand or just to wait until a nvme fails, then replace?

So first steps: unless someone has a better approach!

replace disk1 and rebuild
run a parity check
scrub with removed DIMMS as suggested by Jorge to see if there is a memory issue

Quote

December 30, 2025Dec 30

Community Expert

Just now, poeterdebier said:
scrub with removed DIMMS as suggested by Jorge to see if there is a memory issue

Rule out the bad RAM first or it will just keep sending bad data

Quote

December 30, 2025Dec 30

Author

Ok, that means doing the scrubbing of the cache before touching disk1.

Would that possible mean that disk1 will be mountable again? That for example memory error can cause this 'unmountable: wrong or no file system'

Quote

December 30, 2025Dec 30

Author

Sorry, don't want to spam. But as an additional question regarding rebuilding disk1. Should I run xfs_repair on disk1 first?

Dec 30 20:33:41 Tower kernel: XFS (md1p1): Mounting V5 Filesystem 8c6a785e-215a-4193-b49b-741cfb95f8c2
Dec 30 20:33:42 Tower kernel: XFS (md1p1): Corruption warning: Metadata has LSN (40:2960824) ahead of current LSN (40:2958096). Please unmount and run xfs_repair (>= v4.3) to resolve.
Dec 30 20:33:42 Tower kernel: XFS (md1p1): log mount/recovery failed: error -22
Dec 30 20:33:42 Tower kernel: XFS (md1p1): log mount failed
Dec 30 20:33:42 Tower root: mount: /mnt/disk1: wrong fs type, bad option, bad superblock on /dev/md1p1, missing codepage or helper program, or other error.

Quote

December 31, 2025Dec 31

Community Expert

11 hours ago, poeterdebier said:
But as an additional question regarding rebuilding disk1. Should I run xfs_repair on disk1 first?

12 hours ago, JorgeB said:
I would be wary of trying to do any file system repairs until the issue is identified, or it may make things worse.

Quote

December 31, 2025Dec 31

Author

I dit a scrub of the cache pool as Jorge suggested. Swapping the DIMM's.

Removed 1 8GB DIM of RAM and restarted the server (no issues, accept for disk1).

Scrub result (aft stick):

S

stopped the server and removed the 2nd stick. Reinserted the first stick. And ran a scrub of the cache pool again.

Scrub result (fwd stick)

apart from the 2056 uncorrectable errors?! they look the same.

Quote

December 31, 2025Dec 31

Community Expert

They do, which in a way is good, but still not sure what caused the problem; it may be intermittent.

You can check the syslog, for the list of corrupt files, then delete or restore them from a backup and run another scrub to confirm zero errors.

After that, reset the pool stats and keep monitoring for more errors https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-700582

You can also check filesystem on disk1, but I suspect the issue is still present, so more data corruption can occur..

Quote

December 31, 2025Dec 31

Author

Ok, thanks for the info. Will check the syslog/corrupt files etc. soonest. Probably not tonight 😄.

Regarding the file check. That is something that you recommend to do before replacing the disk1 entirely? Or do you suspect something else is caused disk1 to become corrupt (the intermittent problem you mentioned).

The reason to ask it, you want to have disk1 up and running as soon as possible right. If a second disk fails then I will be loosing data for sure.

Still pretty inexperienced in all this so sorry for asking so many questions.

Quote

nvme cache drive failure

Featured Replies

Update:

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)