BTRFS error on NVME Cache drive

Shu · August 31, 2022

Hi Everyone, I woke up this morning to these errors on my nvme cache drive (my cache pool is two drives, one m.2 nvme and the other a sata ssd):

Aug 31 03:51:39 520unraid kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=10 block=820606861312 slot=118, unexpected item end, have 3543154555 expect 16251
Aug 31 03:51:39 520unraid kernel: BTRFS info (device nvme0n1p1): leaf 820606861312 gen 46836 total ptrs 258 free space 9769 owner 10
Aug 31 03:51:39 520unraid kernel: BTRFS error (device nvme0n1p1): block=820606861312 write time tree block corruption detected
Aug 31 03:51:39 520unraid kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2438: errno=-5 IO failure (Error while writing out transaction)
Aug 31 03:51:39 520unraid kernel: BTRFS info (device nvme0n1p1): forced readonly
Aug 31 03:51:39 520unraid kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction.
Aug 31 03:51:39 520unraid kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:2011: errno=-5 IO failure

I also attached 3 screenshots (0 being the disk info, 1 & 2 being syslog relevant to this issue (a lot of item xxx key ... itemoff... errors)).

While researching this issue, I ran a Check-File System for this drive but it came back clean (picture 3) and a Scrub check which also came back clean (picture 4).

I do believe something is wrong however, because I when I ran the mover, it came back with a Read-Only filesystem error on all files (they're mostly media files from qBit from last night so not much unrecoverable, I believe - though, I haven't checked for files that are kept on the cache pool...which are probably more vital...)

What is the proper guidance here? (Or which guide/thread should I try next?)

Edit 9:56am local: I re-ran the mover and it does appear to be moving files albeit at a slow pace (~25mb/s per drive, so ~50mb/s for the pool). I would expect much faster reads for these fast drives (they are writing to a 4tb wd blue which is less than half full, I'd expect at least a starting speed around 120mb/s)

Edited August 31, 2022 by Shu
Additional information

JorgeB · August 31, 2022

6 minutes ago, Shu said:
write time tree block corruption detected

This usually means bad RAM or other kernel corruption, start by running memtest.

Shu · August 31, 2022

9 minutes ago, JorgeB said:

This usually means bad RAM or other kernel corruption, start by running memtest.

Will run a test. Will update when it finishes

Shu · August 31, 2022

2 hours ago, JorgeB said:

This usually means bad RAM or other kernel corruption, start by running memtest.

I haven't had a chance to run a test yet, unfortunately. But I did run a memtest on August 4th of this month and it passed (not 100% confident I ran the right test, though). I attached a photo of those tests.

Edited August 31, 2022 by Shu

JorgeB · August 31, 2022

You should run another one, also a good idea to post the full diags to see if there are known hardware issues.

Shu · August 31, 2022

1 hour ago, JorgeB said:

You should run another one, also a good idea to post the full diags to see if there are known hardware issues.

It's running right now. I expect it to take about 3 more hours (Pass 1 is done with no errors at just short of an hour, on Pass 2 of 4 now). Since I had already shutdown my server, will I be able to recover the diagnoses file on reboot? Or were they cleared....? sorry about that

trurl · August 31, 2022

Diagnostics after reboot won't tell us anything about what happened in the past since syslog resets, but it will tell us how things are now including your hardware.

Shu · August 31, 2022

23 minutes ago, trurl said:

Diagnostics after reboot won't tell us anything about what happened in the past since syslog resets, but it will tell us how things are now including your hardware.

I will post my diagnostics once the test is done and I boot back into unraid. On pass 3 of 4 now (going faster than I expected initially)

Edited August 31, 2022 by Shu

Shu · August 31, 2022

3 hours ago, JorgeB said:

You should run another one, also a good idea to post the full diags to see if there are known hardware issues.

Okay, 3 hours in and it just finished - all passed & no errors. I also booted into unraid and downloaded the diags. Attached here.

520unraid-diagnostics-20220831-1647.zip

JorgeB · September 1, 2022

Still pretty convinced that was from a hardware issue, but keep sing the server normally and see if there are more issues.

Shu · September 1, 2022

I'll revive the thread with a reply if it comes up again. I did notice my plex docker isn't showing the webui anymore - could a coincidence or related to app data being on cache drives (I'm going to change my cache pool to a raid 1 with two nvme's once it comes in tomorrow - for more redundancy)

JorgeB · September 1, 2022

23 minutes ago, Shu said:

(I'm going to change my cache pool to a raid 1 with two nvme's once it comes in tomorrow - for more redundancy)

That's fine in case a device fails/drops but raid1 won't help with this type of issue.

BTRFS error on NVME Cache drive

Recommended Posts

Shu

Link to comment

JorgeB

Link to comment

Shu

Link to comment

Shu

Link to comment

JorgeB

Link to comment

Shu

Link to comment

trurl

Link to comment

Shu

Link to comment

Shu

Link to comment

JorgeB

Link to comment

Shu

Link to comment

JorgeB

Link to comment

Join the conversation