Cache read only - BTRFS corrupted?

eribob · November 14, 2020

My cache array went to "read only file system" today and the following message is repeated many times in the system log:

Nov 14 21:18:16 Monsterservern kernel: blk_update_request: I/O error, dev loop2, sector 6665664 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 0
Nov 14 21:18:16 Monsterservern kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 136, rd 0, flush 0, corrupt 0, gen 0

Is the BTRFS corrupted? Why did it happen? How can I fix it?

The exact same thing actually happened to my other BTRFS pool today as well. I restarted the server and it works normally again, but after running a BTRFS scrub I get "uncorrectable errors" in that pool:

UUID:             109edb7d-32a7-4c8c-9dfd-d8901216e5e1
Scrub started:    Sat Nov 14 09:47:37 2020
Status:           finished
Duration:         0:03:25
Total to scrub:   786.37GiB
Rate:             3.83GiB/s
Error summary:    csum=6
  Corrected:      0
  Uncorrectable:  6
  Unverified:     0

Attached diagnostics.

cache_filesystem_corrupted_201114.zip

eribob · November 15, 2020

I would love some guidance! Should I format both BTRFS pools and recreate the file systems on them? It is a lot of work so I would like to avoid it if possible but if it is the only way to fix it...

/Erik

JorgeB · November 16, 2020

Cache pool is corrupt and needs to be re-formatted, before it was showing checksum errors (data corruption), that suggests a hardware problem, usually RAM related, and since you're running the RAM above the max supported speed it's the most likely culprit.

eribob · November 16, 2020

Wow that is great help! I was wondering why these issues were building up. So since I run 4 RAM sticks I should limit them to 2667? I guess both pools need to be reformatted then. Is it worth trying to do a "btrfs check --repair" first? It seems that it can corrupt your pool, but I have nothing to loose if I am about to wipe it anyway? In that case, can you give me an example of how to run such a command?

Also, what is the easiest way to format the cache pool?

Thanks!

Erik

JorgeB · November 16, 2020

15 minutes ago, eribob said:

So since I run 4 RAM sticks I should limit them to 2667?

Yes.

15 minutes ago, eribob said:

Is it worth trying to do a "btrfs check --repair" first?

Unlikely to help and it can't fix the data corruption, best to just re-format.

16 minutes ago, eribob said:

Also, what is the easiest way to format the cache pool?

With the array stopped wipe the SSDs with:

blkdiscard /dev/sdX

Then start the array and format the pool

eribob · November 18, 2020

Thank you it worked nicely. Too bad I did not know about the risk from using the RAM at higher speeds. The RAM sticks themselves were rated at 3200MHz so I simply thought that it would work.

tokra · November 18, 2020

its interesting that my Cache recently got into read-only as well. just found today.

now i need to copy all appdata, and reformat ?

Nov 18 20:40:54 Tower kernel: loop: Write error at byte offset 2887852032, length 4096.
Nov 18 20:40:54 Tower kernel: print_req_error: I/O error, dev loop2, sector 5640336
Nov 18 20:40:54 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 195, rd 0, flush 0, corrupt 0, gen 0
Nov 18 20:40:59 Tower kernel: loop: Write error at byte offset 3727376384, length 4096.
Nov 18 20:40:59 Tower kernel: print_req_error: I/O error, dev loop2, sector 7280032
Nov 18 20:40:59 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 196, rd 0, flush 0, corrupt 0, gen 0
Nov 18 20:41:05 Tower kernel: loop: Write error at byte offset 2887852032, length 4096.
Nov 18 20:41:05 Tower kernel: print_req_error: I/O error, dev loop2, sector 5640336
Nov 18 20:41:05 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 197, rd 0, flush 0, corrupt 0, gen 0
Nov 18 20:41:05 Tower kernel: loop: Write error at byte offset 3727376384, length 4096.
Nov 18 20:41:05 Tower kernel: print_req_error: I/O error, dev loop2, sector 7280032
Nov 18 20:41:05 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 198, rd 0, flush 0, corrupt 0, gen 0

Edited November 18, 2020 by tokra

JorgeB · November 19, 2020

12 hours ago, tokra said:

its interesting that my Cache recently got into read-only as well. just found today.

Please post the diagnostics: Tools -> Diagnostics

eribob · November 22, 2020

Hi,

The solution worked for a couple of days, but just now one of my BTRFS pools again went into read only mode. I changed my RAM to 2133MHz (the "auto" setting in BIOS).

The system log says the following:

Nov 22 19:44:56 Monsterservern kernel: BTRFS error (device nvme0n1p1): block=1141445836800 write time tree block corruption detected
Nov 22 19:44:56 Monsterservern kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
Nov 22 19:44:56 Monsterservern kernel: BTRFS info (device nvme0n1p1): forced readonly
Nov 22 19:44:56 Monsterservern kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction.
Nov 22 19:44:56 Monsterservern kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1894: errno=-5 IO failure

Diagnostics are attached.

What is the problem? It is really annoying now...

/Erik

monsterservern-diagnostics-20201122-1953.zip

eribob · November 22, 2020

Update!

I ran a Memtest and after about 15 minutes I got a lot of errors. So I removed my two oldest RAM-sticks and re-ran the test for about 25 minutes without error. I know that is a bit short (not even one pass hehe) but I figured that since I got the errors so soon the first time I would get them again if the remaining RAM-sticks were the faulty ones.

So it was probably a memory issue? I just hope that I will not get any more corruption in my BTRFS now... fingers crossed.

I also ran "btrfs check --readonly /dev/nvme0n1p1" and "btrfs check --readonly /dev/nvme1n1p1" (the two disks that are part of the BTRFS pool in question) and got no errors. Can I then assume that my BTRFS filesystem is intact for that pool?

BIG thanks!

/Erik

JorgeB · November 23, 2020

9 hours ago, eribob said:

So it was probably a memory issue?

Most likely, btrfs will quickly corrupt wit bad RAM.

Cache read only - BTRFS corrupted?

Recommended Posts

eribob

Link to comment

eribob

Link to comment

JorgeB

Link to comment

eribob

Link to comment

JorgeB

Link to comment

eribob

Link to comment

tokra

Link to comment

JorgeB

Link to comment

eribob

Link to comment

eribob

Link to comment

JorgeB

Link to comment

Join the conversation