November 14, 20205 yr My cache array went to "read only file system" today and the following message is repeated many times in the system log: Nov 14 21:18:16 Monsterservern kernel: blk_update_request: I/O error, dev loop2, sector 6665664 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 0 Nov 14 21:18:16 Monsterservern kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 136, rd 0, flush 0, corrupt 0, gen 0 Is the BTRFS corrupted? Why did it happen? How can I fix it? The exact same thing actually happened to my other BTRFS pool today as well. I restarted the server and it works normally again, but after running a BTRFS scrub I get "uncorrectable errors" in that pool: UUID: 109edb7d-32a7-4c8c-9dfd-d8901216e5e1 Scrub started: Sat Nov 14 09:47:37 2020 Status: finished Duration: 0:03:25 Total to scrub: 786.37GiB Rate: 3.83GiB/s Error summary: csum=6 Corrected: 0 Uncorrectable: 6 Unverified: 0 Attached diagnostics. cache_filesystem_corrupted_201114.zip
November 15, 20205 yr Author I would love some guidance! Should I format both BTRFS pools and recreate the file systems on them? It is a lot of work so I would like to avoid it if possible but if it is the only way to fix it... /Erik
November 16, 20205 yr Cache pool is corrupt and needs to be re-formatted, before it was showing checksum errors (data corruption), that suggests a hardware problem, usually RAM related, and since you're running the RAM above the max supported speed it's the most likely culprit.
November 16, 20205 yr Author Wow that is great help! I was wondering why these issues were building up. So since I run 4 RAM sticks I should limit them to 2667? I guess both pools need to be reformatted then. Is it worth trying to do a "btrfs check --repair" first? It seems that it can corrupt your pool, but I have nothing to loose if I am about to wipe it anyway? In that case, can you give me an example of how to run such a command? Also, what is the easiest way to format the cache pool? Thanks! Erik
November 16, 20205 yr 15 minutes ago, eribob said: So since I run 4 RAM sticks I should limit them to 2667? Yes. 15 minutes ago, eribob said: Is it worth trying to do a "btrfs check --repair" first? Unlikely to help and it can't fix the data corruption, best to just re-format. 16 minutes ago, eribob said: Also, what is the easiest way to format the cache pool? With the array stopped wipe the SSDs with: blkdiscard /dev/sdX Then start the array and format the pool
November 18, 20205 yr Author Thank you it worked nicely. Too bad I did not know about the risk from using the RAM at higher speeds. The RAM sticks themselves were rated at 3200MHz so I simply thought that it would work.
November 18, 20205 yr its interesting that my Cache recently got into read-only as well. just found today. now i need to copy all appdata, and reformat ? Nov 18 20:40:54 Tower kernel: loop: Write error at byte offset 2887852032, length 4096. Nov 18 20:40:54 Tower kernel: print_req_error: I/O error, dev loop2, sector 5640336 Nov 18 20:40:54 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 195, rd 0, flush 0, corrupt 0, gen 0 Nov 18 20:40:59 Tower kernel: loop: Write error at byte offset 3727376384, length 4096. Nov 18 20:40:59 Tower kernel: print_req_error: I/O error, dev loop2, sector 7280032 Nov 18 20:40:59 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 196, rd 0, flush 0, corrupt 0, gen 0 Nov 18 20:41:05 Tower kernel: loop: Write error at byte offset 2887852032, length 4096. Nov 18 20:41:05 Tower kernel: print_req_error: I/O error, dev loop2, sector 5640336 Nov 18 20:41:05 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 197, rd 0, flush 0, corrupt 0, gen 0 Nov 18 20:41:05 Tower kernel: loop: Write error at byte offset 3727376384, length 4096. Nov 18 20:41:05 Tower kernel: print_req_error: I/O error, dev loop2, sector 7280032 Nov 18 20:41:05 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 198, rd 0, flush 0, corrupt 0, gen 0 Edited November 18, 20205 yr by tokra
November 19, 20205 yr 12 hours ago, tokra said: its interesting that my Cache recently got into read-only as well. just found today. Please post the diagnostics: Tools -> Diagnostics
November 22, 20205 yr Author Hi, The solution worked for a couple of days, but just now one of my BTRFS pools again went into read only mode. I changed my RAM to 2133MHz (the "auto" setting in BIOS). The system log says the following: Nov 22 19:44:56 Monsterservern kernel: BTRFS error (device nvme0n1p1): block=1141445836800 write time tree block corruption detected Nov 22 19:44:56 Monsterservern kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction) Nov 22 19:44:56 Monsterservern kernel: BTRFS info (device nvme0n1p1): forced readonly Nov 22 19:44:56 Monsterservern kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction. Nov 22 19:44:56 Monsterservern kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1894: errno=-5 IO failure Diagnostics are attached. What is the problem? It is really annoying now... /Erik monsterservern-diagnostics-20201122-1953.zip
November 22, 20205 yr Author Update! I ran a Memtest and after about 15 minutes I got a lot of errors. So I removed my two oldest RAM-sticks and re-ran the test for about 25 minutes without error. I know that is a bit short (not even one pass hehe) but I figured that since I got the errors so soon the first time I would get them again if the remaining RAM-sticks were the faulty ones. So it was probably a memory issue? I just hope that I will not get any more corruption in my BTRFS now... fingers crossed. I also ran "btrfs check --readonly /dev/nvme0n1p1" and "btrfs check --readonly /dev/nvme1n1p1" (the two disks that are part of the BTRFS pool in question) and got no errors. Can I then assume that my BTRFS filesystem is intact for that pool? BIG thanks! /Erik
November 23, 20205 yr 9 hours ago, eribob said: So it was probably a memory issue? Most likely, btrfs will quickly corrupt wit bad RAM.
Archived
This topic is now archived and is closed to further replies.