eribob Posted November 14, 2020 Share Posted November 14, 2020 My cache array went to "read only file system" today and the following message is repeated many times in the system log: Nov 14 21:18:16 Monsterservern kernel: blk_update_request: I/O error, dev loop2, sector 6665664 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 0 Nov 14 21:18:16 Monsterservern kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 136, rd 0, flush 0, corrupt 0, gen 0 Is the BTRFS corrupted? Why did it happen? How can I fix it? The exact same thing actually happened to my other BTRFS pool today as well. I restarted the server and it works normally again, but after running a BTRFS scrub I get "uncorrectable errors" in that pool: UUID: 109edb7d-32a7-4c8c-9dfd-d8901216e5e1 Scrub started: Sat Nov 14 09:47:37 2020 Status: finished Duration: 0:03:25 Total to scrub: 786.37GiB Rate: 3.83GiB/s Error summary: csum=6 Corrected: 0 Uncorrectable: 6 Unverified: 0 Attached diagnostics. cache_filesystem_corrupted_201114.zip Quote Link to comment
eribob Posted November 15, 2020 Author Share Posted November 15, 2020 I would love some guidance! Should I format both BTRFS pools and recreate the file systems on them? It is a lot of work so I would like to avoid it if possible but if it is the only way to fix it... /Erik Quote Link to comment
JorgeB Posted November 16, 2020 Share Posted November 16, 2020 Cache pool is corrupt and needs to be re-formatted, before it was showing checksum errors (data corruption), that suggests a hardware problem, usually RAM related, and since you're running the RAM above the max supported speed it's the most likely culprit. Quote Link to comment
eribob Posted November 16, 2020 Author Share Posted November 16, 2020 Wow that is great help! I was wondering why these issues were building up. So since I run 4 RAM sticks I should limit them to 2667? I guess both pools need to be reformatted then. Is it worth trying to do a "btrfs check --repair" first? It seems that it can corrupt your pool, but I have nothing to loose if I am about to wipe it anyway? In that case, can you give me an example of how to run such a command? Also, what is the easiest way to format the cache pool? Thanks! Erik Quote Link to comment
JorgeB Posted November 16, 2020 Share Posted November 16, 2020 15 minutes ago, eribob said: So since I run 4 RAM sticks I should limit them to 2667? Yes. 15 minutes ago, eribob said: Is it worth trying to do a "btrfs check --repair" first? Unlikely to help and it can't fix the data corruption, best to just re-format. 16 minutes ago, eribob said: Also, what is the easiest way to format the cache pool? With the array stopped wipe the SSDs with: blkdiscard /dev/sdX Then start the array and format the pool Quote Link to comment
eribob Posted November 18, 2020 Author Share Posted November 18, 2020 Thank you it worked nicely. Too bad I did not know about the risk from using the RAM at higher speeds. The RAM sticks themselves were rated at 3200MHz so I simply thought that it would work. Quote Link to comment
tokra Posted November 18, 2020 Share Posted November 18, 2020 (edited) its interesting that my Cache recently got into read-only as well. just found today. now i need to copy all appdata, and reformat ? Nov 18 20:40:54 Tower kernel: loop: Write error at byte offset 2887852032, length 4096. Nov 18 20:40:54 Tower kernel: print_req_error: I/O error, dev loop2, sector 5640336 Nov 18 20:40:54 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 195, rd 0, flush 0, corrupt 0, gen 0 Nov 18 20:40:59 Tower kernel: loop: Write error at byte offset 3727376384, length 4096. Nov 18 20:40:59 Tower kernel: print_req_error: I/O error, dev loop2, sector 7280032 Nov 18 20:40:59 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 196, rd 0, flush 0, corrupt 0, gen 0 Nov 18 20:41:05 Tower kernel: loop: Write error at byte offset 2887852032, length 4096. Nov 18 20:41:05 Tower kernel: print_req_error: I/O error, dev loop2, sector 5640336 Nov 18 20:41:05 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 197, rd 0, flush 0, corrupt 0, gen 0 Nov 18 20:41:05 Tower kernel: loop: Write error at byte offset 3727376384, length 4096. Nov 18 20:41:05 Tower kernel: print_req_error: I/O error, dev loop2, sector 7280032 Nov 18 20:41:05 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 198, rd 0, flush 0, corrupt 0, gen 0 Edited November 18, 2020 by tokra Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 12 hours ago, tokra said: its interesting that my Cache recently got into read-only as well. just found today. Please post the diagnostics: Tools -> Diagnostics Quote Link to comment
eribob Posted November 22, 2020 Author Share Posted November 22, 2020 Hi, The solution worked for a couple of days, but just now one of my BTRFS pools again went into read only mode. I changed my RAM to 2133MHz (the "auto" setting in BIOS). The system log says the following: Nov 22 19:44:56 Monsterservern kernel: BTRFS error (device nvme0n1p1): block=1141445836800 write time tree block corruption detected Nov 22 19:44:56 Monsterservern kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction) Nov 22 19:44:56 Monsterservern kernel: BTRFS info (device nvme0n1p1): forced readonly Nov 22 19:44:56 Monsterservern kernel: BTRFS warning (device nvme0n1p1): Skipping commit of aborted transaction. Nov 22 19:44:56 Monsterservern kernel: BTRFS: error (device nvme0n1p1) in cleanup_transaction:1894: errno=-5 IO failure Diagnostics are attached. What is the problem? It is really annoying now... /Erik monsterservern-diagnostics-20201122-1953.zip Quote Link to comment
eribob Posted November 22, 2020 Author Share Posted November 22, 2020 Update! I ran a Memtest and after about 15 minutes I got a lot of errors. So I removed my two oldest RAM-sticks and re-ran the test for about 25 minutes without error. I know that is a bit short (not even one pass hehe) but I figured that since I got the errors so soon the first time I would get them again if the remaining RAM-sticks were the faulty ones. So it was probably a memory issue? I just hope that I will not get any more corruption in my BTRFS now... fingers crossed. I also ran "btrfs check --readonly /dev/nvme0n1p1" and "btrfs check --readonly /dev/nvme1n1p1" (the two disks that are part of the BTRFS pool in question) and got no errors. Can I then assume that my BTRFS filesystem is intact for that pool? BIG thanks! /Erik Quote Link to comment
JorgeB Posted November 23, 2020 Share Posted November 23, 2020 9 hours ago, eribob said: So it was probably a memory issue? Most likely, btrfs will quickly corrupt wit bad RAM. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.