mbc0 Posted February 18, 2023 Share Posted February 18, 2023 Hi, I woke this morning to find most of my dockers stopped, it looks like ca backup stopped my dockers ready to backup as the backup folder was created but it is empty and I can see all these BTRFS errors on nvme2n1p1 but I cannot work out which physical drive this is? This is the 3rd nvme related problem I have had on this new motherboard in as many weeks so really need to get to the bottom of it, any help will be greatly appreciated! unraid1-diagnostics-20230218-0934.zip Quote Link to comment
mbc0 Posted February 18, 2023 Author Share Posted February 18, 2023 I also, just saw this in the log when I started to shutdown after I took the diags Feb 18 09:48:42 UNRAID1 kernel: BUG: Bad rss-counter state mm:00000000ee5ad6d6 type:MM_ANONPAGES val:1 Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 83360 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0 Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 188192 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0 Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state A) in __btrfs_update_delayed_inode:999: errno=-5 IO failure Feb 18 09:48:45 UNRAID1 kernel: BTRFS info (device loop3: state EA): forced readonly Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in __btrfs_run_delayed_items:1092: errno=-5 IO failure Feb 18 09:48:45 UNRAID1 kernel: BTRFS warning (device loop3: state EA): Skipping commit of aborted transaction. Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in cleanup_transaction:1982: errno=-5 IO failure Feb 18 09:48:45 UNRAID1 kernel: docker0: port 4(vethb7c8447) entered disabled state Feb 18 09:48:45 UNRAID1 kernel: vethd950096: renamed from eth0 Feb 18 09:48:45 UNRAID1 root: Error response from daemon: error while removing network: network br0 id b58d2467fa57ae7061498d882571927884c846ee782347c86f3071b85475f1f0 has active endpoints Quote Link to comment
JorgeB Posted February 18, 2023 Share Posted February 18, 2023 There was some previous data corruption on the pool detected by btrfs, and now there's this error: Feb 18 03:43:10 UNRAID1 kernel: BTRFS error (device nvme2n1p1): block=161059291136 write time tree block corruption detected This means new data corruption was detected before writing the data to the filesystem, and it's usually a sign of bad RAM, start by running memtest, then post new diags after array start. Also: Feb 16 08:44:56 UNRAID1 kernel: macvlan_broadcast+0x10a/0x150 [macvlan] Feb 16 08:44:56 UNRAID1 kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan] Macvlan call traces are usually the result of having dockers with a custom IP address and will end up crashing the server, recommend switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)) Quote Link to comment
mbc0 Posted February 18, 2023 Author Share Posted February 18, 2023 Thank you @JorgeB 1, I have run the memtest which passed but I have another 4 sticks which I have now replaced the existing RAM with. 2, You say there was existing corruption? how do I address that please? 3, Is this not an NVME problem then? "BTRFS error (device nvme2n1p1): block=161059291136 write time tree block corruption detected" 4, I will look into ipvlan now. Thanks again! Quote Link to comment
JorgeB Posted February 18, 2023 Share Posted February 18, 2023 Post new diags after array start. Quote Link to comment
mbc0 Posted February 18, 2023 Author Share Posted February 18, 2023 4 minutes ago, JorgeB said: Post new diags after array start. Attached, many thanks unraid1-diagnostics-20230218-1125.zip Quote Link to comment
JorgeB Posted February 18, 2023 Share Posted February 18, 2023 Run a correction scrub on the pool to see if any errors are detected, if there are uncorrectable errors post new diags after the scrub. Quote Link to comment
mbc0 Posted February 18, 2023 Author Share Posted February 18, 2023 9 minutes ago, JorgeB said: Run a correction scrub on the pool to see if any errors are detected, if there are uncorrectable errors post new diags after the scrub. Hi, which pool do you mean please? Quote Link to comment
JorgeB Posted February 18, 2023 Share Posted February 18, 2023 And cache pool also, since the write aborted because of corruption was on that one. Quote Link to comment
mbc0 Posted February 18, 2023 Author Share Posted February 18, 2023 ok, so have scrubbed all 3 pools and no errors found on any of them! Quote Link to comment
Solution JorgeB Posted February 18, 2023 Solution Share Posted February 18, 2023 That's good, see here for how to reset stats for cachetwo, then see also there how to better monitor the pools, if more corruption errors appear you likely still have some hardware issue. 1 Quote Link to comment
Civic1201 Posted February 18, 2023 Share Posted February 18, 2023 Sorry for the stupid question but how do I run such a scrub process? Quote Link to comment
JorgeB Posted February 18, 2023 Share Posted February 18, 2023 3 minutes ago, Civic1201 said: Sorry for the stupid question but how do I run such a scrub process? Click on the pool then scroll down to the scrub section. 1 Quote Link to comment
mbc0 Posted February 18, 2023 Author Share Posted February 18, 2023 1 hour ago, JorgeB said: That's good, see here for how to reset stats for cachetwo, then see also there how to better monitor the pools, if more corruption errors appear you likely still have some hardware issue. Thank you, I will read through this to better monitor the pools, again, huge thanks for your help! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.