wayner Posted September 2, 2023 Share Posted September 2, 2023 Yesterday I upgraded my server from 6.11.5 to 6.12.4. Everything appeared to go well. But after a few hours I noticed that my dockers weren't running. When I tried to restart one I get Execution Error. I rebooted the system. Overnight the problem re-occurred. The system crashed some time after 1:17am. Anyone have any idea on what the issue is and how to fix? Are BTRFS errors related to my boot USB drive? Do I need to replace that drive? Could this be a bug or issue in 6.12.4? Is it just a coincidence that this occurred after upgrading? Should I restore back to 6.11.5? Here are the logs that were in red from the time that I think the problem occurred. Diagnostics are attached.portrush-diagnostics-20230902-0915.zip Quote Sep 2 01:24:41 Portrush kernel: BTRFS error (device nvme0n1p1): block=275404603392 write time tree block corruption detected Sep 2 01:24:41 Portrush kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction) Sep 2 01:24:41 Portrush kernel: BTRFS info (device nvme0n1p1: state E): forced readonly Sep 2 01:24:41 Portrush kernel: BTRFS warning (device nvme0n1p1: state E): Skipping commit of aborted transaction. Sep 2 01:24:41 Portrush kernel: BTRFS: error (device nvme0n1p1: state EA) in cleanup_transaction:1964: errno=-5 IO failure Sep 2 01:25:05 Portrush kernel: loop: Write error at byte offset 1445859328, length 4096. Sep 2 01:25:05 Portrush kernel: loop: Write error at byte offset 1695289344, length 4096. Sep 2 01:25:05 Portrush kernel: I/O error, dev loop2, sector 2823944 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2 Sep 2 01:25:05 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:05 Portrush kernel: I/O error, dev loop2, sector 3311112 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2 Sep 2 01:25:05 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 351748096, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 83312640, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 351666176, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 83230720, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 351567872, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 83132416, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 349569024, length 4096. Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 81133568, length 4096. Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 687008 op 0x1:(WRITE) flags 0x1800 phys_seg 10 prio class 2 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162720 op 0x1:(WRITE) flags 0x1800 phys_seg 10 prio class 2 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 686848 op 0x1:(WRITE) flags 0x1800 phys_seg 3 prio class 2 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162560 op 0x1:(WRITE) flags 0x1800 phys_seg 3 prio class 2 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 686656 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162368 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 682752 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2 Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 158464 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:13 Portrush kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction) Sep 2 01:25:13 Portrush kernel: BTRFS info (device loop2: state E): forced readonly Sep 2 01:25:13 Portrush kernel: BTRFS warning (device loop2: state E): Skipping commit of aborted transaction. Sep 2 01:25:13 Portrush kernel: BTRFS: error (device loop2: state EA) in cleanup_transaction:1964: errno=-5 IO failure Sep 2 01:25:36 Portrush kernel: loop: Write error at byte offset 1445859328, length 4096. Sep 2 01:25:36 Portrush kernel: loop: Write error at byte offset 1695289344, length 4096. Sep 2 01:25:36 Portrush kernel: I/O error, dev loop2, sector 2823944 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2 Sep 2 01:25:36 Portrush kernel: BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0 Sep 2 01:25:36 Portrush kernel: I/O error, dev loop2, sector 3311112 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2 Sep 2 01:25:36 Portrush kernel: BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0 Quote Link to comment
wayner Posted September 2, 2023 Author Share Posted September 2, 2023 (edited) Looking at the error message a little more closely it looks like the drive with errors is the nvme drive which is my cache drive. Why would it start throwing serious errors after an upgrade of the unRAID OS? Is this just a coincidence? When I do an extended SMART test it finishes immediately. Is that because it is an SSD? Here is the SMART test output: smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.49-Unraid] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: WDC WDS500G2B0C Serial Number: 21305D807219 Firmware Version: 211210WD PCI Vendor/Subsystem ID: 0x15b7 IEEE OUI Identifier: 0x001b44 Total NVM Capacity: 500,107,862,016 [500 GB] Unallocated NVM Capacity: 0 Controller ID: 1 NVMe Version: 1.4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 500,107,862,016 [500 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 001b44 8b41b6d08e Local Time is: Sat Sep 2 11:17:19 2023 EDT Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Maximum Data Transfer Size: 128 Pages Warning Comp. Temp. Threshold: 80 Celsius Critical Comp. Temp. Threshold: 85 Celsius Namespace 1 Features (0x02): NA_Fields Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 3.50W 2.10W - 0 0 0 0 0 0 1 + 2.40W 1.60W - 0 0 0 0 0 0 2 + 1.90W 1.50W - 0 0 0 0 0 0 3 - 0.0250W - - 3 3 3 3 3900 11000 4 - 0.0050W - - 4 4 4 4 5000 39000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 33 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 3% Data Units Read: 11,332,177 [5.80 TB] Data Units Written: 44,540,222 [22.8 TB] Host Read Commands: 77,074,211 Host Write Commands: 531,724,132 Controller Busy Time: 737 Power Cycles: 78 Power On Hours: 14,728 Unsafe Shutdowns: 64 Media and Data Integrity Errors: 0 Error Information Log Entries: 1 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 16 of 256 entries) No Errors Logged Edited September 2, 2023 by wayner Quote Link to comment
JorgeB Posted September 2, 2023 Share Posted September 2, 2023 There are filesystem issues with cache, this can be a hardware problem, like bad RAM, some users also reported these after updating to v6.12, so possibly some kernel corner issue. Quote Link to comment
wayner Posted September 2, 2023 Author Share Posted September 2, 2023 By bad RAM, do you mean bad RAM in the NVME SSD, or bad RAM in terms of the PC's DRAM memory? Quote Link to comment
JorgeB Posted September 3, 2023 Share Posted September 3, 2023 8 hours ago, wayner said: bad RAM in terms of the PC's DRAM memory? This Quote Link to comment
wayner Posted September 4, 2023 Author Share Posted September 4, 2023 So, this keeps happening, I upgraded from 6.1..5 to 6.12.4 on Friday afternoon. About 2-3 hours later the issue occurred. I rebooted, and it ran for about 7 hours before reoccurring. I rebooted the next morning and the system ran for about 30 hours before the issue occurred. Then the system ran for about 22 hours before I just got a hard crash. Previously my stems was still alive - the web UI worked and the system responded to pings. This time it went down hard and I had to do a power cycle to bring back. Should I try going back to 6.11.5 to see if this is just an issue with 6.12.4? Quote Link to comment
JorgeB Posted September 5, 2023 Share Posted September 5, 2023 Try downgrading to v6.11.5 to see if the issues stop, if yes it may be worth upgrading again but reformatting the pool, in case it's some preexisting issue. Quote Link to comment
wayner Posted September 5, 2023 Author Share Posted September 5, 2023 By reformatting the pool, you mean my drive array? Or the cache drive? Quote Link to comment
JorgeB Posted September 5, 2023 Share Posted September 5, 2023 19 minutes ago, wayner said: Or the cache drive? This Quote Link to comment
wayner Posted September 5, 2023 Author Share Posted September 5, 2023 Should I stick with BTRFS or use XFS? Quote Link to comment
JorgeB Posted September 5, 2023 Share Posted September 5, 2023 Whatever you prefer, if it's a single device pool and there are no plans to create a multi device pool, xfs might be better. Quote Link to comment
wayner Posted September 5, 2023 Author Share Posted September 5, 2023 What about if I got a second 500GB NVME, used that as a second cache drive, and then reformatted the first drive? Is that an easy process to do? These drives are very cheap right now - $40 Canadian. Does that make sense? Quote Link to comment
JorgeB Posted September 5, 2023 Share Posted September 5, 2023 If you mean as a second pool that's fine. Quote Link to comment
wayner Posted September 5, 2023 Author Share Posted September 5, 2023 Doesn't this sound like the same issue: Users in that thread report btrfs errors that go away when reverting back to 6.11.5. About 5-6 people report the same issue as I am describing in this thread. Quote Link to comment
JorgeB Posted September 5, 2023 Share Posted September 5, 2023 4 minutes ago, wayner said: Doesn't this sound like the same issue: Yes, hence why I mentioned: On 9/2/2023 at 7:43 PM, JorgeB said: some users also reported these after updating to v6.12, so possibly some kernel corner issue. Quote Link to comment
wayner Posted September 12, 2023 Author Share Posted September 12, 2023 As an update - I reverted back to 6.11.5 but the problems continued and my cache drive was unmountable. So I reformatted the cache drive to btrfs and got the system back up and running and it ran for over 2.5 days. I upgraded again to 6.12.4 and the system went down within 8 hours reporting the same types of problems - BTRFS errors with write time tree block corruption detected. Is this a bug in the unRAID 6.12.4 OS? Or do I have some hardware issues that only show under 6.12.4, but not 6.11.5. Quote Link to comment
JorgeB Posted September 12, 2023 Share Posted September 12, 2023 33 minutes ago, wayner said: Is this a bug in the unRAID 6.12.4 OS? Not that I know, most users don't have issues, and at most it could be a kernel issue, try zfs or stick with v6.11.5 for now. Quote Link to comment
wayner Posted September 12, 2023 Author Share Posted September 12, 2023 There seem to be a bunch of people that have issues with their cache drives after upgrading to 6.12, I am not the only one - I linked to a thread up a coupole of posts. Quote Link to comment
JorgeB Posted September 12, 2023 Share Posted September 12, 2023 Yes, but a few is not most, there are hundreds of users without issues, including myself. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.