System having issues after upgrade

wayner · September 2, 2023

Yesterday I upgraded my server from 6.11.5 to 6.12.4. Everything appeared to go well. But after a few hours I noticed that my dockers weren't running. When I tried to restart one I get Execution Error. I rebooted the system. Overnight the problem re-occurred. The system crashed some time after 1:17am.

Anyone have any idea on what the issue is and how to fix? Are BTRFS errors related to my boot USB drive? Do I need to replace that drive?

Could this be a bug or issue in 6.12.4? Is it just a coincidence that this occurred after upgrading? Should I restore back to 6.11.5?

Here are the logs that were in red from the time that I think the problem occurred. Diagnostics are attached.portrush-diagnostics-20230902-0915.zip

Quote

Sep 2 01:24:41 Portrush kernel: BTRFS error (device nvme0n1p1): block=275404603392 write time tree block corruption detected
Sep 2 01:24:41 Portrush kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
Sep 2 01:24:41 Portrush kernel: BTRFS info (device nvme0n1p1: state E): forced readonly
Sep 2 01:24:41 Portrush kernel: BTRFS warning (device nvme0n1p1: state E): Skipping commit of aborted transaction.
Sep 2 01:24:41 Portrush kernel: BTRFS: error (device nvme0n1p1: state EA) in cleanup_transaction:1964: errno=-5 IO failure
Sep 2 01:25:05 Portrush kernel: loop: Write error at byte offset 1445859328, length 4096.
Sep 2 01:25:05 Portrush kernel: loop: Write error at byte offset 1695289344, length 4096.
Sep 2 01:25:05 Portrush kernel: I/O error, dev loop2, sector 2823944 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
Sep 2 01:25:05 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:05 Portrush kernel: I/O error, dev loop2, sector 3311112 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
Sep 2 01:25:05 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 351748096, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 83312640, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 351666176, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 83230720, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 351567872, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 83132416, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 349569024, length 4096.
Sep 2 01:25:13 Portrush kernel: loop: Write error at byte offset 81133568, length 4096.
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 687008 op 0x1:(WRITE) flags 0x1800 phys_seg 10 prio class 2
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162720 op 0x1:(WRITE) flags 0x1800 phys_seg 10 prio class 2
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 686848 op 0x1:(WRITE) flags 0x1800 phys_seg 3 prio class 2
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162560 op 0x1:(WRITE) flags 0x1800 phys_seg 3 prio class 2
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 686656 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162368 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 682752 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep 2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 158464 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:13 Portrush kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
Sep 2 01:25:13 Portrush kernel: BTRFS info (device loop2: state E): forced readonly
Sep 2 01:25:13 Portrush kernel: BTRFS warning (device loop2: state E): Skipping commit of aborted transaction.
Sep 2 01:25:13 Portrush kernel: BTRFS: error (device loop2: state EA) in cleanup_transaction:1964: errno=-5 IO failure
Sep 2 01:25:36 Portrush kernel: loop: Write error at byte offset 1445859328, length 4096.
Sep 2 01:25:36 Portrush kernel: loop: Write error at byte offset 1695289344, length 4096.
Sep 2 01:25:36 Portrush kernel: I/O error, dev loop2, sector 2823944 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
Sep 2 01:25:36 Portrush kernel: BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
Sep 2 01:25:36 Portrush kernel: I/O error, dev loop2, sector 3311112 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
Sep 2 01:25:36 Portrush kernel: BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0

wayner · September 2, 2023

Looking at the error message a little more closely it looks like the drive with errors is the nvme drive which is my cache drive. Why would it start throwing serious errors after an upgrade of the unRAID OS? Is this just a coincidence?

When I do an extended SMART test it finishes immediately. Is that because it is an SSD?

Here is the SMART test output:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.49-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC WDS500G2B0C
Serial Number:                      21305D807219
Firmware Version:                   211210WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b41b6d08e
Local Time is:                      Sat Sep  2 11:17:19 2023 EDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W    2.10W       -    0  0  0  0        0       0
 1 +     2.40W    1.60W       -    0  0  0  0        0       0
 2 +     1.90W    1.50W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   39000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Read:                    11,332,177 [5.80 TB]
Data Units Written:                 44,540,222 [22.8 TB]
Host Read Commands:                 77,074,211
Host Write Commands:                531,724,132
Controller Busy Time:               737
Power Cycles:                       78
Power On Hours:                     14,728
Unsafe Shutdowns:                   64
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Edited September 2, 2023 by wayner

JorgeB · September 2, 2023

There are filesystem issues with cache, this can be a hardware problem, like bad RAM, some users also reported these after updating to v6.12, so possibly some kernel corner issue.

wayner · September 2, 2023

By bad RAM, do you mean bad RAM in the NVME SSD, or bad RAM in terms of the PC's DRAM memory?

JorgeB · September 3, 2023

8 hours ago, wayner said:

bad RAM in terms of the PC's DRAM memory?

This

wayner · September 4, 2023

So, this keeps happening, I upgraded from 6.1..5 to 6.12.4 on Friday afternoon. About 2-3 hours later the issue occurred. I rebooted, and it ran for about 7 hours before reoccurring. I rebooted the next morning and the system ran for about 30 hours before the issue occurred.

Then the system ran for about 22 hours before I just got a hard crash. Previously my stems was still alive - the web UI worked and the system responded to pings.

This time it went down hard and I had to do a power cycle to bring back.

Should I try going back to 6.11.5 to see if this is just an issue with 6.12.4?

JorgeB · September 5, 2023

Try downgrading to v6.11.5 to see if the issues stop, if yes it may be worth upgrading again but reformatting the pool, in case it's some preexisting issue.

wayner · September 5, 2023

By reformatting the pool, you mean my drive array? Or the cache drive?

JorgeB · September 5, 2023

19 minutes ago, wayner said:

Or the cache drive?

This

wayner · September 5, 2023

Should I stick with BTRFS or use XFS?

JorgeB · September 5, 2023

Whatever you prefer, if it's a single device pool and there are no plans to create a multi device pool, xfs might be better.

wayner · September 5, 2023

What about if I got a second 500GB NVME, used that as a second cache drive, and then reformatted the first drive? Is that an easy process to do? These drives are very cheap right now - $40 Canadian. Does that make sense?

JorgeB · September 5, 2023

If you mean as a second pool that's fine.

wayner · September 5, 2023

Doesn't this sound like the same issue:

Users in that thread report btrfs errors that go away when reverting back to 6.11.5. About 5-6 people report the same issue as I am describing in this thread.

JorgeB · September 5, 2023

4 minutes ago, wayner said:

Doesn't this sound like the same issue:

Yes, hence why I mentioned:

On 9/2/2023 at 7:43 PM, JorgeB said:

some users also reported these after updating to v6.12, so possibly some kernel corner issue.

wayner · September 12, 2023

As an update - I reverted back to 6.11.5 but the problems continued and my cache drive was unmountable. So I reformatted the cache drive to btrfs and got the system back up and running and it ran for over 2.5 days. I upgraded again to 6.12.4 and the system went down within 8 hours reporting the same types of problems - BTRFS errors with write time tree block corruption detected.

Is this a bug in the unRAID 6.12.4 OS? Or do I have some hardware issues that only show under 6.12.4, but not 6.11.5.

JorgeB · September 12, 2023

33 minutes ago, wayner said:

Is this a bug in the unRAID 6.12.4 OS?

Not that I know, most users don't have issues, and at most it could be a kernel issue, try zfs or stick with v6.11.5 for now.

wayner · September 12, 2023

There seem to be a bunch of people that have issues with their cache drives after upgrading to 6.12, I am not the only one - I linked to a thread up a coupole of posts.

JorgeB · September 12, 2023

Yes, but a few is not most, there are hundreds of users without issues, including myself.

System having issues after upgrade

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation