System having issues after upgrade


wayner

Recommended Posts

Yesterday I upgraded my server from 6.11.5 to 6.12.4.  Everything appeared to go well.  But after a few hours I noticed that my dockers weren't running.  When I tried to restart one I get Execution Error.  I rebooted the system.  Overnight the problem re-occurred.  The system crashed some time after 1:17am.

 

Anyone have any idea on what the issue is and how to fix?  Are BTRFS errors related to my boot USB drive?  Do I need to replace that drive?

 

Could this be a bug or issue in 6.12.4?  Is it just a coincidence that this occurred after upgrading?  Should I restore back to 6.11.5?

 

Here are the logs that were in red from the time that I think the problem occurred. Diagnostics are attached.portrush-diagnostics-20230902-0915.zip

Quote

Sep  2 01:24:41 Portrush kernel: BTRFS error (device nvme0n1p1): block=275404603392 write time tree block corruption detected
Sep  2 01:24:41 Portrush kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
Sep  2 01:24:41 Portrush kernel: BTRFS info (device nvme0n1p1: state E): forced readonly
Sep  2 01:24:41 Portrush kernel: BTRFS warning (device nvme0n1p1: state E): Skipping commit of aborted transaction.
Sep  2 01:24:41 Portrush kernel: BTRFS: error (device nvme0n1p1: state EA) in cleanup_transaction:1964: errno=-5 IO failure
Sep  2 01:25:05 Portrush kernel: loop: Write error at byte offset 1445859328, length 4096.
Sep  2 01:25:05 Portrush kernel: loop: Write error at byte offset 1695289344, length 4096.
Sep  2 01:25:05 Portrush kernel: I/O error, dev loop2, sector 2823944 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
Sep  2 01:25:05 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:05 Portrush kernel: I/O error, dev loop2, sector 3311112 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
Sep  2 01:25:05 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 351748096, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 83312640, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 351666176, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 83230720, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 351567872, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 83132416, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 349569024, length 4096.
Sep  2 01:25:13 Portrush kernel: loop: Write error at byte offset 81133568, length 4096.
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 687008 op 0x1:(WRITE) flags 0x1800 phys_seg 10 prio class 2
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162720 op 0x1:(WRITE) flags 0x1800 phys_seg 10 prio class 2
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 686848 op 0x1:(WRITE) flags 0x1800 phys_seg 3 prio class 2
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162560 op 0x1:(WRITE) flags 0x1800 phys_seg 3 prio class 2
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 686656 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 162368 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 682752 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep  2 01:25:13 Portrush kernel: I/O error, dev loop2, sector 158464 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 2
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:13 Portrush kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
Sep  2 01:25:13 Portrush kernel: BTRFS info (device loop2: state E): forced readonly
Sep  2 01:25:13 Portrush kernel: BTRFS warning (device loop2: state E): Skipping commit of aborted transaction.
Sep  2 01:25:13 Portrush kernel: BTRFS: error (device loop2: state EA) in cleanup_transaction:1964: errno=-5 IO failure
Sep  2 01:25:36 Portrush kernel: loop: Write error at byte offset 1445859328, length 4096.
Sep  2 01:25:36 Portrush kernel: loop: Write error at byte offset 1695289344, length 4096.
Sep  2 01:25:36 Portrush kernel: I/O error, dev loop2, sector 2823944 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
Sep  2 01:25:36 Portrush kernel: BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
Sep  2 01:25:36 Portrush kernel: I/O error, dev loop2, sector 3311112 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
Sep  2 01:25:36 Portrush kernel: BTRFS error (device loop2: state EA): bdev /dev/loop2 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0

 

Link to comment

Looking at the error message a little more closely it looks like the drive with errors is the nvme drive which is my cache drive.  Why would it start throwing serious errors after an upgrade of the unRAID OS?  Is this just a coincidence?

When I do an extended SMART test it finishes immediately.  Is that because it is an SSD?

 

Here is the SMART test output:
 

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.49-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC WDS500G2B0C
Serial Number:                      21305D807219
Firmware Version:                   211210WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b41b6d08e
Local Time is:                      Sat Sep  2 11:17:19 2023 EDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W    2.10W       -    0  0  0  0        0       0
 1 +     2.40W    1.60W       -    0  0  0  0        0       0
 2 +     1.90W    1.50W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   39000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Read:                    11,332,177 [5.80 TB]
Data Units Written:                 44,540,222 [22.8 TB]
Host Read Commands:                 77,074,211
Host Write Commands:                531,724,132
Controller Busy Time:               737
Power Cycles:                       78
Power On Hours:                     14,728
Unsafe Shutdowns:                   64
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

 

Edited by wayner
Link to comment

So, this keeps happening, I upgraded from 6.1..5 to 6.12.4 on Friday afternoon.  About 2-3 hours later the issue occurred.  I rebooted, and it ran for about 7 hours before reoccurring.  I rebooted the next morning and the system ran for about 30 hours before the issue occurred.

 

Then the system ran for about 22 hours before I just got a hard crash.  Previously my stems was still alive - the web UI worked and the system responded to pings.

 

This time it went down hard and I had to do a power cycle to bring back.

 

Should I try going back to 6.11.5 to see if this is just an issue with 6.12.4? 

Link to comment

As an update - I reverted back to 6.11.5 but the problems continued and my cache drive was unmountable.  So I reformatted the cache drive to btrfs and got the system back up and running and it ran for over 2.5 days.  I upgraded again to 6.12.4 and the system went down within 8 hours reporting the same types of problems - BTRFS errors with write time tree block corruption detected.

 

Is this a bug in the unRAID 6.12.4 OS?  Or do I have some hardware issues that only show under 6.12.4, but not 6.11.5.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.