CAN'T UPDATE PAST 6.10.3 - Docker Issue?


sunbear
Go to solution Solved by sunbear,

Recommended Posts

Hello,

 

I have been trying every Unriad OS update since 6.10.3, and every single one of them has lasted about 12 hours with docker running, before several of my cores seem to lock up with 100% usage & I start having lockup issues with different docker apps. When I visit the unraid webGUI through browser, it is still usually responsive (if a little slow), until I visit the DOCKER gui page and the page hangs halfway through loading and the rest of the gui becomes unresponsive.

 

At this point, I can usually still access the terminal, although no docker commands work and no apps are responsive (nor can I kill them through docker daemon). The only thing I am left with is a hard reboot and the entire process starts again 12-24 hours later.

 

Would love some help figuring out what the issue is. Like I said, I'm currently on 6.10.3 which works fine, I've tried every update since and I always end up having to revert back to 6.10.3.

 

I have attached my latest diagnostics after reverting but I'm not sure how long these go back. I know my latest lock-up was when I just recently tried updating to 6.11.5 and believe it locked up around 2022-11-20 ~07:00AM.

 

 

Edit: I'm guessing my logs don't back far enough to show what's happening.

 

Do I need to try to update again to try to get logs of my system hanging? I really don't like having to keep hard resetting.

Edited by sunbear
Removed Logs
Link to comment

No, I can't remember how long I've been using Unraid. I think I started w/ version 5.

 

Yes, I have stopped everything. Even doing a safe mode boot.

 

And while I tried removing and recreating my docker custom networks, I don't think I've tried a fresh new docker image just because it's a pain for a lot of dockers w/ a lot of custom settings.

 

Thanks for your help. I will give another update a try.

Link to comment

Alright I tried an update several times again. It seems to crash every night around 4am (but not exactly 4am) and doesn't respond in the morning until I do a hard reset.

 

Below is some data from my syslog that I recorded with Loki. Each from slightly before the hang as far as I can tell from the logs.

 

I have also attached my logs from the diagnostics tool.

 

Any help will be much appreciated!

 

## CRASH AROUND 2022-11-24 04:00

REMOVED

## CRASH AROUND 2022-11-26 04:00

 

REMOVED

 

Edited by sunbear
Removed Logs
Link to comment

I noticed I didn't get a crash on the night of the 25th, so I looked at the logs for that night and there seems to be and endless string of OOM errors. I assume this probably has something to do w/ the crashes but I have no idea how to interpret these.

 

REMOVED

Edited by sunbear
Removed Logs
Link to comment
  • 2 weeks later...

I take that back, I just remembered that I had mirror to flash turned on the last time I tried an update.

 

I have attached a syslog from a day that I believe had at least one crash (also some reboot). I'm not sure if I have personal info in this, so let me know.

 

Can you see anything from this? To me, it looks like there might be an issue with my linux docker? Let me know if there are any questions.

 

Edited by sunbear
Removed Logs
Link to comment

Hi, thanks for that. Did you see BTRFS corruption in the syslog, or somewhere else?

 

I had some cabling issues a while back but I thought I had fixed that.

 

I'm not seeing anything recent on any of my SMART data unless you are just mentioning the few attributes with old errors.

 

PS: I also just noticed that the Backup Appdata plugin is deprecated because it was causing errors during running, which I definitely think was part of what was causing me issues as well.

 

Going to install the new version and try another update.

Edited by sunbear
Link to comment
13 hours ago, sunbear said:

Did you see BTRFS corruption in the syslog, or somewhere else?

During fs mount:

 

Nov 30 02:47:19 SERV-X370 kernel: BTRFS info (device sde1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 40, gen 0
Nov 30 02:47:19 SERV-X370 kernel: BTRFS info (device sde1): bdev /dev/sdf1 errs: wr 0, rd 0, flush 0, corrupt 41, gen 0
Nov 30 02:47:19 SERV-X370 kernel: BTRFS info (device sde1): bdev /dev/sdg1 errs: wr 0, rd 0, flush 0, corrupt 90, gen 0

 

Corruption detected on all 3 devices.

Link to comment

So it does seem to be an out of memory issue with my influxdb docker. Is there anything that happened in the updates AFTER 6.10.3 that would cause something like this?

 

Is there any reason OOM errors would consistently be causing my entire system to hang? I've got years of data in that database so I'd rather wipe it out and try a new one.

 

It definitely seems to be something with docker though because sometimes the hang doesn't happen until I actually interact with the daemon in any way. I'm pretty confident it's typically happens with some sort of OOM error, though.

 

I've attached two recent syslogs, each after a crash (there's also some weird ass error with an incorrect system date that gets updated by the time plugin).

 

REMOVED

 

Nov 30 12:04:02 - wrong date?

REMOVED

 

OOM at Nov 30 12:04:02 - wrong date?

& OOM at Dec 16 07:33:28

 

Unfortunately I can't seem to catch full diagnostics at a time when the crash happens. Only those syslogs.

Edited by sunbear
Removed Logs
Link to comment

Anybody got any recommendations here?

 

Has anyone else had issues with the updates to the docker daemon after version 6.10.3?

 

I was having a weird issue with the incorrect time showing up in my logs but I'm not sure that's the same problem that's causing the docker daemon crashes.

 

I haven't been able to capture full diagnostics from time of hang so I'm going to try to setup a syslog server on my windows desktop but I really hate letting my server crash like this and having to force a hard reset every night.

Link to comment

Ok, I fixed the time issue.

 

It's definitely my influx docker activating OOM killer and the docker subsequently crashing, making the whole system unresponsive.

 

I've tried everything to fix this but am out of things to try other than no longer using influx (or switching to influx 2). I've removed and recreated the docker from scratch, started a fresh new database and transferring my old data to the new database, I've tried the two different type of memory limit parameters for the docker, tried without memory limits.

 

Something in the updates after unraid 6.10.3 changed something that is causing my influx database to crash from oom every night around the same time. If I revert my unraid version, the problem is completely gone.

 

I've pasted my syslog at time of crash again below. Does anyone have any other suggestions?

 

<13>Dec 21 03:00:01 SERV-X370 root: Starting Mover
<13>Dec 21 03:00:01 SERV-X370 root: Forcing turbo write on
<4>Dec 21 03:00:01 SERV-X370 kernel: mdcmd (75): set md_write_method 1
<4>Dec 21 03:00:01 SERV-X370 kernel: 
<13>Dec 21 03:00:01 SERV-X370 root: ionice -c 2 -n 7 nice -n 5 /usr/local/emhttp/plugins/ca.mover.tuning/age_mover start 15 0 0 '' '' '' '' no 80 '' ''
<4>Dec 21 05:23:20 SERV-X370 kernel: influxd invoked oom-killer: gfp_mask=0x8c40(GFP_NOFS|__GFP_NOFAIL), order=0, oom_score_adj=0
<4>Dec 21 05:23:20 SERV-X370 kernel: CPU: 14 PID: 12746 Comm: influxd Not tainted 5.19.17-Unraid #2
<4>Dec 21 05:23:20 SERV-X370 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JU 11/02/2021
<4>Dec 21 05:23:20 SERV-X370 kernel: Call Trace:
<4>Dec 21 05:23:20 SERV-X370 kernel: <TASK>
<4>Dec 21 05:23:20 SERV-X370 kernel: dump_stack_lvl+0x44/0x5c
<4>Dec 21 05:23:20 SERV-X370 kernel: dump_header+0x4a/0x1ff
<4>Dec 21 05:23:20 SERV-X370 kernel: oom_kill_process+0x80/0x111
<4>Dec 21 05:23:20 SERV-X370 kernel: out_of_memory+0x3e8/0x41a
<4>Dec 21 05:23:20 SERV-X370 kernel: mem_cgroup_out_of_memory+0x7c/0xb2
<4>Dec 21 05:23:20 SERV-X370 kernel: try_charge_memcg+0x44a/0x55e
<4>Dec 21 05:23:20 SERV-X370 kernel: ? get_page_from_freelist+0x6ff/0x82d
<4>Dec 21 05:23:20 SERV-X370 kernel: charge_memcg+0x29/0x71
<4>Dec 21 05:23:20 SERV-X370 kernel: __mem_cgroup_charge+0x29/0x41
<4>Dec 21 05:23:20 SERV-X370 kernel: __filemap_add_folio+0xb9/0x34b
<4>Dec 21 05:23:20 SERV-X370 kernel: ? lruvec_page_state+0x43/0x43
<4>Dec 21 05:23:20 SERV-X370 kernel: filemap_add_folio+0x37/0x91
<4>Dec 21 05:23:20 SERV-X370 kernel: __filemap_get_folio+0x1a4/0x1ff
<4>Dec 21 05:23:20 SERV-X370 kernel: pagecache_get_page+0x13/0x8c
<4>Dec 21 05:23:20 SERV-X370 kernel: alloc_extent_buffer+0x12d/0x38b
<4>Dec 21 05:23:20 SERV-X370 kernel: ? read_extent_buffer+0x22/0x9b
<4>Dec 21 05:23:20 SERV-X370 kernel: read_tree_block+0x21/0x7f
<4>Dec 21 05:23:20 SERV-X370 kernel: read_block_for_search+0x200/0x27d
<4>Dec 21 05:23:20 SERV-X370 kernel: btrfs_search_slot+0x6f7/0x7c5
<4>Dec 21 05:23:20 SERV-X370 kernel: btrfs_lookup_csum+0x5b/0xfd
<4>Dec 21 05:23:20 SERV-X370 kernel: btrfs_lookup_bio_sums+0x1f4/0x4a2
<4>Dec 21 05:23:20 SERV-X370 kernel: btrfs_submit_data_bio+0x102/0x18b
<4>Dec 21 05:23:20 SERV-X370 kernel: submit_extent_page+0x390/0x3d2
<4>Dec 21 05:23:20 SERV-X370 kernel: ? btrfs_repair_one_sector+0x30a/0x30a
<4>Dec 21 05:23:20 SERV-X370 kernel: ? set_extent_bit+0x169/0x493
<4>Dec 21 05:23:20 SERV-X370 kernel: ? _raw_spin_unlock+0x14/0x29
<4>Dec 21 05:23:20 SERV-X370 kernel: ? set_extent_bit+0x18b/0x493
<4>Dec 21 05:23:20 SERV-X370 kernel: btrfs_do_readpage+0x487/0x4ed
<4>Dec 21 05:23:20 SERV-X370 kernel: ? btrfs_repair_one_sector+0x30a/0x30a
<4>Dec 21 05:23:20 SERV-X370 kernel: extent_readahead+0x209/0x280
<4>Dec 21 05:23:20 SERV-X370 kernel: read_pages+0x4a/0xe9
<4>Dec 21 05:23:20 SERV-X370 kernel: page_cache_ra_unbounded+0x10c/0x13f
<4>Dec 21 05:23:20 SERV-X370 kernel: filemap_fault+0x2e7/0x524
<4>Dec 21 05:23:20 SERV-X370 kernel: __do_fault+0x30/0x6e
<4>Dec 21 05:23:20 SERV-X370 kernel: __handle_mm_fault+0x9a5/0xc7d
<4>Dec 21 05:23:20 SERV-X370 kernel: handle_mm_fault+0x113/0x1d7
<4>Dec 21 05:23:20 SERV-X370 kernel: do_user_addr_fault+0x36a/0x514
<4>Dec 21 05:23:20 SERV-X370 kernel: exc_page_fault+0xfc/0x11e
<4>Dec 21 05:23:20 SERV-X370 kernel: asm_exc_page_fault+0x22/0x30
<4>Dec 21 05:23:20 SERV-X370 kernel: RIP: 0033:0x1267dd0
<4>Dec 21 05:23:20 SERV-X370 kernel: Code: Unable to access opcode bytes at RIP 0x1267da6.
<4>Dec 21 05:23:20 SERV-X370 kernel: RSP: 002b:000000c212eed7c8 EFLAGS: 00010202
<4>Dec 21 05:23:20 SERV-X370 kernel: RAX: 00000000000a66b6 RBX: 000000c213914d20 RCX: 000000000000001a
<4>Dec 21 05:23:20 SERV-X370 kernel: RDX: 000000c21385774a RSI: 000000c213857765 RDI: 0000000000000000
<4>Dec 21 05:23:20 SERV-X370 kernel: RBP: 000000c212eed998 R08: 000000c212eed780 R09: 0000000000000063
<4>Dec 21 05:23:20 SERV-X370 kernel: R10: 0000000000000030 R11: 0000000000000100 R12: 0000000000000680
<4>Dec 21 05:23:20 SERV-X370 kernel: R13: 0000000000000180 R14: 0000000000000014 R15: 0000000000000200
<4>Dec 21 05:23:20 SERV-X370 kernel: </TASK>
<6>Dec 21 05:23:20 SERV-X370 kernel: memory: usage 8388608kB, limit 8388608kB, failcnt 224254
<6>Dec 21 05:23:20 SERV-X370 kernel: memory+swap: usage 8388608kB, limit 9007199254740988kB, failcnt 0
<6>Dec 21 05:23:20 SERV-X370 kernel: kmem: usage 20376kB, limit 9007199254740988kB, failcnt 0
<6>Dec 21 05:23:20 SERV-X370 kernel: Memory cgroup stats for /docker/290c6fea5fd3a05ed4cc4d6d45b203d7fcf5dd2c08965f4ab8434c3025941022:
<6>Dec 21 05:23:20 SERV-X370 kernel: anon 8560111616
<6>Dec 21 05:23:20 SERV-X370 kernel: file 8957952
<6>Dec 21 05:23:20 SERV-X370 kernel: kernel 20865024
<6>Dec 21 05:23:20 SERV-X370 kernel: kernel_stack 720896
<6>Dec 21 05:23:20 SERV-X370 kernel: pagetables 18878464
<6>Dec 21 05:23:20 SERV-X370 kernel: percpu 14960
<6>Dec 21 05:23:20 SERV-X370 kernel: sock 0
<6>Dec 21 05:23:20 SERV-X370 kernel: vmalloc 32768
<6>Dec 21 05:23:20 SERV-X370 kernel: shmem 0
<6>Dec 21 05:23:20 SERV-X370 kernel: file_mapped 8192
<6>Dec 21 05:23:20 SERV-X370 kernel: file_dirty 0
<6>Dec 21 05:23:20 SERV-X370 kernel: file_writeback 0
<6>Dec 21 05:23:20 SERV-X370 kernel: swapcached 0
<6>Dec 21 05:23:20 SERV-X370 kernel: anon_thp 0
<6>Dec 21 05:23:20 SERV-X370 kernel: file_thp 0
<6>Dec 21 05:23:20 SERV-X370 kernel: shmem_thp 0
<6>Dec 21 05:23:20 SERV-X370 kernel: inactive_anon 8553947136
<6>Dec 21 05:23:20 SERV-X370 kernel: active_anon 6164480
<6>Dec 21 05:23:20 SERV-X370 kernel: inactive_file 8392704
<6>Dec 21 05:23:20 SERV-X370 kernel: active_file 0
<6>Dec 21 05:23:20 SERV-X370 kernel: unevictable 0
<6>Dec 21 05:23:20 SERV-X370 kernel: slab_reclaimable 590168
<6>Dec 21 05:23:20 SERV-X370 kernel: slab_unreclaimable 535288
<6>Dec 21 05:23:20 SERV-X370 kernel: slab 1125456
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_refault_anon 0
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_refault_file 82685793
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_activate_anon 0
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_activate_file 13472836
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_restore_anon 0
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_restore_file 4873761
<6>Dec 21 05:23:20 SERV-X370 kernel: workingset_nodereclaim 1664
<6>Dec 21 05:23:20 SERV-X370 kernel: pgfault 3626958
<6>Dec 21 05:23:20 SERV-X370 kernel: pgmajfault 157695
<6>Dec 21 05:23:20 SERV-X370 kernel: pgrefill 20783523
<6>Dec 21 05:23:20 SERV-X370 kernel: pgscan 468181478
<6>Dec 21 05:23:20 SERV-X370 kernel: pgsteal 83488920
<6>Dec 21 05:23:20 SERV-X370 kernel: pgactivate 5328178
<6>Dec 21 05:23:20 SERV-X370 kernel: pgdeactivate 18803208
<6>Dec 21 05:23:20 SERV-X370 kernel: pglazyfree 394022
<6>Dec 21 05:23:20 SERV-X370 kernel: pglazyfreed 382682
<6>Dec 21 05:23:20 SERV-X370 kernel: thp_fault_alloc 1
<6>Dec 21 05:23:20 SERV-X370 kernel: thp_collapse_alloc 0
<6>Dec 21 05:23:20 SERV-X370 kernel: Tasks state (memory values in pages):
<6>Dec 21 05:23:20 SERV-X370 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<6>Dec 21 05:23:20 SERV-X370 kernel: [  12607]     0 12607  2883688  2089516 18767872        0             0 influxd
<6>Dec 21 05:23:20 SERV-X370 kernel: [  12872]     0 12872     1071       23    57344        0             0 sh
<6>Dec 21 05:23:20 SERV-X370 kernel: [   4187]     0  4187     1071       16    53248        0             0 sh
<6>Dec 21 05:23:20 SERV-X370 kernel: [  22708]     0 22708     1071       17    53248        0             0 sh
<6>Dec 21 05:23:20 SERV-X370 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=290c6fea5fd3a05ed4cc4d6d45b203d7fcf5dd2c08965f4ab8434c3025941022,mems_allowed=0,oom_memcg=/docker/290c6fea5fd3a05ed4cc4d6d45b203d7fcf5dd2c08965f4ab8434c3025941022,task_memcg=/docker/290c6fea5fd3a05ed4cc4d6d45b203d7fcf5dd2c08965f4ab8434c3025941022,task=influxd,pid=12607,uid=0
<3>Dec 21 05:23:20 SERV-X370 kernel: Memory cgroup out of memory: Killed process 12607 (influxd) total-vm:11534752kB, anon-rss:8358064kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:18328kB oom_score_adj:0

 

Link to comment

Ok, apparently it doesn't require the influxdb docker running to cause the docker daemon to hang like I thought it did.

 

I have even less clue what is going on now.

 

Something was updated after the unraid version update greater than 6.10.3 that consistently causes my docker daemon to hang, forcing a hard reset of my entire system to get it back. If I can't continue to update my system, this software has become unusable for me and I'm going to be forced to switch to something like TrueNAS.

 

Can whoever does the updates for unraid please give me some guesses for what this might be? Or some suggestions for troubleshooting?

Link to comment

Ok thanks, I will look into that. But I actually think I may have isolated the issue.

 

The hang only seems to happen when I have my backuppc docker running. I think there might be some kind of interference when a backup runs on my host appdata folder while other dockers are running. I'm thinking it is my influxdb docker with its database in the appdata folder that is getting accessed by backuppc and causing some kind of memory issue and crashing the docker daemon.

 

The obvious solution would be to shut down my other dockers while running a backup of the appdata folder. My only question is why is this suddenly become an issue? I have been backing up databases like this for several years. Was there something updated with the docker daemon to cause this?

 

Is it standard practice to shutdown any dockers before backing up data that they are accessing?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.