Random freezes/crashes - docker stopps


Recommended Posts

Hi there,

 

I had a SSD (cache) problem a few days ago... maybe still have. I swapped the SSD. But still the system is instable.

 

Any Idea's.

 

  • I swapped the cache SSD, moved the systemdata from cache to HDD, and back to the new SSD.
  • Started my containers in order and waited for errors.... 
  • removed the most plugins
  • shutdown the VM's
  • removed the CPU pinning
  • removed the wannabe cache / temp SSD from unassigned devices 

server-diagnostics-20240211-1612.zip

Link to comment
Feb 11 06:17:53 Server kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Feb 11 06:17:53 Server kernel: ? _raw_spin_unlock+0x14/0x29
Feb 11 06:17:53 Server kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

Macvlan call traces will usually end up crashing the server, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)), then reboot.

 

You docker image is getting full, also need to fix that, and possibly there's a container constantly restarting, look at their up times to see if you can find out which one.

Link to comment
  • 2 weeks later...
  • 2 weeks later...

@JorgeB so it seems, it took a while - but it happed again.

 

Mar  5 20:31:27 Server kernel: loop: Write error at byte offset 30351020032, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 59279328 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30082580480, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 58755040 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30574559232, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 59715936 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30306123776, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 59191648 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30410817536, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 59396128 op 0x1:(WRITE) flags 0x1800 phys_seg 23 prio class 2
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30142382080, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 58871840 op 0x1:(WRITE) flags 0x1800 phys_seg 23 prio class 2
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30410653696, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 59395808 op 0x1:(WRITE) flags 0x1800 phys_seg 32 prio class 2
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30142218240, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 58871520 op 0x1:(WRITE) flags 0x1800 phys_seg 32 prio class 2
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30377951232, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 59331936 op 0x1:(WRITE) flags 0x1800 phys_seg 80 prio class 2
Mar  5 20:31:28 Server kernel: loop: Write error at byte offset 30109515776, length 4096.
Mar  5 20:31:28 Server kernel: I/O error, dev loop2, sector 58807648 op 0x1:(WRITE) flags 0x1800 phys_seg 80 prio class 2
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Mar  5 20:31:28 Server kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2494: errno=-5 IO failure (Error while writing out transaction)
Mar  5 20:31:28 Server kernel: BTRFS info (device loop2: state E): forced readonly
Mar  5 20:31:28 Server kernel: BTRFS warning (device loop2: state E): Skipping commit of aborted transaction.
Mar  5 20:31:28 Server kernel: BTRFS: error (device loop2: state EA) in cleanup_transaction:1992: errno=-5 IO failure
Mar  5 20:31:31 Server kernel: docker0: port 3(vethc3d2d62) entered disabled state
Mar  5 20:31:31 Server kernel: docker0: port 2(veth3acfc83) entered disabled state
Mar  5 20:31:31 Server kernel: veth517a158: renamed from eth0
Mar  5 20:31:31 Server kernel: vethfaf214b: renamed from eth0
Mar  5 20:31:34 Server kernel: lo_write_bvec: 18 callbacks suppressed
Mar  5 20:31:34 Server kernel: loop: Write error at byte offset 6943166464, length 4096.

 

server-diagnostics-20240305-2234.zip

Link to comment

Cache pool is completely full, it's also using dual data profiles, you need to fix that, first move some data of there, then decide if you want single or raid1 and convert, note that raid1 won't be able to use the pool's full capacity.

Link to comment

In the meantime, the mover moved the data over. But this setup only exists because I thought I had a SSD. So I would remove one one the drives from the cache. 

I am not sure if I got, what you mean with dual data profiles?!  My guess is, you mean, that the new files will be created on the ssd, and moved over using "mover" every X hours?!

Isn't there an event like "drive full - start mover"? So how should I deal with that? 

The idea behind my setup is energy saving. I only need to spinup my drives every 6h (mover setting). But I haven't thought about filling the cache disc before the cronjob starts...

Link to comment
40 minutes ago, patrickstigler said:

I am not sure if I got, what you mean with dual data profiles?!

It means the pool is using both single and raid1 profiles for data, so part of the data in not redundant, if your want raid1 you can convert the single part to raid1, after there's enough space.

Link to comment
1 minute ago, patrickstigler said:

So I guess I would prefer raid one, since I have the disks anyway. So do I need to change the profile with "new config" or can I do it with btrfs?

You click on the pool on the Main tab and there is an option there to change the profile.   After changing the profile a Balance is performed to convert all the data and metadata to the selected profile.

Link to comment
Posted (edited)

that worked - I guess. 

image.thumb.png.f79c77deb730d1f24437badca42e99bd.png

 

but the server just froze, I had to reset it. So I guess no log... I will enable the USB log mirroring for now. 

 

 

*update*

 

 

Now the server behaves like it got ssd / whatever problems.

 

image.thumb.png.1bb219063aa23ca13ff1dd14d8285106.png

 

It won't show my drives - and also the webinterface is slow / will not load every content.

The web based logviewer is empty right now.

Edited by patrickstigler
Link to comment
Posted (edited)
Mar 13 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 176, gen 0
Mar 13 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 13 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 165, gen 0
Mar 13 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 13 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 177, gen 0
Mar 13 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 13 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 166, gen 0
Mar 13 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 13 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 178, gen 0
Mar 13 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 13 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 167, gen 0

 

I haven't rebooted to safe mode yet, should I do so? But it seems, that my SSD / the lanes to my SSD have some issue ?! 

 

(currently 207 GB free on the cache pool)

Edited by patrickstigler
Link to comment

Something is wrong....

 

 

Mar 13 19:54:39 Server smbd[25669]:   INTERNAL ERROR: assert failed: (fh->fd == -1) || (fh->fd == AT_FDCWD) in pid 25669 (4.17.12)
Mar 13 19:54:39 Server smbd[25669]: [2024/03/13 19:54:39.999494,  0] ../../lib/util/fault.c:178(smb_panic_log)
Mar 13 19:54:39 Server smbd[25669]:   If you are running a recent Samba version, and if you think this problem is not yet fixed in the latest versions, please consider reporting this bug, see https://wiki.samba.org/index.php/Bug_Reporting
Mar 13 19:54:39 Server smbd[25669]: [2024/03/13 19:54:39.999507,  0] ../../lib/util/fault.c:183(smb_panic_log)
Mar 13 19:54:39 Server smbd[25669]:   ===============================================================
Mar 13 19:54:39 Server smbd[25669]: [2024/03/13 19:54:39.999525,  0] ../../lib/util/fault.c:184(smb_panic_log)
Mar 13 19:54:39 Server smbd[25669]:   PANIC (pid 25669): assert failed: (fh->fd == -1) || (fh->fd == AT_FDCWD) in 4.17.12
Mar 13 19:54:39 Server smbd[25669]: [2024/03/13 19:54:39.999873,  0] ../../lib/util/fault.c:292(log_stack_trace)
Mar 13 19:54:39 Server smbd[25669]:   BACKTRACE: 27 stack frames:
Mar 13 19:54:39 Server smbd[25669]:    #0 /usr/lib64/libgenrand-samba4.so(log_stack_trace+0x2e) [0x14b30c6e064e]
Mar 13 19:54:39 Server smbd[25669]:    #1 /usr/lib64/libgenrand-samba4.so(smb_panic+0x9) [0x14b30c6e08a9]
Mar 13 19:54:39 Server smbd[25669]:    #2 /usr/lib64/libsmbd-base-samba4.so(+0x4d0fb) [0x14b30cac20fb]
Mar 13 19:54:39 Server smbd[25669]:    #3 /usr/lib64/libtalloc.so.2(+0x44df) [0x14b30c68f4df]
Mar 13 19:54:39 Server smbd[25669]:    #4 /usr/lib64/libsmbd-base-samba4.so(file_free+0xd6) [0x14b30cacf266]
Mar 13 19:54:39 Server smbd[25669]:    #5 /usr/lib64/libsmbd-base-samba4.so(+0xc0781) [0x14b30cb35781]
Mar 13 19:54:39 Server smbd[25669]:    #6 /usr/lib64/libsmbd-base-samba4.so(smbd_smb2_request_process_close+0x211) [0x14b30cb35f01]
Mar 13 19:54:39 Server smbd[25669]:    #7 /usr/lib64/libsmbd-base-samba4.so(smbd_smb2_request_dispatch+0xdfa) [0x14b30cb29bfa]
Mar 13 19:54:39 Server smbd[25669]:    #8 /usr/lib64/libsmbd-base-samba4.so(+0xb5bc1) [0x14b30cb2abc1]
Mar 13 19:54:39 Server smbd[25669]:    #9 /usr/lib64/libtevent.so.0(tevent_common_invoke_fd_handler+0x91) [0x14b30c6a28c1]
Mar 13 19:54:39 Server smbd[25669]:    #10 /usr/lib64/libtevent.so.0(+0xee07) [0x14b30c6a8e07]
Mar 13 19:54:39 Server smbd[25669]:    #11 /usr/lib64/libtevent.so.0(+0xcef7) [0x14b30c6a6ef7]
Mar 13 19:54:39 Server smbd[25669]:    #12 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x91) [0x14b30c6a1ba1]
Mar 13 19:54:39 Server smbd[25669]:    #13 /usr/lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x14b30c6a1e7b]
Mar 13 19:54:39 Server smbd[25669]:    #14 /usr/lib64/libtevent.so.0(+0xce97) [0x14b30c6a6e97]
Mar 13 19:54:39 Server smbd[25669]:    #15 /usr/lib64/libsmbd-base-samba4.so(smbd_process+0x817) [0x14b30cb18be7]
Mar 13 19:54:39 Server smbd[25669]:    #16 /usr/sbin/smbd(+0xb090) [0x55de02472090]
Mar 13 19:54:39 Server smbd[25669]:    #17 /usr/lib64/libtevent.so.0(tevent_common_invoke_fd_handler+0x91) [0x14b30c6a28c1]
Mar 13 19:54:39 Server smbd[25669]:    #18 /usr/lib64/libtevent.so.0(+0xee07) [0x14b30c6a8e07]
Mar 13 19:54:39 Server smbd[25669]:    #19 /usr/lib64/libtevent.so.0(+0xcef7) [0x14b30c6a6ef7]
Mar 13 19:54:39 Server smbd[25669]:    #20 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x91) [0x14b30c6a1ba1]
Mar 13 19:54:39 Server smbd[25669]:    #21 /usr/lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x14b30c6a1e7b]
Mar 13 19:54:39 Server smbd[25669]:    #22 /usr/lib64/libtevent.so.0(+0xce97) [0x14b30c6a6e97]
Mar 13 19:54:40 Server smbd[25669]:    #23 /usr/sbin/smbd(main+0x1489) [0x55de0246f259]
Mar 13 19:54:40 Server smbd[25669]:    #24 /lib64/libc.so.6(+0x236b7) [0x14b30c4a96b7]
Mar 13 19:54:40 Server smbd[25669]:    #25 /lib64/libc.so.6(__libc_start_main+0x85) [0x14b30c4a9775]
Mar 13 19:54:40 Server smbd[25669]:    #26 /usr/sbin/smbd(_start+0x21) [0x55de0246fb31]
Mar 13 19:54:40 Server smbd[25669]: [2024/03/13 19:54:40.000043,  0] ../../source3/lib/dumpcore.c:315(dump_core)
Mar 13 19:54:40 Server smbd[25669]:   dumping core in /var/log/samba/cores/smbd
Mar 13 19:54:40 Server smbd[25669]: 
Mar 13 19:58:50 Server emhttpd: read SMART /dev/sdh
Mar 13 19:59:40 Server emhttpd: spinning down /dev/sdd

 

Link to comment

Could be the module, the socket, a bad connection, memory timing not compatible, unstable with all modules driven, etc.

 

Your best outcome is a clear fail with a single module tested in different slots. That would mean you should be good replacing just that module.

 

Memory errors can get messy, hopefully your error is easy to replicate and eliminate.

Link to comment
Posted (edited)

After a few hours of testing, and swapping the modules around, it seems that one of the Crucial Ballistix BL2K8G32C16U4B 3200 MHz modules died. 

It was 24/7 online since 28.11.21 - I guess not a server grade RAM module. Well, the test for the last module is running. And I will also check if that is only in that slot. 

But thank you for your help - I keep you updated. 
 

After another 20ish hours, I must say the Crucial ones are working fine. But the kingston ones are gone.

So removed the kingston ram's, and till now it's working.

 

Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 201, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 212, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 202, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 213, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 203, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 214, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 204, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 215, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 205, gen 0
Mar 16 18:00:01 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 16 18:00:01 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 216, gen 0

 

I have a few errors on one of the cache SSD's, but I assume that one is about to die. about 70% health ..

Edited by patrickstigler
Link to comment

For 3 Days stable, but 

 

Mar 19 11:13:39 Server kernel: PMS LoudnessCmd[16217]: segfault at 0 ip 000014ce38ac7090 sp 000014ce335d40b8 error 4 in libswresample.so.4[14ce38abf000+18000] likely on CPU 14 (core 6, socket 0)

 

and

 

Mar 19 12:00:02 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 318, gen 0
Mar 19 12:00:02 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1
Mar 19 12:00:02 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 308, gen 0
Mar 19 12:00:02 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 2
Mar 19 12:00:02 Server kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 319, gen 0
Mar 19 12:00:02 Server kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 4711988 off 264597504 csum 0x3019a12b expected csum 0x3ed50501 mirror 1

 

Maybe the CPU got problems ... well.

I will let it run for a while. We will see.

 

Link to comment

Did you scrub the pool and reset the btrfs stats after fixing the RAM? It may still be finding old corruptions.

 

26 minutes ago, patrickstigler said:
PMS LoudnessCmd[16217]: segfault

I see these in a lot of diags, so they should be harmless.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.