Web GUI and some docker services randomly becoming unresponsive

May 19, 20242 yr

I have had the same issue twice during the last weeks now. The unraid web gui suddenly becomes unresponsive and some of my docker services are no longer reachable. If I open the webgui in a incognito window I get to the login prompt, and when I enter "root" and my password the unraid GUI becomes blank and loads forever without giving me a 404 or a 503 or any error.

The server is still reachable through ssh, and some docker services can be accessed. My windows 11 VM was still working normally. I tried to get the diagnostics through the terminal with the "diagnostics" command but it never finished. I instead copied the syslog to the USB and attach it here.

I tried to both restart `/etc/rc.d/rc.nginx restart` and kill the nginx process but it did not work.

ps -aux | grep nginx

gave me a list of processes, and after trying to kill them with

kill -9 [PID]

a list of 4 processes with `D` status (uninterruptable) remained.

I tried to restart the server with `reboot -n` but nothing happened.

Finally I had to pull the plug on the server. After rebooting, everything is back to normal (except for complaints about unclean restart). Attaching diagnostics that I took once the server was back up again.

Since this has happened twice now I would appreciate any help you guys can give me on what might be the issue.

Best regards

Erik

monsterservern-diagnostics-20240519-2203.zip syslog_240519.txt

Quote

May 20, 20242 yr

Community Expert

May 19 17:58:34 MONSTERSERVERN kernel: BUG: unable to handle page fault for address: ffffc9005cbb6000
May 19 17:58:34 MONSTERSERVERN kernel: #PF: supervisor read access in kernel mode
May 19 17:58:34 MONSTERSERVERN kernel: #PF: error_code(0x0000) - not-present page
May 19 17:58:34 MONSTERSERVERN kernel: PGD 100000067 P4D 100000067 PUD 27b7e7067 PMD 9a62e7067 PTE 0
May 19 17:58:34 MONSTERSERVERN kernel: Oops: 0000 [#4] PREEMPT SMP NOPTI
May 19 17:58:34 MONSTERSERVERN kernel: CPU: 0 PID: 14419 Comm: z_wr_iss_h Tainted: P D W O 6.1.79-Unraid #1
May 19 17:58:34 MONSTERSERVERN kernel: Hardware name: To Be Filled By O.E.M. X570 Taichi/X570 Taichi, BIOS P5.60 01/18/2024
May 19 17:58:34 MONSTERSERVERN kernel: RIP: 0010:lzjb_compress+0xab/0x188 [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: Code: 00 00 00 48 ff c3 49 39 c2 0f b6 10 73 0a 88 13 48 ff c0 48 ff c3 eb b9 44 0f b6 58 01 c1 e2 10 49 89 c6 41 c1 e3 08 41 01 d3 <0f> b6 50 02 44 01 da 41 c1 eb 09 41 01 d3 44 89 da c1 fa 05 44 01
May 19 17:58:34 MONSTERSERVERN kernel: RSP: 0018:ffffc90026c5fd28 EFLAGS: 00010206
May 19 17:58:34 MONSTERSERVERN kernel: RAX: ffffc9005cbb5ffe RBX: ffffc90053d215be RCX: 0000000000000080
May 19 17:58:34 MONSTERSERVERN kernel: RDX: 0000000000080000 RSI: ffffc90053d215b6 RDI: ffff88821a97f800
May 19 17:58:34 MONSTERSERVERN kernel: RBP: 0000000000020000 R08: ffffc9005cbb7000 R09: ffffc90053d23fef

May 19 17:58:34 MONSTERSERVERN kernel: R10: ffffc9005cbb6fbe R11: 0000000000086100 R12: ffffc90053d08000
May 19 17:58:34 MONSTERSERVERN kernel: R13: ffffc9005cb97000 R14: ffffc9005cbb5ffe R15: 00000000000000e9
May 19 17:58:34 MONSTERSERVERN kernel: FS:  0000000000000000(0000) GS:ffff889fbe800000(0000) knlGS:0000000000000000
May 19 17:58:34 MONSTERSERVERN kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 17:58:34 MONSTERSERVERN kernel: CR2: ffffc9005cbb6000 CR3: 000000016a362000 CR4: 0000000000350ef0
May 19 17:58:34 MONSTERSERVERN kernel: Call Trace:
May 19 17:58:34 MONSTERSERVERN kernel: <TASK>
May 19 17:58:34 MONSTERSERVERN kernel: ? __die_body+0x1a/0x5c
May 19 17:58:34 MONSTERSERVERN kernel: ? page_fault_oops+0x329/0x376
May 19 17:58:34 MONSTERSERVERN kernel: ? fixup_exception+0x22/0x24b
May 19 17:58:34 MONSTERSERVERN kernel: ? exc_page_fault+0xf4/0x11d
May 19 17:58:34 MONSTERSERVERN kernel: ? asm_exc_page_fault+0x22/0x30
May 19 17:58:34 MONSTERSERVERN kernel: ? lzjb_compress+0xab/0x188 [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: ? lzjb_compress+0x36/0x188 [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: zio_compress_data+0xbf/0xf2 [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: zio_write_compress+0xfe/0x6af [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: zio_execute+0xb4/0xdf [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: taskq_thread+0x269/0x38a [spl]
May 19 17:58:34 MONSTERSERVERN kernel: ? wake_up_q+0x44/0x44
May 19 17:58:34 MONSTERSERVERN kernel: ? zio_subblock+0x22/0x22 [zfs]
May 19 17:58:34 MONSTERSERVERN kernel: ? taskq_dispatch_delay+0x106/0x106 [spl]
May 19 17:58:34 MONSTERSERVERN kernel: kthread+0xe7/0xef
May 19 17:58:34 MONSTERSERVERN kernel: ? kthread_complete_and_exit+0x1b/0x1b
May 19 17:58:34 MONSTERSERVERN kernel: ret_from_fork+0x22/0x30
May 19 17:58:34 MONSTERSERVERN kernel: </TASK>

Couple of zfs related crashes, these could indicate a filesystem or some hardware issue, you can first try recreating the filesystem, if it happens again soon after there could be a hardware issue.

Quote

May 20, 20242 yr

Author

1 hour ago, JorgeB said:

Couple of zfs related crashes, these could indicate a filesystem or some hardware issue, you can first try recreating the filesystem, if it happens again soon after there could be a hardware issue.

Thank you for spotting this.

Does it give any information about where the problem might lie?

I have 2 drives with ZFS, one U.2 Intel P4510 that is bran new and has a very good endurance rating, and one HDD that is also quite new, no smart errors reported so far.

Or could it be a RAM issue? I do not have ECC.

How do I recreate the zfs file system? Do I need to format? Does it say which zfs pool is causing the issue?

Quote

May 20, 20242 yr

Community Expert

3 hours ago, eribob said:

How do I recreate the zfs file system? Do I need to format?

Yes.

3 hours ago, eribob said:

Does it say which zfs pool is causing the issue?

Nope, do you know if one of them was being more heavily utilized at that time? If yes, that one would be the suspect.

Quote

May 20, 20242 yr

Author

1 hour ago, JorgeB said:

Nope, do you know if one of them was being more heavily utilized at that time? If yes, that one would be the suspect.

OK that's unfortunate. The U.2 NVME was probably more heavily utilised, but I find it hard to believe that a brand new enterprise nvme would fail. Is there a way to test a zfs pool for errors?

Quote

May 20, 20242 yr

Community Expert

11 minutes ago, eribob said:

but I find it hard to believe that a brand new enterprise nvme would fail.

I don't think it's a device problem, just a filesystem problem, which could or not be the result of for example bad RAM.

Quote

May 20, 20242 yr

Author

5 hours ago, JorgeB said:

which could or not be the result of for example bad RAM.

OK, will start by running memtest tonight.

Should I performe a zfs scrub on the pools? Can that detect this kind of errors?

I have also seen some wierdness since moving to zfs. I have made each of my appdata folders a separate zfs pool (spaceinvader one video...) and some of them are not accessible in krusader. It says "cannot open the folder" when I try. Is this a permission issue or normal behaviour when trying to access zfs pools from krusader? The folders are accessible thorugh the unraid gui and I can cd into them in the terminal.

BTW, the reason I am hesitant to recreate the file system is that I do not have any redundancy in my pools. I have one on the NVME and one on one of the array drives. I backup the nvme to the array every night with ZFS snapshots and replication. I can wipe the nvme and restore from backup but it will take some time...

Edited May 20, 20242 yr by eribob

Quote

May 21, 20242 yr

Community Expert

11 hours ago, eribob said:

Should I performe a zfs scrub on the pools? Can that detect this kind of errors?

You can, but most likely won't fix the issue.

Quote

May 21, 20242 yr

Author

An update: Ran memtest, no errors after 8 hours, see attached images.

I am using this PCI-e to SFF-8643 adapter to connect the U.2 drive: https://www.amazon.se/dp/B0B6CJ889T/ref=pe_24982401_506182521_TE_item

Do you think that may cause these issues?

/Erik

Quote

May 22, 20242 yr

Community Expert

I would backup and recreate the pool, but if it happens again in the near future, it could mean there is a hardware issue, and note that memtest is only definitive if it finds errors.

Quote

June 7, 20242 yr

Author

On 5/22/2024 at 10:01 AM, JorgeB said:

I would backup and recreate the pool, but if it happens again in the near future, it could mean there is a hardware issue, and note that memtest is only definitive if it finds errors.

Hi again!

Sorry for the long delay, and for not marking this as solved despite your excellent efforts JorgeB.

Maybe I got closer to the solution today. I installed a new NVME drive in the server (one of the M.2 slots) and suddenly my enterprise M.2 SSD that has zfs stopped working. I got the following in the system log (could not run diagnostics):

Jun  7 17:14:13 MONSTERSERVERN kernel: WARNING: Pool 'enterprise' has encountered an uncorrectable I/O failure and has been suspended.
Jun  7 17:14:13 MONSTERSERVERN kernel: 
Jun  7 17:14:13 MONSTERSERVERN kernel: WARNING: Pool 'enterprise' has encountered an uncorrectable I/O failure and has been suspended.
Jun  7 17:14:13 MONSTERSERVERN kernel: 
Jun  7 17:14:13 MONSTERSERVERN kernel: WARNING: Pool 'enterprise' has encountered an uncorrectable I/O failure and has been suspended.

Jun  7 17:30:43 MONSTERSERVERN kernel: zio pool=enterprise vdev=/dev/nvme1n1p1 error=5 type=1 offset=1768154103808 size=102400 flags=180880
Jun  7 17:30:43 MONSTERSERVERN kernel: WARNING: Pool 'enterprise' has encountered an uncorrectable I/O failure and has been suspended.
Jun  7 17:30:43 MONSTERSERVERN kernel: 
Jun  7 17:30:43 MONSTERSERVERN kernel: WARNING: Pool 'enterprise' has encountered an uncorrectable I/O failure and has been suspended.

This happened immediately on boot and the drive was unusable.

I tried simply removing the NVME drive and now it works again.

Another thing I recently did was enabling autostart on my workstation VM (Fedora 40) with GPU passthrough.

I am now suspecting that this is power issue, perhaps my powersupply is old or too small? It is a 750W unit. Corsair, pretty good quality, but 5 years old now. Perhaps it gets overloaded during boot if all dockers are starting in addition to the VM with the GPU etc?

Or do you have any other suggestions? Why would installing one NVME drive cause another (U.2 drive) to fail?

/Erik

syslog 2.txt syslog.txt

Quote

June 7, 20242 yr

Community Expert

That looks more like the NVMe device dropped and reconnected.

Quote

June 7, 20242 yr

Author

13 minutes ago, JorgeB said:

That looks more like the NVMe device dropped and reconnected.

OK, I am sorry but that comment is not really helpful to me.

What happened was that when I installed the new NVMe, the old one "enterprise" stopped working. In the "main" tab I could no longer see used/available space. Since all dockers and VMs are on that drive, none of them were working. This happened immediately when I started the array. I tried rebooting the server and the same thing happened again. So I shut down the server and took the new NVMe drive out and now the old one ("enterprise") is working as normal again.

The "enterprise" NVMe is attached to a PCI-e → SFF 8643 card with a SFF 8643 → SFF 8639 cable.

The new NVMe was attached to the first M.2 slot on the motherboard.

Quote

June 7, 20242 yr

Community Expert

21 minutes ago, eribob said:

This happened immediately when I started the array.

If it happens at array start it could be mapped to a VM/vfiopci, but we'd need the complete diags to see that, it won't be visible in the syslog.

Quote

June 7, 20242 yr

Author

2 minutes ago, JorgeB said:

If it happens at array start it could be mapped to a VM/vfiopci, but we'd need the complete diags to see that, it won't be visible in the syslog.

OOOHHHH thanks! Now I need to test something: When I installed the new NVMe drive the vfio binding for my GPU disappeared and I had to rebind it at boot. Maybe the enterprise NVMe was somehow accidentally bound to the workstation VM instead of the GPU? And since the workstation VM is on autostart it would bind the enterprise NVMe... I will test and post back!

Quote

June 7, 20242 yr

Author

Yep it looks like that was the problem, thanks!

My GPU used to be on address 0e:00.0 and bound to the workstation VM like this:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x0e' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </hostdev>

When I installed the new NVMe the addresses changed so that the enterprise NVMe instead got that address:

    IOMMU group 28:			 	[8086:0a54] 0e:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
[N:1:0:1]    disk    INTEL SSDPE2KX080T8__1                     /dev/nvme1n1  8.00TB

Since I did not check that, the VM probably tried to bind the enterprise NVMe to it when it started and that caused it to become disconnected?

I changed the address in the VM xml to 0f:00.0, and now it works as normal.

IOMMU group 29:			 	[10de:1b82] 0f:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] (rev a1)
 	[10de:10f0] 0f:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

Again thanks for the hint! I am just sorry I did not find the source of the problem that this thread was originally about though...

Quote

June 8, 20242 yr

Community Expert

11 hours ago, eribob said:

When I installed the new NVMe the addresses changed

This is typical when you add or remove a device, so recommend always rechecking.

Quote

June 8, 20242 yr

Author

24 minutes ago, JorgeB said:

This is typical when you add or remove a device, so recommend always rechecking.

Yeah, I have experienced that before, dont know why I did not think of it. Thanks for pointing me in the right direction!

Quote

Web GUI and some docker services randomly becoming unresponsive

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)