Unraid System: unreliable/unresponsive after just one day?

blacklight · February 22

I have a few severe problems with my main Unraid system, that makes the system unuseable after 1-2 days. I have no clue what the reason for that is and a short disclaimer from the beginning: I am not able to get any diagnostics when the problem occures, at least I don't know a way for now that works.

My system:

- Unraid Version: 6.12.6

- 2x 500gb Samsung SSD (1x data/1x parity) dedicated for Unraid + Docker (but not VMs)
over LSI-HBA (x4 -> 2 controllers, this one not STUBBED)

- 128gb DDR5

- Docker: activated

- nextcloud

- swag

- MariaDB

- DynDNS (x 2)

- VMs:

- TrueNAS (Main nas system - always active)

- 10x 4TB Iron Wolf over LSI-HBA (x4 PCIE -> 2 controllers, one STUBBED) in IT-mode and with dedicated Fimrware for TrueNAS

- 6x 2TB Samsungs SSDs over Mainboard controller (STUBBED)

- 2x Intel Optane (Stubbed)

- Windows 10 VM

- Ubuntu VM

- Custom User cripts:

- used to wait for mount of Shares from VM (until it's booted) and also for shutdown to have enough time to (lazy) unmount

Problem:

I set up my nextcloud docker and connected my SMB shares from truenas. The truenas Vm is the main part for data storage. The DynDNS and Swag docker instances worked fine, because I could reach my cloud from outside. After a few hours (I checked everything approximetely 12h later) my cloud was unreachable. Unraid GUI was partially reachable -> Settings, VMs & Docker didn't work (the UI just continued loading forever). And more problems:

- Main Tab was displaying 100% load on 3 random cores BUT htop didn't

- truenas vm could not be reached over the usual IP (ip not reported in either the fritzbox router nor my mikrotik switch -> I have two DHCPs)

- truenas couldn't be reached either

- nextcloud couldn't be accessed, so my guess is swag did also fail in some way

- nexctloud CAN STILL be reached from inside the network -> the docker is still functioning (the only thing I can reach at the moment without
restarting)

- can not create diagnostic file over IPMI console (stuck forever)!

- also system can not be shutdown (powerdown) -> had this problem for a longer time, it's stuck with "The system is going to reboot".

After another 2 days (I let the system run to see if diagnostics are created):

- UI not reachable at all in two different browsers

- IPMI console still reachable

- Unraid can still be pinged

Orderly shutdown via IPMI leads to:

Second orderly shutdown AND --> nothing happend. System doesn't respond...

It frustrates me ... I really want to depend on the system. I want to use it as my main storage and Homelab for work, university and private data but at the moment I can't rely at all on Unraid. I put hundreds of hours into setting up the system and fine tuning everything. I had huge problems with the truenas virtualizatrion. It was throwing IOcapability errors on the HBA (iscisTaskFull) for months despite updating the HBA firmare and going through current tutorials step by step + troubleshooting. After a truenas update from version 12 to 13 the unraid VNC didn't print any error messages anymore. Now Unraid is acting up and leads to failing systems all over the place after a few hours and I can't find anything according these errors.

Can please someone explain HOW to get diagnostics? I can still access the USB share, but that's it. Of course I can restart the system via force IPMI shutdown, but it will be the same after a few days or just hours. I really want to depend on the system but right now it's pretty much a very expensive pile of electronic scrap.... I really want to stick with Unraid because I love so many features. But if basic virtualization and management features lead to so much troubleshooting work, I really consider switching to another hypervisor. All the knowledge and work would be for nothing though.

Looking forward to any help!

Please let me know any infromation you need and how to get diagnostics, I will try to provide as much information as possible.

Thanks in advance!

Edited February 22 by blacklight

itimpi · February 22

The syslog in the standard diagnostics is the RAM version that starts afresh every time the system is booted. You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash/freeze. The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.

blacklight · February 22

8 hours ago, itimpi said:

The syslog in the standard diagnostics is the RAM version that starts afresh every time the system is booted. You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash/freeze. The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.

I have a second server running for future backups, can I also stream the diagnostic log to a share of that machine ? It doesn't have much storage though (just a few gb left and only one HDD), could that result in a problem e.g. bottleneck ? I want to avoid using the same machine or a VM on the same machine, because if the VM fails first I obviously have no log at the failing point to begin with. Can you suggest a syslog server for truenas ? If not I will just try to get it set up with the infos I find and post the log here, as soon as I have one.

blacklight · February 23

I was trying to get a syslog (graylog) server up and running on my second machine, but as always it's not that easy to get it to run properly.

So I radnomly looked into my current log and a lot of errors were popping up over the day. The limiting request errors are actually pointing to the mac I am acessing the server with (via VPN). Is this a big deal ? Can I ignore the error messages ?

Also in my syslog server there are 0 messages reported. Can I send a test message to the syslog server to validate it ? Something like "diagnostics *target ip of log server*" ?

I will have a second look tommorow if any messages are reported but for now it seems like the syslog server option in Unraid is not working for me (attachted you can find the syslog configuration).

icarus-syslog-20240223-0705.zip

blacklight · February 29

ok I found something new, for whatever reason the same hiccup with the system happened again. The point where it happened was the start of a SMB transfer from a windows 10 VM (image located on an Unraid NOT Truenas drive) to my Truenas ssd share. I can clearly identify the error:

qemu-system-x86_64: vfio_dma_map(0x14a74b57da00, 0x380000060000, 0x2000, 0x14af51e47000) = -22 (Invalid argument)
2024-02-22T06:10:33.177272Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument

There is no GPU handed through for the TrueNAS vm BUT a Broadcom HBA (Broadcom 9300-16i -> capacity for 16 internal drives on two 8 SAS controllers -> one controller bound to VFIO = 8HDDs for TrueNAS & 3x drives for Unraid on the other controller not bound to VFIO).

The weird thing is that the SMB transfer was targeted to a RAID10 of 6 Samsung SSDs all of which are connected to the motherboard's SATA connectors. I had problems with the HBA in the past before upgrading to TrueNAS 13 so I am still afraid the HBA could be the failure point, but at this moment the system was accessing the other controller with 2x 500Gb ssds for Unraid directly (virtual image on the Unraid and not the Truenas drives !). And the Truenas vm should have written the data to the ssds attached to the mobo not the HBA.

How can I handle this error ? Any ideas ? Why is the Unraid UI affected (freezing tabs)? Unraid is in the RAM it shouldn't be affected by failing connections to any drives ? Only the VM, Docker and certain Setting tabs are resulting in a frozen UI .... it doesn't make sense and still drives me crazy : (

Attached there are to different diagnostics that I was able to create while the UI was acting up and the extracted VM log where the VFIO error can be found.

Thanks for any answer ...

icarus-diagnostics-20240226-2326.zip icarus-diagnostics-20240228-2358.zip TrueNAS Icarus.txt

blacklight · February 29

Attached also results for the "cat /proc/iomem" commond, with the distinct block for the HBA:

  bc100000-bc5fffff : PCI Bus 0000:0b
    bc100000-bc4fffff : PCI Bus 0000:0c
      bc100000-bc2fffff : PCI Bus 0000:0f
        bc100000-bc1fffff : 0000:0f:00.0
        bc200000-bc23ffff : 0000:0f:00.0
          bc200000-bc23ffff : vfio-pci
        bc240000-bc24ffff : 0000:0f:00.0
          bc240000-bc24ffff : vfio-pci
      bc300000-bc4fffff : PCI Bus 0000:0d
        bc300000-bc3fffff : 0000:0d:00.0
        bc400000-bc43ffff : 0000:0d:00.0
          bc400000-bc43ffff : mpt3sas
        bc440000-bc44ffff : 0000:0d:00.0
          bc440000-bc44ffff : mpt3sas
    bc500000-bc53ffff : 0000:0b:00.0

Both controllers seem to be bound correctly. One to the VFIO and one to the SAS module.

iomem

blacklight · March 8

New error log after restarting. 0ver 6000 lines full with kernel error. No clue what is going on here : (

Also attached a screenshot of my IOMMU groups. Anyone any idea ? I am happy about any input ....

syslog.txt Icarus_SysDevs.pdf

JorgeB · March 8

See if this helps with the PCIe errors:

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165009

blacklight · March 9

On 3/8/2024 at 3:26 AM, JorgeB said:

See if this helps with the PCIe errors:

Unfortunately, that doesn't seem to solve it. I configured my Syslinux Conf and restarted the machine:

Also attached again: newest diagnostics and syslog of the trueness vm. It always freezes on normal VM shutdown: still the VFIO_DMA_MAP -22 error. Force shutdown works tho.

Unraid Shutdown from the IPMI looks like this (see the two "waiting 200 secs ..."), tried it two times, shutdown is not executed even if forced by Unraid, so I have to cut the power via IPMI command):

But I solved a PART of the problem: The heavy SMB load to the Truenas vm works perfectly (even with advanced features like Dedup the speeds are pretty good for what I see -> up to 2Gb/s without bigger hickups). Thanks to the QEMU Dev/professional user "jarthur" on the QEMU Matrix channel. He responded immediately and send me this:

which solved to problem under heavy load. My guess is that this is a downgrade to a lower QEMU version that didn't have this particular problem, which rendered my truenas vm with a passed through HBA useless. Thank you very much, again! This is the first success after weeks of troubleshooting.

The amount of dma map errors (-22) also became less in my opinion buts is still there. The Syslinux Config (above) didn't change that.

I would be glad about any new input for the Syslinux file! I turned on ACS OVERRIDE to MULTIFUNCTION, but left VFIO unsafe interrupts OFF (NO).

I will try to add all the unsafe interrupts and test again - for now no luck!

I feel like this is definitely a problem caused by Unraid, because the UI freezes after trying to access the VM tab and not before that, even if I already tried to shut down the nas vm. So if I go to the dashboard I still can control Unraid, but one click on VM, Docker or Settings->VM/Docker and the machine is unreachable... That must be a BUG, right? How can a failed VM influence the stability of the whole system after a weird sequence of changing tabs back and forth

Still fighting to find a fix : (

icarus-diagnostics-20240309-1550.zip icarus-syslog-20240309-2149.zip

blacklight · March 9

ACS Override = Multifunction and VFIO Allow Unsafe Interrupts = Yes, just gets rid of the log entries. Error persists:

This is the UI stuck on a loading screen after trying to shut down.

Syslinux Conf used:

Edited March 9 by blacklight

SimonF · March 13

On 2/29/2024 at 12:46 AM, blacklight said:
ok I found something new, for whatever reason the same hiccup with the system happened again. The point where it happened was the start of a SMB transfer from a windows 10 VM (image located on an Unraid NOT Truenas drive) to my Truenas ssd share. I can clearly identify the error:
qemu-system-x86_64: vfio_dma_map(0x14a74b57da00, 0x380000060000, 0x2000, 0x14af51e47000) = -22 (Invalid argument)
2024-02-22T06:10:33.177272Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument
There is no GPU handed through for the TrueNAS vm BUT a Broadcom HBA (Broadcom 9300-16i -> capacity for 16 internal drives on two 8 SAS controllers -> one controller bound to VFIO = 8HDDs for TrueNAS & 3x drives for Unraid on the other controller not bound to VFIO).

The weird thing is that the SMB transfer was targeted to a RAID10 of 6 Samsung SSDs all of which are connected to the motherboard's SATA connectors. I had problems with the HBA in the past before upgrading to TrueNAS 13 so I am still afraid the HBA could be the failure point, but at this moment the system was accessing the other controller with 2x 500Gb ssds for Unraid directly (virtual image on the Unraid and not the Truenas drives !). And the Truenas vm should have written the data to the ssds attached to the mobo not the HBA.

How can I handle this error ? Any ideas ? Why is the Unraid UI affected (freezing tabs)? Unraid is in the RAM it shouldn't be affected by failing connections to any drives ? Only the VM, Docker and certain Setting tabs are resulting in a frozen UI .... it doesn't make sense and still drives me crazy : (

Attached there are to different diagnostics that I was able to create while the UI was acting up and the extracted VM log where the VFIO error can be found.

Thanks for any answer ...

icarus-diagnostics-20240226-2326.zip 240.79 kB · 0 downloads icarus-diagnostics-20240228-2358.zip 243.31 kB · 1 download TrueNAS Icarus.txt 18.67 kB · 0 downloads

I started looking at this thread, Not sure if you what to try this option?

https://github.com/tianocore/edk2/discussions/4662

Currently on 6.12.8 we are running 8.7.0 Libvirt

You should be able to add the maxphysaddr to the xml, and looks like it remains persistant. The limit options is currently not supported but may be on a new point release where libvirt may be bumped up.

  <cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='1' dies='1' cores='1' threads='2'/>
    <cache mode='passthrough'/>
    <maxphysaddr mode='passthrough'/>
  </cpu>

blacklight · March 22

So I gave it a try and implemented the "<maxphysaddr mode='passthrough'/>" and it WORKED for a few days for the Truenas VM !!

So the VM can be started/stopped/restarted from within the VM and also Unraid, is performing under load and it seems like it won't crash after a long time on its own, thank you very much for the help there !

BUT (and that's a big but unfortunately) the LOG is still spammed with VFIO DMA MAP -22 errors AND the error seemed to progress into another VM but with the same symptom:

I had a Windows 10 VM running in parallel with a passed through GPU and the same lock up happened:

- Windows 10 VM (with GPU) failed, RDP froze (see attached syslog__.txt)

- second Windows 10 VM continued running w.o. problems but a file transfer inside stopped, that made me curious

-> more fatal the Truenas vm failed but continued running -> more precisely the pool attached to the HBA failed while the other pool that had the two VMs on it was still fine.

-> so my guess here is: something inside the virtualization mechanism of Unraid failed and dropped all VFIO maps or attached devices

- The Unraid guy continued to work, but VM, Docker & Settings-> VM/Docker froze again !

- I could download the syslog from the GUI

- but I couldn't create a diagnostics package

- after clicking on VM the GUI froze again

I already implemented the "<maxphysaddr mode='passthrough'/>" for the Windows VM with GPU because I wasn't able to restart it. I always had to force shut down it and sometimes it froze randomly and the VM paused (I attached a log of that event). When I tried to use the Unraid shut down/restart one core went to 100% and the other stayed at 0% and the VM got stuck in this crashed state. The maxphysaddr didn't help here ....

Is there any solution to avoid that all VFIO devices fail at once ? Do I have to use "VFIO Allow Unsafe Interrupts:" ? I wanted to avoid that because then the log is completely empty and I can't trace the errors.

Thanks again @SimonF that helped me a lot, because the main part, the NAS, is working. But the VFIO trouble remains.
I posted it here because I didn't want to start a new post because the symptoms of a locked up UI and failed VFIO devices are the same.

I also added the two xmls of the VMs, the SOUNDCARD is passed through, so that's not the problem : P

I will research more about that, because I found way more input for Windows/GPU/Gaming VMs than for HBA or Truenas problems, but still if some expert from Unraid has any clou of what part is faulty inside that VFIO construct I would be glad about a (technical) answer and a solution.

Thanks.

syslog__.txt

paused vm log.txt xml windows gpu.txt xml truenas.txt

Edited March 22 by blacklight

blacklight · March 22

And the crash is reproducible. ~~Every time I want to SMB move files while 3 VMs are active~~ EDIT: It happens for all transfers now no matter what Windows VM I use. I didn't even change the Truenas xml and it was working for days ... man Unraid makes me go insane : (

I however managed to get a diagnostics dump. It randomly works ... Find it attached.

The only thing I noticed is that the times are not synchronized between the logs from libvirt, the trueness vm and the windows vm.

Another idea, is there a chance to change the libvirt/qemu version (again) to maybe an older or a newer version ? I already did it with the edk2 as explained here:

Does that include libvirt/qemu ... I have no idea ...

Glad about any answer.

icarus-diagnostics-20240322-0553.zip

Edited March 22 by blacklight

SimonF · March 22

13 hours ago, blacklight said:

And the crash is reproducible. ~~Every time I want to SMB move files while 3 VMs are active~~ EDIT: It happens for all transfers now no matter what Windows VM I use. I didn't even change the Truenas xml and it was working for days ... man Unraid makes me go insane : (

I however managed to get a diagnostics dump. It randomly works ... Find it attached.

The only thing I noticed is that the times are not synchronized between the logs from libvirt, the trueness vm and the windows vm.

Another idea, is there a chance to change the libvirt/qemu version (again) to maybe an older or a newer version ? I already did it with the edk2 as explained here:

Does that include libvirt/qemu ... I have no idea ...

Glad about any answer.

icarus-diagnostics-20240322-0553.zip 131.66 kB · 0 downloads

6.12.8 is on QEMU 7.2, 8+ will be in 6.13

blacklight · April 9

Bug/Error persists after BIOS Update (which resets BIOS entries) and Unraid update (6.12.10) + I swapped the HBA to one of the BIFUR x8 slots to have full bandwidth.

Unfortunately it didn't solve the problem. I noticed two things: 1. the qemu log stopped at one day but the VM was running for another 3 days, so I also couldn't see any shutdown command in the qemu log. Is it because the log overflows with the VFIO DMA MAP errors ? Can I clear the log somehow ? I also had another error inside the VM with TrueNAS. It couldn't SMART check two drives and I think it stopped at some point. Could it be that this was the same point the DMA MAP errors stopped ? I will actually give it a try and turn off the SMART check inside Truenas, maybe this fixes it. Could failed drives lead to such an error, even if the HBA is stubbed ?

EDIT: the qemu log seems to stop shortly after starting the VM. Even under load or running tasks inside Truenas I can not see any new DMA MAP errors.

2. The unresponsive Unraid GUI behavior actually started after days of the VM running and not after stopping it immediately (which was the case before). I was able to restart it a few times without the GUI or the VM acting up.

Any idea where I can find what the attribute -22 even means. I am researching about this error in particular for months now and I still have no clue what the error (code) actually means and yes I looked up the qemu documentation ... no luck from my side there.

Edited April 9 by blacklight

Unraid System: unreliable/unresponsive after just one day?

Recommended Posts

blacklight

Link to comment

itimpi

Link to comment

blacklight

Link to comment

blacklight

Link to comment

blacklight

Link to comment

blacklight

Link to comment

blacklight

Link to comment

JorgeB

Link to comment

blacklight

Link to comment

blacklight

Link to comment

SimonF

Link to comment

blacklight

Link to comment

blacklight

Link to comment

SimonF

Link to comment

blacklight

Link to comment

Join the conversation