unRAID Server Release 6.2.0-beta21 Available


Recommended Posts

Thought I would try my luck with 6.2. Trying to get my VM's back up and working but it would seem that with 6.2 it doesn't like the VM's hosted on the unassigned devices plugin. Is this correct?

 

I keep getting the following when I create a VM located on a mounted disk:

Warning: libvirt_domain_xml_xpath(): namespace warning : xmlns: URI unraid is not absolute in /usr/local/emhttp/plugins/dynamix.vm.manager/classes/libvirt.php on line 936 Warning: libvirt_domain_xml_xpath():

 

EDIT: hmm, even If I try to create a new VM in the /mnt/user/system/ location I get error. Diagnostics attached. Any ideas?

 

I had/have the same problem with my Win10 VM's.

 

I ended up ditching my XML and creating a new VM and pointing it to the existing VM's Win10 disk image. I then went back and edited the XML as needed. I still get the warning but the VM's will start up.

 

Warning: libvirt_domain_xml_xpath(): namespace warning : xmlns: URI unraid is not absolute in /usr/local/emhttp/plugins/dynamix.vm.manager/classes/libvirt.php on line 936 Warning: libvirt_domain_xml_xpath():

 

I am using the Unassigned plugin and it doest seem to be a problem.

Link to comment
  • Replies 545
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

 

I had/have the same problem with my Win10 VM's.

 

I ended up ditching my XML and creating a new VM and pointing it to the existing VM's Win10 disk image. I then went back and edited the XML as needed. I still get the warning but the VM's will start up.

 

Warning: libvirt_domain_xml_xpath(): namespace warning : xmlns: URI unraid is not absolute in /usr/local/emhttp/plugins/dynamix.vm.manager/classes/libvirt.php on line 936 Warning: libvirt_domain_xml_xpath():

 

I am using the Unassigned plugin and it doest seem to be a problem.

 

My VM was also a Windows 10 VM. I went back to 6.1.9 as I have limited time currently.  I will probably try again once the next beta is released.

Link to comment

All my filesystems are XFS. (cache & array)

 

I found a very easy and fast way to reproduce the issue.

I installed an Ubuntu VM (15.04, desktop, default settings) and created a script that runs some dd commands.

The script contains variants up to 8G and 1000000 count (for i/o stress).

 

When I place the vDisk on a disk in the array and run it, the first line

dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct

does not even succeed. Everything hangs after 5s - 15s.

 

It works perfectly fine while beeing placed on the cache. (850MB/s and more on NVMe cache)

 

I can now test the issue without putting my windows VMs in danger and I can rule out a windows issue.

Maybe other people want to try that and see if their system keeps running or hangs like mine.

Ok, I had some kind of breakthrough...

Testing the other filesystems was on the list and I finally had some spare time to move the files around.

 

Switching from XFS to BTRFS didn't change anything, but after going to ReiserFS the mentioned Ubuntu VM managed to not even just complete the first part of the script, but the whole thing.

 

Even with very decent perfomance, considering the parity and qemu overhead...

 

I need to test more, but until now, that script NEVER even once completed...

 

Btw, I forgot to mention that I changed the vDisk from "cache=writeback" to "cache=directsync" in all the tests, because it seemed to be a disk issue, not a cache issue.

So if anyone tried to reproduce, the result may be diffrent without "cache=directsync", sorry!

ubuntu-dd.PNG.0617f4bb0dd2e68584e642947a27be9c.PNG

Link to comment

I have been coming across a weird issue lately which seems to have crept in with this new version, everything was working fine for ages but now i keep getting my vm's lockup and the following error message

 

Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00004000/00000000

 

I'm unsure how to trace what the requester id actually goes back to but when this error appears both my vm's become paused under the gui and can not be resumed. On an attempted resume i get the bottom message:

internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

 

Not sure what to do with this as it is rendering my vm's unusable as without a forcestop they will not do anything. VM 2 (one named "Cat - SeaBios") will start up after the failure but will refuse to see any usb devices attached to it

 

No idea what to do with this one at all - i was able to take a diagnostic after this error occured

archangel-diagnostics-20160423-1555.zip

Link to comment

I have been coming across a weird issue lately which seems to have crept in with this new version, everything was working fine for ages but now i keep getting my vm's lockup and the following error message

 

Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00004000/00000000

 

I'm unsure how to trace what the requester id actually goes back to but when this error appears both my vm's become paused under the gui and can not be resumed. On an attempted resume i get the bottom message:

internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

 

Not sure what to do with this as it is rendering my vm's unusable as without a forcestop they will not do anything. VM 2 (one named "Cat - SeaBios") will start up after the failure but will refuse to see any usb devices attached to it

 

No idea what to do with this one at all - i was able to take a diagnostic after this error occured

 

I've seen this error on my system after a nasty motherboard death issue. Anyhow mine was related to memory timing, and I was able to change some settings and have never seen it again. Even if you haven't changed anything hardware related, I'd still run Memtest to be certain.

Link to comment

I have been coming across a weird issue lately which seems to have crept in with this new version, everything was working fine for ages but now i keep getting my vm's lockup and the following error message

 

Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00004000/00000000

 

I'm unsure how to trace what the requester id actually goes back to but when this error appears both my vm's become paused under the gui and can not be resumed. On an attempted resume i get the bottom message:

internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

 

Not sure what to do with this as it is rendering my vm's unusable as without a forcestop they will not do anything. VM 2 (one named "Cat - SeaBios") will start up after the failure but will refuse to see any usb devices attached to it

 

No idea what to do with this one at all - i was able to take a diagnostic after this error occured

 

I've seen this error on my system after a nasty motherboard death issue. Anyhow mine was related to memory timing, and I was able to change some settings and have never seen it again. Even if you haven't changed anything hardware related, I'd still run Memtest to be certain.

 

Hi Bungee,

 

Thanks for the reply. I've recently run a 24 hour memtest with no issue but it is certainly possible my memory may have started to play up.

 

For now i have done the following to test what was going on just to rule out a few things before taking the home server away for memtests:

  • Re-flashed the BIOS on my motherboard
  • Moved the PCIe devices to different sockets in case something weird was going on
  • Replaced the power cables and usb cables going to/from my usb 3 controllers
  • Moved my gtx 210 so it is on its own under a single PLX chip

 

Having googled the error people are suggesting it is an nvidia driver issue under linux - most reporting to be the host gpu (weirdly all under the same pcie port and requester id). Having an older gtx 210 it would make sense to be driver support issue due to its age but i don't think this is too likely due to the way it has just started happening

 

I think my next steps are to run a memtest again, and alter my system overclock slightly just in case something is not playing nice. It seems too weird that vm 2 would lose all usb input after the lockup though - i did find a kink in the usb cable and part of it had be crushed so right no i am ruling out the usb cable sending rubbish to the controller

Link to comment

I have been coming across a weird issue lately which seems to have crept in with this new version, everything was working fine for ages but now i keep getting my vm's lockup and the following error message

 

Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00004000/00000000

 

I'm unsure how to trace what the requester id actually goes back to but when this error appears both my vm's become paused under the gui and can not be resumed. On an attempted resume i get the bottom message:

internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

 

Not sure what to do with this as it is rendering my vm's unusable as without a forcestop they will not do anything. VM 2 (one named "Cat - SeaBios") will start up after the failure but will refuse to see any usb devices attached to it

 

No idea what to do with this one at all - i was able to take a diagnostic after this error occured

 

I've seen this error on my system after a nasty motherboard death issue. Anyhow mine was related to memory timing, and I was able to change some settings and have never seen it again. Even if you haven't changed anything hardware related, I'd still run Memtest to be certain.

 

Hi Bungee,

 

Thanks for the reply. I've recently run a 24 hour memtest with no issue but it is certainly possible my memory may have started to play up.

 

For now i have done the following to test what was going on just to rule out a few things before taking the home server away for memtests:

  • Re-flashed the BIOS on my motherboard
  • Moved the PCIe devices to different sockets in case something weird was going on
  • Replaced the power cables and usb cables going to/from my usb 3 controllers
  • Moved my gtx 210 so it is on its own under a single PLX chip

 

Having googled the error people are suggesting it is an nvidia driver issue under linux - most reporting to be the host gpu (weirdly all under the same pcie port and requester id). Having an older gtx 210 it would make sense to be driver support issue due to its age but i don't think this is too likely due to the way it has just started happening

 

I think my next steps are to run a memtest again, and alter my system overclock slightly just in case something is not playing nice. It seems too weird that vm 2 would lose all usb input after the lockup though - i did find a kink in the usb cable and part of it had be crushed so right no i am ruling out the usb cable sending rubbish to the controller

 

Did you do a search on this thread for "QEMU" as it has come up a number of times.  I can also vaguely recall seeing some mention of the settings having to be change for use with some MB's and VM's.  So search of the entire board might provide an answer...

Link to comment

Don't know if this has been reported or not.

 

It seems whenever I edit a VM on the WebGUI, the information regarding Primary vdisk location is reset to "None". When I choose "Manual", it automatically gets all the original values back. This happens everytime I click "Edit".

Link to comment

Don't know if this has been reported or not.

 

It seems whenever I edit a VM on the WebGUI, the information regarding Primary vdisk location is reset to "None". When I choose "Manual", it automatically gets all the original values back. This happens everytime I click "Edit".

I've observed this as well.

 

Sent from my SM-T810 using Tapatalk

 

 

Link to comment

I have been coming across a weird issue lately which seems to have crept in with this new version, everything was working fine for ages but now i keep getting my vm's lockup and the following error message

 

Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
Apr 23 15:49:00 Archangel kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00004000/00000000

 

I'm unsure how to trace what the requester id actually goes back to but when this error appears both my vm's become paused under the gui and can not be resumed. On an attempted resume i get the bottom message:

internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

 

Not sure what to do with this as it is rendering my vm's unusable as without a forcestop they will not do anything. VM 2 (one named "Cat - SeaBios") will start up after the failure but will refuse to see any usb devices attached to it

 

No idea what to do with this one at all - i was able to take a diagnostic after this error occured

 

I've seen this error on my system after a nasty motherboard death issue. Anyhow mine was related to memory timing, and I was able to change some settings and have never seen it again. Even if you haven't changed anything hardware related, I'd still run Memtest to be certain.

 

Hi Bungee,

 

Thanks for the reply. I've recently run a 24 hour memtest with no issue but it is certainly possible my memory may have started to play up.

 

For now i have done the following to test what was going on just to rule out a few things before taking the home server away for memtests:

  • Re-flashed the BIOS on my motherboard
  • Moved the PCIe devices to different sockets in case something weird was going on
  • Replaced the power cables and usb cables going to/from my usb 3 controllers
  • Moved my gtx 210 so it is on its own under a single PLX chip

 

Having googled the error people are suggesting it is an nvidia driver issue under linux - most reporting to be the host gpu (weirdly all under the same pcie port and requester id). Having an older gtx 210 it would make sense to be driver support issue due to its age but i don't think this is too likely due to the way it has just started happening

 

I think my next steps are to run a memtest again, and alter my system overclock slightly just in case something is not playing nice. It seems too weird that vm 2 would lose all usb input after the lockup though - i did find a kink in the usb cable and part of it had be crushed so right no i am ruling out the usb cable sending rubbish to the controller

When doing the memtest just make sure you have tye SMP (multiprocessor/core) option enabled. I found that once when had issues with vm's and unraid hanging ram would pass a normal memtest, but when I tried using the the multiprocessor option it hung around test number 7 and was fixed by changing out some memory module. 

 

Thing is the default Memtest might not be stressing your system enough to reveal the issue.

 

Sent from my LG-D802T using Tapatalk

 

 

Link to comment

[...] after going to ReiserFS the mentioned Ubuntu VM managed to not even just complete the first part of the script, but the whole thing.

Even with very decent perfomance, considering the parity and qemu overhead...

I need to test more, but until now, that script NEVER even once completed...

The Windows VM crashed during the backup this night.  :'(

 

So I guess ReiserFS does not "fix" the issue, but delays it or softens it. (at least im my case)

Or it plays better with linux VMs, investigation ongoing  -.-

Link to comment

When doing the memtest just make sure you have tye SMP (multiprocessor/core) option enabled. I found that once when had issues with vm's and unraid hanging ram would pass a normal memtest, but when I tried using the the multiprocessor option it hung around test number 7 and was fixed by changing out some memory module. 

 

Thing is the default Memtest might not be stressing your system enough to reveal the issue.

 

Sent from my LG-D802T using Tapatalk

 

I have just started another meanest as this crash has happend 6 times today. I opened meanest and pressed f2 to force the multiform and it is just sitting there doing nothing

 

Met test is also reporting memory 20g when I have 32g installed so something is clearly messed up

 

Edit:

So I did a memtest which did over 1000 errors in the first 10 minutes in single thread as multi wouldn't launch. I did a bios update and memtest in single core stopped throwing errors.

 

I used the pc with no issues for a few hours then had the same lockup -  I have the latest memtest running right now with all 16 cores. Its able to do 2 passes per hour so I have set it to run through 50 tests and see what happens.

 

If this passes I'm not sure what to try next - unraid continues to function but the vms lockup the second that error is shown, it usually shows twice and 1 vm will lock up before the other

 

I may try a new gpu for my 1 vm and failing that I think it may be time for a motherboard rma

Link to comment

Just had a nasty BTRFS loop error with all VM's and Docker no longer working at all.. Grabbed diagnostics right after, rebooted, same issue. Ran scrub on the BTRFS cache drive with no errors. Really appreciate help in getting this fixed.

Diagnostic attached.

 

Edit: I think it all started here, however I have over 200GB free on my cache drive.

Apr 25 21:00:21 Server shfs/user: shfs_write: write: (28) No space left on device
Apr 25 21:00:25 Server kernel: loop: Write error at byte offset 2902630400, length 4096.
Apr 25 21:00:25 Server kernel: blk_update_request: I/O error, dev loop0, sector 5669200
Apr 25 21:00:25 Server kernel: loop: Write error at byte offset 2902760960, length 512.
Apr 25 21:00:25 Server kernel: blk_update_request: I/O error, dev loop0, sector 5669455
Apr 25 21:00:25 Server kernel: loop: Write error at byte offset 2902891520, length 1024.
Apr 25 21:00:25 Server kernel: blk_update_request: I/O error, dev loop0, sector 5669710

 

When this happened all VM's went into a paused state, Docker looked to be on, but I think it was dead also.

After reboot Docker has disappeared, I cannot start VM's, they error out (forget the message). If I edit a VM, I get a read-only error (or write failed, can't recall) that pops up.

server-diagnostics-20160425-2101.zip

Link to comment

Just had a nasty BTRFS loop error with all VM's and Docker no longer working at all.. Grabbed diagnostics right after, rebooted, same issue. Ran scrub on the BTRFS cache drive with no errors. Really appreciate help in getting this fixed.

Diagnostic attached.

 

Edit: I think it all started here, however I have over 200GB free on my cache drive.

Apr 25 21:00:21 Server shfs/user: shfs_write: write: (28) No space left on device
Apr 25 21:00:25 Server kernel: loop: Write error at byte offset 2902630400, length 4096.
Apr 25 21:00:25 Server kernel: blk_update_request: I/O error, dev loop0, sector 5669200
Apr 25 21:00:25 Server kernel: loop: Write error at byte offset 2902760960, length 512.
Apr 25 21:00:25 Server kernel: blk_update_request: I/O error, dev loop0, sector 5669455
Apr 25 21:00:25 Server kernel: loop: Write error at byte offset 2902891520, length 1024.
Apr 25 21:00:25 Server kernel: blk_update_request: I/O error, dev loop0, sector 5669710

 

When this happened all VM's went into a paused state, Docker looked to be on, but I think it was dead also.

After reboot Docker has disappeared, I cannot start VM's, they error out (forget the message). If I edit a VM, I get a read-only error (or write failed, can't recall) that pops up.

 

So, this is certainly related to the write errors I was getting, however interesting to diagnose.

I decided to disable my VM's, and delete the 1GB libvirt.img located at mnt/user/system/libvirt/libvirt.img (from console). The RM went fine, and the image was deleted.

From the WebUI I increased the size to 4GB (I have the space) and re-enabled, however a write error was in the log and it was never created:

Apr 25 22:50:02 Server root: Creating new image file: /mnt/user/system/libvirt/libvirt.img size: 4G
Apr 25 22:50:02 Server shfs/user: cache disk full
Apr 25 22:50:02 Server shfs/user: shfs_create: assign_disk: system/libvirt/libvirt.img (28) No space left on device
Apr 25 22:50:02 Server root: touch: cannot touch '/mnt/user/system/libvirt/libvirt.img': No space left on device
Apr 25 22:50:02 Server shfs/user: cache disk full
Apr 25 22:50:02 Server shfs/user: shfs_create: assign_disk: system/libvirt/libvirt.img (28) No space left on device
Apr 25 22:50:02 Server shfs/user: cache disk full
Apr 25 22:50:02 Server shfs/user: shfs_create: assign_disk: system/libvirt/libvirt.img (28) No space left on device
Apr 25 22:50:02 Server root: failed to create image file

I decided to disable docker under settings/Docker/Enable setting it to no.

After doing this I was able to create the new 4GB libvirt.img, and edit my primary VM XML (from the VM manager) and it booted and is working as it should.

 

So I assume my 20GB docker.img filled up (somehow) and caused a lot of havoc on the cache drive, leading to the VM (and libvirt.img) not properly functioning.

I will now plan to delete my Docker.img, increase it to 30GB (which seems extremely excessive for my Docker usage) and see how it goes.

A new diagnostic attached of my adventure in figuring this out.

 

Edit/Update: Fixed, all is well  8)

server-diagnostics-20160425-2306.zip

Link to comment

I will now plan to delete my Docker.img, increase it to 30GB (which seems extremely excessive for my Docker usage) and see how it goes.

A new diagnostic attached of my adventure in figuring this out.

 

Edit/Update: Fixed, all is well  8)

If its 6.2 related, increasing the image does not seem like a reasonable solution.

Did you try to find out which container consumes the space, to see if its a beta related issue?

 

If not, I would suggest installing "cAdvisor" from Community Applications and then have a look into the ressource Monitor of Community Applications.

Link to comment

If its 6.2 related, increasing the image does not seem like a reasonable solution.

Did you try to find out which container consumes the space, to see if its a beta related issue?

 

If not, I would suggest installing "cAdvisor" from Community Applications and then have a look into the ressource Monitor of Community Applications.

 

I ended up sticking at 20GB, as I only use ~6GB from initial Docker install with my standard Docker apps installed.

I haven't been monitoring the free space for this, so I'm unsure which Docker would have cause this, but I do understand this may not be directly 6.2B related (however never happened prior to now).

I had ran the script from a Docker thread (don't have link right now) to remove cache/left behind items in the Docker image, and at that time (months ago) I did not have any offenders. However since then I had installed Plex, and I bet it consumed all kinds of space for no real good reason (some hate on Plex there, and also others have had this issue).

 

What surprises me is this:

My Docker.img was (almost certainly) full.. Ok, fine, Docker should become "broken".

Why does a disc image that is full cause a read-only/write failed condition for the libvirt.img or the qemu XML to be written?

This may have also made the rest of cache drives writes fail also, however I am uncertain.

 

Now knowing this is a bad condition for it to happen, (sorry haven't looked as I never had this issue) can we set a notification for the Docker image utilization at a given %?

I know I get this for my array discs.

 

Link to comment

Just started getting this error today whilst running the latest beta. I'm in the process of copying everything onto the unRAID server (from various Windows machines) and today that copy seemed really slow.

 

The copy stopped with the following error message;

 

Error 0x8007003B: An unexpected network error occurred.

 

I captured the diagnosis file after this error occurred.

It's worth noting that I was doing a copy of files from a Windows machine, but also trying to move several files from one share on unRAID to another, via a Windows PC.

 

I then couldn't access any other files on the unRAID machine. Tried to stop the array and reboot. That didn't work. Also tried to use the powerdown plugin, which also didn't work, so did a hard reboot.

 

After the box came backup, I tried to copy one of the same files which was copying when the transfer died the first time (this was a copy from one unRAID share to another - via Windows machine). This caused the same unexpected network error again. I've attached the diagnosis file for the second failure as well.

 

I've searched on the forum and this error appears to have occurred frequently (so probably isn't related to the latest beta).

 

The machine has only been in service for 6-8 weeks. I ran memtest on it before starting. All the hard drives have been through 3 rounds of preclear before being used.

 

If I can't copy files either internally or externally to my unRAID machine I have a bit of an issue.

 

Does anyone have any idea what's causing it and how to resolve it?

 

Thanks

unraid-diagnostics-20160426-2032.zip

unraid-diagnostics-20160426-2232.zip

Link to comment

Don't know if this has been reported or not.

 

It seems whenever I edit a VM on the WebGUI, the information regarding Primary vdisk location is reset to "None". When I choose "Manual", it automatically gets all the original values back. This happens everytime I click "Edit".

I've observed this as well.

 

Sent from my SM-T810 using Tapatalk

 

Goto settings. Vm manager. Click advanced settings and set default vm storage to your vm directory and then all will be fine

Link to comment

Just started getting this error today whilst running the latest beta. I'm in the process of copying everything onto the unRAID server (from various Windows machines) and today that copy seemed really slow.

 

The copy stopped with the following error message;

 

Error 0x8007003B: An unexpected network error occurred.

 

I captured the diagnosis file after this error occurred.

It's worth noting that I was doing a copy of files from a Windows machine, but also trying to move several files from one share on unRAID to another, via a Windows PC.

 

I then couldn't access any other files on the unRAID machine. Tried to stop the array and reboot. That didn't work. Also tried to use the powerdown plugin, which also didn't work, so did a hard reboot.

 

After the box came backup, I tried to copy one of the same files which was copying when the transfer died the first time (this was a copy from one unRAID share to another - via Windows machine). This caused the same unexpected network error again. I've attached the diagnosis file for the second failure as well.

 

I've searched on the forum and this error appears to have occurred frequently (so probably isn't related to the latest beta).

 

The machine has only been in service for 6-8 weeks. I ran memtest on it before starting. All the hard drives have been through 3 rounds of preclear before being used.

 

If I can't copy files either internally or externally to my unRAID machine I have a bit of an issue.

 

Does anyone have any idea what's causing it and how to resolve it?

 

Thanks

 

You say you were saving files to the server and also moving files from one share to another. That's a lot of non-sequential parity activity. I notice you are using SMR disks for both data and parity so I wonder if you managed to fill up the persistent cache on one or other (or both) of your parity disks. That would cause performance to drop like a stone. When you powered back up the persistent caches would still be full and need time to flush. So for a while the write performance would remain very poor. Try leaving it to recover. You might need to abort the parity check that unRAID forces after an unclean power down, until the disk activity ceases, but be sure to do a manual parity check when it does. Of course, this is just a theory and I may well be wrong and the cause may lie elsewhere.

Link to comment

To daigo and anyone else that has had system stability issues relating to VMs, array vdisks, etc.

 

I just wanted to touch base on this issue because we have been trying a to recreate this on our systems here and have been unsuccessful.  There are multiple people reporting issues like this, but it's definitely not affecting everyone (nor would I say even the majority of users).  I've tried copying large files to and from both array and cache-based vdisks.  I've tried bulk copies to and from SMB shares.  I've tried bulk copies from mounting ISOs inside Linux VMs and copying data from them to the vdisk of the linux VM.  No matter what, the systems here remain solid and stable, no crashes or log events of any kind.

 

What this means is that we are still investigating and we are continuing to patch QEMU and the kernel so we can see if this issue is better addressed in a future beta release.

 

I wish I had more to say on this issue but for now, until we can recreate it, it's going to be an ongoing research problem.

 

Jon,

 

Just wanted to follow up since my last email.  Probably restating what you already know, so chances are its of no value, but what I've noticed so far when the system hangs:

 

- Stop all VM and Dockers and perform heavy I/O on the system (i.e. disk to disk copies) and all is fine.  It will happily copy for hours.

- Have a Docker do heavy IO (i.e. Sonarr unpacking all the mono dependencies on startup) without any disk activity (on the host) and all is fine.  No freeze. Although I did previously (a few weeks back) manage to get a Sabnzbd Docker to kill the system without host disk activity when processing a large queue.

- Have a VM do heavy I/O (i.e. Ubuntu server VM w/ Sabnzbd processing 100's of gigabyte queue) without disk activity on host and no freeze.

- Multiple VM and/or Dockers taxing the system at the same time will bring on a hang as well.  I had a script processing processing some PDF's for language detection and renaming, while Sabnzbd worked on some downloads in the Ubuntu VM and it caused a hang.

- Start a VM and/or Docker with heavy disk activity while performing heavy host I/O as above and the system will hang within 10-15 minutes.  By hang I mean the disk to disk copy (using mc) will freeze, the Docker will be unresponsive and the web ui will cease to respond.  Only a power cycle will bring the system back.  You can telnet to the system, but trying to run a command such as iotop will never return and will hang that ssh session.

 

During all the above, there is no dmesg entries, nothing in the syslog, no significant I/O (as per previously started iotop) and the system load will just creep up until you power cycle.  I've left it in that state to see if it would recover and had a system load in the high 70's after a few hours with no end in site.

 

It really seems as though the system is hitting some kind of a deadlock between Docker, KVM and the Linux host OS.  When the hang starts, nothing is happening in terms of CPU, disk or network IO.  All the I/O just suspends (doesn't even get to error out).  It just ceases to respond which is why I think the various pieces (processes, threads, whatever) are waiting on another piece.  That's why I describe it as a deadlock, at least in terms of its symptoms.

 

I wish I could be more descriptive, but there's nothing much to go on.  I wish the was a rogue process or io stream that could be identified as a smoking gun.

 

In terms of how often it happens, I've gone 3-4 days without a hang and then had 4-5 hangs in a day.  It's fairly random and very frustrating.  Luckily the machine has IPMI so I can power cycle it remotely.

 

Again, nothing here is probably of value but I figured it couldn't hurt to send it along anyway.  Feel free to ignore it, I'm just filling in the time waiting for the next beta... :P:)

 

Link to comment

So as a further update to my issue things are now getting worse on the pcie front and i can now replicate this issue time after time on my system, attached is my diagnostics file

 

Apr 27 04:11:06 Archangel kernel: pcieport 0000:00:02.0: AER: Corrected error received: id=0010
Apr 27 04:11:06 Archangel kernel: pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
Apr 27 04:11:06 Archangel kernel: pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00000080/00002000
Apr 27 04:11:06 Archangel kernel: pcieport 0000:00:02.0:    [ 7] Bad DLLP              
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00004000/00000000
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0:    [14] Completion Timeout     (First)
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0: broadcast error_detected message
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0: broadcast mmio_enabled message
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0: broadcast resume message
Apr 27 21:06:31 Archangel kernel: pcieport 0000:00:03.0: AER: Device recovery successful

 

I have never seen the error on requester device 10 before or one through that root bus. In these events both my vms lock up (paused under the gui) and can not be resumed. Unraid remains working perfectly but vms stop functioning. Things i have tried

  • Replace the gpu i have issues with (750ti)
  • Reflashed the bios multiple times
  • Done a 24 hour memtest on the latest version across all 16 cores with no faults
  • Changed the order of pcie slots on my motherboard
  • Removed the usb 3 controller to the vm that usually has trouble

 

To replicate this issue i launch a game on the vm with the 780, i then load a movie on amazon prime in the vm with the 750ti. After around 10 minutes of video playback the vms will lock up and fail every time - this is something i can repeat

 

I have no idea if this is related to the beta or hardware failure on the motherboard but if i can get some help debugging what is wrong it would be appreciated - especially as it only seems to affect vm's (dockers continue to run fine)

 

Edit

It seems there is an issue with the latest nvidia driver and playing items suchg as netflix and amazon prime on certain system. This appears to be causing pcie conflicts with other devices.

To test this i have replaced my 750ti with an older AMD 5750 and downgraded my 780 drive down to version 362 which people are reporting as a fix. I never considered the fact that a graphics issue on a guest would affect the host system so much (the 750ti over seabios is most likely the culprit). I am going to test this for a few days and see what happens - if all goes well with the AMD card i will put the 750ti back in with the older drivers and see if this fixes the issue

archangel-diagnostics-20160427-2109.zip

Link to comment

[...] after going to ReiserFS the mentioned Ubuntu VM managed to not even just complete the first part of the script, but the whole thing.

Even with very decent perfomance, considering the parity and qemu overhead...

I need to test more, but until now, that script NEVER even once completed...

The Windows VM crashed during the backup this night.  :'(

 

So I guess ReiserFS does not "fix" the issue, but delays it or softens it. (at least im my case)

Or it plays better with linux VMs, investigation ongoing  -.-

 

Played around with some settings (thanks to Eric!) and found a working combination for my Windows VM.

 

As I mentioned, going "back" to ReiserFS worked for the Linux VM, but that alone was not a fix for my windows VM.

I changed the cache mode of the vDisk that is placed on the array to 'directsync' (like the Ubuntu VM), but that alone did not help. After changing the Tunable (md_num_stripes) to 20480, my Windows VM has no issues at all. I tried those settiongs with XFS, but it did not help. ReiserFS+'directsync'+Tunable is a working solution for me. Normal usage, nightly backups, everything works right now.

 

But perfomance while copying vDisks from that disk to the cache disk (sequential read) went from ~450MB/s on XFS to ~350MB/s on BTRFS to now "only" ~280MB/s on ReiserFS. The perfomance hit seems rather big, don't know if its related. It may have other reasons (trim, etc.)

 

While thats no fix and even a debatable workaround, I hope it helps LimeTech/Tom to find the underlying issue.

 

For those that don't want to go back to 6.1.9, but need a quick&dirty fix, you may try these settings and see if it helps.

Link to comment
Guest
This topic is now closed to further replies.