Unraid 6.8.3 - Upgrade, multiple problems, server died


Recommended Posts

I am at my wits end, this is the second time when I tried to upgrade and the whole thing went wrong. First time, it killed a drive and then I rolled back, encounter zero issue since then. After a while a prompt showing that I should upgrade to minimise some login vulnerability so I upgraded to the latest 6.8.3 (I think, I can't my server now).

 

1. First, it killed a functional almost 6 months old - 4TB drive. One of my VM just hang and when I reboot, one of the drive is showing an error and requiring me to rebuild the array. (note: this happened last time as well when I tried to upgrade about 6 months ago).

2. Went to the shop and spent $150 to get another 4TB drive, change into a completely different SATA port and use a brand new SATA cable for just in case scenario.

3. Spent 12 hours rebuilding the parity and then I started my Windows 10 VM.

4. It was a patchy start for the VM requiring me to reattach the keyboard and mouse multiple times before the VM detected them. Then occasionally after a while the screen will turn off by itself, requiring me to turn off the VM and restart it back again. I did this 2 times then the whole things just crashed.

5. Went to the GUI and saw that one of the drive failed again! This is a brand new 4TB  drive that I just bought!

6. The whole UNRAID is unresponsive, forcing me to restart the server.... now UNRAID refuse to boot! It says "...waiting for /dev/disk/by-label/UNRAID (will check for 30 sec) at boot up..." and "cannot find device Bond0" and the whole gui refuse to start.... (see attachment)...

 

I just can't imagine UNRAID releasing so unstable release and it actually kills hardware without a possibility to fix it. This experience has really leave a bad impression, even if I rebuild the whole things (which will take days), I don't know if I can trust the solution again....

 

Happy to listen to suggestions... (and thanks in advance)

 

My system:

Motherboard ASROCK Taichi

CPU AMD Ryzen 5900

RAM  32 GB

 

GetAttachmentThumbnail.jpg

Link to comment

First and foremost, whenever a drive fails (or reports to have issues), the first thing you should have done was to get the diagnostics zip (Tools -> Diagnostics) which contain clues as to what happened.

People regularly post disk issues on here and experts e.g. johnnie.black respond relatively quickly. Based on your forum activity, you didn't ask for help originally so keep that in mind for next time. Remember, you're not alone.

 

What version were you on before upgrading?

Did you stress test (e.g. run a preclear cycle) your drives before adding to the array?

 

Your issue with booting looks to be a problem with the USB stick.

  • Are you able to plug it into another (Windows) computer and run a disk check?
  • Is your stick USB 3.0 or 2.0? Are you using USB 3.0 or 2.0 port on your motherboard for the stick?

Corrupt / failing USB sticks are known to cause inexplicable issues e.g. hanging VM etc.

Your corrupt drive last time and this time could just be due to a forced reboot while data is being written, leading to corrupt data.

 

 

Link to comment

Thanks for the reply...

 

Yes, I should've done diagnostic but it is too late now.

What version were you on before upgrading? It is 6.6.7

Did you stress test (e.g. run a preclear cycle) your drives before adding to the array? Yeap, all good, using the preclear plugin

 

Your issue with booting looks to be a problem with the USB stick.

Are you able to plug it into another (Windows) computer and run a disk check? - I can try, it is booting but then failed to load the necessary stuff to load the UNRAID GUI...

Is your stick USB 3.0 or 2.0? Are you using USB 3.0 or 2.0 port on your motherboard for the stick? USB3.0, it has been working perfectly before the upgrade, no issue whatsoever... it worked for 1-2 days after the upgrade, so I am not sure what has changed since the last time, other than it crapped itself... but too many coincidences, first drive failure then USB drive failure after 1-2 days, after upgrading to 6.8.3... all point out to 6.8.3 version that create unstable environment somehow and then all these issues....

-NOTE: The login screen works, I can enter my root password and get in, but the Mozilla GUI is saying Unable to connect (the address here is localhost)

 

One thing that really painful here is not only I have issue with the array (and VMs) but I have issue with the USB drive itself after the upgrade, the devs should build an automatic roll back feature inside the GUI/through the boot prompt similar to what Windows have (albeit that one only occasionally work)....

Edited by munchies2x
Link to comment
24 minutes ago, munchies2x said:

One thing that really painful here is not only I have issue with the array (and VMs) but I have issue with the USB drive itself after the upgrade, the devs should build an automatic roll back feature inside the GUI/through the boot prompt similar to what Windows have (albeit that one only occasionally work)....

Already exists:
image.thumb.png.8ea69adf503d049e58eacd2a73ced447.png

 

EDIT:  You can do the same thing manually by copying the contents of the previous folder on the flash drive back to the root of that flash drive.

Edited by Frank1940
Link to comment
1 minute ago, munchies2x said:

Thanks Frank, where is the location of the previous folder? Is it in /usr ?

If you did the Upgrade from the GUI, it is in the previous folder/directory found on the Flash Drive.  (As a part of the upgrade process, the folder is created and the necessary files for restoragation are copied there before installing the new files in their place.) 

Link to comment
2 minutes ago, munchies2x said:

I have to plug it to my windows machine then

When you do this, be sure you do a    chkdsk     on the flash drive before copying any files!  With problems like you first described, this is always a good idea.  (Plus, USB3 and Unraid have a checkered history.  Some MB's and BIOS seem to have problems while others work fine...)

Link to comment

I managed to do a chkdsk and fixed the error -> Try to reboot my server again, didn't work

I then copied the previous folder to the main/root folder -> Still doesn't work, the localhost still refuse to connect and display the GUI

 

But I managed to do a diagnostic (now attached), any help would be appreciated!

 

Note: I am unsure why the date is showing Feb 15? I can only tell that's the latest diagnostic file that I did....

tower-diagnostics-20200215-2249.zip

Link to comment
9 hours ago, munchies2x said:

But I managed to do a diagnostic (now attached), any help would be appreciated!

 

First question, How did you create the Diagnostics file when you did the above action?  (GUI, SSH or console)

 

9 hours ago, munchies2x said:

Note: I am unsure why the date is showing Feb 15? I can only tell that's the latest diagnostic file that I did....

Where did you find this Diagnostics file that you attached in this post?  (This is a bit of a loaded question as there are only two possible answers to this question!  Either in the Downloads folder/directory for your PC Web Browser  -or-   in the    logs    folder/directory of the Unraid boot drive.)

Link to comment

6pm update:

1. Updated my motherboard bios to the latest

2. Magically this allow me to boot into the GUI again, not because there is an issue with the firmware but because the BIOS update reset my system and disable all virtualization stuffs... somehow this allow me to get back to the GUI again...

3. Clear my config file, etc to tidy things up, and get the boot stable again and turning on all the virtualization stuffs again - AMD-V

4. But now, I encountered a new problem, I can start my Windows 10 VM using VNC but I can't get this to work with GPU passthrough. It was working 100% before the system crashed down on me few days ago.

5. The VM starts then paused and generate error message...

 

Error message as per below:

Quote

Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: broadcast error_detected message
Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: broadcast mmio_enabled message
Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: broadcast resume message
Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: AER: Device recovery successful
Mar 29 18:13:57 Tower kernel: iommu ivhd0: AMD-Vi: Event logged [
Mar 29 18:13:57 Tower kernel: iommu ivhd0: IOTLB_INV_TIMEOUT device=0e:00.0 address=0x000000081e44b450]
Mar 29 18:13:57 Tower kernel: pcieport 0000:00:03.2: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
Mar 29 18:13:57 Tower kernel: pcieport 0000:00:03.2: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Mar 29 18:13:57 Tower kernel: pcieport 0000:00:03.2: device [1022:1453] error status/mask=00200000/04400000

 

I have tried several of the recommended solutions like amd_iommu=on iommu=pt pci=noaer vfio_iommu_type1.allow_unsafe_interrupts=1 acs pcie_acs_override=downstream, including enabling Hyper-V and Q35, it manages to boot but showing blank screen (i.e. no video output)...

 

Any suggestions/recommendation that I can try out? Note: The whole thing was working correctly using 6.6.7 just a week ago before all the troubles begin!

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.