munchies2x Posted March 27, 2020 Share Posted March 27, 2020 I am at my wits end, this is the second time when I tried to upgrade and the whole thing went wrong. First time, it killed a drive and then I rolled back, encounter zero issue since then. After a while a prompt showing that I should upgrade to minimise some login vulnerability so I upgraded to the latest 6.8.3 (I think, I can't my server now). 1. First, it killed a functional almost 6 months old - 4TB drive. One of my VM just hang and when I reboot, one of the drive is showing an error and requiring me to rebuild the array. (note: this happened last time as well when I tried to upgrade about 6 months ago). 2. Went to the shop and spent $150 to get another 4TB drive, change into a completely different SATA port and use a brand new SATA cable for just in case scenario. 3. Spent 12 hours rebuilding the parity and then I started my Windows 10 VM. 4. It was a patchy start for the VM requiring me to reattach the keyboard and mouse multiple times before the VM detected them. Then occasionally after a while the screen will turn off by itself, requiring me to turn off the VM and restart it back again. I did this 2 times then the whole things just crashed. 5. Went to the GUI and saw that one of the drive failed again! This is a brand new 4TB drive that I just bought! 6. The whole UNRAID is unresponsive, forcing me to restart the server.... now UNRAID refuse to boot! It says "...waiting for /dev/disk/by-label/UNRAID (will check for 30 sec) at boot up..." and "cannot find device Bond0" and the whole gui refuse to start.... (see attachment)... I just can't imagine UNRAID releasing so unstable release and it actually kills hardware without a possibility to fix it. This experience has really leave a bad impression, even if I rebuild the whole things (which will take days), I don't know if I can trust the solution again.... Happy to listen to suggestions... (and thanks in advance) My system: Motherboard ASROCK Taichi CPU AMD Ryzen 5900 RAM 32 GB Quote Link to comment
testdasi Posted March 27, 2020 Share Posted March 27, 2020 First and foremost, whenever a drive fails (or reports to have issues), the first thing you should have done was to get the diagnostics zip (Tools -> Diagnostics) which contain clues as to what happened. People regularly post disk issues on here and experts e.g. johnnie.black respond relatively quickly. Based on your forum activity, you didn't ask for help originally so keep that in mind for next time. Remember, you're not alone. What version were you on before upgrading? Did you stress test (e.g. run a preclear cycle) your drives before adding to the array? Your issue with booting looks to be a problem with the USB stick. Are you able to plug it into another (Windows) computer and run a disk check? Is your stick USB 3.0 or 2.0? Are you using USB 3.0 or 2.0 port on your motherboard for the stick? Corrupt / failing USB sticks are known to cause inexplicable issues e.g. hanging VM etc. Your corrupt drive last time and this time could just be due to a forced reboot while data is being written, leading to corrupt data. Quote Link to comment
munchies2x Posted March 27, 2020 Author Share Posted March 27, 2020 (edited) Thanks for the reply... Yes, I should've done diagnostic but it is too late now. What version were you on before upgrading? It is 6.6.7 Did you stress test (e.g. run a preclear cycle) your drives before adding to the array? Yeap, all good, using the preclear plugin Your issue with booting looks to be a problem with the USB stick. Are you able to plug it into another (Windows) computer and run a disk check? - I can try, it is booting but then failed to load the necessary stuff to load the UNRAID GUI... Is your stick USB 3.0 or 2.0? Are you using USB 3.0 or 2.0 port on your motherboard for the stick? USB3.0, it has been working perfectly before the upgrade, no issue whatsoever... it worked for 1-2 days after the upgrade, so I am not sure what has changed since the last time, other than it crapped itself... but too many coincidences, first drive failure then USB drive failure after 1-2 days, after upgrading to 6.8.3... all point out to 6.8.3 version that create unstable environment somehow and then all these issues.... -NOTE: The login screen works, I can enter my root password and get in, but the Mozilla GUI is saying Unable to connect (the address here is localhost) One thing that really painful here is not only I have issue with the array (and VMs) but I have issue with the USB drive itself after the upgrade, the devs should build an automatic roll back feature inside the GUI/through the boot prompt similar to what Windows have (albeit that one only occasionally work).... Edited March 27, 2020 by munchies2x Quote Link to comment
Frank1940 Posted March 27, 2020 Share Posted March 27, 2020 (edited) 24 minutes ago, munchies2x said: One thing that really painful here is not only I have issue with the array (and VMs) but I have issue with the USB drive itself after the upgrade, the devs should build an automatic roll back feature inside the GUI/through the boot prompt similar to what Windows have (albeit that one only occasionally work).... Already exists: EDIT: You can do the same thing manually by copying the contents of the previous folder on the flash drive back to the root of that flash drive. Edited March 27, 2020 by Frank1940 Quote Link to comment
munchies2x Posted March 27, 2020 Author Share Posted March 27, 2020 I wish I can do that through the Web GUI, it doesn't work currently (can't connect to the GUI). Is there a way to do it through command prompt? Quote Link to comment
Frank1940 Posted March 27, 2020 Share Posted March 27, 2020 3 minutes ago, munchies2x said: I wish I can do that through the Web GUI, it doesn't work currently (can't connect to the GUI). Is there a way to do it through command prompt? Refresh this page and reread my earlier post. (Sorry, I thought I might be able to edit my post before you read it...) Quote Link to comment
munchies2x Posted March 27, 2020 Author Share Posted March 27, 2020 Thanks Frank, where is the location of the previous folder? Is it in /usr ? Quote Link to comment
Frank1940 Posted March 27, 2020 Share Posted March 27, 2020 1 minute ago, munchies2x said: Thanks Frank, where is the location of the previous folder? Is it in /usr ? If you did the Upgrade from the GUI, it is in the previous folder/directory found on the Flash Drive. (As a part of the upgrade process, the folder is created and the necessary files for restoragation are copied there before installing the new files in their place.) Quote Link to comment
munchies2x Posted March 27, 2020 Author Share Posted March 27, 2020 Thanks, I have to plug it to my windows machine then. I'll do it tomorrow morning, to late to do this at this hour.... I will post back how it goes.... Quote Link to comment
Frank1940 Posted March 27, 2020 Share Posted March 27, 2020 2 minutes ago, munchies2x said: I have to plug it to my windows machine then When you do this, be sure you do a chkdsk on the flash drive before copying any files! With problems like you first described, this is always a good idea. (Plus, USB3 and Unraid have a checkered history. Some MB's and BIOS seem to have problems while others work fine...) Quote Link to comment
munchies2x Posted March 28, 2020 Author Share Posted March 28, 2020 I managed to do a chkdsk and fixed the error -> Try to reboot my server again, didn't work I then copied the previous folder to the main/root folder -> Still doesn't work, the localhost still refuse to connect and display the GUI But I managed to do a diagnostic (now attached), any help would be appreciated! Note: I am unsure why the date is showing Feb 15? I can only tell that's the latest diagnostic file that I did.... tower-diagnostics-20200215-2249.zip Quote Link to comment
Frank1940 Posted March 28, 2020 Share Posted March 28, 2020 9 hours ago, munchies2x said: But I managed to do a diagnostic (now attached), any help would be appreciated! First question, How did you create the Diagnostics file when you did the above action? (GUI, SSH or console) 9 hours ago, munchies2x said: Note: I am unsure why the date is showing Feb 15? I can only tell that's the latest diagnostic file that I did.... Where did you find this Diagnostics file that you attached in this post? (This is a bit of a loaded question as there are only two possible answers to this question! Either in the Downloads folder/directory for your PC Web Browser -or- in the logs folder/directory of the Unraid boot drive.) Quote Link to comment
munchies2x Posted March 29, 2020 Author Share Posted March 29, 2020 Hi Frank, 1. Console, i type diagnostic and it outputs the file 2. It is in the logs folder. I found it after I use windows pc to browse the usb stick Cheers. Quote Link to comment
munchies2x Posted March 29, 2020 Author Share Posted March 29, 2020 6pm update: 1. Updated my motherboard bios to the latest 2. Magically this allow me to boot into the GUI again, not because there is an issue with the firmware but because the BIOS update reset my system and disable all virtualization stuffs... somehow this allow me to get back to the GUI again... 3. Clear my config file, etc to tidy things up, and get the boot stable again and turning on all the virtualization stuffs again - AMD-V 4. But now, I encountered a new problem, I can start my Windows 10 VM using VNC but I can't get this to work with GPU passthrough. It was working 100% before the system crashed down on me few days ago. 5. The VM starts then paused and generate error message... Error message as per below: Quote Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: broadcast error_detected message Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: broadcast mmio_enabled message Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: broadcast resume message Mar 29 18:13:56 Tower kernel: pcieport 0000:00:03.2: AER: Device recovery successful Mar 29 18:13:57 Tower kernel: iommu ivhd0: AMD-Vi: Event logged [ Mar 29 18:13:57 Tower kernel: iommu ivhd0: IOTLB_INV_TIMEOUT device=0e:00.0 address=0x000000081e44b450] Mar 29 18:13:57 Tower kernel: pcieport 0000:00:03.2: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0 Mar 29 18:13:57 Tower kernel: pcieport 0000:00:03.2: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) Mar 29 18:13:57 Tower kernel: pcieport 0000:00:03.2: device [1022:1453] error status/mask=00200000/04400000 I have tried several of the recommended solutions like amd_iommu=on iommu=pt pci=noaer vfio_iommu_type1.allow_unsafe_interrupts=1 acs pcie_acs_override=downstream, including enabling Hyper-V and Q35, it manages to boot but showing blank screen (i.e. no video output)... Any suggestions/recommendation that I can try out? Note: The whole thing was working correctly using 6.6.7 just a week ago before all the troubles begin! Quote Link to comment
munchies2x Posted March 31, 2020 Author Share Posted March 31, 2020 Any comments from the Developers? Any troubleshooting methodology that the Developers can recommend? Quote Link to comment
trurl Posted March 31, 2020 Share Posted March 31, 2020 On 3/29/2020 at 3:57 AM, munchies2x said: AMD Have you seen this? https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 Quote Link to comment
munchies2x Posted April 1, 2020 Author Share Posted April 1, 2020 @Constructor thanks, had a look, I have reduced my memory bandwidth as per the article, no impact. Will try updating my BIOS, seems like AMD has issue the new AGESA patch and my issue seems to relate to the PCI driver (which should in theory relates to the BIOS)... Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.