bobo89 Posted June 1, 2020 Posted June 1, 2020 I updated a ubuntu Linux vm, and upon reboot of the vm, all VMs disappeared. I rebooted the server and the 1 cache drive of the 2 drive raid1 has the mentioned error. What should my next steps be? tower-diagnostics-20200531-2328.zip Quote
JorgeB Posted June 1, 2020 Posted June 1, 2020 Cache filesystem is corrupt, see here for some recovery options. Also see here since you're running the RAM above the officially supported speed, and that's known to cause stability issues with Ryzen, including data corruption. 1 Quote
bobo89 Posted June 1, 2020 Author Posted June 1, 2020 4 hours ago, johnnie.black said: Cache filesystem is corrupt, see here for some recovery options. Also see here since you're running the RAM above the officially supported speed, and that's known to cause stability issues with Ryzen, including data corruption. Thank you for the reply. I disabled the array, managed to mount my device with the the mount command and all my data appeared to be there. I then started the array, and it started normally with all my dockers and VMs in tact, with the cache pool in tact. Regarding OC RAM, I was going to say it passed an aggressive memtest pass, but that may not be enough. I've been running the server straight for 7 months and this was the only incident of it's kind, but will try and downclock. Luckily I had backups anyways if this didn't pan out! Quote
JorgeB Posted June 1, 2020 Posted June 1, 2020 23 minutes ago, bobo89 said: I was going to say it passed an aggressive memtest pass, but that may not be enough. It isn't there are various reports of sync errors during a parity check when running out of spec RAM with Ryzen despite no memtest errors. Quote
bobo89 Posted September 21, 2020 Author Posted September 21, 2020 This is an ongoing issue that still occurs every 3-4 months, even with running RAM at stock speeds. It coincidentally seems to happen when I perform an OS upgrade on an Ubuntu VM. I usually I just reboot 3-4 times and the error seems to resolve itself, however this time this isn't working. Can someone please help diagnose why this occurs on a regular basis ? tower-diagnostics-20200921-0556.zip Quote
JorgeB Posted September 21, 2020 Posted September 21, 2020 Max officially supported RAM speed for your config is 2666Mhz, you're running it at 3000Mhz. Quote
bobo89 Posted September 21, 2020 Author Posted September 21, 2020 6 hours ago, JorgeB said: Max officially supported RAM speed for your config is 2666Mhz, you're running it at 3000Mhz. You are correct. I've rebuilt the server (yay for backups) with hopefully the right speed, and the additional ryzen paramters (global c state, and the power setting) I have restored all the dockers using the backup plugin, however I had to recreate my docker.img file. I am going through and manually re-adding all the containers from the template in the gui. Is there a way to re-add everrything instead of the manual method ? Quote
JorgeB Posted September 21, 2020 Posted September 21, 2020 If you have CA installed you can restore them all at once: previous apps. Quote
bobo89 Posted September 21, 2020 Author Posted September 21, 2020 (edited) tower-diagnostics-20200921-1947.zip 6 hours ago, JorgeB said: If you have CA installed you can restore them all at once: previous apps. Something very strange is happening. It seems to be the same Linux Ubuntu VM that when interacting with it causes the cache FS error. Just now I had a notification show up with the VM active with cache pool BTRFS missing device. When I got to the unraid GUI it came back... Diagnostics should have captured it. Edit: It didn't. Uploaded the raw syslog. BTRFS cache drive dissapeared around 19:42. The server is still running, but extremely laggy and slow. GUI takes ages to load. Other vm's other barely responsive as well. Global c-state control - disabled Power supply ide control is set to "typical current idle" Is it possible for a nvme drive to be dying without anything showing up in smart ? tower-diagnostics-20200921-1947.zip syslog1.zip Edited September 22, 2020 by bobo89 Syslog was corrupted in diagnostics. Quote
JorgeB Posted September 22, 2020 Posted September 22, 2020 There's appears to be an issue with IOMMU that's causing issues with one of the NVMe devices: Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd000 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd200 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd400 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd600 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd700 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd900 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdb00 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdd00 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecde00 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdf00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece000 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece200 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece400 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece600 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece800 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecea00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecec00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecee00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf000 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf200 flags=0x0070] Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610699776 csum 0x496c4ed7 expected csum 0x276067eb mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610740736 csum 0x6a049d0f expected csum 0xf3ce5b4e mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610744832 csum 0x553ea8da expected csum 0xf96f8211 mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610703872 csum 0x5eb2f80a expected csum 0xd999aa21 mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610748928 csum 0x37e64c12 expected csum 0x91eb05dc mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610707968 csum 0x3d09f103 expected csum 0x1b86b19c mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610753024 csum 0xde65ec48 expected csum 0xf82ad979 mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610712064 csum 0xbd0c7e5e expected csum 0x8947912c mirror 2 I've see similar issues before with Ryzen boards but it usually affects the onboard SATA controller, in those cases updating to the latest Unraid beta usually helps, due to newer kernel, also look for a BIOS update. Quote
bobo89 Posted September 22, 2020 Author Posted September 22, 2020 (edited) 11 hours ago, JorgeB said: There's appears to be an issue with IOMMU that's causing issues with one of the NVMe devices: Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd000 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd200 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd400 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd600 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd700 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd900 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdb00 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdd00 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecde00 flags=0x0070] Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdf00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece000 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece200 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece400 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece600 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece800 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecea00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecec00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecee00 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf000 flags=0x0070] Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf200 flags=0x0070] Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610699776 csum 0x496c4ed7 expected csum 0x276067eb mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610740736 csum 0x6a049d0f expected csum 0xf3ce5b4e mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610744832 csum 0x553ea8da expected csum 0xf96f8211 mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610703872 csum 0x5eb2f80a expected csum 0xd999aa21 mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610748928 csum 0x37e64c12 expected csum 0x91eb05dc mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610707968 csum 0x3d09f103 expected csum 0x1b86b19c mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610753024 csum 0xde65ec48 expected csum 0xf82ad979 mirror 2 Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610712064 csum 0xbd0c7e5e expected csum 0x8947912c mirror 2 I've see similar issues before with Ryzen boards but it usually affects the onboard SATA controller, in those cases updating to the latest Unraid beta usually helps, due to newer kernel, also look for a BIOS update. I will have to wait till 6.9 is out of beta to test as I need stability. That being said, there is something about that main VM that whenever I open Reminna on in (remote desktop client), the GUI of the VM freezes(still accessible over ssh), shortly thereafter the server slows to a crawl too. I just had to do a poweroff from the command line, and upon boot, the usb is corrupted. Is there anything else in the logs indicating a massive systemic problem ? Edited September 22, 2020 by bobo89 Quote
bobo89 Posted September 28, 2020 Author Posted September 28, 2020 That already is a massive problem.Looks line it was the gpu crashing to some issue with Gnome.Regarding the OC ram on ryzen, will newer kernels better handle that (on 6.9) or is this another issue? Sent from my SM-N960W using Tapatalk Quote
JonathanM Posted September 28, 2020 Posted September 28, 2020 23 minutes ago, bobo89 said: Regarding the OC ram on ryzen, will newer kernels better handle that (on 6.9) or is this another issue? It's a chipset / CPU / motherboard issue. Mainly CPU. Quote
bobo89 Posted September 29, 2020 Author Posted September 29, 2020 It's a chipset / CPU / motherboard issue. Mainly CPU.So if I value stability first with unraid, there's no value of going over 2666 mhz ram in my config?Sent from my SM-N960W using Tapatalk Quote
JorgeB Posted September 30, 2020 Posted September 30, 2020 Yes, overclock and servers not a good combination. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.