Cache drive - unmountable no fs


Recommended Posts

4 hours ago, johnnie.black said:

Cache filesystem is corrupt, see here for some recovery options.

 

Also see here since you're running the RAM above the officially supported speed, and that's known to cause stability issues with Ryzen, including data corruption.

Thank you for the reply. I disabled the array, managed to mount my device with the the mount command and all my data appeared to be there.

 

I then started the array, and it started normally with all my dockers and VMs in tact, with the cache pool in tact.

 

Regarding OC RAM, I was going to say it passed an aggressive memtest pass, but that may not be enough. I've been running the server straight for 7 months and this was the only incident of it's kind, but will try and downclock.

 

Luckily I had backups anyways if this didn't pan out!

Link to comment
  • 3 months later...

This is an ongoing issue that still occurs every 3-4 months, even with running RAM at stock speeds. It coincidentally seems to happen when I perform an OS upgrade on an Ubuntu VM. 

 

I usually I just reboot 3-4 times and the error seems to resolve itself, however this time this isn't working.

 

Can someone please help diagnose why this occurs on a regular basis ? 

tower-diagnostics-20200921-0556.zip

Link to comment
6 hours ago, JorgeB said:

Max officially supported RAM speed for your config is 2666Mhz, you're running it at 3000Mhz.

You are correct. I've rebuilt the server (yay for backups) with hopefully the right speed, and the additional ryzen paramters (global c state, and the power setting)

 

I have restored all the dockers using the backup plugin, however I had to recreate my docker.img file. I am going through and manually re-adding all the containers from the template in the gui. Is there a way to re-add everrything instead of the manual method ? 

Link to comment

tower-diagnostics-20200921-1947.zip

6 hours ago, JorgeB said:

If you have CA installed you can restore them all at once: previous apps.

Something very strange is happening. It seems to be the same Linux Ubuntu VM that when interacting with it causes the cache FS error.

 

Just now I had a notification show up with the VM active with cache pool BTRFS missing device. When I got to the unraid GUI it came back...

 

Diagnostics should have captured it.

Edit: It didn't. Uploaded the raw syslog. BTRFS cache drive dissapeared around 19:42.

 

The server is still running, but extremely laggy and slow. GUI takes ages to load. Other vm's other barely responsive as well.

 

Global c-state control - disabled

Power supply ide control is set to "typical current idle"

 

Is it possible for a nvme drive to be dying without anything showing up in smart ? 

 

 

 

tower-diagnostics-20200921-1947.zip

 

syslog1.zip

Edited by bobo89
Syslog was corrupted in diagnostics.
Link to comment

There's appears to be an issue with IOMMU that's causing issues with one of the NVMe devices:

 

Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd000 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd200 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd400 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd600 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd700 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd900 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdb00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdd00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecde00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdf00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece000 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece200 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece400 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece600 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece800 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecea00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecec00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecee00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf000 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf200 flags=0x0070]
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610699776 csum 0x496c4ed7 expected csum 0x276067eb mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610740736 csum 0x6a049d0f expected csum 0xf3ce5b4e mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610744832 csum 0x553ea8da expected csum 0xf96f8211 mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610703872 csum 0x5eb2f80a expected csum 0xd999aa21 mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610748928 csum 0x37e64c12 expected csum 0x91eb05dc mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610707968 csum 0x3d09f103 expected csum 0x1b86b19c mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610753024 csum 0xde65ec48 expected csum 0xf82ad979 mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610712064 csum 0xbd0c7e5e expected csum 0x8947912c mirror 2

 

I've see similar issues before with Ryzen boards but it usually affects the onboard SATA controller, in those cases updating to the latest Unraid beta usually helps, due to newer kernel, also look for a BIOS update.

Link to comment
11 hours ago, JorgeB said:

There's appears to be an issue with IOMMU that's causing issues with one of the NVMe devices:

 


Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: AMD-Vi: Completion-Wait loop timed out
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd000 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd200 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd400 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd600 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd700 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecd900 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdb00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdd00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecde00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: nvme 0000:01:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000000eecdf00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece000 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece200 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece400 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece600 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eece800 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecea00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecec00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecee00 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf000 flags=0x0070]
Sep 21 19:38:00 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x000000000eecf200 flags=0x0070]
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610699776 csum 0x496c4ed7 expected csum 0x276067eb mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610740736 csum 0x6a049d0f expected csum 0xf3ce5b4e mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610744832 csum 0x553ea8da expected csum 0xf96f8211 mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610703872 csum 0x5eb2f80a expected csum 0xd999aa21 mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610748928 csum 0x37e64c12 expected csum 0x91eb05dc mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610707968 csum 0x3d09f103 expected csum 0x1b86b19c mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610753024 csum 0xde65ec48 expected csum 0xf82ad979 mirror 2
Sep 21 19:38:00 Tower kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 340162 off 15610712064 csum 0xbd0c7e5e expected csum 0x8947912c mirror 2

 

I've see similar issues before with Ryzen boards but it usually affects the onboard SATA controller, in those cases updating to the latest Unraid beta usually helps, due to newer kernel, also look for a BIOS update.

I will have to wait till 6.9 is out of beta to test as I need stability.

 

That being said, there is something about that main VM that whenever I open Reminna on in (remote desktop client), the GUI of the VM freezes(still accessible over ssh), shortly thereafter the server slows to a crawl too. I just had to do a poweroff from the command line, and upon boot, the usb is corrupted.

 

Is there anything else in the logs indicating a massive systemic problem ? 

Edited by bobo89
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.