NVME metadata I/O error - migrated to new HW-issue


Recommended Posts

Hey so I had some issues with my old UNRAID HW

TLDR I had some issues with a HBA as well as old HW, so I got some new HW and now I have some problems with the Cache-drive.

 

I got an upgrade to the hardware, to the following:

  • CPU – Ryzen 5 3600
  • MB - MSI x470 Gaming Plus
  • RAM – Crucial Ballistix 32GB
  • HBA - LSI 9211-8i

+ my old NVME from previous system – Crucial Force MP510 1TB

This was also the cache-drive in my "old" Unraid server.

 

 

So long story short, I started the new server, it boots nicely, and logs-in (checked from WebGui).

However there is a problem with the nvme-Cache-drive.


MB_Bios sees the Force MP510-drive fine.

Unraid also sees it, but it sort of "ejects it"? I dunno 🤷‍♂️.. 

Also if I press the "mount"-button (see picture) nothing happens, it starts to mount (with the spinny-circle) for like .5seconds, before it turns back to the Mount-button.

 

Attaching:

pic from Server (taken just after auto-download of keyfile and log-in)

UnraidWebGui-ss

Logfile (without GO-file due to cleartext username and PW)

 

the Cachedrive (nvme) worked fine before the HW-upgrade.

I am using the M2_1-slot on the MB.

It still has another slot M2_2-slot. I have NOT yet tried to switch it around, I'm trying here first.

Reason: if using the M2_2-slot, it will disable SATA1 and one of the PCIe-slots, and I'm trying to avoid this.

 

 

Thanks for any help and input from you all.

 

 

//Magmanthe

NVME-error.png

nvme_unraid_gui.png

magnas-diagnostics-20210608-1613.zip

Edited by Magmanthe
Link to comment

NVMe device dropped, this can sometimes help with that:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference, also good idea to look for a board BIOS update.

Link to comment
3 hours ago, JorgeB said:

NVMe device dropped, this can sometimes help with that:

"Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"


Reboot and see if it makes a difference, also good idea to look for a board BIOS update.

 

Hi,

 

Thanks for the quick reply, however I have to report NoJoy on this.

Added line to the Syslinux Config and rebootet, and the same happens when it boots back up.

 

New picture taken from attached monitor.

Regarding BIOS: I upgraded the BIOS to latest from MSI (release 31|st of May I believe) before booting to unraid for the first time, so that is already covered.

 

However something peculiar.

If I stop the array,

unmark/remove the cache from the pool-dropdown list

re-add the nvme to the cache-pool

start array.

 

It does seem to work normally again.

All dockers (that write to the cahce) function (opening KRUSADER i can browser the cache-drive).

 

So I mean yeah, it works, if I take the time to stop/remove/add/start the array I guess... 🤔

 

Any clues?


//Magmanthe

nvme_after_syslinux.png

Edited by Magmanthe
Link to comment

It's rather strange, cache initially mounts OK:

 

Jun  9 19:30:20 magnas kernel: XFS (nvme0n1p1): Ending clean mount

 

But just a few seconds later it fails:
 

Jun  9 19:30:26 magnas kernel: nvme nvme0: failed to set APST feature (-19)
Jun  9 19:30:26 magnas kernel: XFS (nvme0n1p1): metadata I/O error in "xfs_imap_to_bp+0x5c/0xa2 [xfs]" at daddr 0x5e340 len 32 error 5

 

Then it's detected again but now as nvme1 so it's picked by Unraid as a new device:

 

Jun  9 19:30:26 magnas kernel: nvme nvme1: pci function 0000:01:00.0
Jun  9 19:30:26 magnas kernel: nvme nvme1: missing or invalid SUBNQN field.
Jun  9 19:30:26 magnas kernel: nvme nvme1: Shutdown timeout set to 10 seconds
Jun  9 19:30:26 magnas kernel: nvme nvme1: 8/0/0 default/read/poll queues
Jun  9 19:30:26 magnas kernel: nvme1n2: p1

 

As far as Unraid is concern the cache is offline, because it's looking for nvme0, don't remember seeing an issue like this before, I would try a different model NVMe device if that's an option.

 

 

 

 

Link to comment

Happy Weekend :)

 

Tried with a new (well used) nvme.

Turned off server - Removed the 1TB corsairs Force MP510 (that did work on the "old" hw). Installed a 250GB Samsung 960 EVO, and turned on the server.

 

After boot I took out a diagnostics-zip.

The Samsung EVO was discovered as unassigned.

I stopped the array

put the EVO as Cache-drive and started the Array.

Had to format the drive (as it was not XFS) and then it just kind of worked.

I then installed 3 random apps/dockers (cause that data lives on the cachedrive) and then restarter the server.

 

Now UNRAID boots totally normally and detects and treats the 250EVO as a normal cache-drive. No ejection or anything.

After boot I took out another Diagnostics (for comparison reasons, but that might not be needed).

 

Just based on my experiences here, it does seem that some kind of mix between I guess the MB, Unraid and that specific Corsair-series of NVME's just doesn't play nice with eachother? 

 

or?

 

//magmanthe

 

magnas-diagnostics-after_NVME_setCache.zip magnas-diagnostics-firstBoot after nvme-swap.zip

Edited by Magmanthe
Link to comment

So just a quick update.

 

Plan was to get a new nvme (probably a samsung-one).

So to prepare for this, I removed the small EVO250GB from M2_1-slot.

And then I installed the "old" Corsair Force MP510 in the M2_2-slot (so that the "main-slot" would be free and clear for when the new NVME-arrived in the mail).

 

Put the server back in it's room and started it up. 

Yes, the first boot, it didn't understand much of what was happening (due to the removing of, and adding of a new drive) but after stopping the array, assigning the Corsair as the Cache and then starting the array, I did a full server reboot.

 

Once the server came back up (I fully expecting it to eject the drive as previously stated in the diags earlier in the thread) and prepared to stop the array, remove/add the drive and start it again. But to my surprise, now it just worked, totally fine and normal.

No ejection of the drive, no I/O error, no nothing.. I was very surprised.


So I tried some more reboots and it still holds strong. Very strange indeed.

 

So for now, I guess I don't have to get a new NVME.. (Just need to remember that the SATA1-port and PCIe6-slot is now disabled because I'm using the M2_2-slot..)

 

Again thanks for the help, @JorgeB... Much appreciated :)

 

 

//Magmanthe

Edited by Magmanthe
Link to comment
7 hours ago, JorgeB said:

That should remain enable when an NVMe device is used in the M.2 slot, disable if you use a SATA M.2 device.

 

Well, not according to MSI-manual..
Source: https://download.msi.com/archive/mnu_exe/mb/E7B79v3.0.pdf

 

Page 32/110 on the M2-page it says:

  • SATA1 port will be unavailable when installing SATA M.2 SSD in M2_2 slot.
  • PCI_E6 slot will be unavailable when installing PCIe M.2 SSD in M2_2 slot.
  • M2_1 slot only supports PCIe mode.

//Magmanthe

msi_nvme_manual.png

Link to comment
5 minutes ago, JorgeB said:

? That's what the manual says, SATA1 will be disable if a SATA M.2 device is used, you're using an NVMe (PCIe) device.

Ahh, yeah now I see what you mean..

I think I read that section kind of fast and wronlgy. Thought it was BOTH goth disabled when using the M2_2-slot...

 

Not (like you say) that it is either SATA1 port is disabled IF SATA M.2 or PCI6 is disabled IF nvme M2.. Got it..

Thanks for pointing that out..

 

 

Anywho the whole thing is still kind of mysterious..

Like WHY would UNRAID eject the drive if it's in the M2_1-slot, while performs totally normal if the drive is in the M2_2-slot... Very strange indeed.

But it is working now, and as long as I just note it down in my Notes I will remember this the next time I poke around in there :)

 

Thanks again..

 

 

//Magmanthe

  • Like 1
Link to comment
  • 2 months later...

Just helped someone diagnose something with very similar errors.  This was caused by the NVME controller being configured for passthrough in Tools | System Devices - even though this device had never been configured for passthrough.

 

This happened due to a single unrelated pcie device being removed - this changed pcie device assignments in some way that caused the wrong devices to be enabled for passthrough.

 

 

Edited by jortan
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.