6.10.3 Chaos following new nvme install


Recommended Posts

I'm about ready to pull my hair out with this one. 

 

A couple of days ago I shut down my server to install a new 2TB nvme to replace the cheapie SSD that I have been using for my cache drive up until now. I thought it would be a simple process, but when I booted the server back up, all hell broke loose. I got an error about btrfs file system issues from the cache drive, and 3 of my 5 data drives were showing up in the array and also under unassigned devices. 

 

Some googling at this point suggested various things, including issues with the cables, so I did a couple of rounds shutdown and power back on during which I reseated the sata cables and also updated the motherboard's bios, as one thread I read suggested a potential problem with the controller. Eventually I wound up with no drives erroneously showing under unassigned devices but Parity 2 and Disk 5 were both disabled in an error state. 

 

I did a rebuild on Disk 5, which seemed to complete without issue, and I've gone through the process of moving all of my cache files onto the array, reassigning to the new nvme, and moving the files back onto the cache. The old cache drive has been disconnected from sata and power. 

 

Where I'm running into a further issue is with trying to reassign parity 2. I've done the process of stop array, assign no device, start array, stop array, reassign the hard drive, and then start array again, but both times I've done that the parity check pauses almost immediately saying read errors on parity 2 (?), and I can neither cancel nor resume the parity check from the gui. From this state, shutdown and reboot are my only working options. 

I am now booted into safe mode rebuilding the data on disk 5 for the second time. After that finishes sometime tomorrow, I'll of course want to get parity 2 rebuilt as well. 

 

Other potentially pertinent details:

 

At one point the USB drive was not seen by the bios and the machine booted from the windows boot manager on the other NVME that is passed through to my windows 10 vm. I plugged the usb into my desktop and checked for errors, finding none, and then plugged it back in and it has booted fine. Is my flash drive dying?

 

Checking the system log during times when I had drives showing in both the array and unassigned devices showed an error to the tune of "disk with the ID <WDC SERIAL NUMBER> is not set to auto mount" and other errors saying that the disk labeled sde is now sdi (for example, don't remember the actual drive labels in question), with the drive labels corresponding to the ones that were showing up in the array and UD simultaneously.

Additionally, I was unable to uninstall the unassigned devices plugin, getting either a blank screen or a 502 bad gateway from nginx where the uninstall progress should have been. 

The parity 2 disk is my newest disk, having been installed only a few months ago. 

 

Other hardware installed includes an RTX2060 and usb hub card which are passed through to a vm, a quadro p400 for transcoding, and a pcie sata card (which is just hosting an optical drive, none of my actual array or pool drives.) 

 

Anyone have any advice or experience? 

 

Please let me know if there is any additional information that I should provide.

 

Edit - Should have mentioned, this system is running a Ryzen 9 3950X in an Asus X570-Plus Tuf Gaming Motherboard. Maybe a PCI Lane issue? 

Edited by LuckJury
Link to comment

I beg your pardon, this has me frazzled and I was moving too fast. I downloaded the diagnostics from the server and then selected the wrong file to upload. I've attached the diagnostics from earlier this morning to THIS post, and once the running data rebuild finishes in a couple of hours, I'll reboot out of safe mode and will gladly pull a fresh diagnostic and upload it if that will be helpful. 

ironhide-diagnostics-20220808-0908.zip

Link to comment

Alright, after the disk 5 rebuild completed in safe mode, I shut down the server. I removed the pcie 1x 2-port sata card that was installed to host an optical drive. I booted up, started the array, and ran a diagnostic. I then corrected the vfio settings in the system devices that had been changed due to the hardware changes, rebooted, assigned the parity 2 drive, and started the array (while holding my breath.) Everything seems to be working as expected at this point, making me think that I did in fact have a pcie lane issue. For the sake of making sure we're good, and in case someone googling in the future runs across this thread, I've attached both diagnostics to this post for review. 

ironhide-diagnostics-20220808-1859.zip ironhide-diagnostics-20220808-1838.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.