Parity disk won't enable after parity check


Go to solution Solved by JorgeB,

Recommended Posts

Unraid 6.11.5

 

I have some weird happening with my array. I will try and describe the series of events the best I can.

TL;DR I tried to bring my disk1 online. It and parity disk took turns being offline. After 3 or 4 parity syncs and disk rebuilds. Parity disk will not go online after parity sync. Unraid has multiple notices saying Disk1 Disk2 can't be written to. Disk1 has read errors. Parity disk is disabled. 

 

Before I began my upgrade I have 7 disks total with 1 being parity. I have 1 cache drive NVME.

 

I had planned to install a 2nd cache drive and assign it to the same pool.

I install the 2nd cache drive. On bootup it asks me to assign it BTRFS. I click yes and that brings my 1st cache drive offline because it was formatted XFS. At the same time my Disk1 showed it was offline. I restarted Unraid multiple times and tried different cables and SATA ports. I though the disk being offline meant that it was not recognized by the OS/BIOS. I learn that was not the case and that the disk had to be rebuilt by the parity data. I take the array offline remove the disk, start the array, stop the array, add the disk back. Rebuild begins. After rebuild unraid says all disks + parity are online. I remove the 2nd cache slot and set the cache pool back to XFS and assign my cache drive back to the pool. I move the data off the cache drive by invoking mover and setting all of my "prefer cache" disks to "yes cache". All the data moves successfully. I also as a precaution copy the appdata folder contents to my main PC via SMB.

 

I notice that my VM list and Docker list are empty. I restart unraid. I should mention at this point and before during the restarts Unraid was not able to properly shut down. It would either hang on trying to stop the array, and do absolutely nothing for 30 minutes or more. Or it would spit out on the local console that it had IO errors. I didn't think anything of it. 3 times now upon boot up my computer would not recognize any bootable devices including the unraid USB except for the 2nd cache drive which had a windows install. After a restart unraid would boot.

 

After actually booting into unraid if I had rebuilt disk1, the parity disk would be offline and I would take the array down and back up to get it to parity sync. If I had previously done a parity sync before a restart disk1 would be offline and I would take the array offline and back online to rebuild it. Now after each restart and parity sync, the parity disk remains offline even before the restart and the "successful" parity sync. Also as of now there is no longer a parity sync button, it has been replaced with a read-check button.

 

I'm sure some info has been left out and that certain things aren't very clear. Please ask me questions and point me in the correct direction to recover my array. My only hypothesis is that I somehow swapped my parity and disk1 positions.

 

I have attached some diagnostic dumps and here are screenshots of my current webui. I also have a flash backup of unraid from 02-15-2023 I believe it is version 6.9.5 in that backup?

 

Thanks in advance for all that yall do here.

image.thumb.png.d0ccf09d877d1bfc32e49fbda4c0cef1.png

 

image.thumb.png.e2856e8b48b502359a7f68b07cb33822.png

 

image.thumb.png.e80af4d7e4472f6271cd96417c40855b.png

 

waffle-diagnostics-20230303-1230.zip waffle-diagnostics-20230303-1932 after parity sync.zip waffle-diagnostics-20230303-2004 after restart to normal OS mode.zip

Link to comment
  • Solution

This SATA controller is being passed-through to the VM, so when it starts Unraid will lose all connected disks, just edit the VM and correct that.

 

08:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
    Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901]
    Kernel driver in use: ahci
    Kernel modules: ahc

i

  • Upvote 1
Link to comment
1 hour ago, JorgeB said:

This SATA controller is being passed-through to the VM, so when it starts Unraid will lose all connected disks, just edit the VM and correct that.

 

08:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
    Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901]
    Kernel driver in use: ahci
    Kernel modules: ahc

i

My VMs list is currently empty. I should have a windows 10 VM there. Any suggestions to accomplish what you recommended above?

I feel like I've done enough potential damage. I'd like to play is slow and safe and see what the community suggests first.

I assume I could fix this problem by disabling the VM manager?

Edited by offroadguy56
Link to comment
10 minutes ago, JorgeB said:

That's strange, there a Windows 10 VM on the diags, but yeah, disable the VM manager, you will then have the option to delete libvirt.img, delete it and you can then re-enable the VM manager.

I have tried to disable the VM manager. It still says running in the top right. But the VMs tab is gone and enable VMs is set to 'no'. Unraid did hang on the loading icon for a few minutes. I refreshed the page to regain control of the webui.

How should I go about removing the libvirt.img file? I assume I would see a button next to the path location in the settings page.

 

This is the most recent line in the log:
 

Quote

Mar  4 05:18:46 waffle  ool www[2694]: /usr/local/emhttp/plugins/dynamix/scripts/emcmd 'cmdStatus=Apply'

image.thumb.png.11f7e84ad7b5b6f6ab9e55c472d9099d.png

Edited by offroadguy56
Link to comment
6 hours ago, JorgeB said:

The service must be stopped for the delete option to appear, edit /boot/config/domain.cfg and change SERVICE="enable" to "disable" then reboot.

Looks like the file was properly modified by the webui. But unraid failed to properly shut down the VM service. After a restart I can now modify the Libvirt storage location path. Do I need the array running to see the option to delete libvirt.img?

 

If the SATA controller passthrough was the culprit, with the VM manager disabled in theory I should be able to start the array and repair it without issue correct? Even with out removing libvirt.img?

 

There is also one more thing I want to point out. Currently on my Disk1 if I look at the contents of the disk I do not have my usual SMB shares. Instead I see a linux file system. If I navigate to /mnt/ I can see Disk2,Disk3,Disk4, etc but no Disk1. Just want to put this info out there before any more rebuilds or parity checks are performed.

image.thumb.png.1d486a9ab14c089543f11f212c340d13.png

Edited by offroadguy56
Link to comment
16 hours ago, offroadguy56 said:

Do I need the array running to see the option to delete libvirt.img?

Yes.

 

16 hours ago, offroadguy56 said:

If the SATA controller passthrough was the culprit, with the VM manager disabled in theory I should be able to start the array and repair it without issue correct? Even with out removing libvirt.img?

Yep, but since possibly there's a "phantom" VM there probably best to still delete it before or after.

 

16 hours ago, offroadguy56 said:

There is also one more thing I want to point out. Currently on my Disk1 if I look at the contents of the disk I do not have my usual SMB shares. Instead I see a linux file system. If I navigate to /mnt/ I can see Disk2,Disk3,Disk4, etc but no Disk1. Just want to put this info out there before any more rebuilds or parity checks are performed.

Post new diags after the VM issue is resolved.

Link to comment
5 hours ago, JorgeB said:

Yes.

 

Yep, but since possibly there's a "phantom" VM there probably best to still delete it before or after.

 

Post new diags after the VM issue is resolved.

Array started, libvirt deleted (i believe). Here is the most recent diagnostics.

 

Here is most recent screenshot of Main tab. For when I attempt to fix the array.

 

And my normal SMB shares have shown themselves on Disk1 again. No more linux file system.

 

waffle-diagnostics-20230305-1047 - After libvirt deletion.zip

image.png

image.thumb.png.b922d8981214a6190e083b87e210d02c.png

Edited by offroadguy56
Link to comment
23 hours ago, JorgeB said:

Everything looks good so far, try re-enabling parity now.

I though the read check that it offered me would fix the disk being disabled. It did not.

So I performed the start array with out disk and add it back trick. The array is performing a parity sync now. Parity drive is enabled now. I'll see how it goes, last time it did it's parity check but then the drive was disabled again, but that shouldn't happen now with the VM issue removed. I'll past back here with results sometime tomorrow after some sleep.

image.thumb.png.29a2b64ad1cd5e9d38de981a6d1bdb5e.png

Link to comment
10 hours ago, JorgeB said:

Yeah, that won't do it, should be OK for now or the disk would become disabled almost immediately.

Looks like the array is back online. SMB shares are working again. Docker and VM are currently disabled. I can work on my own to get those back.

 

Can I remove these historical disks without further issue?

image.png.385fc0692e56886e11d60e8861f7d530.png

Link to comment

Disk1 is down again. My docker img was corrupted at some point so I went to go fix that. I also had 2 images on the array, docker.img and docker-xfs.img. I deleted both then started the docker service with these settings. Do we know if this has caused Disk1 to go offline. This time it says it's enabled but unmountable: wrong or no file system. 

EDIT: I found a previous post referencing xfs_repair. I was able to execute the command and disk1 appears to be back and operational.

image.png

waffle-diagnostics-20230307-1507 - removed corupted docker.img_then made new docker img as btrfs.zip

Edited by offroadguy56
Link to comment
13 hours ago, offroadguy56 said:

Can I remove these historical disks without further issue?

Yes.

 

11 hours ago, offroadguy56 said:

EDIT: I found a previous post referencing xfs_repair. I was able to execute the command and disk1 appears to be back and operational.

Yep, that's the correct procedure.

Link to comment

Thanks JorgeB. Looks like the SATA passthrough to VM was the root of the problem. I'm not entirely sure how it managed to get that way. All I remember at the start from a week ago was plugging my GPU accelerator back in after changing it's cooler while also plugging in a 2nd m.2 NVME drive. The computer attempted to boot windows off that 2nd NVME as I had not wiped it. It tried several times before I caught on to it. After getting into unraid I noticed Disk1 was disabled so I restarted unraid multiple times and tried changing cables/SATA ports. When I disabled the array to fix Disk1 (just a simple stop array -> start array) I also simultaneously added a 2nd slot to my cache pool which changed it from xfs to btrfs which disabled my working cache drive (the 1st nvme). I don't believe loosing the cache pool was the cause or symptom as Disk1 was disabled before I touched the cachepool. But I could be remembering wrong because libvirt.img was in a share that was solely stored on the cache drive. So the SATA passthrough issue could have happened when I added that 2nd slot and drive to the cache pool which caused the file system to change causing the cache pool to become unreadable.

Thanks again, the community here is great.

Edited by offroadguy56
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.