Starting array with failed disk after lost disk assignments


daemian

Recommended Posts

Good Morning,

 

I had a disk fail, which i replaced. However, sometime before the parity sync/rebuild finished there was a power outage. When i booted the unraid back up - the disk assignments were lost. Through my research it sounds like its just critical that i get the parity disk correct - the other disks can be put in any location without negative consequence. So, i determined which disk was the parity, and have put it in the parity slot, and my other disks all in the disk # slots.


What I am unsure of, is should I now start the array as normal, start it with "parity is already valid" selected, or do something else entirely? Whats throwing me off is that all of the disks are recognized as a "New Device" right now (blue square). I want it to rebuilt the data on the failed disk, and trust the data on the others and the parity. How do I go about this without destroying everything?

 

Thanks!

Link to comment
Quote

Do you know what disk you were rebuilding, not the old disk#, the actual disk serial or current disk#? 

I am pretty certain it is WCC4N0334109. I say that because i put all of the drives in as data drives, and strted the array (with no parity). The other 3 looked fine, but that one showed "Unmountable: No file system". I presume that would be because the power failure occurred before the parity sync finished.

Quote

Is parity the 6TB Hitachi or one of the currently assigned data disks? 

The 6TB drive is the parity.

Edited by daemian
Link to comment
1 minute ago, daemian said:

I am pretty certain it is WCC4N0334109. I say that because i put all of the drives in as data drives, and strted the array (with no parity). The other 3 looked fine, but that one showed "Unmountable: No file system". I presume that would be because the power failure occurred before the parity sync finished.

If parity is the 6TB then that's likely it, though it would have been best if the data disks were mounted read-only, but this should still work:

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s) like parity
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (I'll assume disk to rebuild is still disk1 if not adjust the command):

mdcmd set invalidslot 1 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box, disk1 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

 

 

Link to comment

So I just want to double check, this is what the screen looks like now:

image.png.c7587283b14da9b87d600bb41cee54ce.png

 

I have issues this command at the CLI

image.thumb.png.24dc860c159cfe20c8f5089baf9e6d6a.png

 

I have not refreshed or left the page. Now I am going to start the array, without the "Parity is already valid" selected.

 

Is that all correct?


Thank you for your help!

Link to comment
59 minutes ago, daemian said:

OK - so the rebuild is completed. Now in the GUI disk 1 shows as "Unmountable: No file system"

A rebuild does not fix an “unmountable” problem as it works at the physical sector level, not the file system level.    You normally need to run the file system repair tools to fix the unmountable state.

Link to comment
1 hour ago, daemian said:

OK - so the rebuild is completed. Now in the GUI disk 1 shows as "Unmountable: No file system"

Possibly the result of starting the disks read-write before without parity before, or worse, parity is not in sync, either way try a filesystem check:

https://wiki.unraid.net/Check_Disk_Filesystems#Drives_formatted_with_XFS

or

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

Link to comment

P.S. I didn't notice at first since I didn't check the complete syslog but you also have problems with your cache pool, there are read and write errors on both devices, but mainly cache1:

Oct 23 08:04:35 dt-ur01 kernel: BTRFS info (device sdi1): bdev /dev/sdi1 errs: wr 166, rd 1, flush 0, corrupt 0, gen 0
Oct 23 08:04:35 dt-ur01 kernel: BTRFS info (device sdi1): bdev /dev/sdh1 errs: wr 863327568, rd 506341990, flush 65261822, corrupt 0, gen 0

These are hardware errors and with SSDs usually the result of bad cables, after replacing them run a scrub and check that all errors were corrected, though if you're using any NOCOW shares there might be some undetected corruption there.

Link to comment

Thanks for pointing out the cache drive - I will check that out when i can.

For the original issue, when I try to run xfs_repair I get the following error:

 

root@dt-ur01:~# xfs_repair -v /dev/md1
Phase 1 - find and verify superblock...
        - block cache size set to 2290880 entries
Phase 2 - using internal log
        - zero log...
Log inconsistent (didn't find previous header)
failed to find log head
zero_log: cannot find log head/tail (xlog_find_tail=5)
ERROR: The log head and/or tail cannot be discovered. Attempt to mount the
filesystem to replay the log or use the -L option to destroy the log and
attempt a repair.

Do i try it with the -L options? It sounds like that may result in [more] data lose, but perhaps I don't really have any other option?


Thank you again for all of your time and assistance.

Link to comment

well -L didn't get me any further

root@dt-ur01:~# xfs_repair -Lv /dev/md1
Phase 1 - find and verify superblock...
        - block cache size set to 2290880 entries
Phase 2 - using internal log
        - zero log...
Log inconsistent (didn't find previous header)
failed to find log head
zero_log: cannot find log head/tail (xlog_find_tail=5)

 

Link to comment

This means the rebuilt disk has more serious corruption, either parity wasn't valid before or possibly the result of mounting the disks read-write before rebuilding, like I mentioned disks should be mounted read only since there will always be some filesystem housekeeping that won't be reflected in the existing parity, since it wasn't assigned, btrfs you'll usually never survive this, reiserfs usually survives without issues, xfs most times should survive but other times might not.

Link to comment

Thanks Johnnie.

 

I upgraded to  6.5.3 and tried xfs_repair against. Still no luck. Putting this disk in another machine is not really an option for me with this one (I am remote to the site, and there are not much in the way of resources there).


I think I may need to bite the bullet and just format the drive, conceding that the data from that drive is lost. Its probably not really that big of a deal. Obviously not ideal, but I don't think I have much other choice. Would I just format that drive and then run a parity check to be sure everything is ok?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.