Array in bad shape


Recommended Posts

Hello,

I don't know what happened but my server crashed and when I restarted it I noticed problems to my array:

- USB disk needed repair => done

- 1 of the 2 parity disks is disabled

- 1 of my data disk (md11) does not mount and xfs_repair drops an "error cannot find log head/tail"

- 1 other data disk is disabled and content emulated. I successfully mounted it and could check the content. I ran xfs_repair and it corrected some errors.

I attached my diag file to this post.

Is there any chance not to loose data ? In any case, what are the steps I should follow in order to minimize damage and recover as much as possible.

 

Thank you for your help.

 

Sined

tower-diagnostics-20200106-1803.zip

Link to comment

You should have asked for advice before doing anything if you were unsure what to do, and it seems you were. But from your description it isn't obvious you have done anything to make things worse, though that is perhaps by accident.

 

Since you have dual parity, you should be able to rebuild both the disabled parity(1) and the disabled disk23.

 

You mentioned running xfs repair on a data disk that wasn't disabled. You should always capture exactly the command used and the results so you can post them.

 

You also mention running xfs repair on the disabled data disk that you somehow mounted yourself. This is a bit more complicated, and perhaps I have misunderstood you.

 

You should never attempt to work with array disks outside of the array, or you will invalidate parity. Since the disk was being emulated and it will have to be rebuilt anyway it doesn't invalidate parity. But it was also a waste of time to repair since it is the emulated disk that should have been repaired if it needed repair. Rebuilding the disk will just put it back like it was before the repair you did outside of the array.

 

Again, as mentioned above, you should always capture exactly the command used and the results so you can post them. That would have perhaps clarified exactly what you did in this case since I might have misunderstood.

 

SMART reports for both disabled disks looks OK. You have too many disks for me to examine them all. Do any of your disks show SMART warnings on the Dashboard?

 

Link to comment
45 minutes ago, Unraid_Noob said:

1 of my data disk (md11) does not mount and xfs_repair drops an "error cannot find log head/tail"

Disk11 is failing, likely why xfs_repair failed, but since you already have another disabled disk not possible to emulate.

 

If you think parity is OK and in sync we could try re-enabling it to rebuild disk11, but since the filesystem on other emulated disk was repaired it won't be 100% in sync, alternatively you could use ddrescue on disk11.

Link to comment
51 minutes ago, trurl said:

SMART reports for both disabled disks looks OK. You have too many disks for me to examine them all. Do any of your disks show SMART warnings on the Dashboard?

44 minutes ago, johnnie.black said:

Disk11 is failing

I should have thought to at least check that one also 

Link to comment
Just now, trurl said:

so we can't see why the disks were disabled.

Mostly likely a controller crash or power issue that caused errors in multiple disks, when this happens Unraid disables as many disks as there are parity devices, which disks get disabled is luck of the draw.

 

Parity1 is likely still valid, parity2 will be a little different because the emulated disk repair, if it was me I would re-enable parity and the other disk to rebuild disk11 to a new disk, but this assumes all other disks and OK and parity really is in sync.

  • Like 1
Link to comment

Hi,

 

Thank you for your replies.

The actions I took were based on comments found on this forum with the exact same error message.

 

Here are the different steps I undertook:
 

mkdir /mnt/tmp
mount /dev/md11 /mnt/tmp
Quote

error: mount: /mnt/tmp: can't read superblock on /dev/md11.

xfs_repair -n /dev/md11
Quote

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
xfs_repair: read failed: Input/output error
empty log check failed
zero_log: cannot find log head/tail (xlog_find_tail=-5)
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1:149238) is ahead of log (0:0).
Would format log to cycle 4.
No modify flag set, skipping filesystem flush and exiting.

mount /dev/md23 /mnt/tmp

No errors

xfs_repair -n /dev/md23
Quote

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 4
        - agno = 3
        - agno = 5
        - agno = 2
        - agno = 7
        - agno = 0
        - agno = 6
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

xfs_repair /dev/md23
Quote

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 5
        - agno = 4
        - agno = 6
        - agno = 7
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

 

Could you please explain me what steps do I need to undertake?

 

Thank you

 

Sined

 

 

Link to comment
39 minutes ago, Unraid_Noob said:

The actions I took were based on comments found on this forum with the exact same error message.

There are a lot of threads about repairing filesystems, and any particular error message isn't the full picture. I think you must have been looking at some old threads because trying to mount to tmp isn't the usual way now. You can do the repair from the webUI now and are less likely to get the commands wrong that way.

 

I notice that you were working with the md device though, so that seems to me as if you weren't actually doing the repair outside the array as I thought originally. So you would have been repairing the emulated disk in the case of working with a disabled disk.

 

Let's let @johnnie.black comment on this new information.

Link to comment
12 hours ago, Unraid_Noob said:

xfs_repair: read failed: Input/output error

This confirms the problem with disk11, before proceeding with invalidslot command I just want to make sure actual disk23 is mounting correctly, almost certainly is but it doesn't hurt to confirm, so with the array stopped type:

mkdir temp
mount -o ro /dev/sdq1 /temp

If you rebooted since diags check disk23 is still sdq, if it mounts correctly you can browse contents, but really not needed, we just want to make sure it's mounting, next unmount:

 

umount /temp

Report back so we can proceed with invalid slot, and don't forget you need a new disk to replace disk11.

 

Link to comment

OK, so to replace disk11 wee need to enable parity1, also since disk23 looks healthy and is mounting correctly we might as well enable it also instead of keeping the repaired emulated disk, which might have some corruptions, so to do that:

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s), including new disk11
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 11

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk11 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

 

Keep old disk11 intact, most data there should be recoverable with ddrescue if still needed.

Link to comment

Dear Jorge,

 

In order to not make any mistakes:

- I stop the array

- I create a new config

- On the main tab I change the slot 11 with a new spare disk

- I check all the slot assignments

- I run the provided command in a ssh session

- I start the array and let the array rebuild

 

The only part I wasn't sure is the assignment of a new spare disk in the slot 11 in place of the defective one. Could you confirm this?

 

I will wait for your feedback before doing anything.

 

Thank you

 

Edited by Unraid_Noob
Additionnal line
Link to comment

Dear Jorge,

 

I had trouble completing the rebuild. The server froze and I had to do a hard reset. Starting the server in maintenance mode I got the following message:

"Unraid Parity sync / Data rebuild: 08-01-2020: Parity sync / Data rebuild finished (errors). Duration unavailable (no parity-check entries logged

 

Does this mean, the rebuild completed successfully and I need to continue the procedure to correct the errors or do I need to start over from the beginning.

 

I attached the diags in case of.

 

Thank again for your support

tower-diagnostics-20200108-1924.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.