Array in bad shape

Unraid_Noob · January 6, 2020

Hello,

I don't know what happened but my server crashed and when I restarted it I noticed problems to my array:

- USB disk needed repair => done

- 1 of the 2 parity disks is disabled

- 1 of my data disk (md11) does not mount and xfs_repair drops an "error cannot find log head/tail"

- 1 other data disk is disabled and content emulated. I successfully mounted it and could check the content. I ran xfs_repair and it corrected some errors.

I attached my diag file to this post.

Is there any chance not to loose data ? In any case, what are the steps I should follow in order to minimize damage and recover as much as possible.

Thank you for your help.

Sined

tower-diagnostics-20200106-1803.zip

trurl · January 6, 2020

You should have asked for advice before doing anything if you were unsure what to do, and it seems you were. But from your description it isn't obvious you have done anything to make things worse, though that is perhaps by accident.

Since you have dual parity, you should be able to rebuild both the disabled parity(1) and the disabled disk23.

You mentioned running xfs repair on a data disk that wasn't disabled. You should always capture exactly the command used and the results so you can post them.

You also mention running xfs repair on the disabled data disk that you somehow mounted yourself. This is a bit more complicated, and perhaps I have misunderstood you.

You should never attempt to work with array disks outside of the array, or you will invalidate parity. Since the disk was being emulated and it will have to be rebuilt anyway it doesn't invalidate parity. But it was also a waste of time to repair since it is the emulated disk that should have been repaired if it needed repair. Rebuilding the disk will just put it back like it was before the repair you did outside of the array.

Again, as mentioned above, you should always capture exactly the command used and the results so you can post them. That would have perhaps clarified exactly what you did in this case since I might have misunderstood.

SMART reports for both disabled disks looks OK. You have too many disks for me to examine them all. Do any of your disks show SMART warnings on the Dashboard?

JorgeB · January 6, 2020

45 minutes ago, Unraid_Noob said:

1 of my data disk (md11) does not mount and xfs_repair drops an "error cannot find log head/tail"

Disk11 is failing, likely why xfs_repair failed, but since you already have another disabled disk not possible to emulate.

If you think parity is OK and in sync we could try re-enabling it to rebuild disk11, but since the filesystem on other emulated disk was repaired it won't be 100% in sync, alternatively you could use ddrescue on disk11.

trurl · January 6, 2020

51 minutes ago, trurl said:

SMART reports for both disabled disks looks OK. You have too many disks for me to examine them all. Do any of your disks show SMART warnings on the Dashboard?

44 minutes ago, johnnie.black said:

Disk11 is failing

I should have thought to at least check that one also

trurl · January 6, 2020

And syslog is after reboot of course, so we can't see why the disks were disabled. Connections issues would be my first guess, so those should be checked before attempting any further fixes.

trurl · January 6, 2020

Since you haven't visited since making that first post in the thread, I really hope you haven't been trying to fix this yourself.

JorgeB · January 6, 2020

Just now, trurl said:

so we can't see why the disks were disabled.

Mostly likely a controller crash or power issue that caused errors in multiple disks, when this happens Unraid disables as many disks as there are parity devices, which disks get disabled is luck of the draw.

Parity1 is likely still valid, parity2 will be a little different because the emulated disk repair, if it was me I would re-enable parity and the other disk to rebuild disk11 to a new disk, but this assumes all other disks and OK and parity really is in sync.

Unraid_Noob · January 6, 2020

Hi,

Thank you for your replies.

The actions I took were based on comments found on this forum with the exact same error message.

Here are the different steps I undertook:

mkdir /mnt/tmp
mount /dev/md11 /mnt/tmp

Quote

error: mount: /mnt/tmp: can't read superblock on /dev/md11.

xfs_repair -n /dev/md11

Quote

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
xfs_repair: read failed: Input/output error
empty log check failed
zero_log: cannot find log head/tail (xlog_find_tail=-5)
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1:149238) is ahead of log (0:0).
Would format log to cycle 4.
No modify flag set, skipping filesystem flush and exiting.

mount /dev/md23 /mnt/tmp

No errors

xfs_repair -n /dev/md23

Quote

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 4
        - agno = 3
        - agno = 5
        - agno = 2
        - agno = 7
        - agno = 0
        - agno = 6
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

xfs_repair /dev/md23

Quote

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 5
        - agno = 4
        - agno = 6
        - agno = 7
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

Could you please explain me what steps do I need to undertake?

Thank you

Sined

trurl · January 6, 2020

39 minutes ago, Unraid_Noob said:

The actions I took were based on comments found on this forum with the exact same error message.

There are a lot of threads about repairing filesystems, and any particular error message isn't the full picture. I think you must have been looking at some old threads because trying to mount to tmp isn't the usual way now. You can do the repair from the webUI now and are less likely to get the commands wrong that way.

I notice that you were working with the md device though, so that seems to me as if you weren't actually doing the repair outside the array as I thought originally. So you would have been repairing the emulated disk in the case of working with a disabled disk.

Let's let @johnnie.black comment on this new information.

JorgeB · January 7, 2020

12 hours ago, Unraid_Noob said:

xfs_repair: read failed: Input/output error

This confirms the problem with disk11, before proceeding with invalidslot command I just want to make sure actual disk23 is mounting correctly, almost certainly is but it doesn't hurt to confirm, so with the array stopped type:

mkdir temp
mount -o ro /dev/sdq1 /temp

If you rebooted since diags check disk23 is still sdq, if it mounts correctly you can browse contents, but really not needed, we just want to make sure it's mounting, next unmount:

umount /temp

Report back so we can proceed with invalid slot, and don't forget you need a new disk to replace disk11.

Unraid_Noob · January 7, 2020

Dear Jorge,

Thank you for your help. I confirm I succeeded in mounting and browsing the drive.

Regards

trurl · January 7, 2020

Might still be good to get an answer to this:

19 hours ago, trurl said:

You have too many disks for me to examine them all. Do any of your disks show SMART warnings on the Dashboard?

Unraid_Noob · January 7, 2020

Hi,

I have 7 out of 24 with CRC errors.

Regards

trurl · January 7, 2020

11 minutes ago, Unraid_Noob said:

I have 7 out of 24 with CRC errors.

Those are OK. Usually a temporary connection issue. You can acknowledge them by clicking on them and they won't warn again unless the count increases.

JorgeB · January 7, 2020

OK, so to replace disk11 wee need to enable parity1, also since disk23 looks healthy and is mounting correctly we might as well enable it also instead of keeping the repaired emulated disk, which might have some corruptions, so to do that:

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s), including new disk11
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 11

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk11 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

Keep old disk11 intact, most data there should be recoverable with ddrescue if still needed.

Unraid_Noob · January 7, 2020

Dear Jorge,

In order to not make any mistakes:

- I stop the array

- I create a new config

- On the main tab I change the slot 11 with a new spare disk

- I check all the slot assignments

- I run the provided command in a ssh session

- I start the array and let the array rebuild

The only part I wasn't sure is the assignment of a new spare disk in the slot 11 in place of the defective one. Could you confirm this?

I will wait for your feedback before doing anything.

Thank you

Edited January 7, 2020 by Unraid_Noob
Additionnal line

JorgeB · January 7, 2020

3 minutes ago, Unraid_Noob said:

On the main tab I change the slot 11 with a new spare disk

Correct, then leave the GUI on that page, the main page, GUI can't be refreshed after the invalid slot command is typed and before the array start button is clicked.

Unraid_Noob · January 7, 2020

Ok, I have started the array, rebuild is in progress (1.5%).

I will keep you posted of the progress.

Thank you all for your support.

Sined

JorgeB · January 7, 2020

Did disk11 mount correctly?

Unraid_Noob · January 7, 2020

It says "Unmountble: No file system"

But array is mounted and functional

I got the following notification:

Unraid Disk 11 error: Drive 11, drive not ready, content being reconstructed

Unraid parity sync / Data rebuild: Parity Sync / Data rebuild started

JorgeB · January 7, 2020

OK, please post the current diags, or just the syslog is enough for now.

Unraid_Noob · January 7, 2020

Here are the diags

tower-diagnostics-20200107-1650.zip

JorgeB · January 7, 2020

Valid filesystem is detected which is good news, but there is some corruption, should be fixable with xfs_repair, when the rebuild finishes check filesystem on disk11, with array in maintenance mode:

xfs_repair -v /dev/md11

If it asks for -L use it.

Unraid_Noob · January 7, 2020

OK will do.

Thanks

Unraid_Noob · January 8, 2020

Dear Jorge,

I had trouble completing the rebuild. The server froze and I had to do a hard reset. Starting the server in maintenance mode I got the following message:

"Unraid Parity sync / Data rebuild: 08-01-2020: Parity sync / Data rebuild finished (errors). Duration unavailable (no parity-check entries logged

Does this mean, the rebuild completed successfully and I need to continue the procedure to correct the errors or do I need to start over from the beginning.

I attached the diags in case of.

Thank again for your support

tower-diagnostics-20200108-1924.zip

Array in bad shape

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation