Mechanical disk failure then separate disk file system corruption with one parity disk


cyberstyx
Go to solution Solved by JorgeB,

Recommended Posts

I had a mechanical disk failure (md2) on an old 4TB disk. I replaced it with a new 6TB disk.

 

When I added it unraid started preparing it (not rebuilding it yet). After 20-30% of its process another disk (md1) got disabled.

 

I now have 2 disks to be formatted and I can't proceed with rebuilding the new disk until I sort things out firstly with the disabled disk.

 

I did a "Check -n" from Disk 1 properties (unraid GUI), I left it running yesterday night and return from work today afternoon only to find it "dotting" the progress window for 22 hours:

 

Phase 1 - find and verify superblock...

couldn't verify primary superblock - not enough secondary superblocks with matching geometry !!!

attempting to find secondary superblock...

....found candidate secondary superblock...

unable to verify superblock, continuing...

............................ 

 

etc. I don't expect anything helpful after so many hours.

 

I am not sure how to proceed fixing md1 before rebuilding md2, and I have one parity disk.

 

Some of the errors you can see in the appropriate logs:

 

Sep 14 20:34:32 Tower kernel: XFS (md1): Mounting V5 Filesystem
Sep 14 20:34:32 Tower kernel: XFS (md1): Internal error !uuid_equal(&mp->m_sb.sb_uuid, &head->h_fs_uuid) at line 253 of file fs/xfs/xfs_log_recover.c.  Caller xlog_header_check_mount+0x60/0xb4 [xfs]

 

Sep 14 20:34:32 Tower kernel: XFS (md1): Corruption detected. Unmount and run xfs_repair
Sep 14 20:34:32 Tower kernel: XFS (md1): log has mismatched uuid - can't recover
Sep 14 20:34:32 Tower kernel: XFS (md1): failed to find log head
Sep 14 20:34:32 Tower kernel: XFS (md1): log mount/recovery failed: error -117
Sep 14 20:34:32 Tower kernel: XFS (md1): log mount failed
Sep 14 20:34:32 Tower root: mount: /mnt/disk1: mount(2) system call failed: Structure needs cleaning.
Sep 14 20:34:32 Tower root:        dmesg(1) may have more information after failed mount system call.

 

I would rather ask experts about this as I 'm not a Linux OS user and I 'd rather not make a bigger mess. Any help will be greatly appreciated. Thank you in advance, Christos.

tower-diagnostics-20230915-1850.zip

Link to comment
2 minutes ago, cyberstyx said:

When I added it unraid started preparing it (not rebuilding it yet). After 20-30% of its process another disk (md1) got disabled.

This is confusing.   If you replaced it then Unraid should immediately have started rebuilding it.   There is no concept of preparing it during a replacement unless you did something different such as trying to add it as a new drive rather than replacing the failed drive.

Link to comment
6 minutes ago, itimpi said:

This is confusing.   If you replaced it then Unraid should immediately have started rebuilding it.   There is no concept of preparing it during a replacement unless you did something different such as trying to add it as a new drive rather than replacing the failed drive.

I replaced the failed drive (md2) with the new bought disk, I didn't add it as a new drive.

 

7 minutes ago, JorgeB said:

Diags are after a reboot so we can't see what happened, but disk2 is enabled, I assume you let the rebuild finish after the errors on disk1?

Yes, I let it finish whatever it was doing, but I think it was doing a parity check not a rebuild. On the Dashboard GUI I see on the top right (Parity area):

 

Last check completed on Thu 14 Sep 2023 08:00:17 PM EEST (yesterday)
Duration: 25 minutes, 54 seconds. Average speed: 3.9 GB/s
Finding 1465130633 errors

 

And this is the only thing it did. It was automatic after I replaced the disk, I didn't initiate anything.

 

Now on the Main screen I see for both Disk 1 (the failed disk) and Disk 2 (the newly replaced):

Unmountable: Wrong or no file system

 

Also those two disks are available for formatting in the Main screen / Array Operations:

Unmountable disks present:
Disk 1 • WDC_WD60EFAX-68SHWN0_... (sdd)
Disk 2 • WDC_WD60EFAX-68JH4N1_... (sdi)

Link to comment
4 minutes ago, cyberstyx said:

but I think it was doing a parity check not a rebuild.

That's not possible, because the disk is now enabled, it must have been a rebuild, problem is that this disk will be corrupt.

 

Since disk1 looks healthy, there are only a few UDMA CRC errors, so we can suspect it got disabled by a connection issue, you can try force enable disk1 and rebuild disk2 again, but this is not guaranteed to work, and it won't if parity is no longer valid, can post detailed instructions if you want to try.

Link to comment
14 minutes ago, JorgeB said:

That's not possible, because the disk is now enabled, it must have been a rebuild, problem is that this disk will be corrupt.

 

Since disk1 looks healthy, there are only a few UDMA CRC errors, so we can suspect it got disabled by a connection issue, you can try force enable disk1 and rebuild disk2 again, but this is not guaranteed to work, and it won't if parity is no longer valid, can post detailed instructions if you want to try.

Yes, please let me know what I can do to fix this.

 

The array is in maintenance mode. I have attached the status of the Main page.

Main - Array Devices.jpg

Main - Array Operations.jpg

Link to comment
  • Solution

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk2
-Start array (in normal mode now) and post new diags

Link to comment

Disk1 dropped offline:

 

Sep 15 20:38:57 Tower kernel: ata11.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep 15 20:38:57 Tower kernel: ata11.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 15 20:38:57 Tower kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x2)
Sep 15 20:38:57 Tower kernel: ata11.00: revalidation failed (errno=-5)
Sep 15 20:38:57 Tower kernel: ata11.00: disable device

 

Power down, replace the SATA cable, also check/replace power cable, and post new diags after array start.

Link to comment
14 minutes ago, cyberstyx said:

from its UDMA CRC error count which is expected and not to worry I guess?

You can acknowledge the attribute, and if it doesn't increase problem is resolved.

 

Emulated disk2 is mounting, if contents look correct you can re-assign the disk to rebuild, if there are more errors on disk1 or another disk during the rebuild cancel and post new diags.

 

 

Link to comment
24 minutes ago, trurl said:

Next time please ask for advice before doing anything since there seems to have been some confusion.

 

Do you have another copy of everything important and irreplaceable? Parity is not a backup.

Normally it should have been a basic operation that I have done 20+ times in the past 7 years, the multiple errors put me off that 's why I did ask for help rather than try to further troubleshoot it on my own. The help was invaluable.

 

Yes, parity is not a backup, actual work files are backed up on an external SSD and on cloud storage.

 

Thank you again.

Link to comment
13 hours ago, JorgeB said:

You can acknowledge the attribute, and if it doesn't increase problem is resolved.

 

Emulated disk2 is mounting, if contents look correct you can re-assign the disk to rebuild, if there are more errors on disk1 or another disk during the rebuild cancel and post new diags.

 

 

Unfortunately the rebuilt got paused due to Disk 8 going disable. Disk 1 was a SATA HDD on the motherboard, Disk 8 is an SSD disk (for VMs and containers) on an expansion PCI card.

 

The operation got paused at 5.5%, which was probably after 2-3 hours and while I was sleeping.

 

In the morning I saw the Pause status and I resumed it - later on I saw on you post you wanted me to cancel and post new diagnostics. Resume had the same behavior as my initial post here - it was rebuilding with a speed of 3.9GB/s and finished in 26', not a normal speed of 50-100MB/s and 18+ hours.

 

All contents on Disk 2 are said to be there, but they are actually empty / NULL files. I opened a few movie files, they cannot be played, I opened a few text files, they are NULL characters. The rebuilt was not completed successfully.

 

I proceeded doing the same procedure as with Disk 1, shut the PC down, checked the power cable, replaced the SATA cable. Turned it back on, the array was down, all disks were present, when I started the array everything is mounted but Disk 8 is disabled. During BIOS post all disks are reported present (6 disks on M/B , 2 PCI expansions each with 2 disks).

 

Disk 8 (SSD) SMART checks are completed without errors but I cannot see the SMART log.

 

On the monitor of the physical unraid server I see messages every 5-10" for md2 about metadata I/O error, metadata corruption and that I should unmount and xfx_repair it. There is nothing on the web GUI that would suggest something is wrong about Disk2, only that Disk8 is disabled. On the Main page there is constant Reads and Writes without anyone accessing the server - there are no errors.

 

The current situation is the GUI suggested that everything is ok with Disk 8 been disabled (but active and healthy). But Disk 2 is not successfully rebuilt and has some sort of corruption which is not shown on the GUI (it shows mounted, active, healthy with 4TB of data).

tower-diagnostics-20230916-1050.zip

Link to comment
44 minutes ago, JorgeB said:

Ideally you should have canceled, but you can try repeating the procedure above to this time re-enable disk8, then if the emulated disk2 mounts try to rebuild again.

You mean by creating a new config etc.

 

Will start that procedure in 30' as soon as I am back from my chores.

Link to comment
13 minutes ago, JorgeB said:

You need to change the filesystem to xfs first

Check finished

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
agf_freeblks 8620987, counted 8620468 in ag 0
sb_fdblocks 498026383, counted 498027136
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 4
        - agno = 2
        - agno = 5
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:2050985) is ahead of log (1:1850812).
Format log to cycle 4.
done

 

Driver mounted, all other drives appear ok so far, rebuild started.

 

Speed is slow, it will take ~20 hours to complete (speed will drop to 50MB/s later on)

 

Total size:6 TB

Elapsed time:5 minutes

Current position:24.7 GB (0.4 %)

Estimated speed:90.0 MB/sec

Estimated finish: 18 hours, 17 minutes

 

Will report back as soon as I have news, hopefully tomorrow with a successful completion.

 

Thank you JorgeB.

  • Like 1
Link to comment

Unfortunately the files are inaccessible.

 

Rebuild finished without errors, file structure and files are there, but again the video files are unplayable and the text files are NULL characters.

 

Performing and ls - lsR on /mnt/disk2 gives sometimes results like:

 

? d????????? ? ? ? ?            ? HowTo/

/bin/ls: cannot open directory './User Christos/RaspberryPi3/HowTo': Structure needs cleaning

 

With our without a "structure cleaning" message all files are not rebuild (sampled randomly in different directories around 15 files).

 

After the rebuild finished and after waiting for a few minutes to check for any status changes in the GUI, when I tried to open the first file from Disk2 to check for its contents, Disk9 got disabled.

 

You know better, but to me it looks like Parity cannot be used anymore to rebuild Disk 2.

 

What concerns me more now is the stability of my system: the constant disablement of different, even newly added disks and if any newly added content in the array will be there for future access. I don't mind re-downloading some of the contents I like from Disk2, but the system should operate as designed.

 

Even though the system is old (old CPU, old MB, old RAM, old PSU), it is operating as should, and the disks are new and newish. The system, due to power consumption, is usually OFF, and I turn it on for home work or media access. It is not on 24/7 therefore the disks are not on all the time.

 

Let me know what you think, if you want me to format Disk2, add some new content on it to see if everything is ok, use previous procedure to bring Disk9 online, and if there are any checks that could be made on hardware and OS configuration side to make some good assumptions about the stability of the whole system. But at this point I am not going to invest more cash on it for buying hardware for testing as I have other expense priorities.

tower-diagnostics-20230917-1130.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.