2nd drive started throwing thousands of errors during disk rebuild


Recommended Posts

So yesterday, I was ready to swap in a spare drive to replace one that was starting to throw SMART errors.  I'd already moved all files off of it, so there was no real risk of data loss - or so I thought.  After I replaced it, during the 10 hour rebuild, another drive started throwing hundreds of thousands of errors during the rebuild. This drive contained a few dozen movies in my media share.  Not really knowing what else to do, I let the rebuild complete, which it did overnight.  I then replaced the newly failed drive as well and then let that rebuild complete.  Now, it would appear that my Movies folder under my media share has become corrupt somehow.  The files still exist spread throughout my drives - including the newly replaced drive, though I'd be shocked if there wasn't significant corruption in those files, which are quite easily replaceable.  My logs are now full of "Unmount and run xfs_repair" and "Metadata corruption detected at xfs_buf....." messages which eventually completely fill the log file until it stops filling up anymore.  Seems like at this point, I don't have many options other than to just let it sit there and do its thing.  Anything else I can do?  Potentially create a new share and use Krusader or Unbalance to move the missing files to the new share? I appreciate any insights!

diagnostics-20210704-2126.zip

Link to comment
4 hours ago, Tydell said:

Now, it would appear that my Movies folder under my media share has become corrupt somehow.

That is expected, the 1st rebuilt disk would be corrupt because of the read errors, doesn't matter if it was empty, this would then translate to corruption on the next rebuild.

 

You can run ddrescue on the failing disk, this way you can at least know which files are corrupt after the clone.

  • Thanks 1
Link to comment

So there are movies in the media share on a bunch of other disks that are completely fine, but if i navigate to \\unraid\media\movies, that folder in the share appears empty, but browsing the individual disks, the files are there.  the other folders under the media share appear to be unaffected - just the Movies folder appears affected.  I'll certainly look @ ddrescue though, much appreciated! 

Link to comment
11 minutes ago, Tydell said:

but if i navigate to \\unraid\media\movies, that folder in the share appears empty,

That is because of the filesystem corruption on disk5, see below how to fix it, naturally that won't fix the corrupt files.

 

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

 

Run it without -n or nothing will be done, if it asks for -L use it.

  • Thanks 1
Link to comment
10 hours ago, Tydell said:

replaced the newly failed drive

Do you have any evidence that the drive itself had failed? More likely is that you disturbed its connections when replacing the other disk.

 

Do you still have that original disk? Can you mount it with Unassigned Devices?

  • Thanks 1
Link to comment
Just now, trurl said:

Do you have any evidence that the drive itself had failed? More likely is that you disturbed its connections when replacing the other disk.

 

Do you still have that original disk? Can you mount it with Unassigned Devices?

 

I have no evidence yet - I'll definitely be investigating.  I do still have the old drive, although it's unseated slightly from its hot swap bay. 

It's weird - when the rebuild of that first disk was happening, the error count on that drive shot through the roof.  I noticed a bit later on that the drive was actually showing up in both the array AND unassigned devices at the same time. I didn't do anything at that time other than let the rebuild finish.  I then unseated it when I replaced it in the array.

 

Also xfs_repair appars to have done the trick!  I'll be going through and seeing if I can find any corrupted files, but it had me run the repair twice and now the folder is back, so thank you for that!  Now to investigate the "failed" drive...

Link to comment
21 minutes ago, trurl said:

Do you have any evidence that the drive itself had failed? More likely is that you disturbed its connections when replacing the other disk.

 

Do you still have that original disk? Can you mount it with Unassigned Devices?

I might have evidence now - It won't mount in unassigned devices - says it has a dup UUID.  I haven't done what it suggests yet (running xfs_repair with the -L flag) I running xfs_repair -nv on it is not going well.:

xfs_repair -nv /dev/sdc
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

 

The system log file also threw some errors during the check:

Jul 5 08:30:37 Unraid kernel: ata4.00: exception Emask 0x10 SAct 0x180000 SErr 0x280100 action 0x6 frozen
Jul 5 08:30:37 Unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Jul 5 08:30:37 Unraid kernel: ata4: SError: { UnrecovData 10B8B BadCRC }
Jul 5 08:30:37 Unraid kernel: ata4.00: failed command: READ FPDMA QUEUED
Jul 5 08:30:37 Unraid kernel: ata4.00: cmd 60/3f:98:00:30:03/05:00:00:00:00/40 tag 19 ncq dma 687616 in
Jul 5 08:30:37 Unraid kernel: res 50/00:c1:3f:35:03/00:02:00:00:00/40 Emask 0x10 (ATA bus error)
Jul 5 08:30:37 Unraid kernel: ata4.00: status: { DRDY }
Jul 5 08:30:37 Unraid kernel: ata4.00: failed command: READ FPDMA QUEUED
Jul 5 08:30:37 Unraid kernel: ata4.00: cmd 60/c1:a0:3f:35:03/02:00:00:00:00/40 tag 20 ncq dma 360960 in
Jul 5 08:30:37 Unraid kernel: res 50/00:c1:3f:35:03/00:02:00:00:00/40 Emask 0x10 (ATA bus error)
Jul 5 08:30:37 Unraid kernel: ata4.00: status: { DRDY }
Jul 5 08:30:37 Unraid kernel: ata4: hard resetting link
Jul 5 08:30:37 Unraid kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 5 08:30:37 Unraid kernel: ata4.00: configured for UDMA/133
Jul 5 08:30:37 Unraid kernel: ata4: EH complete
Jul 5 08:30:39 Unraid kernel: ata4.00: exception Emask 0x10 SAct 0x80800000 SErr 0x280100 action 0x6 frozen
Jul 5 08:30:39 Unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Jul 5 08:30:39 Unraid kernel: ata4: SError: { UnrecovData 10B8B BadCRC }
Jul 5 08:30:39 Unraid kernel: ata4.00: failed command: READ FPDMA QUEUED
Jul 5 08:30:39 Unraid kernel: ata4.00: cmd 60/3f:b8:00:00:0e/05:00:00:00:00/40 tag 23 ncq dma 687616 in
Jul 5 08:30:39 Unraid kernel: res 50/00:c1:3f:05:0e/00:02:00:00:00/40 Emask 0x10 (ATA bus error)
Jul 5 08:30:39 Unraid kernel: ata4.00: status: { DRDY }
Jul 5 08:30:39 Unraid kernel: ata4.00: failed command: READ FPDMA QUEUED
Jul 5 08:30:39 Unraid kernel: ata4.00: cmd 60/c1:f8:3f:05:0e/02:00:00:00:00/40 tag 31 ncq dma 360960 in
Jul 5 08:30:39 Unraid kernel: res 50/00:c1:3f:05:0e/00:02:00:00:00/40 Emask 0x10 (ATA bus error)
Jul 5 08:30:39 Unraid kernel: ata4.00: status: { DRDY }
Jul 5 08:30:39 Unraid kernel: ata4: hard resetting link
Jul 5 08:30:40 Unraid kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 5 08:30:40 Unraid kernel: ata4.00: configured for UDMA/133
Jul 5 08:30:40 Unraid kernel: ata4: EH complete
Jul 5 08:31:50 Unraid kernel: ata4.00: exception Emask 0x10 SAct 0xc00 SErr 0x280100 action 0x6 frozen
Jul 5 08:31:50 Unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Jul 5 08:31:50 Unraid kernel: ata4: SError: { UnrecovData 10B8B BadCRC }
Jul 5 08:31:50 Unraid kernel: ata4.00: failed command: READ FPDMA QUEUED
Jul 5 08:31:50 Unraid kernel: ata4.00: cmd 60/3f:50:00:58:4c/05:00:01:00:00/40 tag 10 ncq dma 687616 in
Jul 5 08:31:50 Unraid kernel: res 50/00:c1:3f:5d:4c/00:02:01:00:00/40 Emask 0x10 (ATA bus error)
Jul 5 08:31:50 Unraid kernel: ata4.00: status: { DRDY }
Jul 5 08:31:50 Unraid kernel: ata4.00: failed command: READ FPDMA QUEUED
Jul 5 08:31:50 Unraid kernel: ata4.00: cmd 60/c1:58:3f:5d:4c/02:00:01:00:00/40 tag 11 ncq dma 360960 in
Jul 5 08:31:50 Unraid kernel: res 50/00:c1:3f:5d:4c/00:02:01:00:00/40 Emask 0x10 (ATA bus error)
Jul 5 08:31:50 Unraid kernel: ata4.00: status: { DRDY }
Jul 5 08:31:50 Unraid kernel: ata4: hard resetting link
Jul 5 08:31:51 Unraid kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 5 08:31:51 Unraid kernel: ata4.00: configured for UDMA/133
Jul 5 08:31:51 Unraid kernel: ata4: EH complete

 

I think she's dead!

Link to comment

@JorgeB & @trurl - Thank you both for your help!  My folder is back and I'll be going through now to see if there are any missing files.  Lost+Found only has about 600KB of total stuff in it, and no important files that I recognize or can really see the contents of in notepad.

Link to comment
14 minutes ago, Tydell said:

xfs_repair -nv /dev/sdc

You need to specify the partition at the end:

 

xfs_repair -v /dev/sdc1

 

to change the UUID you can use:

xfs_admin -U generate /dev/sdc1

 

15 minutes ago, Tydell said:

ata4: SError: { UnrecovData 10B8B BadCRC }

These errors are usually a bad SATA cable, please post a SMART report for that disk.

 

13 minutes ago, Tydell said:

Lost+Found only has about 600KB of total stuff in it

That's good but don't forget that because of the read errors during the rebuild there can be more corrupt files, unless those sectors were unused.

  • Thanks 1
Link to comment
11 minutes ago, JorgeB said:

You need to specify the partition at the end:

 


xfs_repair -v /dev/sdc1

Oops, thanks for that! 

 

12 minutes ago, JorgeB said:

to change the UUID you can use:


xfs_admin -U generate /dev/sdc1

That worked!  Still doesn't mount though

Jul 5 08:59:29 Unraid kernel: XFS (sdc1): Corruption warning: Metadata has LSN (1:22491) ahead of current LSN (1:2). Please unmount and run xfs_repair (>= v4.3) to resolve.
Jul 5 08:59:29 Unraid kernel: XFS (sdc1): log mount/recovery failed: error -22
Jul 5 08:59:29 Unraid kernel: XFS (sdc1): log mount failed
Jul 5 08:59:29 Unraid unassigned.devices: Mount of '/dev/sdc1' failed: 'mount: /mnt/disks/Hitachi_HDS723030ALA640_MK0311YHG033ZA: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error. '

Jul 5 08:59:29 Unraid kernel: XFS (sdc1): Corruption warning: Metadata has LSN (1:22491) ahead of current LSN (1:2). Please unmount and run xfs_repair (>= v4.3) to resolve.
Jul 5 08:59:29 Unraid kernel: XFS (sdc1): log mount/recovery failed: error -22
Jul 5 08:59:29 Unraid kernel: XFS (sdc1): log mount failed
Jul 5 08:59:29 Unraid unassigned.devices: Mount of '/dev/sdc1' failed: 'mount: /mnt/disks/Hitachi_HDS723030ALA640_MK0311YHG033ZA: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error. '

 

 

13 minutes ago, JorgeB said:
30 minutes ago, Tydell said:

ata4: SError: { UnrecovData 10B8B BadCRC }

These errors are usually a bad SATA cable, please post a SMART report for that disk.

Attached.  Lots of UDMA_CRC_Error_Count.

Hitachi_HDS723030ALA640_MK0311YHG033ZA-20210705-0900.txt

Link to comment
7 minutes ago, Tydell said:

Still doesn't mount though

Run xfs_repair like I type above, without -n, or nothing will be done.

 

8 minutes ago, Tydell said:

Attached.  Lots of UDMA_CRC_Error_Count.

Disk look fine, replace the SATA cable then copy all the data to the new disk in the array, replacing existing files, this will fix any corrupt files.

 

Alternatively you could run a binary file compare utility to detect the corrupt files, but it will take about the same time.

  • Thanks 1
Link to comment
27 minutes ago, JorgeB said:

Disk look fine, replace the SATA cable then copy all the data to the new disk in the array, replacing existing files, this will fix any corrupt files.

It would appear that you, sir, are correct!  Copying everything over now.  Thanks again for all your help!  Time for a 2nd parity drive, methinks.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.