New Parity Disk rebuild encountered read errors


Go to solution Solved by trurl,

Recommended Posts

Hoping for some direction on error encountered. Array running out of space so started process to replace Parity Disk with a larger one, all going ok till struck an error in rebuild it is a read error of of one of the disks see below. Current rebuild has got another 3 hours to go to finish.

Prior to parity upgrade was no errors and last parity rebuild was all good.

Question , Is this something to worry about and will it lead to some data loss ? When it finishes if reboot will the error correct itself ?

I have new data disks to install and old parity disk still available.

Any help or suggestions appreciated 

 

Thanks in advance Justintas

 

Parity - ST12000VN0008-2PH103_ZS802V5R (sdb) - active 33 C [DISK INVALID] (new parity Disk getting rebuilt)
Disk 1 - WDC_WD40EFRX-68N32N0_WD-WCC7K2PF6VZX (sdc) - active 31 C (disk has read errors) [NOK]
Disk 2 - WDC_WD40EFRX-68N32N0_WD-WCC7K3EN931N (sdd) - active 31 C [OK]
Disk 3 - WDC_WD40EFRX-68N32N0_WD-WCC7K5FREU23 (sde) - active 31 C [OK]

Parity sync / Data rebuild in progress.
Total size: 12 TB
Elapsed time: 8 hours, 12 minutes
Current position: 4.01 TB (33.4 %)
Estimated speed: 216.3 MB/sec
Estimated finish: 10 hours, 16 minutes
Sync errors corrected: 2689

Link to comment

Does look like disk1 has problems

Serial Number:    WD-WCC7K2PF6VZX
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   189   051    -    3
197 Current_Pending_Sector  -O--CK   200   200   000    -    5

attribute 1 isn't monitored by default, click on each of your WD disks to get to its page and add attribute 1 and 200

 

The pending sectors are monitored by default, and should have warned you, but they might have just been discovered since rebuild is going to access all sectors. Did you get any notifications about disk1? No doubt it has a SMART warning on the Dashboard page now (unless you acknowledged it). Was that warning there when you decided to replace parity?

 

Have you written anything to your server since parity rebuild began? Do you still have the original parity disk? Do you have another copy of anything important and irreplaceable?

Link to comment
46 minutes ago, trurl said:

The pending sectors are monitored by default, and should have warned you, but they might have just been discovered since rebuild is going to access all sectors. Did you get any notifications about disk1? No doubt it has a SMART warning on the Dashboard page now (unless you acknowledged it). Was that warning there when you decided to replace parity?

 

Have you written anything to your server since parity rebuild began? Do you still have the original parity disk? Do you have another copy of anything important and irreplaceable?

 

Thanks for your help to answer questions;

No notification about disk 1,

Yes it has a notification on dashboard now about error,  No wasn't reason decided to replace parity I need to expand array as running out of disk so replacing parity first. 

No nothing written to array since rebuild started, yes still have  original parity disk, yes copies of most of important data etc , just my movie collection on there that is not backed up.

 

Options ?

 

Link to comment

You need to replace disk1 instead of parity. Should be possible to rebuild disk1 from original parity, but will require jumping through a few hoops now that parity had been replaced and is invalid. Of course, you need a replacement for disk1 that is at least as large as disk1 but no larger than original parity.

Link to comment

I would go with the 8TB to get the extra capacity which is what you wanted anyway.

 

Looks like you already have autostart disabled.

 

Shutdown, replace new parity with original parity, leave disk1 installed for now, then reboot.

Tools - New Config - Retain All - Apply.

Assign original parity, check the box saying parity is already valid, then start the array.

Shutdown, replace disk1, reboot.

Assign new disk1 and start the array to begin rebuild of disk1.

Link to comment

Thanks Trurl , will wait to current parity option finishes, about an hour then follow the steps you have outlined. 

Assume once data disk 1 is rebuilt and all boots ok can go ahead again and replace the parity disk ? then start gradual upgrade of each data disk.

 

Really appreciate your help and advice.

Link to comment
  • 2 weeks later...

just struck a problem with fixing above had a delay due to a damaged cage drive so had to switch hardware.

Have inserted new  disk 1 and did a rebuild.

 

When reboot 2 things are happening;

1. it is saying disk 1 ''Unmountable disk present: '' not sure what did wrong here ?

2. All my dockers are not visible assume could be related to 1 above.

updated diagnostics attached , have tried 2 rebuilds but can't work out what i have done wrong ?

 

hptower-diagnostics-20211215-1157.zip

Edited by justintas
fix
Link to comment

Ok ran check then repair output doesn't look to good ?

Here is last lines of process..

Metadata corruption detected at 0x44d778, xfs_bmbt block 0xec37d798/0x1000 libxfs_bwrite: write verifier failed on xfs_bmbt bno 0xec37d798/0x1000 Maximum metadata LSN (2146145896:-2144772351) is ahead of log (22:71999). Format log to cycle 2146145899. xfs_repair: Releasing dirty buffer to free list! cache_purge: shake on cache 0x5021c0 left 3 nodes!? xfs_repair: Refusing to write a corrupt buffer to the data device! xfs_repair: Lost a write to the data device! fatal error -- File system metadata writeout failed, err=117. Re-run xfs_repair.

 

Options ? above run in mainteance mode only way to activate check option

I do have another disk available to try as a replacement for SDC ? Original SDC is still available but jammed in drive cage

Link to comment
28 minutes ago, justintas said:

original disk is jammed in disk cage

By "jammed" do you mean it can't be removed for some reason?

 

28 minutes ago, justintas said:

new disk in as disk 1 (sdc)

Might as well forget about that sdc designation. If you want to identify a specific drive assignment, disk1 is the way to go. If you want to identify a specific drive, some unique portion of the serial number is most useful, often the last 4 characters will work for many models.

 

26 minutes ago, justintas said:

check and repair was from gui

So it would have used the correct designation, which in this case would be /dev/md1. Specifying the md device is necessary to get parity updated with repair so it remains valid.

 

Might be useful to try to get the data from the original disk. Can you mount it as an Unassigned Device?

Link to comment
2 minutes ago, trurl said:

By "jammed" do you mean it can't be removed for some reason?

Yes its jammed in the cage and cant be removed screw must have moved

2 minutes ago, trurl said:

 

Might as well forget about that sdc designation. If you want to identify a specific drive assignment, disk1 is the way to go. If you want to identify a specify drive, some unique portion of the serial number is most useful, often the last 4 characters will work for many models.

 

So it would have used the correct designation, which in this case would be /dev/md1. Specifying the md device is necessary to get parity updated with repair so it remains valid.

 

Might be useful to try to get the data from the original disk. Can you mount it as an Unassigned Device?

yes is disk 1 is green but showing as unmountable , mounting existing drive will be hard as cage damaged. Will see if can cut drive out of cage

 

any other options or do  pictures attached if any help

disk picture2.JPG

disk picture.JPG

Link to comment

Ok tried check process again here are the results below, and guess what it is fixed !!! 

 

Thanks for your guidance much appreciated

 

re ran check -n
Results
xfs_repair status:
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
would have reset inode 4328816385 nlinks from 1 to 2
would have reset inode 4328816389 nlinks from 1 to 2
would have reset inode 4328816398 nlinks from 1 to 2
No modify flag set, skipping filesystem flush and exiting.

Then re ran check (blank) to do a repair
Results:
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
resetting inode 4328816385 nlinks from 1 to 2
resetting inode 4328816389 nlinks from 1 to 2
resetting inode 4328816398 nlinks from 1 to 2
done

Link to comment
  • Solution
13 hours ago, justintas said:

Should I do another parity rebuild before changing parity disk to larger disk ?

No point unless you just want to exercise your hardware. Parity will be built to the new larger disk whether your current parity is valid or not, or even if you had no parity disk before.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.