New Parity Disk rebuild encountered read errors

justintas · November 29, 2021

Hoping for some direction on error encountered. Array running out of space so started process to replace Parity Disk with a larger one, all going ok till struck an error in rebuild it is a read error of of one of the disks see below. Current rebuild has got another 3 hours to go to finish.

Prior to parity upgrade was no errors and last parity rebuild was all good.

Question , Is this something to worry about and will it lead to some data loss ? When it finishes if reboot will the error correct itself ?

I have new data disks to install and old parity disk still available.

Any help or suggestions appreciated

Thanks in advance Justintas

Parity - ST12000VN0008-2PH103_ZS802V5R (sdb) - active 33 C [DISK INVALID] (new parity Disk getting rebuilt)
Disk 1 - WDC_WD40EFRX-68N32N0_WD-WCC7K2PF6VZX (sdc) - active 31 C (disk has read errors) [NOK]
Disk 2 - WDC_WD40EFRX-68N32N0_WD-WCC7K3EN931N (sdd) - active 31 C [OK]
Disk 3 - WDC_WD40EFRX-68N32N0_WD-WCC7K5FREU23 (sde) - active 31 C [OK]

Parity sync / Data rebuild in progress.
Total size: 12 TB
Elapsed time: 8 hours, 12 minutes
Current position: 4.01 TB (33.4 %)
Estimated speed: 216.3 MB/sec
Estimated finish: 10 hours, 16 minutes
Sync errors corrected: 2689

trurl · November 29, 2021

attach diagnostics to your NEXT post in this thread

justintas · November 29, 2021

Diagnostics attached

hptower-diagnostics-20211130-1009.zip

trurl · November 29, 2021

Does look like disk1 has problems

Serial Number:    WD-WCC7K2PF6VZX
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   189   051    -    3
197 Current_Pending_Sector  -O--CK   200   200   000    -    5

attribute 1 isn't monitored by default, click on each of your WD disks to get to its page and add attribute 1 and 200

The pending sectors are monitored by default, and should have warned you, but they might have just been discovered since rebuild is going to access all sectors. Did you get any notifications about disk1? No doubt it has a SMART warning on the Dashboard page now (unless you acknowledged it). Was that warning there when you decided to replace parity?

Have you written anything to your server since parity rebuild began? Do you still have the original parity disk? Do you have another copy of anything important and irreplaceable?

justintas · November 30, 2021

46 minutes ago, trurl said:

The pending sectors are monitored by default, and should have warned you, but they might have just been discovered since rebuild is going to access all sectors. Did you get any notifications about disk1? No doubt it has a SMART warning on the Dashboard page now (unless you acknowledged it). Was that warning there when you decided to replace parity?

Have you written anything to your server since parity rebuild began? Do you still have the original parity disk? Do you have another copy of anything important and irreplaceable?

Thanks for your help to answer questions;

No notification about disk 1,

Yes it has a notification on dashboard now about error, No wasn't reason decided to replace parity I need to expand array as running out of disk so replacing parity first.

No nothing written to array since rebuild started, yes still have original parity disk, yes copies of most of important data etc , just my movie collection on there that is not backed up.

Options ?

trurl · November 30, 2021

You need to replace disk1 instead of parity. Should be possible to rebuild disk1 from original parity, but will require jumping through a few hoops now that parity had been replaced and is invalid. Of course, you need a replacement for disk1 that is at least as large as disk1 but no larger than original parity.

justintas · November 30, 2021

ok , so I have a replacement ready brand new but is same size as original parity 8tb is that ok ? or an older 4tb drive which one to use?

So what steps do I follow , assume let parity rebuild follow first

trurl · November 30, 2021

I would go with the 8TB to get the extra capacity which is what you wanted anyway.

Looks like you already have autostart disabled.

Shutdown, replace new parity with original parity, leave disk1 installed for now, then reboot.

Tools - New Config - Retain All - Apply.

Assign original parity, check the box saying parity is already valid, then start the array.

Shutdown, replace disk1, reboot.

Assign new disk1 and start the array to begin rebuild of disk1.

justintas · November 30, 2021

Thanks Trurl , will wait to current parity option finishes, about an hour then follow the steps you have outlined.

Assume once data disk 1 is rebuilt and all boots ok can go ahead again and replace the parity disk ? then start gradual upgrade of each data disk.

Really appreciate your help and advice.

justintas · December 15, 2021

just struck a problem with fixing above had a delay due to a damaged cage drive so had to switch hardware.

Have inserted new disk 1 and did a rebuild.

When reboot 2 things are happening;

1. it is saying disk 1 ''Unmountable disk present: '' not sure what did wrong here ?

2. All my dockers are not visible assume could be related to 1 above.

updated diagnostics attached , have tried 2 rebuilds but can't work out what i have done wrong ?

hptower-diagnostics-20211215-1157.zip

Edited December 15, 2021 by justintas
fix

trurl · December 15, 2021

Probably too slightly out-of-sync for a clean rebuild. You will have to repair the filesystem on disk1.

https://wiki.unraid.net/Manual/Storage_Management#Drive_shows_as_unmountable

justintas · December 15, 2021

Ok ran check then repair output doesn't look to good ?

Here is last lines of process..

Metadata corruption detected at 0x44d778, xfs_bmbt block 0xec37d798/0x1000 libxfs_bwrite: write verifier failed on xfs_bmbt bno 0xec37d798/0x1000 Maximum metadata LSN (2146145896:-2144772351) is ahead of log (22:71999). Format log to cycle 2146145899. xfs_repair: Releasing dirty buffer to free list! cache_purge: shake on cache 0x5021c0 left 3 nodes!? xfs_repair: Refusing to write a corrupt buffer to the data device! xfs_repair: Lost a write to the data device! fatal error -- File system metadata writeout failed, err=117. Re-run xfs_repair.

Options ? above run in mainteance mode only way to activate check option

I do have another disk available to try as a replacement for SDC ? Original SDC is still available but jammed in drive cage

trurl · December 15, 2021

2 hours ago, justintas said:

Original SDC

Do you mean original disk1? The sdX designations aren't very useful since they can change with hardware changes or even just reboots.

trurl · December 15, 2021

2 hours ago, justintas said:

ran check then repair

Since you mentioned SDC, I have to wonder. How exactly did you do the check and repair? Best is to run it from the webUI so the correct designation gets used automatically. If you did it from the command line, what was the exact command you used?

justintas · December 15, 2021

yes original disk is jammed in disk cage hence swapped to new cage and put new disk in as disk 1 (sdc)

Are errors recoverable ?

justintas · December 15, 2021

check and repair was from gui check first with default settings -n had to put array into mainteance mode to run

then ran check again with a blank option

was that the correct steps ?

trurl · December 16, 2021

28 minutes ago, justintas said:

original disk is jammed in disk cage

By "jammed" do you mean it can't be removed for some reason?

28 minutes ago, justintas said:

new disk in as disk 1 (sdc)

Might as well forget about that sdc designation. If you want to identify a specific drive assignment, disk1 is the way to go. If you want to identify a specific drive, some unique portion of the serial number is most useful, often the last 4 characters will work for many models.

26 minutes ago, justintas said:

check and repair was from gui

So it would have used the correct designation, which in this case would be /dev/md1. Specifying the md device is necessary to get parity updated with repair so it remains valid.

Might be useful to try to get the data from the original disk. Can you mount it as an Unassigned Device?

justintas · December 16, 2021

2 minutes ago, trurl said:

By "jammed" do you mean it can't be removed for some reason?

Yes its jammed in the cage and cant be removed screw must have moved

2 minutes ago, trurl said:

Might as well forget about that sdc designation. If you want to identify a specific drive assignment, disk1 is the way to go. If you want to identify a specify drive, some unique portion of the serial number is most useful, often the last 4 characters will work for many models.

So it would have used the correct designation, which in this case would be /dev/md1. Specifying the md device is necessary to get parity updated with repair so it remains valid.

Might be useful to try to get the data from the original disk. Can you mount it as an Unassigned Device?

yes is disk 1 is green but showing as unmountable , mounting existing drive will be hard as cage damaged. Will see if can cut drive out of cage

any other options or do pictures attached if any help

justintas · December 16, 2021

Ok tried check process again here are the results below, and guess what it is fixed !!!

Thanks for your guidance much appreciated

re ran check -n
Results
xfs_repair status:
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
would have reset inode 4328816385 nlinks from 1 to 2
would have reset inode 4328816389 nlinks from 1 to 2
would have reset inode 4328816398 nlinks from 1 to 2
No modify flag set, skipping filesystem flush and exiting.

Then re ran check (blank) to do a repair
Results:
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
resetting inode 4328816385 nlinks from 1 to 2
resetting inode 4328816389 nlinks from 1 to 2
resetting inode 4328816398 nlinks from 1 to 2
done

justintas · December 16, 2021

Should I do another parity rebuild before changing parity disk to larger disk ?

trurl · December 16, 2021

13 hours ago, justintas said:

Should I do another parity rebuild before changing parity disk to larger disk ?

No point unless you just want to exercise your hardware. Parity will be built to the new larger disk whether your current parity is valid or not, or even if you had no parity disk before.

New Parity Disk rebuild encountered read errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation