Trying to understand what went wrong with data-rebuild

discreet-booby4798 · March 23

I was looking to encrypt my drives, so I believe I had migrated everything off of my Toshiba 1tb disk. That's to say this failing disk that I am trying to rebuild, shouldn't have anything on it. But I would like to know I know how to rebuild failed disks correctly before the next emergency...

For some reason my Toshiba 1tb disk started failing, it was mounted directly on a pcie card with two sata ports. A second disk (my seagate 2tb) on that card did not fail, so it couldn't be a cable issue, but could be a card issue, I digress...

My two parity drives emulated the failing Toshiba disk, so I ordered a new Samsung 4 tb disk, popped it in, read through https://docs.unraid.net/unraid-os/manual/storage-management/. I am 99% certain I went through the https://docs.unraid.net/unraid-os/manual/storage-management/#normal-replacement Normal replacement process.

As I recall there were 132 errors in the rebuild process, and I through that was weird/bad. (See screen shot)

I thought I would run a parity-check, now I am getting 3418046386 errors, and my 4tb Samsung is showing as unmountable or no file system.

What am I doing wrong?

At some point I rebooted, so I hope that didn't reduce the chance of resolving this.

servernas2-diagnostics-20240323-0740.zip servernas2-syslog-20240323-1136.zip

JorgeB · March 23

During disk3 rebuild parity2 was already wrong, meaning there was an issue before that, do you know what could have caused that?

Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: recon D3 ...
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=0
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=8
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=16
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=24
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=32
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=40
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=48
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=56
Mar 17 16:28:00 ServerNas2 kernel: md: recovery thread: Q corrected, sector=64

discreet-booby4798 · March 23

As far as what cause the original parity problem.

- possibly power loss, that went out shortly before these problems but the machine is ancient so who knows, my newer machines are on UPS.

If I suspected I had data on the disk how might we go about this?

- How can I trouble shoot the bad parity?

- How might I rebuild from parity again?

- What were the 132 errors?

My real plan is to disable the network and fire up my main app, resilio sync, and reformat disk 3 and rebuild the parity, then see if it want's to write data to my other offsite fail overs.

discreet-booby4798 · March 23

I guess I just want to "rebuild from parity", "forget disk", then "rebuild"; or something like that. "reformatting" will probably delete the disk in parity now that I think about it.

JorgeB · March 23

2 hours ago, discreet-booby4798 said:

possibly power loss

I don't think that explains so many errors on parity2, and from the beginning disk4 also failed to emulate, suggesting parity1 was also not valid, you can try checking filesystem on disk4, if that doesn't work your best bet may be to see if you can recover some data from the old disk.

discreet-booby4798 · March 24

I understand what you are saying about so many errors on the parity disks, but I don't understand why you are talking about disk 4, I only have 3 data disks. I am now more concerned about the large number of errors, because I was trying to use these machines as redundant backup. Do you have any resources on how can I track down the source of these errors?

trurl · March 24

17 minutes ago, discreet-booby4798 said:

talking about disk 4

Probably a typo. Disk3 rebuild was compromised by bad parity.

Check filesystem on disk3 from the webUI not the command line. Post the output.

JorgeB · March 24

5 hours ago, trurl said:

Probably a typo.

Yep, sorry.

discreet-booby4798 · March 25

Ok I ran the check filesystem. (I stripped out the dots)

Phase 1 - find and verify superblock
bad primary superblock - bad magic number !!!

attempting to find secondary superblock
Sorry, could not find valid secondary superblock
Exiting now

JorgeB · March 25

That means there's no valid filesystem on that disk, further suggesting parity wasn't valid during the rebuild, so the rebuilt disk could be corrupt.

discreet-booby4798 · March 25

So pop in the original disk 3 and try to repair that disk?

JorgeB · March 25

If it's still available it's probably your best bet, but for now don't add the disk to the array, see it mounts with the UD plugin.

discreet-booby4798 · March 25

This is what I found.

FS: xfs

Executing file system check: /sbin/xfs_repair -n '/dev/sdh1' 2>&1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used. Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
- scan filesystem freespace and inode maps...
sb_fdblocks 484728573, counted 488140140
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 3
- agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

File system corruption detected!

trurl · March 25

Do it again without -n. If it asks for it use -L. Post the results.

discreet-booby4798 · March 25

Ok,

It did xfs_repair -e, then it asked me to try and mount it, and it mounted. I unmounted, and ran the check again, and got this...

FS: xfs

Executing file system check: /sbin/xfs_repair -n '/dev/sdh1' 2>&1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 1
- agno = 0
- agno = 2
- agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

No file system corruption detected!

Nothing is in there as expected...

I am recalling that it has been complaining about disk tempature, and then the new disk has been complaining about disk temp...

I think my last question is how do I put this disk back in as disk 3, and fix the parity?

So I am planning on picking up

1. an extra case fan

2. ups

3. ecc ram

Any plugins to "stress test" unraid to find the source of these parity issues?

JorgeB · March 25

39 minutes ago, discreet-booby4798 said:

Nothing is in there as expected...

What do you mean, isn't this old disk3?

discreet-booby4798 · March 25

On 3/23/2024 at 7:59 AM, discreet-booby4798 said:

I was looking to encrypt my drives, so I believe I had migrated everything off of my Toshiba 1tb disk. That's to say this failing disk that I am trying to rebuild, shouldn't have anything on it. But I would like to know I know how to rebuild failed disks correctly before the next emergency...

Yes, ti's the old disk... It was actually a Seagate 2tb disk... but either way I do feel more comfortable that I will be able to recover things next time.

My next concern is the bad parity. Other than the steps above, and setting parity-check to run frequenlty Is there anything else I can do?

47 minutes ago, discreet-booby4798 said:

So I am planning on picking up

1. an extra case fan

2. ups

3. ecc ram

trurl · March 25

14 minutes ago, discreet-booby4798 said:

Yes, ti's the old disk

Have you examined it's data?

15 minutes ago, discreet-booby4798 said:

setting parity-check to run frequenlty

No reason to do that. Most only do monthly or even less frequently.

Trying to understand what went wrong with data-rebuild

Recommended Posts

discreet-booby4798

Link to comment

JorgeB

Link to comment

discreet-booby4798

Link to comment

discreet-booby4798

Link to comment

JorgeB

Link to comment

discreet-booby4798

Link to comment

trurl

Link to comment

JorgeB

Link to comment

discreet-booby4798

Link to comment

JorgeB

Link to comment

discreet-booby4798

Link to comment

JorgeB

Link to comment

discreet-booby4798

Link to comment

trurl

Link to comment

discreet-booby4798

Link to comment

JorgeB

Link to comment

discreet-booby4798

Link to comment

trurl

Link to comment

Join the conversation