Tried to replace and rebuild a failed disk, errors in a second disk began... now fearful of losing everything


egeis

Recommended Posts

Diagnostics attached.

 

Disk2 had an X, so I bought the same size 2TB disk and followed all the directions on https://wiki.unraid.net/Replacing_a_Data_Drive

The process began last night and was only 2.6% done the next morning with read errors now occurring on Disk1.

I paused the rebuild, shut everything down, I checked all cables, and restarted in maintenance mode.

It first said no disks have read errors.

I went to disk 1 and ran xfs_repair on disk1. During phase 2 it told me:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
xfs_repair: read failed: Input/output error
btree block 3/4093184 is suspect, error -117
bad magic # 0 in btcnt block 3/4093184
agf_freeblks 83999918, counted 1229868 in ag 3
agf_longest 68382104, counted 3995 in ag 3
agf_btreeblks 26, counted 25 in ag 3
sb_fdblocks 273855784, counted 191085733
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0

It's still running...

 

Status is now saying "array has errors - Array has 1 disk with read errors"  even though it says it passed SMART overall-health. It seems ridiculous that a second HD would fail immediately upon trying to rebuild the first failed one in 3 years.

 

I'm petrified of starting the array and losing my data. And I don't want to kick off the rebuild a second time and cause more problems.

 

Is there anything I can do to attempt to prevent loss of parity by having a second failed disk?

 

 

unraidtower-diagnostics-20211113-1504.zip

Link to comment

No way to tell from those diagnostics if there was really anything wrong with disk2 that you replaced, since it is no longer attached, and you rebooted since then so nothing about that disk in your diagnostics. Bad connections are much more common than bad disks.

 

On the other hand, disk1 really is BAD. So BAD that it seems like it must have been telling you about that disk for a long time. Did you not notice SMART warnings on the Dashboard page for disk1? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

Do you still have the original disk2? Have you written anything to your server since you replaced disk2?

Link to comment

Disk 1 had a green balloon and I was not getting any errors via notifications. Every error was referencing disk 2. Until 4 days ago, most everything I run was running just fine, plex, miniDLNA, deluge, couchpotato, nextcloud, etc. etc. When I started looking at data before I did the swap last night, the shares were running SLOW... 5-10 seconds just to display folders at the lowest level.

 

Nothing has been written to the array since I made the swap last night. I can put it back in if necessary. I'm ready to try anything.

 

I'm still in maintenance mode. Hoping to learn how to save whatever I can. 

 

This is the updated xfs_repair status

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
xfs_repair: read failed: Input/output error
btree block 3/4093184 is suspect, error -117
bad magic # 0 in btcnt block 3/4093184
agf_freeblks 83999918, counted 1229868 in ag 3
agf_longest 68382104, counted 3995 in ag 3
agf_btreeblks 26, counted 25 in ag 3
sb_fdblocks 273855784, counted 191085733
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
xfs_repair: read failed: Input/output error
cannot read inode 1098603328, disk block 1001618784, cnt 32

 

Edited by egeis
Link to comment
22 minutes ago, egeis said:

Disk 1 had a green balloon

That is expected if the disk isn't disabled, but disks can be disabled for different reasons, and one of the most common is bad connections.

1 hour ago, trurl said:

Did you not notice SMART warnings on the Dashboard page for disk1?

On the Dashboard page where the disks are shown, there is a column for SMART with a thumbs up or thumbs down indicator. I suspect disk1 had been showing thumbs down before you even had a problem with disk2.

 

XFS repair is failing on disk1 because the disk itself is failing. Was disk1 actually saying it was unmountable?

 

Do you have another copy of anything important and irreplaceable?

 

Stop the array, unassign disk2, start the array in Normal mode, then post new diagnostics.

 

 

Link to comment

Maybe it was thumbs down, but honestly didn't see it. I only saw it on disk 2. It's definitely yellow thumbs down on disk 1 right now.

 

Definitely got lots of irreplaceable stuff. I was able to access the folder I cared most about and it's empty. This is really really really bad.

 

I have gone over in my head how much of an a** I was to imagine that my backup solution was redundant and to not back things up again. This is literally, the worst possible case situation right now. I spent two months setting up this server to be my redundant backup solution... I saw methods for replacing motherboards, CPUs, etc. etc. 

 

Here's the new diagnostics file. 

 

unraidtower-diagnostics-20211113-1744.zip

Link to comment
1 hour ago, egeis said:

this server to be my redundant backup solution

Parity is not a substitute for backup, and Unraid can only be a backup solution if it is, in fact, a backup of files you have elsewhere.

1 hour ago, egeis said:

the new diagnostics file.

Disk1 is mounted so don't know why you were trying to repair its filesystem. It is a failing disk though.

 

Disk2 is not assigned but emulated disk2 is mounted. Probably you aren't going to be able to rebuild it with that failing disk1 in the array though.

 

Do you have another port you can plugin the original disk2? If original disk2 is good maybe we can use it to rebuild disk1 to that new disk instead.

 

Link to comment

If the original disk 2 is truly out of commission (as I assume given the that was next to it when I replaced it), what other options exist here?

 

just saw your reply. Reconnecting the original disk 2 now. The new diagnostics are uploaded here. It looks like Unraid is seeing the old disk 2 as a "new device" so I'm very fearful of spinning up the array and then Unraid begins to automate the formatting and syncing process...

 

If this disk happens to still work and has retrievable data, how do I prevent Unraid from going forward with a process to rebuild data?? 

 

unraidtower-diagnostics-20211113-2326.zip

Edited by egeis
I have a new question in reference to the issue.
Link to comment
18 hours ago, egeis said:

original disk 2 is truly out of commission (as I assume given the that was next to it when I replaced it)

  

On 11/13/2021 at 5:30 PM, trurl said:

disks can be disabled for different reasons, and one of the most common is bad connections.

 

SMART for original disk2 looks good. So, we are going to New Config original disk2 back into the array, and rebuild disk1 instead.

  1. Go to Settings - Disk Settings and set Autostart to No
  2. Shutdown, remove bad disk1, install new replacement disk in its place, reboot
  3. Tools -> New Config -> Retain current configuration: All -> Apply
  4. Check all assignments, make sure original disk2 is assigned as disk2, and assign that new replacement disk as disk1
  5. IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array
  6. Stop array
  7. Unassign disk1 (new replacement disk)
  8. Start array (in normal mode now), emulated disk1 and all other disks should be mountable and show their size and usage.
  9. If all disks mount stop the array, re-assign disk1 (new replacement disk) and start array to begin rebuilding disk1 to the new replacement disk.

If you have any questions about any of this, ask. If something doesn't go according to plan, let us know and post screenshot of Main and new diagnostics.

 

Link to comment

Huge Question: I'm on Step 3: "Tools -> New Config" and I'm reading the warning... it is a very scary warning... specifically:

"DO NOT USE THIS UTILITY THINKING IT WILL REBUILD A FAILED DRIVE - it will have the opposite effect of making it impossible to rebuild an existing failed drive - you have been warned!"

 

I'm open-minded, so I can absolutely get over the strong wording on that warning, but can you give me a little reassurance that there's a good reason that those words don't necessarily apply to this situation?

 

There is a part of me that is thinking I should take all 4 disks to a data recovery shop before I do anything that could destroy the data on the drives. But I'm willing to give this a shot assuming that the Unraid OS will recognize that the 3/4 disks are already from a previous Parity configuration.

 

By the way, just want to give huge thanks to all of you who support n00bs like us on these forums and work with this software, cause I can't imagine it's easy, and I can't imagine the level of complexity involved in understanding the nuances of running this kind of project. If any of you ever need some help with analytics projects, hit me up!

Link to comment

New Config takes all disks just as they are and optionally (by default) rebuilds parity based on all the data disks. The warning is because normally if you have a data disk that is disabled and needs rebuilding, New Config will make it so no disks are disabled, so Unraid won't know anything needs rebuilding. And if you rebuild parity then it will match all the data disks and so even if you made it rebuild a data disk after rebuilding parity, it wouldn't change the data disk because parity is already in sync with it and all other disks.

 

We want to use New Config to make it so disk2 isn't disabled (or anything else), and NOT rebuild parity. Since we don't change parity, it should still be as needed to rebuild any disk that we disable. Then in step 8, we disable disk1 by starting the array with the disk unassigned. After that Unraid wants to rebuild disk1 to that new disk when it is assigned.

 

Rebuild will read all disks to get the result of the parity calculation, and write that result to the rebuilding disk. We need parity and disk2 and all other disks that parity is based on in order to get the result of the parity calculation to rebuild disk1.

Link to comment

Another thing to note with this procedure. Since rebuilding only changes the rebuilding disk (that new disk), all other disks remain the same and are only read. If it turns out that everything doesn't go perfectly, you are no worse off than you were.

 

There is some possibility that emulated disk1 won't be mountable. Maybe even a chance that original disk2 isn't mountable even though the emulated disk2 mounted. If that is the case we can try to repair the filesystems on any disk that doesn't mount.

 

Be careful when working with the disks, cables, etc. It is very important that we have good connections for this to work well.

Link to comment

Alright... so after running step 3, and looking back at the MAIN menu, disk2 disappeared from the array devices list. It's no longer assigned, and it's no longer in any of the dropdown menus... It is now under Unassigned devices and labeled "sdd", next to the replacement disk for disk1. I have no idea why it says that it is 4.14 GB. It's a 2TB Seagate. I'm afraid it might be the "Hybrid" SSD part of the disk that is the only thing being recognized.

image.thumb.png.a1e4dde2b4189c4566bf0b4a118cd38b.png

 

Here's a snippet from the disk log:

image.png.55ed39144f71c4485fb9ca5be28cdd61.png

Edited by egeis
Link to comment

So I rebooted and all four disks are now able to be allocated... though I still think disk2 is suspect given my previous post. Looking forward to your suggestions of whether I should continue given the previously mentioned errors.

Edited by egeis
Link to comment

I'm on step 8. All disks mounted and there looks to be emulation. I'm copying over some super important stuff before I start the rebuild. So HUGE thanks for getting me this far.

 

Note... maybe I'm a very unaware person, but a few basic text files I was using for keeping records don't look like they're "up-to-date"... last modified date is 3 months ago and I'm pretty sure I made updates to that file in October. Is it possible that modifications would get "missed" from missing a disk? Seems super weird.

 

I'll rebuild the disk as soon as I get some stuff backed up.

Link to comment
4 hours ago, egeis said:

Note... maybe I'm a very unaware person, but a few basic text files I was using for keeping records don't look like they're "up-to-date"... last modified date is 3 months ago and I'm pretty sure I made updates to that file in October. Is it possible that modifications would get "missed" from missing a disk? Seems super weird.

This thread began with a disabled (emulated) disk2, then later we put the original disk2 back in with New Config so we could rebuild bad disk1 instead. If anything was written to the emulated disk2 then of course it wouldn't be on the original disk2 since original disk2 was not used again after becoming disabled. Hence the reason for my question:

  

On 11/13/2021 at 4:21 PM, trurl said:

Have you written anything to your server since you replaced disk2?

Perhaps I should have said have you written anything to your server since disk2 became disabled. Maybe you didn't know when it became disabled and had been using it like that for a while. You didn't know disk1 had bad SMART. You really need to setup Notifications.

 

If you did indeed write to the emulated disk2, then parity will not be totally in sync with the original disk2 and so might cause problems for rebuilding disk1 accurately. We will just have to see how it goes and repair filesystem if needed.

Link to comment

I'll let you know. I just backed up a ton of data and I'm ready to start the rebuild. I'm also going to follow the rebuild by cloning the parity drive to a new and larger disk, then move the parity disk in to replace the suspicious disk2. 

 

I definitely haven't "knowingly" written new information to the files since the system acted up. If it was operating from an emulated disk for more than 2 months, then what you're saying makes sense. I'll be sure to integrate more monitoring protocols and email updates...  

Link to comment

OK... so everything is saying it's done... but I did not format disk 1... cause I thought that would automatically be part of the process... It's currently saying that the new disk is unmountable. 

image.thumb.png.f2542a8653ed337156ac6d2c1c786f92.png

 

I saw this warning... probably should've asked the question before the process began... Not a big deal to lose another 3 hours overnight... assuming the right thing to do is to format and restart.

image.thumb.png.6a85003c8beb5ed8129fcfb211475615.png

 

 

unraidtower-diagnostics-20211115-2056.zip

Here, I'm attaching a new diagnostics file. Should I restart the process? 

 

 

Link to comment
3 minutes ago, egeis said:

assuming the right thing to do is to format and restart.

No, absolutely not. Format is NEVER part of rebuild. Format means write an empty filesystem to this disk. That is what it has always meant in every operating system you have ever used. If you format a disk in the parity array, Unraid treats that write operation just like any other, by updating parity. So, after formatting a disk in the parity array, parity agrees the disk has an empty filesystem.

 

 

Link to comment
19 minutes ago, egeis said:

wasn't going to format it while it was assigned

Just FYI

 

Formatting it outside the array before using it to rebuild would have been pointless since that empty filesystem would be completely overwritten by rebuild.

 

And if you removed it from the array now to format it, that would be even more pointless, because you would have to rebuild it again.

 

Rebuild doesn't require a formatted disk nor a clear disk because every bit of the disk is going to be completely overwritten with the results of the parity calculation during rebuild.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.