While reconstructing an old drive, multiple others started showing errors.

migueldias · September 14, 2021

So I replaced a very old 2TB WD Green drive from my array. It had around 6 years of power-on hours. I replaced it with a 12TB WD white label.

Just started reconstructing (progress is at 1.4%, should take a day to complete). The problem is that shortly after starting the array, I started getting these notifications:

https://i.imgur.com/tK5qhFs.png

and then, a few minutes later:

Checking the Attributes for both:

Suffice to say I am a bit nervous now. I don't really know how to proceed. I'm guessing I should just leave it and wait until the process ends?

I still have the old drive that I replaced. It had 100+ read errors which is why I replaced it with a 12TB I had.

The Array screen is now looking like this:

Any tip would be really appreciated.

tower-diagnostics-20210914-1114.zip

Edited September 14, 2021 by migueldias

JorgeB · September 14, 2021

Both disks appear to be failing, you can run an extended SMART test on both to confirm.

migueldias · September 14, 2021

10 minutes ago, JorgeB said:

Both disks appear to be failing, you can run an extended SMART test on both to confirm.

Doing that now, but on just one of them as I really want to avoid stressing the drives even more now that there is a rebuild going on.

In that event, as there are two drives I'll have to replace one at a time which just increases the risk of more data loss.

What an unfortunate event this is. I really can't afford to lose some of the data stored on this array.

I think I will just leave the array untouched until the rebuild finishes (with a bunch of errors, I'm suspecting). Once that is done, I'll stop the array, re-check the SATA connections and run a Parity Check.

JorgeB · September 14, 2021

Rebuilt disk will be corrupt, but if the disks are really failing there will always be some data loss, you can also use ddrescue on all the failing disks, this way at least you can know which files are corrupt.

migueldias · September 14, 2021

7 minutes ago, JorgeB said:

Rebuilt disk will be corrupt, but if the disks are really failing there will always be some data loss, you can also use ddrescue on all the failing disks, this way at least you can know which files are corrupt.

Thanks.

I'll let the rebuild go on until completion for now, and try to deal with the damage later. I'm still hopeful that this is some false positive as it is very weird that two drives, connected to different controllers, started showing errors just as I started to rebuild another drive.

JorgeB · September 14, 2021

Even if the disks aren't failing, and it really looks like they are, the rebuilt disk will still be corrupt due to the read errors.

migueldias · September 14, 2021

40 minutes ago, JorgeB said:

Even if the disks aren't failing, and it really looks like they are, the rebuilt disk will still be corrupt due to the read errors.

Certainly, but at 36 errors (assuming it stays around the 100 error count by the end), the corruption should be minimal.

Furthermore if they are indeed corrupted and I am still rebuilding one disk, I won't ever get be able to reconstruct the corrupt data.

I think the best plan now is to let the reconstruction finish and once that is done I'll use unBALANCE to move all the contents of the two failing 4TB drives in to the new 12TB drive (it will have ~10TB free after the rebuild). Once that is done I'll remove both drives from the array.

Edited September 14, 2021 by migueldias

trurl · September 14, 2021

4 hours ago, migueldias said:

replaced a very old 2TB WD Green drive

Did you check the health of other disks before deciding to replace that one? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

4 hours ago, migueldias said:

can't afford to lose some of the data

You must always have another copy of everything important and irreplaceable. Lots of ways to lose data that parity can't help with.

migueldias · September 14, 2021

So I had to stop the rebuild operation and shut off the array due to an unrelated reason.

When I turned it back on, the disk was seen as unmountable. I ran xfs_repair on it and now it mounts fine and it is rebuilding again.

My problem is that now some of my dockers are not working, more specifically binhex-plex.

When I try to start it I have no GUI access and the logs say this:

(...)
2021-09-14 22:13:41,817 INFO exited: plexmediaserver (exit status 255; not expected)
2021-09-14 22:13:41,818 DEBG received SIGCHLD indicating a child quit
2021-09-14 22:13:42,820 INFO spawned: 'plexmediaserver' with pid 67
2021-09-14 22:13:43,074 DEBG fd 8 closed, stopped monitoring <POutputDispatcher at 22622312476576 for <Subprocess at 22622312917696 with name plexmediaserver in state STARTING> (stdout)>
2021-09-14 22:13:43,074 DEBG fd 12 closed, stopped monitoring <POutputDispatcher at 22622312476624 for <Subprocess at 22622312917696 with name plexmediaserver in state STARTING> (stderr)>
2021-09-14 22:13:43,074 INFO exited: plexmediaserver (exit status 255; not expected)
2021-09-14 22:13:43,074 DEBG received SIGCHLD indicating a child quit
2021-09-14 22:13:45,076 INFO spawned: 'plexmediaserver' with pid 72
2021-09-14 22:13:45,215 DEBG fd 8 closed, stopped monitoring <POutputDispatcher at 22622312917216 for <Subprocess at 22622312917696 with name plexmediaserver in state STARTING> (stdout)>
2021-09-14 22:13:45,215 DEBG fd 12 closed, stopped monitoring <POutputDispatcher at 22622312476384 for <Subprocess at 22622312917696 with name plexmediaserver in state STARTING> (stderr)>
2021-09-14 22:13:45,216 INFO exited: plexmediaserver (exit status 255; not expected)
2021-09-14 22:13:45,216 DEBG received SIGCHLD indicating a child quit
2021-09-14 22:13:48,219 INFO spawned: 'plexmediaserver' with pid 77
2021-09-14 22:13:48,334 DEBG fd 8 closed, stopped monitoring <POutputDispatcher at 22622312476480 for <Subprocess at 22622312917696 with name plexmediaserver in state STARTING> (stdout)>
2021-09-14 22:13:48,334 DEBG fd 12 closed, stopped monitoring <POutputDispatcher at 22622312476432 for <Subprocess at 22622312917696 with name plexmediaserver in state STARTING> (stderr)>
2021-09-14 22:13:48,334 INFO exited: plexmediaserver (exit status 255; not expected)
2021-09-14 22:13:48,334 DEBG received SIGCHLD indicating a child quit
2021-09-14 22:13:49,334 INFO gave up: plexmediaserver entered FATAL state, too many start retries too quickly

When I tried to force update it, it couldn't remove the image, so I installed it again, but the same thing happens when I run it.

Not sure if the xfs_repair corrupted something.

trurl · September 14, 2021

post new diagnostics

migueldias · September 15, 2021

8 hours ago, trurl said:

post new diagnostics

Hi trurl,

Here they are.

tower-diagnostics-20210915-0920.zip

edit: I do have a CA Backup tar file of the appdata folder from 2 days ago.

Edited September 15, 2021 by migueldias

trurl · September 15, 2021

Repair put some files in lost+found share because it couldn't figure out what they were.

You are having connection problems while trying to rebuild disk1. You should go to Settings and disable Docker until you get your array stable again.

While reconstructing an old drive, multiple others started showing errors.

Recommended Posts

migueldias

Link to comment

JorgeB

Link to comment

migueldias

Link to comment

JorgeB

Link to comment

migueldias

Link to comment

JorgeB

Link to comment

migueldias

Link to comment

trurl

Link to comment

migueldias

Link to comment

trurl

Link to comment

migueldias

Link to comment

trurl

Link to comment

Join the conversation