Several Simultaneous Failures


Recommended Posts

19 minutes ago, trurl said:

Wouldn't trust parity be good enough?

Thinking more about it I think you are saying trust parity with the old disk then replace, in that case yes, it would be similar and no problem replacing it with a larger disk, but since the disk appears to be failing badly just mounting it could result in read errors, though you could start the array in maintenance mode, that should be OK.

Link to comment
48 minutes ago, johnnie.black said:

Do you have a spare? Same size or larger.

I will tomorrow evening.

 

The array does start in maintenance mode and it does start (so far) in regular mode but I've been keeping it in maintenance to minimize any corruption.

 

I think you're saying that the best thing to do is just to wait until I receive the new disk...if so, when I receive the replacement drive, what is the process I should follow? Sorry, I got a bit confused on the recommended path with the back and forth there.

Link to comment
6 hours ago, srfnmnk said:

Anything I should do between now and then?

no

 

11 hours ago, srfnmnk said:

what is the process I should follow?

invalidslot might be the more direct way to do it all at once instead of trusting parity first, then replacing. But, invalidslot requires overriding the webUI from the command line at one point in the process.

 

Let us know when you get the replacement and we'll see if we can reach consensus.

Link to comment

Invalid slot is a more cleaner way of doing it, just make sure you follow the instructions carefully and ask if there's any doubt.

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, including new disk14, replacement disk should be same size or larger
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 14

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk14 will start rebuilding, disk should mount immediately (possibly not in this case since parity won't be 100% valid) but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

 

 

 

Link to comment
9 hours ago, srfnmnk said:

Why is that necessary/better than rebuilding the config?

Normally, when you New Config, parity is rebuilt by default. You don't want to rebuild parity, you want to use existing parity to rebuild disk14 instead. Invalidslot lets you specify a different disk to rebuild during New Config.

Link to comment

HOLY SMOKES -- I think it's working. I cannot thank you two enough for all your help. I never would have figured this out. Another reason to keep recommending Unraid.

 

I am attaching some screenshots through the process in case it may help anyone in the future.

 

The million dollar question is how does one drive going bad constitute two other drives getting disabled in the array? Do you have any insights on this? I'd also like your insights on the quote below. Is my turbo write / md_write_method set up properly / safely?

On 8/27/2020 at 9:14 AM, srfnmnk said:

The only potentially dangerous thing I do (I think) is I had the the "md_write_method" set to reconstruct_write but I have the reconstruct write tool "Turbo Write" and it's set to 10 disks (of 16) allowed spun down...I thought this was safe...

 

Thank you for your input.

 

@trurl you mentioned that there were some opportunities regarding the way my dockers are configured. I agree, I would love your input.

On 8/26/2020 at 8:59 AM, trurl said:

There are some other things about how you have dockers configured that we can discuss later after your array is stable.

final_rebuilding.png

start.png

syslog_error_before_start_after_ssh.png

array_after_ssh_before_start.png

ssh_command.png

Link to comment
18 minutes ago, srfnmnk said:

the way my dockers are configured.

Reviewing your diagnostics, I think I must have been referring to the fact that you have allocated 80G to docker.img, and are using 26G of that.

 

My usual recommendation is only 20G allocated for docker.img. Anytime I see someone with more than that it makes me wonder if they don't have some application writing to a path that isn't mapped.

 

I have 20G allocated to docker.img. I am running 17 dockers, and they are using less than half of that 20G.

 

Have you had problems filling docker.img? Making it larger will not fix anything, it will only make it take longer to fill.

 

The usual reason for using more space than necessary in docker.img is for an application to write data into the docker.img. That will happen when it writes to a path that isn't mapped to host storage. Common mistakes are writing to a path that doesn't exactly match the mapped container path with regard to upper/lower case, or writing to a relative path (what is it relative to?)

Link to comment
7 minutes ago, srfnmnk said:

can a drive failing cause bad writes

A drive failing can cause bad writes to that drive certainly. Writes to one drive are unrelated to writes to other drives since each disk is an independent filesystem. Except of course that parity is always updated when a data drive is written, but even there, parity is only disabled when a write to parity fails.

 

11 minutes ago, srfnmnk said:

happenstance that one drive started to go bad and two others received bad writes within just a few moments?

Do you really know it was "within just a few moments"? If you know exactly when these events occurred that would point to where in your syslog to look for them.

Link to comment

Here is how this whole disable and emulation thing works.

 

When a write to a disk fails, Unraid disables the disk.

 

If the disk is a data disk, the write is still used to update parity. So that failed write can be recovered when the disabled disk is rebuilt. The disk is disabled because it is no longer in sync with parity.

 

After a disk is disabled, the actual disk is not used again until it is rebuilt (or in your case, a New Config, see below). Instead, the disk is emulated by reading all other disks to get its data. The emulated disk can be read, and it can also be written by updating parity. So writes to the emulated disk continue even when the disk is disabled. Those writes can be recovered by rebuilding the disk from the parity calculation.

 

And, rebuilding the disk is the usual way to recover from this, because the disk is no longer in sync with parity, since parity contains writes that happened with the disk disabled.

 

It is also possible to enable all disks again by setting a New Config and rebuilding parity, thus getting parity back in sync with all the data disks in the array. But any writes to that disk that happened with the disk disabled are lost when you take that option.

 

In your case, the actually failing disk14 was contributing bad data to the emulation of those disabled disks. That resulted in those emulated disks being unmountable. But the actual disks were still mountable, as we discovered. Technically, parity is out-of-sync with those disks, but maybe not much. The rebuild of disk14 is relying on that "not much".

 

One final note. If a read from a disk fails, Unraid will try to get its data from the parity calculation by reading all the other disks, and then try to write that data back to the disk. If that write fails the disk is disabled. So, it is possible for a failed read to cause a failed write that disables the disk.

Link to comment
7 hours ago, johnnie.black said:

Disk14 is unmountable

@johnnie.black You do mean the original Disk14, right? The new disk14 (the replacement) is being rebuilt and things seem to be going just fine

7 hours ago, johnnie.black said:

please post current diags

Will do after rebuild. 

 

@trurl I will get back to your comments in a bit. I haven't read them all the way through yet but they look super nice and juicy.

 

Sheesh -- I hate these disks. Got another one going out now. Became disabled during the rebuild. Can't wait for all of them to die to be honest.

ST3000DM001-1ER166_Z501E486 - 3 TB (sdj)

Link to comment
10 minutes ago, srfnmnk said:

You do mean the original Disk14, right? The new disk14 (the replacement) is being rebuilt and things seem to be going just fine

Whatever is being emulated is what is being rebuilt. On Main, does it say the emulated disk is unmountable?

 

12 minutes ago, srfnmnk said:

Sheesh -- I hate these disks. Got another one going out now. Became disabled during the rebuild. Can't wait for all of them to die to be honest.

Diagnostics now might be better than waiting until after the rebuild, especially since you have a new problem.

 

Link to comment

Little confused about your screenshots. They both seem to indicate rebuild of disk14 completed, but disk18 is now disabled.

 

The 1st screenshot shows 2 unmountable disks, the rebuilt disk14, and the disabled disk18. The 2nd screenshot shows only 1 unmountable disk, the rebuilt disk14, with the disabled disk18 mounted.

 

The diagnostics indicate both 14 and 18 are not mounted.

 

Do you have any new information to add or anything else that might clear up this discrepancy?

 

Also, what itimpi said above.

6 hours ago, srfnmnk said:

I'm guessing I need to format it.

BAD GUESS

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.