Several Simultaneous Failures


Recommended Posts

Hello,


I was just watching a movie and then the frame started skipping. I went to look and discovered that two drives were disabled and one was piling up the read errors. I am using XFS - encrypted. I believe these drives are fairly new ( < 3 yrs ) ... strange how two crashed at once with a third having read errors...What to do? I did a prompt order of some new Drives (out of spare 8TB disks atm)...recommendations? I do have two parities but if there truly is a third disk failing all at the same time...wow...No earth quakes or anything the server was just sitting in its rack as always.

 

Regardless of age of drives and what not, having three blow up within 3 minutes doesn't seem reasonable...As for mounting them somewhere else I can do that but it's a PITA and not even sure how to mount the XFS encrypted ones...

 

Thanks.

 

Attached Diagnostics.

Disabled Disks are: 

ST8000VN0022-2EL112 ( still under warranty )

TOSHIBA MD04ACA500 ( probably old disk )

ST3000DM001-1CH166 ( read errors old disk )

pumbaa-diagnostics-20200825-1959.zip

Link to comment

You didn't give the unique portion of the disk serial numbers. Often the last 4 characters are all that is needed. I can see from diagnostics which are the disabled disks though. Unraid won't disable more disks than you have parity, so only 2 disks are actually disabled, disks 6 and 19.

 

Disk6, there are 2 disks that match that serial number you gave. Last 4 characters of disk6 are SAT0

Disk19, there is only one TOSHIBA so that isn't a problem. Last 4 characters of disk19 are FS9A

 

SMART for both of those disabled disks looks OK to me.

 

As for the ST3000 disk, you have 3 disks that match those parts of the serial number you gave. Of those, disk14, last 4 characters 0PV5, has 32 reallocated, not necessarily a reason to replace.

 

Which disk were you referring to for ST3000?

 

You might run an extended SMART test on each of those disks.

 

Lots of stuff in syslog I don't know much about, including some things that appear related to encryption.

 

The array wasn't started in those diagnostics, so can't tell if any disks are unmountable, and many other things that might be useful.

 

Can you start the array and post new diagnostics?

 

 

Link to comment

Awesome @trurl Thank you! Sorry about the confusion on the model numbers instead of serials. See attached for new diagnostics and below are serial #s of the disks in question and their #s in the Array. I have kicked off the extended reports and will add them when they're complete.

 

If we determine the disks are ok and nothing bad seemed to happen, is there any alternative to rebuilding them? Could I rebuild one and not the other? I have a new 8TB arriving Friday I can rebuild to a new drive in case another fails during rebuild I'd have the original as a backup but just looking for best approach to get back to only a single disk at risk.

 

Appreciate the quick turnaround and review. I'm terrified that another disk may fail now and my array be corrupted. So thank you very much.

 

Disk 6 (sdh) - ZA1ESAT0 - ST8000VN0022-2EL112 -  disabled ( still under warranty )

Disk 19 (sdq) - Y4NEK70WFS9A - TOSHIBA MD04ACA500 - disabled ( probably old disk )

Disk 14 (sde) - W1F50PV5 - ST3000DM001-1CH166 - enabled and mounted with 0 errors today after array restart but the reallocated sectors went from 32 to 40( read errors old disk ) This disk is probably going bad (I'm guessing) and will ultimately replace but just hoping it can hang on until the rest of the cluster is healthy.

pumbaa-diagnostics-20200826-0804.zip

Link to comment

There are some other things about how you have dockers configured that we can discuss later after your array is stable.

 

Looks like emulated disks 6 and 19 are unmountable. Even though these disks are disabled, they are emulated by reading all other disks and getting their data from the parity calculation. Unfortunately, the result of these emulations are corrupt filesystems. And, the emulated data is what will be rebuilt when you do rebuild the disks.

 

So, the first thing will be to try to repair these filesystems.

 

Study this wiki and come back with any questions:

 

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

 

Be sure to capture the output so you can post it.

Link to comment
32 minutes ago, srfnmnk said:

rebuild to a new drive in case another fails during rebuild I'd have the original as a backup

This is always a good idea and in the current situation, having the original may be another solution to the filesystem corruption of the emulated disks if the repair doesn't work.

Link to comment

Stopped Array 

Started in Maintenance Mode

Used -nv on checks 

Below are outputs

NOTE: the SMART tests are still in progress but I didn't figure that would have an affect...could be wrong.

 

Disk 6:

 

Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now.

 

Disk 19

 

Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now.

Link to comment
2 minutes ago, srfnmnk said:

the SMART tests are still in progress but I didn't figure that would have an affect

The SMART tests are on the physical disks and the filesystem checks/repairs are on the emulated disks. But if you are also running SMART test on disk14 then that disk is being read for the emulation. Let the SMART tests complete.

Link to comment

DISK 19

The SMART TEST has completed for Disk 19 without error. The report is attached.

 

I re-ran the repair check -nv on Disk 19 and got the same thing as above.

Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now.

pumbaa-smart-20200826-1701.zip

Link to comment

Could be failing disk14 is actually the cause of the unmountable filesystem on the emulated disks.

 

Maybe the best way forward would be to New Config / Trust Parity and then try to replace / rebuild disk14 and if there really is corruption on the other disks try to repair after that.

 

The main thing that troubles me with making that recommendation though is that I don't know why those other disks became disabled. The syslog in those first diagnostics you posted may have a clue since they probably cover those events, but it was quite a lot of stuff that I haven't seen before.

 

I am going to ask @johnnie.black for a second opinion.

 

 

Link to comment

Ok, will await response from @johnnie.black -- thank you. It seems very risky to create new config and go, right? What's the backout plan if that blows up for some reason? I think you're suggesting:

 

1. New config with exact same disk structure as I have today

2. Start the array

3. Check the failing disk (disk14) and do extended SMART test

4. If bad, replace it and rebuild from parity

 

Is this correct? If so, where in this process do I wind up losing all my data? That would be the end of my world haha (I do have the most important stuff off-site but that's only about 2TB of 50+).

 

I am VERY curious to know how Disk 6 & 19 just became unmountable with no visible errors/issues, ESPECIALLY while another disk was failing...that immediately invalidates the entire array even with dual parities. The only potentially dangerous thing I do (I think) is I had the the "md_write_method" set to reconstruct_write but I have the reconstruct write tool "Turbo Write" and it's set to 10 disks (of 16) allowed spun down...I thought this was safe...

 

Thank you for your input.

Link to comment
8 minutes ago, srfnmnk said:

I am VERY curious to know how Disk 6 & 19 just became unmountable with no visible errors/issues,

18 minutes ago, trurl said:

Could be failing disk14 is actually the cause of the unmountable filesystem on the emulated disks.

We don't know that the physical disks 6 and 19 have unmountable filesystems, because those physical disks have not been read since they became disabled. Instead, those disabled disks are emulated by reading all other disks and getting the data for the disabled disks from the parity calculation. So, the failing disk14 is providing some of the bits for the emulation of the disabled disks. It is the emulated disks that are unmountable.

 

Link to comment

Right but disks 6 and 19 are emulated.

1 minute ago, trurl said:

It is the emulated disks that are unmountable.

So those "have unmountable filesystems" no? There's no way to mount them with the current config, right? I get that disk 14 is providing bits for the emulation and that's why I am likely continuing to get increased error counts...I'm just not sure what to do about it at this point. 

Link to comment
1 minute ago, srfnmnk said:

Right but disks 6 and 19 are emulated.

So those "have unmountable filesystems" no? There's no way to mount them with the current config, right? I get that disk 14 is providing bits for the emulation and that's why I am likely continuing to get increased error counts...I'm just not sure what to do about it at this point. 

It is possible to try to mount them outside the array, but that would technically invalidate parity at least a little. I think another possibility would be to clone them outside the array and see if the cloned disks are mountable, but that would require spare disks and would be time-consuming.

 

Perhaps the simplest, and possibly as good as any other way, would be trusting the array, rebuilding disk14 to a new disk, and then repairing the corruption, if any really exists, on the other disks.

Link to comment
2 minutes ago, trurl said:

It is possible to try to mount them outside the array, but that would technically invalidate parity at least a little.

The filesystem on the actual disks should be OK, and they can me mounted with the read-only option in UD, but since the rest of the array was started without them parity will always be a little out of sync, so there could still be issues rebuilding disk14 using the invalid slot command.

 

I would check if both disks mount OK (with UD in read-only mode) and also run an extended test on disk14, then decide what to do based on the results.

 

P.S. disk14 and a lot of your disks are the infamous ST3000DM001 that tend to fail a lot.

Link to comment
27 minutes ago, johnnie.black said:

ST3000DM001

Yes, for a while there I was replacing disks all the time as my first batch of disks were these lovely guys...the ones left now have been running a LONG time as I never bought any more of them.

 

What I want to do is get the two disabled disks (6 & 19) back online, do extended test on 14, and replace it (because it's probably bad).

 

29 minutes ago, johnnie.black said:

I would check if both disks mount OK (with UD in read-only mode)

Will this invalidate my parity (at least a little)? I've never tried to mount array disks with UD before. What's the process here? 

Stop Array --> temporarily remove from array config --> mount in read only --> umount --> return it to array?

 

I assume you're saying that if both disks are mountable I should just trust the parity and rebuild the config right? If they are both able to be mounted, I can:

  • build a new config with exactly the same composition of disks
  • check disks 6 & 19 for errors (assuming good or can be repaired) move on
  • extended SMART on 14
  • Run parity check (correct errors -- back to healthy)
  • Replace and Rebuild 14 (assuming it's dead) and throw that garbage in the trash :P

^^ does that sound right?

 

Thanks so much

Link to comment
5 minutes ago, srfnmnk said:

Will this invalidate my parity (at least a little)?

Not that part if done in read-only mode, but parity already is already a little out of sync with those disks.

 

9 minutes ago, srfnmnk said:

Stop Array --> temporarily remove from array config --> mount in read only --> umount --> return it to array?

Stop array, unassign both disks and start array, or they won't be available for UD, in UD enable read-only button before mounting.

 

10 minutes ago, srfnmnk said:

I assume you're saying that if both disks are mountable I should just trust the parity and rebuild the config right?

Also depends on the result of the SMART test for disk14.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.