Jump to content

[SOLVED] Several Errors on Multiple Disks, One Disk Disabled


Recommended Posts

Context:
- I have a big ol 24 bay Supermicro server (half populated with dual parity and two cache ssds).
- The drives are all plugged into a backplane and connected via a SAS cable to a LSI SAS 9211-4i host bus adapter into the mobo.

Early this morning several of my drives started spitting out errors.  7 of the 9 data drives had between 20-90 errors each. My two parity and two cache drives didnt report any (all of these are plugged into the same backplane). When I noticed what was going on, Unassigned Devices showed those same 7 drives in its unassigned list. Yikes. It also showed one of my data drives as "disabled, contents emulated".

 

My memory gets a little foggy as this happened at 430am, but I believe I turned off the machine at some point before exporting the initial diagnostics. Rookie mistake. But I made sure the offending drives were all in place, the PSU cables were plugged in, the HBA was seated, and the SAS cable was plugged in well. 

 

I turned the system back on in safe mode and everything appeared to be normal aside from the one disabled drive. I made sure to disable parity checking so it doesn't write any possible garbage data that may now exist.

System was running all day with no issues. I then rebooted in non-safe-mode aka just normal mode. I disabled docker as I believed that could thrash the system around too much. I then started the array. Several hours later, being inpatient, I decided to turn docker back on to see what would happen. Within about 10 minutes I saw a couple more errors pop up on 4 of the drives so I exported diagnostics, stopped the array, and shutdown.

I've attached the 430am Monday (possibly useless diagnostics), and the 1am Tuesday (probably useful) diagnostics.

- I have ordered a new drive (arriving later tonight) with the goal of replacing the disabled one.
- I have a new USB onhand if I need to replace the current one if suggested.
- I have a Fujitsu LSI SAS LSI2008-FU ZM I could use as a replacement for the HBA if recommended. (Could running dual HBA's be best going forward?)
- I also have a couple replacement SAS cables on the way which could replace the existing one if suggested (also arriving tonight).

Any suggestions for my steps going forward? Did I handle this right so far?

I'm not sure on what the best order of operations would be right now, especially because a bunch of the drives spit up errors. I'm also not sure what's the best practice for dealing with the possible garbage data that may exist.

halp!

 

Edited by ddwag1
Link to comment
1 hour ago, ddwag1 said:

What is best to do to ensure that the possible garbage data doesnt become a problem?

Until you find and fix the problem don't run correcting parity checks, and don't rebuild any disk on top of the old one, either use a spare or re-enable with a new config instead (but any data written to the emulated disk, if any, will be lost).

Link to comment

Thanks, currently making sure all of the smart checks are fine, cables are in order and doing more reading about my options.

1 hour ago, johnnie.black said:

don't rebuild any disk on top of the old one, either use a spare

Can you clarify what you mean here?

My understanding from the wiki and your reply is that I have 2 options similar to what you mentioned.

1) I can just choose to re-enable the drive (the 'new config' route) if I'm confident that the all the data on the disable drive is fine in addition to just being a healthy drive. I'm not super confident in that as docker was running during the disabling and I'm sure a few things were written to an emulation drive. This route would delete that emulation disk data like you mentioned.

2) This one I find a little confusing so please correct me if I'm wrong. Following the steps under Re-enable the drive, I can either reconstruct the data to a new disk (which just arrived in the mail), or reconstruct the disk onto the previously disabled drive (if the drive is healthy). This would entail removing the disk from the array by selecting 'no device' under Array Devices. Starting the array. Stopping the array. Either assigning a new disk or the old one. Then restarting the array to allow reconstruction to begin.

 


 

 

Link to comment

A few points to help you decide:

  • a disk is disabled when a write to it fails as at that point it is now out of step with parity.    At this pointUnraid stops writing to the drive and will now start emulating the drive using the combination of the other drives plus parity.   Subsequent writes to this emulated drive are reflected in parity only and the disabled drive is left untouched.
  • A drive being Unmountable indicates some sort of. Irruption at the filing system level.    The rebuild process will not correct this.   
  • The rebuild process just puts back on the physical drive the contents of the emulated drive so if the emulated drive is unmountable then the rebuilt one will be as well.
  • you can attempt the file system repair process on the emulated drive prior to any rebuild attempt.   Only if this is successful does the rebuild makes sense.
  • it is always possible the disabled disk does not have the file system corruption, but it only contains any data written before it became disabled.
  • Rebuilding to a new disk leaves the original disabled disk untouched giving you other recovery options if this does not resolve the issue

  • Like 1
Link to comment

Thanks for the summary. I'm pretty certain that my issue was the SAS cable. I replaced it and everything has been looking fine for two days. Smart check passed on the disabled drive, but to be safe I'm going to replace it with my new drive, let that reconstruct, then clear my old drive and add it back to the array assuming a preclear pass and another smart check go well.

Thanks for your help!

Link to comment
  • JorgeB changed the title to [SOLVED] Several Errors on Multiple Disks, One Disk Disabled

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...