Context:
- I have a big ol 24 bay Supermicro server (half populated with dual parity and two cache ssds).
- The drives are all plugged into a backplane and connected via a SAS cable to a LSI SAS 9211-4i host bus adapter into the mobo.
Early this morning several of my drives started spitting out errors. 7 of the 9 data drives had between 20-90 errors each. My two parity and two cache drives didnt report any (all of these are plugged into the same backplane). When I noticed what was going on, Unassigned Devices showed those same 7 drives in its unassigned list. Yikes. It also showed one of my data drives as "disabled, contents emulated".
My memory gets a little foggy as this happened at 430am, but I believe I turned off the machine at some point before exporting the initial diagnostics. Rookie mistake. But I made sure the offending drives were all in place, the PSU cables were plugged in, the HBA was seated, and the SAS cable was plugged in well.
I turned the system back on in safe mode and everything appeared to be normal aside from the one disabled drive. I made sure to disable parity checking so it doesn't write any possible garbage data that may now exist.
System was running all day with no issues. I then rebooted in non-safe-mode aka just normal mode. I disabled docker as I believed that could thrash the system around too much. I then started the array. Several hours later, being inpatient, I decided to turn docker back on to see what would happen. Within about 10 minutes I saw a couple more errors pop up on 4 of the drives so I exported diagnostics, stopped the array, and shutdown.
I've attached the 430am Monday (possibly useless diagnostics), and the 1am Tuesday (probably useful) diagnostics.
- I have ordered a new drive (arriving later tonight) with the goal of replacing the disabled one.
- I have a new USB onhand if I need to replace the current one if suggested.
- I have a Fujitsu LSI SAS LSI2008-FU ZM I could use as a replacement for the HBA if recommended. (Could running dual HBA's be best going forward?)
- I also have a couple replacement SAS cables on the way which could replace the existing one if suggested (also arriving tonight).
Any suggestions for my steps going forward? Did I handle this right so far?
I'm not sure on what the best order of operations would be right now, especially because a bunch of the drives spit up errors. I'm also not sure what's the best practice for dealing with the possible garbage data that may exist.
halp!