4 drives randomly kicked up errors, including 2 parity drives - what to do next?

KRiSX · February 18, 2022

Hey all, bit freaked out right now, I have been fast approaching finishing my build and transferring everything across from my old setup when I've just hit a whole boat load of errors on both parity drives and 2 of the drives that had data copying to them. I've stopped doing any transfers and don't want to touch anything until I know what to do next. Logs attached.

Essentially I'm seeing a whole heap of "md: diskX write error" messages. I am now also seeing a lot of "Failing async write on buffer block" messages.

For the record, all of these drives that now have errors showing have been in service for years without issue. I didn't pre-clear them, I just let unraid do its clear and then formatted them as it seems that is acceptable with known good drives.

Hopefully this isn't the end of the world and I can simply resolve it with a parity rebuild or something along those lines.

UPDATE #1: array appears to not be allowing any writes at this point either, going to stop all my docker containers and have kicked off extended tests on the 2 x 6tb's that have kicked up errors - have done a short test on the 2 x 8tb parity drives and they are showing as ok, maybe I need to do an extended test though?

UPDATE #2: I've stopped the array as it was non-stop generating the same lines in the logs "Failing async write on buffer block" for 10 different blocks. When stopping the array also noticed this "XFS (md13): I/O Error Detected. Shutting down filesystem" and "XFS (md13): Please unmount the filesystem and rectify the problem(s)" - so perhaps disk 13 really isn't good like I thought?

UPDATE #3: Restarted the array to see what would happen, array started, appears to be writable now, no errors being produced in the logs - parity is offline. Going to keep everything else (docker) shut down until the SMART tests are complete on the 2 x 6tb's unless someone advises me otherwise.

UPDATE #4: Looking at the logs a bit harder, it seems my controller (Adaptec 6805) had a bit of a meltdown which is why I think the errors occured - I've since restarted the server which has cleared all the errors, but parity is still disabled. I'm going to continue running without parity until after the extended SMART test finishes on the 2 x 6tb's and at this point may just keep it disabled until I've finished moving data across anyway. I also ran xfs checks on each disk to be sure they were all healthy. Not sure there is much else to do apart from wait for scans to finish and then rebuild parity. Would still appreciate any feedback anyone may have

I also found this article.... seems old... but I confirmed the timeout is set to 60... would changing it per drive as instructed cause any issue? https://ask.adaptec.com/app/answers/detail/a_id/15357/~/error%3A-aacraid%3A-host-adapter-abort-request

newbehemoth-diagnostics-20220218-1114.zip

Edited February 18, 2022 by KRiSX

JorgeB · February 18, 2022

7 hours ago, KRiSX said:

it seems my controller (Adaptec 6805) had a bit of a meltdown which is why I think the errors occured

Yep, that was the problem.

KRiSX · February 18, 2022

5 minutes ago, JorgeB said:

Yep, that was the problem.

Anything I can do, would it be worth adjusting those time outs or not relevant these days?

JorgeB · February 18, 2022

Not sure, though have experience with those controllers, we usually recommend LSI HBAs.

KRiSX · February 18, 2022

12 minutes ago, JorgeB said:

Not sure, though have experience with those controllers, we usually recommend LSI HBAs.

Yeah I've seen that seems to be the go from what I've seen on here, have had this Adaptec for years and it's served me well.... I guess if it keeps giving me trouble I can look into an LSI

Thanks for the replies

4 drives randomly kicked up errors, including 2 parity drives - what to do next?

Recommended Posts

KRiSX

Link to comment

JorgeB

Link to comment

KRiSX

Link to comment

JorgeB

Link to comment

KRiSX

Link to comment

Join the conversation