4 drives randomly kicked up errors, including 2 parity drives - what to do next?


KRiSX
Go to solution Solved by KRiSX,

Recommended Posts

  • Solution

Hey all, bit freaked out right now, I have been fast approaching finishing my build and transferring everything across from my old setup when I've just hit a whole boat load of errors on both parity drives and 2 of the drives that had data copying to them. I've stopped doing any transfers and don't want to touch anything until I know what to do next. Logs attached.

 

Essentially I'm seeing a whole heap of "md: diskX write error" messages. I am now also seeing a lot of "Failing async write on buffer block" messages.

 

For the record, all of these drives that now have errors showing have been in service for years without issue. I didn't pre-clear them, I just let unraid do its clear and then formatted them as it seems that is acceptable with known good drives.

 

Hopefully this isn't the end of the world and I can simply resolve it with a parity rebuild or something along those lines.

 

UPDATE #1: array appears to not be allowing any writes at this point either, going to stop all my docker containers and have kicked off extended tests on the 2 x 6tb's that have kicked up errors - have done a short test on the 2 x 8tb parity drives and they are showing as ok, maybe I need to do an extended test though?

UPDATE #2: I've stopped the array as it was non-stop generating the same lines in the logs "Failing async write on buffer block" for 10 different blocks. When stopping the array also noticed this "XFS (md13): I/O Error Detected. Shutting down filesystem" and "XFS (md13): Please unmount the filesystem and rectify the problem(s)" - so perhaps disk 13 really isn't good like I thought?

 

UPDATE #3: Restarted the array to see what would happen, array started, appears to be writable now, no errors being produced in the logs - parity is offline. Going to keep everything else (docker) shut down until the SMART tests are complete on the 2 x 6tb's unless someone advises me otherwise.

 

UPDATE #4: Looking at the logs a bit harder, it seems my controller (Adaptec 6805) had a bit of a meltdown which is why I think the errors occured - I've since restarted the server which has cleared all the errors, but parity is still disabled. I'm going to continue running without parity until after the extended SMART test finishes on the 2 x 6tb's and at this point may just keep it disabled until I've finished moving data across anyway. I also ran xfs checks on each disk to be sure they were all healthy. Not sure there is much else to do apart from wait for scans to finish and then rebuild parity. Would still appreciate any feedback anyone may have :)

I also found this article.... seems old... but I confirmed the timeout is set to 60... would changing it per drive as instructed cause any issue? https://ask.adaptec.com/app/answers/detail/a_id/15357/~/error%3A-aacraid%3A-host-adapter-abort-request

 

 

newbehemoth-diagnostics-20220218-1114.zip

Edited by KRiSX
Link to comment
  • KRiSX changed the title to 4 drives randomly kicked up errors, including 2 parity drives - what to do next?
12 minutes ago, JorgeB said:

Not sure, though have experience with those controllers, we usually recommend LSI HBAs.

 

Yeah I've seen that seems to be the go from what I've seen on here, have had this Adaptec for years and it's served me well.... I guess if it keeps giving me trouble I can look into an LSI

 

Thanks for the replies

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.