How to respond to new S.M.A.R.T. Warnings or Errors


Falcosc

Recommended Posts

Why SMART?

Why do I care about SMART? I don't like waiting for Data loss to get noticed, and I don't want to execute a Parity Check every day. So I like to monitor SMART to may detect something before the parity check gets executed.
And for me as a private person, restoring backups is an inconvenient and slow process with a lot of manual tasks. It doesn't make sense to spend effort to make the backup restore more convenient. Or maybe it is just less interesting than spending time with SMART :)
So I like to try to close the gap between the parity checks with smart monitoring to notice some issues before the be-monthly parity check.

 

Feedback needed

Did anybody already document or talked about this topic?
I didn't find a procedure, so I thought about it and based on my limited knowledge I made some assumptions and created the following process.

Could you give me feedback about these steps, and in best case point out where I had made wrong assumptions? That would be great.
Maybe we can use the discussion result to share a documentation for people how want to add SMART data in their monitoring process.

 

My Process

  1. check SMART notification (is a count change of some key fields part of email notifications?)

 

SMART counter change is critical: New Pending Sectors or other critical indicators

  1. replace Disk immediately (replacement disk was precleaned in the past, that is good enough for this case, no time left for an additional preclean validation)
  2. meanwhile, preclean the broken disk to see if the counter recovers or if the count is stable to reuse the disk (I never had a pending sector disk which did pass the stress test)

 

SMART counter change is a warning: new relocated sectors

  1. start a parity check nocorrect as stress test
  2. while the check is running, start preclean stress test on the replacement disk again to check if it is still healthy since the last preclean
  3. repeat the parity check if relocated sector does not change to get up to 3 time to get 3 read operations to all sectors

 

Parity Check result: Relocated Sector Count did not change on all 3 parity checks (read only tests)

  1. don't replace the disk if it was the first time
  2. if the disk had that same issue in the past, replace it to start preclean write stress test on the problematic data disk

 

Parity check result: Relocated Sector count did raise at any of the parity check runs

  1. disk has a persistent issue, replace it with a precleaned replacement disk

 

Preclean of the replacement disk did create new relocated sectors

  1. select a different replacement disk and make one preclean run to validate it (even if it already was precleaned in the past)
  2. meanwhile, stress test the suspicious replacement disk with 5 preclean runs to check if it needs to be thrown away (because this takes long, you should select a different replacement disk to fix the array)
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.