Smart servo/seek failure during parity build. What does it mean?

TexasUnraid · November 30, 2020

Ok, long story short I upgraded my parity drive and was rebuilding the parity.

During the rebuild my daily short smart test kicked in (some other interesting things happened like faster speeds that will be for another thread). All the drives complete the test fine but one of them had a "servo/seek failure".

You would think all smart errors would be well documented but I could not find any good information on google about this? Even the wiki/documetation doesn't list this error? Could this simply be due to the parity sync running?

Does anyone know what this error is? The drive is 1.5 months old. The drive passed a few more tests right after that without an issue. All the other smart attributes are normal.

I don't really want to swap it unless it is needed as the newer shucked drives are the apparently air filled EDAZ vs the EMFZ I have but thinking that might be needed unless this is a mundane error due to the parity sync.

Edited November 30, 2020 by TexasUnraid

TexasUnraid · November 30, 2020

Also, shouldn't unraid notify me when an error like this is found in the smart test? I did not get any indication of an issue, only noticed it because I was investigating the speed increase I got from my shucked drives.

JorgeB · November 30, 2020

It will notify you if there's any "FAILING NOW" SMART attribute, and system notifications are enable.

TexasUnraid · November 30, 2020

6 minutes ago, JorgeB said:

It will notify you if there's any "FAILING NOW" SMART attribute, and system notifications are enable.

I see, so it doesn't monitor the smart test results? Seems kinda odd.

Any idea what that smart error means? I can't find anything on it with my google foo.

JorgeB · November 30, 2020

1 minute ago, TexasUnraid said:

so it doesn't monitor the smart test results?

Nope, it will show the last result on the device page, but won't notify if it fails:

image.png.9647590a212d31f87dfeb27da1041954.png

Possibly because since SMART tests need to be run manually one expects the user to check the results.

TexasUnraid · November 30, 2020

Ok, good to know, I just keep assuming that unraid would have basic features for a storage system os, I need to stop that lol.

OT, when re-seating the cables for the array drives after the smart error to make sure that wasn't an issue, I think I bumped one of the cache cables and caused a drive to drop out (about to fix all this drive dropping out I hope, got some hot swap 4in3 bays for black friday).

Is it possible to add the cache drive back into the pool without stopping the array/parity sync or best to just leave it alone until the parity sync finishes and reboot? The missing drive is in UD now.

Edited November 30, 2020 by TexasUnraid

JorgeB · November 30, 2020

5 minutes ago, TexasUnraid said:

Ok, good to know, I just keep assuming that unraid would have basic features for a storage system os

Regular SMART tests were not really needed in the past with Unraid since a parity check is for all purposes the same as an extended test, cache devices are usually SSDs so also not really a need for SMART tests and SMART test monitoring, now with multiple pools it is some times good the run regular SMART tests (if using spinners) on those, it's possibly something that could be improved in the future, for now I have a script running an extend test for those drives once a month and it emails me the results.

9 minutes ago, TexasUnraid said:

Is it possible to add the cache drive back into the pool without stopping the array/parity sync or best to just leave it alone until the parity sync finishes and reboot?

No, you'll need to stop the array, also make sure the removed device is not still part of the pool, if it is and you added it back it will be wiped (there's a warning when assigned).

TexasUnraid · November 30, 2020

5 minutes ago, JorgeB said:

No, you'll need to stop the array, also make sure the removed device is not still part of the pool, if it is and you added it back it will be wiped (there's a warning when assigned).

Ok, it is a raid5 pool, so I need to remove it from the pool > start array > stop array > add it back to the pool > Balance?

Quote

for now I have a script running an extend test for those drives once a month and it emails me the results.

Got a link to this script? I have a short test run every day and it saves the result to a file but it would be nice to be able to have it also email me the short version of the results, not sure how to do that.

Edited November 30, 2020 by TexasUnraid

JorgeB · November 30, 2020

Depends if was kicked out of the pool and it re-balanced without it or not, if yes just add it back and it will be re-balanced again, if not you can make Unraid forget the old pool by unassigning all devices and starting the array, then re-assign them all and run a scrub and check that all errors were corrected, in doubt would need to see the diags, but about to leave for the rest of the day.

TexasUnraid · November 30, 2020

I have not balanced the pool or anything to the pool except move some important stuff off it.

So it sounds like I should run a balance now, then run another balance after reboot and adding the drive back? This was my first instinct but I lost everyone doing it wrong once and can't remember exactly what the correct method was.

Edited November 30, 2020 by TexasUnraid

JorgeB · December 1, 2020

Difficult for me to say without seeing the diags to see current state.

TexasUnraid · December 1, 2020

4 hours ago, JorgeB said:

Difficult for me to say without seeing the diags to see current state.

So there is not a universal method to ensure no data loss? That is kind of scary.

Currently the dropped drive is just sitting in UD. I have made changes to the pool contents since it dropped, moved a lot of stuff off it etc. Besides that, I have not done anything to the dropped drive or the pool.

The parity build is finished, so I can reboot now.

Edited December 1, 2020 by TexasUnraid

JorgeB · December 1, 2020

Before starting good idea to check that backups are up to date, then if enable disable array auto-start and reboot, check if the dropped device is already assigned to the pool and all devices have a green ball, if yes start the array and run a scrub, if not do this:

If Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array, run a scrub.

P.S. next time please attache diags here instead.

TexasUnraid · December 1, 2020

23 minutes ago, JorgeB said:

Before starting good idea to check that backups are up to date, then if enable disable array auto-start and reboot, check if the dropped device is already assigned to the pool and all devices have a green ball, if yes start the array and run a scrub, if not do this:

If Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array, run a scrub.

P.S. next time please attache diags here instead.

Ok, I went ahead and pulled everything off the cache to another pool so there is nothing to loose right now. Guessing this is the safest option whenever possible.

What causes the need for different fix methods?

After rebooting in this case everything looked normal, started the array and cache came up. I then tried a scrub but said it ended with errors. So I ran a balance and then ran a scrub again, this time it finished without errors.

So in the future if this happens again, The device drops but is perfectly fine and works just like it should after a reboot.

What should I do both before I reboot and after assuming I could not move everything off like this time?

Assuming a drive actually died I am assuming I should remove it from the pool and then re-balance. Although not sure if I should do that before or after a reboot.

Thanks for the help, still getting a handle on the more advanced side of things but learning with each issue. Making a bit of a "log book" for myself of any of the important lessons to reference later.

JorgeB · December 1, 2020

2 hours ago, TexasUnraid said:

What causes the need for different fix methods?

It would depend if the device was removed from the btrfs pool or not, i.e., if you started the array without it, only by seeing the diags could I be sure.

2 hours ago, TexasUnraid said:

I then tried a scrub but said it ended with errors. So I ran a balance and then ran a scrub again, this time it finished without errors.

Errors after the first scrub are expected, it's bringing that device up to date with the rest of the pool, as long as all errors were corrected you're fine, balance isn't really needed for this.

2 hours ago, TexasUnraid said:

So in the future if this happens again, The device drops but is perfectly fine and works just like it should after a reboot.

Yep, as long as it was never removed from the pool you just need start the array with all devices present (all should be green) and run a scrub.

2 hours ago, TexasUnraid said:

What should I do both before I reboot and after assuming I could not move everything off like this time?

Nothing special, always make sure backups are up do date, just in case.

2 hours ago, TexasUnraid said:

Assuming a drive actually died I am assuming I should remove it from the pool and then re-balance. Although not sure if I should do that before or after a reboot.

Always better to replace than remove, but remove should work in most cases (raid5/6 still has some corner issues).

You should also reset the btrfs stats errors, take a look here, it also explains how to better monitor a pool.

TexasUnraid · December 1, 2020

Ok, thanks for that, I will add that to my notes.

I tried running scrub 2x and it still did not say it completed successfully, so ran the balance and then it said scrub finished without errors. Not sure what that was about but it worked after that. I went ahead and took the opportunity to wipe the cache completely anyways to force a "trim" of the SSD's.

Yep, already got the BTRFS monitor script running for all my pools. That is how I found out a drive dropped this time around. I did forget to add the reset option to the notes though. Thanks again for that script!

Smart servo/seek failure during parity build. What does it mean?

Recommended Posts

TexasUnraid

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

JorgeB

Link to comment

TexasUnraid

Link to comment

Join the conversation