[6.1.9] Cache drive failure not detected/reported when drive physically removed


Recommended Posts

A drive can't fail much harder than this, and it wasn't detected or reported.

 

 

Last night, I was pulling out empty hotswap drive trays from server to prep for installing/burnin-testing on a few new drives. 

 

I accidentally pulled a tray that had a drive, and quickly fired up the management page to see which drive I pulled.  To my surprise, ALL drives were green.

 

I have a 2-drive cache pool.  Digging a little deeper (serial numbers), I determined that I had pulled "cache drive 1".  No failure emails, no popups on the management page, and the drive still showed "green".  If I clicked on the device for more info, that page reported the drive was missing/had a problem -- but going back to the main status page still showed all-good.  Syslog attached starting ~5 mins before I began.

 

I was holding the drive in my hand and had just refreshed the page when the attached screenshot was taken, which shows the drive as 'good'.

 

cache_drive_2016-05-30_21_10_26.png.a6cf5f52cfd8a4170d19779250def442.png

syslog-snippet_cache_drive_failure.zip

Link to comment

Whether it's always been like this or not, this is clearly incorrect/undesired/unacceptable behavior.

 

If it's not clearly reported, I won't know that it needs replaced to avoid losing data (and avoiding losing data is the whole reason for a cache pool to begin with)

Link to comment

Whether it's always been like this or not, this is clearly incorrect/undesired/unacceptable behavior.

 

If it's not clearly reported, I won't know that it needs replaced to avoid losing data (and avoiding losing data is the whole reason for a cache pool to begin with)

 

Agree, can't remember if it was reported as a defect, go ahead and do it.

 

Edit: Didn't notice this was already posted on the defects forum.

Link to comment

Like many complex systems, there's layers!  And the upper layers don't necessarily know what's going on at the lower levels.  That's true in this case, has been for a long time.  The management pages work at a high level, and only know the results of high level I/O requests.  The kernel knows almost instantly of physical changes, as it's notified by exception handlers and other peer modules, and most of this action gets logged, but the upper levels won't know a thing about it.  Lower layers are notified of issues and try to fix them, requesting retries and resets and whatever else is necessary to carry on.  I've often seen errors occur at the lower levels in user's syslogs, and either they succeeded or didn't and drives were even disabled, but the management pages didn't find out until they tried to access the drive, and it worked or it didn't.  In some cases, it knew there was a problem with a drive but NOT that the drive was considered missing!  So a write command kept trying and getting failure messages, but write commands do not have a way to learn of a missing target.  After all, lower layers are *supposed* to deal with problems, keep them hidden, not bother the higher layers.

 

I suppose considerable reprogramming could be done, but LimeTech has almost always tried to stay away from modifying low level Linux modules, customizing them.  It's a nightmare to support and rebase each new release, plus include others patches.  Plus, you may have to program for each controller, since ATA controllers communicate info very different from mvsas, mpt2sas, etc.  And I'm not sure how you get exceptions to filter up to the web pages.

 

On the other hand, I do believe there should be a better way to do it, but I'm not sufficiently versed in the kernel.  I've wondered if creating a simple syslog monitor could help, at least pick up drives being marked as 'disabled'.

Link to comment

I think the problem here is that unlike an array disk failure, where unRAID gets a write error or timeout and can disable that disk, on a cache pool a single disk is disabled by the btrfs fs, and unRAID does not receive any error, until there's a better way maybe unRAID could monitor the btrfs status and notify the user, since when a disk drops out it displays "*** Some devices missing", or else a user could go days or months without noticing that the cache pool is no longer protected.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.