[6.1.9] Cache drive failure not detected/reported when drive physically removed

May 31, 201610 yr

A drive can't fail much harder than this, and it wasn't detected or reported.

Last night, I was pulling out empty hotswap drive trays from server to prep for installing/burnin-testing on a few new drives.

I accidentally pulled a tray that had a drive, and quickly fired up the management page to see which drive I pulled. To my surprise, ALL drives were green.

I have a 2-drive cache pool. Digging a little deeper (serial numbers), I determined that I had pulled "cache drive 1". No failure emails, no popups on the management page, and the drive still showed "green". If I clicked on the device for more info, that page reported the drive was missing/had a problem -- but going back to the main status page still showed all-good. Syslog attached starting ~5 mins before I began.

I was holding the drive in my hand and had just refreshed the page when the attached screenshot was taken, which shows the drive as 'good'.

syslog-snippet_cache_drive_failure.zip

Quote

May 31, 201610 yr

It's always been like this for cache pools, dropped disks stay green, usually best indication is that at the next refresh it will stop reporting temp for the affected disk.

Quote

May 31, 201610 yr

Author

Whether it's always been like this or not, this is clearly incorrect/undesired/unacceptable behavior.

If it's not clearly reported, I won't know that it needs replaced to avoid losing data (and avoiding losing data is the whole reason for a cache pool to begin with)

Quote

May 31, 201610 yr

Whether it's always been like this or not, this is clearly incorrect/undesired/unacceptable behavior.

If it's not clearly reported, I won't know that it needs replaced to avoid losing data (and avoiding losing data is the whole reason for a cache pool to begin with)

Agree, can't remember if it was reported as a defect, go ahead and do it.

Edit: Didn't notice this was already posted on the defects forum.

Quote

May 31, 201610 yr

Like many complex systems, there's layers! And the upper layers don't necessarily know what's going on at the lower levels. That's true in this case, has been for a long time. The management pages work at a high level, and only know the results of high level I/O requests. The kernel knows almost instantly of physical changes, as it's notified by exception handlers and other peer modules, and most of this action gets logged, but the upper levels won't know a thing about it. Lower layers are notified of issues and try to fix them, requesting retries and resets and whatever else is necessary to carry on. I've often seen errors occur at the lower levels in user's syslogs, and either they succeeded or didn't and drives were even disabled, but the management pages didn't find out until they tried to access the drive, and it worked or it didn't. In some cases, it knew there was a problem with a drive but NOT that the drive was considered missing! So a write command kept trying and getting failure messages, but write commands do not have a way to learn of a missing target. After all, lower layers are *supposed* to deal with problems, keep them hidden, not bother the higher layers.

I suppose considerable reprogramming could be done, but LimeTech has almost always tried to stay away from modifying low level Linux modules, customizing them. It's a nightmare to support and rebase each new release, plus include others patches. Plus, you may have to program for each controller, since ATA controllers communicate info very different from mvsas, mpt2sas, etc. And I'm not sure how you get exceptions to filter up to the web pages.

On the other hand, I do believe there should be a better way to do it, but I'm not sufficiently versed in the kernel. I've wondered if creating a simple syslog monitor could help, at least pick up drives being marked as 'disabled'.

Quote

June 1, 201610 yr

I think the problem here is that unlike an array disk failure, where unRAID gets a write error or timeout and can disable that disk, on a cache pool a single disk is disabled by the btrfs fs, and unRAID does not receive any error, until there's a better way maybe unRAID could monitor the btrfs status and notify the user, since when a disk drops out it displays "*** Some devices missing", or else a user could go days or months without noticing that the cache pool is no longer protected.

Quote

[6.1.9] Cache drive failure not detected/reported when drive physically removed

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)