Cache drive disappeared after power drop to UPS

charibdis · January 16

My cache drive appears unmountable. Noticed it after a transient drop to UPS power and back. (no more than a few seconds.) Noticed a bunch of docker containers were down, and decided to reboot, which may have aggravated the issue.

I tried looking up similar cases, but in most it seems either disabling docker, deleting and restarting fixes it. OR the drive is just completely missing. In my case the drive still shows in the Main list as active. Actually spins down since it isn't mountable. Doesn't display any SMART errors. Raw UDMA CRC count of 123. I mean it's a Green spinner with 12+ years power on time. If it's on it's last leg I got my money's worth out of it. If only all drives were as reliable... And yes I know a spinner as a cache drive, especially a green one, is a recipe for slower performance, but for what I use my array it works out perfectly fine.

The I did find this: brtfs check shows errors:

Quote

[1/7] checking root items

[2/7] checking extents

tree extent[948178698240, 16384] root 4611686018427387906 has no tree block found

tree extent[948178698240, 16384] root 2 has no backref item in extent tree

incorrect global backref count on 948178698240 found 2 wanted 1 backpointer mismatch on [948178698240 16384]

ERROR: errors found in extent allocation tree or chunk allocation

[3/7] checking free space tree

[4/7] checking fs roots

[5/7] checking only csums items (without verifying data)

[6/7] checking root refs

[7/7] checking quota groups skipped (not enabled on this FS)

Opening filesystem to check...

Checking filesystem on /dev/sde1 UUID: 28978e0f-3b47-4932-9f1f-bf88b08877ed found 271930515456 bytes used, error(s) found

total csum bytes: 219465280

total tree bytes: 321290240

total fs tree bytes: 49889280

total extent tree bytes: 21643264

btree space waste bytes: 38313960

file data blocks allocated: 631057014784

referenced 265284116480

I suspect I could possibly try a btrfs repair, but I don't know enough about it to be sure, considering all the warnings that pop-up about using it. My other alternative is likely just reformat. If memory serves I think I'll just lose whatever was on my cache in terms of docker configs among a few other things. I do have a backup of my "Appdata" but for some reason it's over 6 months old.

Recommendations on best course of action?

scylla-unraid-diagnostics-20240115-1944.zip

JorgeB · January 16

There appear to be other issues with the filesystem, but it didn't mount initially because of a bad log tree, this may help getting past that:

btrfs rescue zero-log /dev/sde1

Then re-start the array, if the pool mounts good idea to make sure backups are up to date.

charibdis · January 16

Thanks for the info.

Running the command and starting the array appears to have allowed the disk to be mounted. I'm able to read the files. Though I've noted that it appears to be mounted as read-only. I'm manually backing up what I can so I can at least reference config files, as the automated backup I had setup isn't running to backup anything related to dockers.

I pulled a new Diagnostic now that it's mounted.

scylla-unraid-diagnostics-20240116-0800.zip

JorgeB · January 16

The filesystem mounted but went read only due to other issues.

1 hour ago, charibdis said:

I'm manually backing up what I can so I can at least reference config files

This would be my suggesting as well, then reformat the pool and restore the data.

Frank1940 · January 16

14 hours ago, charibdis said:

Noticed it after a transient drop to UPS power and back.

As a completely side issue, I would checking that UPS battery. A drop in the voltage of the power feed to the UPS will usually cause a UPS to supply power to the load without a discernible drop in the voltage being supply. (I normally use enough incandescent light bulbs to require the UPS to deliver 200-300 watts of power. I would expect to continue to supply power for at least five minutes. If it doesn't, I would be ordering a replacement battery!)

charibdis · January 16

2 hours ago, JorgeB said:

The filesystem mounted but went read only due to other issues.

This would be my suggesting as well, then reformat the pool and restore the data.

Thanks,

I've completed the manual backup of the data and I've removed the pool and am reformatting the drive. While I was doing this I did get a pop-up for CRC errors on the drive finally. I'll plan to replace it in the near future, need to look at options. For now will see if I can get basic services back up and running on it. There's a few things I may opt to temporarily shift to the Array if absolutely required, even though I know that's not optimal.

Maybe I can finally get a faster cache drive...

1 hour ago, Frank1940 said:

As a completely side issue, I would checking that UPS battery. A drop in the voltage of the power feed to the UPS will usually cause a UPS to supply power to the load without a discernible drop in the voltage being supply. (I normally use enough incandescent light bulbs to require the UPS to deliver 200-300 watts of power. I would expect to continue to supply power for at least five minutes. If it doesn't, I would be ordering a replacement battery!)

The power issue was isolated to that particular circuit, the power was cut and immediately put back on. At 300W load it's sitting at 7 minutes right now. Both Unraid boxes are set to shutdown if switched to battery for more than 90 seconds. Intent there is to ensure they both shutdown cleanly with some reserve power left over.

trurl · January 16

3 minutes ago, charibdis said:

CRC errors

These indicate connection problems. I usually just acknowledge the occasional CRC SMART warning on the Dashboard page. After acknowledge, it will warn again if it increases. I might reseat connector next time I open up the case. If it increases rapidly you need to investigate.

charibdis · January 16

18 minutes ago, trurl said:

These indicate connection problems. I usually just acknowledge the occasional CRC SMART warning on the Dashboard page. After acknowledge, it will warn again if it increases. I might reseat connector next time I open up the case. If it increases rapidly you need to investigate.

Thanks. Will be keeping an eye on it. I expect some of those errors are from the age of the drive as well. It's SMART power on hours info shows: 12y, 3m, 26d, 15h.

Cache drive disappeared after power drop to UPS

Recommended Posts

charibdis

Link to comment

JorgeB

Link to comment

charibdis

Link to comment

JorgeB

Link to comment

Frank1940

Link to comment

charibdis

Link to comment

trurl

Link to comment

charibdis

Link to comment

Join the conversation