I think one of my cache mirrored drives may be dying. Could use some advice.

brandon3055 · February 8, 2023

Hi guys.

Earlier tonight i noticed a bunch of my docker containers were down. Unsurprisingly it was because my chache filed up a gain.

So i did the usual. Started the move, Then almost immediately got impatioent and told the system to reboot so i could get my dockers up and running again.

Only this time the system never came back up (Atleast not the webui) So i checked dmesg via ssh and found it was continuously spamming this.

Spoiler

c6a7a

From what i understand this is usually cause by bad sata connections but at this point i have tried re-seating the cables, replacing the cables and switching to diferent sata ports on the MB. It changed nothing.

Usually after a while the errors will stop and the mebgui will load but as soon as i try to access files on cache it starts up again and files are inaccessible.

My guess is one of the drives is failing but i have seeb both drives mentioned in the errors so i have no idea which one.

Spoiler

a93c5

Its a mirrored cache pool so If i can figure out which drive is failing it should be a simple matter of disconnecting the bad drive in order to get the sistem back up and running right?

Any advice would be most appreciated.

p.s. if your wondering why my nas is named what it is. Its because its slow and it tends to get stuck and bog down the network. So i geuss its just living up to its name...

Edit: Looks like its sdb. But not sure if i should just remove it or attempt a scrub...

fdbc3

evergreen-diagnostics-20230208-2054.zip

Edited February 8, 2023 by brandon3055

JorgeB · February 8, 2023

Looks more like a power/connection problem, check/replace cables for cache2 and post new diags after array start.

brandon3055 · February 8, 2023

I already checked and replaced the cables to both SSD's and it had no effect. Power connections also look good but i dont have a free sata power cable to rule it out completely.

The First report attached to this post was generated while the server was attemptine to start. (via ssh) The second was generated when the GUI finally loaded.

evergreen-diagnostics-20230208-2225.zip evergreen-diagnostics-20230208-2235.zip

JorgeB · February 8, 2023

Still showing the same issues, if cables didn't help remove that SSD, pool is mirrored so it should keep working, then add another device if you want to keep it raid1.

brandon3055 · February 8, 2023

1 hour ago, JorgeB said:

Still showing the same issues, if cables didn't help remove that SSD, pool is mirrored so it should keep working, then add another device if you want to keep it raid1.

Going to have to continue this in the morning but i removed the bad drive and the cache si now readable but it looks like it has gone read-only as a result of having no space left?

So the mover is unable to do its job.

b6bc9

At the very least i can access the file now and can manually copy everything off if i have to.

evergreen-diagnostics-20230209-0027.zip

JorgeB · February 8, 2023

If needed you could cancel the balance, delete some data, then re-balance, but probably easier to just backup and re-format.

brandon3055 · February 9, 2023

14 hours ago, JorgeB said:

If needed you could cancel the balance, delete some data, then re-balance, but probably easier to just backup and re-format.

Yea in the end i just disabled cache on all shares, rsync'd everything to my backup share, Remove the cache pool and then restored everything to the appropriate shares.

I think one of my cache mirrored drives may be dying. Could use some advice.

Recommended Posts

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

JorgeB

Link to comment

brandon3055

Link to comment

Join the conversation