Jump to content

Disappearing cache drive, Unable to write to other cache drive, and now unable to start docker service.


Go to solution Solved by CorserMoon,

Recommended Posts

I recently dismantled a secondary, non-parity protected pool of several hdds. 2 of these drives are to replace the existing single parity drive of array and the remaining to be added to array storage. I have run into a lot of cascading issues which has resulted in the docker service not starting.  Here is the general timeline:

 

  • Stopped array in order to swap a single 12tb parity drive for 2x14tb parity drives. As soon as the array stopped, one of my 2 cache drives (2x1tb nvme, mirrored) disappeared. Shows missing and not in disk dropdowns. My first thought is that it died. 
  • Immediately restarted the array (without swapping the parity drives) and performed a backup of the cache pool to the array via the Backup/Restore Appdata plugin. Completed successfully. Everything, including docker, working normally. 
  • Ordered new nvme drives to replace both. 
  • Stopped array and successfully replaced swapped parity drive as outlined earlier. Parity rebuilt successfully. 
  • Stopped array to add remaining HDDs to array storage. Added, started array, and disk-clear started automatically as expected. 
  • Got notification "Unable to write to super_cache" (super_cache is the cache pool). Paused disk-clear and rebooted the server.
  • Same error upon reboot. In the interest if troubleshooting, I increased docker image size to see if that was the issue but the service still wouldn't start. I AM able to see/read files on cache drive but can't write to it. A simple mkdir command in appdata share errors saying it's a read-only file system. 

 

My best guess is that both nvme drives failed? Or maybe the pci-e adapter they are in failed? Any thoughts or clues from the attached diagnostics as I wait for the replacement drives to arrive?

diagnostics-20231025-1118.zip

Link to comment
1 hour ago, JorgeB said:

Pool is missing a device and the other is fully allocated, so it's going read-only, disable docker/VM services and reboot, if the pool doesn't go immediately read-only you need to free up some space and rebalance, see here:

 

https://forums.unraid.net/topic/62230-out-of-space-errors-on-cache-drive/?do=findComment&comment=610551

 

 

I don't think it actually is full though. The "Super_Cache" pool has 2 1TB drives (super_cache and super_cache 2). 1 disappeared (aka missing) but everything was working fine after I acknowledged that it was missing, since the drives were mirrored (1TB actual space). I was having no issues with docker until this morning. I monitor that capacity closely and they were ~70% full before all this happened. GUI currently shows the remaining drive (super_cache 2) w/ 279GB free space.

image.thumb.png.e6d5cee44405fe8859991a403684b6ba.png

 

Strangely, du -sh super_cache/ shows total size of 476GB. But regardless, it shouldn't be full. 

 

side note, that link throws this error: You do not have permission to view this topic.

Edited by CorserMoon
Link to comment
16 minutes ago, CorserMoon said:

I don't think it actually is full though.

It's not full, 

 

1 hour ago, JorgeB said:

is fully allocated

But the result is the same, it won't be able to write any data until there's some free space to allocate new metadata chunks.

 

17 minutes ago, CorserMoon said:

side note, that link throws this error: You do not have permission to view this topic.

That's strange, works for me:

 

image.png

 

You need to free up some space first or the balance will likely fail.

Link to comment
13 minutes ago, JorgeB said:

It's not full, 

 

But the result is the same, it won't be able to write any data until there's some free space to allocate new metadata chunks.

 

That's strange, works for me:

 

image.png

 

You need to free up some space first or the balance will likely fail.

 

So what is the difference between allocation and free space? What would cause allocation to fill and is there a way to monitor for that? It's just weird  that all this starteed happening after one of the cache drives just disappeared. Would full allocation cause this? 

 

I also just noticed that when the array is stopped and I am assigning/un-assigning disks, this error sporadically pops up briefly then disappears:

image.png.2721c06145e43a6144145529955cd156.png

 

EDIT: I tried to start the Mover process to move any extraneous data of the cache drive but the mover doesnt appear to be starting. 

Edited by CorserMoon
Link to comment
6 minutes ago, CorserMoon said:

So what is the difference between allocation and free space?

It's the way btrfs works, it would take more time than I have now to explain here, you can Google it, it's easy to find.

 

7 minutes ago, CorserMoon said:

Would full allocation cause this?

No.

 

7 minutes ago, CorserMoon said:

I also just noticed that when the array is stopped and I am assigning/un-assigning disks, this error sporadically pops up briefly then disappears:

If it just flashes and disappears you can ignore.

  • Like 1
Link to comment
11 minutes ago, JorgeB said:

It's the way btrfs works, it would take more time than I have now to explain here, you can Google it, it's easy to find.

 

No.

 

If it just flashes and disappears you can ignore.

 

Thanks so much for your help.

 

Last questions for now: Would it make sense that 1 of the cache drives dying would lead to this full allocation issue? Could it be resolved by just replacing that 1 dead drive?

 

I'm just trying to figure out if I have 1 issue or multiple different issues. 

Edited by CorserMoon
Link to comment
6 minutes ago, CorserMoon said:

Last questions for now: Would it make sense that 1 of the cache drives dying would lead to this full allocation issue?

No, completely unrelated, this can happen when the filesystem gets close to full, than data is deleted but most chunks are left, some may not be fully empty, others may not have been cleaned yet, btrfs should remove completely empty chucks automatically after some time.

  • Confused 1
Link to comment

So I manually deleted many gigs of data off the drive, but free space according to the GUI didn't change, still 279GB free. I tried running Mover but it didn't seem to start because there is still data sitting on the cache drive that is configured to move onto the array when mover is invoked. I then rebooted the server and the free space didnt change and the files that I deleted are back. I am stuck and don't know what I am doing wrong. 

 

EDIT: At this point it seems to make sense to reformat the pool (since I have the backup from the Backup/Restore Appdata plugin). Is there a guide on how to do this? And I also have the issue of the missing cache drive so not sure how to knock the cache pool back down to 1 drive again (it wont let me change the number of devices from 2 back to 1). Or maybe a better idea to just pop in a replacement ssd so I'm back up to 2 drives first and then reformat the pool?

 

Additional weird observations:

  • As stated in my OP, I was also trying to add new drives to the array. At that time I added them but paused the disk-clear when I noticed issues. I've since removed the new disks, returning those array slots to "unassigned" but now every time I reboot the server, all those drives are back and disk-clear starts! 
  • I tried using one of the aforementioned HDDs to replace the missing cache drive and provide additional space so hopefully btrfs would be able to balance but cache pool still mounting as read-only and I received a new error: Unraid Status: Warning  - pool BTRFS too many profiles (You can ignore this warning when a pool balance operation is in progress)
Edited by CorserMoon
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...