Unable to write to cache and docker image

trurl · April 14, 2020

Note that just because you set minimum to larger than the largest file you expect to write doesn't mean that space will never be used.

For example, say you set minimum free to 10GB. A disk has 20GB free. You write a 15GB file. It could choose another disk depending on other factors such as allocation method and split level, but it is allowed to choose that disk because it has more than minimum. If it does choose the disk, it will write that 15GB file to it. After that, the disk will have 5GB free, which is now less than minimum. Unraid will not choose the disk again until it has more than minimum again.

trurl · April 14, 2020

Or even another example, one where you don't try to write a file larger than minimum.

Minimum is set to 10GB as before. The disk has 11GB free. You write a 9GB file. The disk can be chosen, and after it will only have 2GB free.

sonofdbn · April 15, 2020

OK, so the "buffer" is actually used. I swear that at one time I knew all of this 🙂

sonofdbn · April 15, 2020

On 4/14/2020 at 12:54 AM, sonofdbn said:
Unfortunately no joy. I went to the link you provided, and tried the first two approaches. My cache drives (btrfs pool) are sdd and sdb.

1) Mount filesystem read only (non-destructive)

I created mount point x and then tried
mount -o usebackuproot,ro /dev/sdd1 /x
This gave me an error
mount: /x: can't read superblock on /dev/sdd1.
(Same result if I tried sdb1.) Then I tried
mount -o ro,notreelog,nologreplay /dev/sdd1 /x
This produced the same error.

So I moved to

2) BTRFS restore (non-destructive)

I created the directory /mnt/disk4/restore. Then entered
btrfs restore -v /dev/sdd1 /mnt/disk4/restore
After a few seconds I got this error message:
/dev/sdd1 is currently mounted.  Aborting.
This looks odd (in that the disk is mounted and therefore presumably accessible), so I thought I should check whether I've missed anything so far.

In trying to do the btrfs restore, I realised that it might not be surprising that "1) Mount filesystem read only (non-destructive) " above didn't work, because the disk is already mounted. I haven't had a problem actually accessing the disk. And it's then not surprising to see the last error message above.

So my problem was how to unmount the cache drives to try 1) again. Not sure if this is the best way, but I simply stopped the array and then tried 1) again. Now I have access to the cache drive at my /x mountpoint, at least in the console. But I was a bit stuck trying to use it in any practical way. I thought about starting up the array again so that I could copy the cache files to an array drive, but wasn't sure if the cache drive could be mounted both "normally" for unRAID and at mountpoint /x.

In any case, I had earlier used mc to try to copy files from the cache drive to the array, and that hadn't worked. So I've now turned to WinSCP and am copying files from mountpoint /x to a local drive. The great thing is that it can happily ignore errors and continue and it writes to a log. (No doubt there's some Linux way of doing this, but I didn't spend time looking.) Now I swear that some /appdata folders that generated errors when I tried copying earlier are now copying just fine, with no problems. Or perhaps the problem files are just not there any more ☹️, WinSCP can be very slow, but I think it's a result of the online/offline problem that I had with some files, and at least it keeps chugging away without horrible flashing which Teracopy did.

But to my earlier point, can I start the array again, say in safe mode with GUI? I'd like to read some files off it. What would happen to the cache drive at mountpoint /x?

JorgeB · April 15, 2020

It should be fine to start the array, cache should appear as unmountable, but to be safer just unassign the cache devices before starting.

sonofdbn · April 16, 2020

On 4/14/2020 at 11:05 PM, johnnie.black said:

If you run a scrub it will identify all corrupt files, so any files not on that list will be OK.

I changed a SATA cable on one of the cache drives in case that was a source of the weird online/offline access. Then I started the array with cache drives unassigned, and mounted the cache drives again at /x. Ran

 btrfs dev stats /x

and got 0 errors (two drives in the pool):

[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  0
[/dev/sdd1].generation_errs  0
[/dev/sdb1].write_io_errs    0
[/dev/sdb1].read_io_errs     0
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0

So time to scrub. I started the scrub, waited patiently for over an hour, then checked status, and found that it had aborted.

Scrub started:    Thu Apr 16 19:15:12 2020
Status:           aborted
Duration:         0:00:00
Total to scrub:   1.79TiB
Rate:             0.00B/s
Error summary:    no errors found

Basically it had aborted immediately without any error message. (I know I shouldn't really complain, but Linux is sometimes not too worried about giving feedback.) Thought this might be because I had mounted the drives as read-only, so remounted as normal read-write and scrubbed again. Waited a while, and the status looked good:

Scrub started:    Thu Apr 16 21:01:53 2020
Status:           running
Duration:         0:18:06
Time left:        1:07:25
ETA:              Thu Apr 16 22:27:27 2020
Total to scrub:   1.79TiB
Bytes scrubbed:   388.31GiB
Rate:             366.15MiB/s
Error summary:    no errors found

Waited patiently until I couldn't resist checking and found this:

Scrub started:    Thu Apr 16 21:01:53 2020
Status:           aborted
Duration:         1:04:36
Total to scrub:   1.79TiB
Rate:             262.46MiB/s
Error summary:    no errors found

Again no message that the scrub had aborted, had to run scrub status to see that it had aborted. Ran btrfs dev stats and again got no errors. Maybe this is an edge case, given that the drive is apparently full?

So is there anything else worth trying? I'm not expecting to recover everything, but was hoping to avoid having to re-create some of the VMs. What if I deleted some files (if I can) to clear space and then tried scrub again?

JorgeB · April 16, 2020

If the scrub isn't working best bet it's to move the data, and all data successfully moved can be considered OK, any files that give you i/o error during the transfer are corrupt, you can get them with btrfs restore, but like mentioned before they will still be corrupt.

sonofdbn · April 16, 2020

Thanks for all the help so far. Given that my cache pool had two drives, sdd and sdb, is this the correct command to mount them?

mount -o usebackuproot /dev/sdd1 /x

sonofdbn · April 17, 2020

Command seems to be OK. Am now happily copying files off /x to the array. I swear that some files that couldn't be copied before are now copying across at a reasonable - even very good - speed. A few folders seem to be missing entirely but everything I've tried so far has copied across with no problem. I'm hopeful that most of the files will be recovered.

Thanks again for all the help.

JorgeB · April 17, 2020

On 4/16/2020 at 6:19 PM, sonofdbn said:

Thanks for all the help so far. Given that my cache pool had two drives, sdd and sdb, is this the correct command to mount them?

Sorry, missed this one, but yes, you can use either one, it's in the FAQ.

sonofdbn · April 24, 2020

Sorry to be back again, but more problems.

So I backed up what I could, then reformatted the cache drives and setup the same cache pool, then reinstalled most of the dockers and VMs. It was a pleasant surprise to find that just about everything that had been recovered was fine, including the VMs. As far as I could tell, nothing major was missing.

Anyway, the server trundled along fine for a few days, then today the torrenting seemed a bit slow, so I looked at the dashboard and found that the log was at 100%. So I stopped my sole running VM and then tried to stop some dockers but found I was unable to; they seemed to restart automatically. In the end I used the GUI to stop the Docker service, then tried to stop the array (not shutdown), but the disks couldn't be unmounted. I got the message: Array Stopping - Retry unmounting user share(s) and nothing else happened after that. I grabbed the diagnostics and in the end I used powerdown from the console and shut down the array.

From what I can see in the diagnostics, looks like there are a lot of BTRFS errors. So not sure what I should do at this point. Array is still powered down.tower-diagnostics-20200424-1108.zip

JorgeB · April 24, 2020

Docker image is corrupt, delete and re-create, see docker FAQ if needed, cache filesystem seems OK for what I can see.

sonofdbn · April 24, 2020

OK, I can do that, but I already re-created the docker image when I reinstalled the dockers on the re-formatted cache drive. Any way of reducing the chance of this corruption happening again? Or could it be that there's some problem with the appdata files that I backed up and used to reinstall the dockers that is causing this?

trurl · April 24, 2020

9 minutes ago, sonofdbn said:

OK, I can do that, but I already re-created the docker image when I reinstalled the dockers on the re-formatted cache drive. Any way of reducing the chance of this corruption happening again? Or could it be that there's some problem with the appdata files that I backed up and used to reinstall the dockers that is causing this?

Docker image doesn't look full in those diagnostics now, but are you sure you didn't fill it?

Since the appdata for a container contains the settings for that application, such as paths, it is certainly possible that the appdata contained incorrect settings, or maybe was missing some settings, that would cause you to fill docker image.

Did you check the settings for the applications before you used them again?

sonofdbn · April 24, 2020

Can't say I did check the settings. But I'm sure the docker image wasn't anywhere near full when I looked at the dashboard. I remember thinking that there was significantly more free space since I'd left out a number of dockers when I did the reinstall.

sonofdbn · April 27, 2020

I've re-created the docker image and all seemed fine. But this morning the log usage on the Dashboard jumped from 1% to 30% when I refreshed the browser. I did reinstall Plex yesterday, and prior to that the log was at 1% of memory (I have 31.4 GiB of usable RAM). Unfortunately it seems that the Dashboard doesn't necessarily update the log until you refresh the browser, so it's possible that the log size was higher than 1% earlier.

Is Log size on the Dashboard just the syslog or also docker log? Because docker log is at 36MB, syslog at only around 1MB.

Diagnostics are attached.

tower-diagnostics-20200427-0949.zip

sonofdbn · April 27, 2020

Took a quick look at the logs in the above diagnostics and they seem to have omitted docker.log.1. So I've attached it here after editing out many similar lines. (Think file size was too big to upload.)

Plex container ID is 8138acb243f3

Pihole container ID is 358107cb7f64

These two seem to come up in the logs.

docker.log.1.txt

sonofdbn · April 27, 2020

Now looking at Fix Problems, I see errors: Unable to write to cache (Drive mounted read-only or completely full.) and Unable to write to Docker Image (Docker Image either full or corrupted.).

According to Dashboard, doesn't look like cache is full (90% utilisation - about 100 GB fee). This is what I get (now) when I click Container Size on the Docker page in the GUI.

Name                              Container     Writable          Log
---------------------------------------------------------------------
calibre                             1.48 GB       366 MB      64.4 kB
binhex-rtorrentvpn                  1.09 GB         -1 B       151 kB
plex                                 723 MB       301 MB      6.26 kB
CrashPlanPRO                         454 MB         -1 B      44.9 kB
nextcloud                            354 MB         -1 B      4.89 kB
mariadb                              351 MB         -1 B      9.62 kB
pihole                               289 MB         -1 B      10.2 kB
letsencrypt                          281 MB         -1 B      10.1 kB
QDirStat                             210 MB         -1 B      19.4 kB
duckdns                             20.4 MB      9.09 kB      5.18 kB

JorgeB · April 27, 2020

Cache filesystem is corrupt again, if it's happening multiple times without an apparent reason you likely have some hardware issue.

trurl · April 27, 2020

8 hours ago, sonofdbn said:

doesn't look like cache is full (90% utilisation - about 100 GB fee).

No but considering the size of cache, that is a lot of data on cache. Any idea what is taking all that space? Except for the usual "system" shares looks like you have one starting with 'd' that is cache-only. I am guessing this is a downloads share. Does it really need to be cache-only? I can understand wanting to post-process on cache, but you might consider making this cache-yes so it can at least overflow and so anything that sits there too long will get moved to the array.

On the other hand, most of your array is mostly full so you might want to consider getting more capacity there also.

sonofdbn · April 27, 2020

Yes, it's a downloads share for torrents. I did try using cache-prefer, but then of course some files did, correctly, go to the array. But I didn't like keeping the array disk spinning for reads. What I'd like to do is download to my unassigned device (SSD) and then manually move things I want to seed longer back to the cache drive. But I can't find any way of doing this in the docker I use (rtorrentvpn).

trurl · April 27, 2020

Why not just download to the UD and leave them seeding from the UD?

sonofdbn · April 28, 2020

The UD is 500GB, and it's likely to be too small, especially taking into account other stuff I want to put on it.

sonofdbn · July 6, 2020

So in the end I changed the BTRFS cache pool to a single SSD (also BTRFS), re-created everything and was fine for a few months. Unfortunately today I got error messages from the Fix Common Problems plug-in: A) unable to write to cache and B) unable to write to Docker Image. I'm assuming that B is a consequence of A, but anyway I've attached diagnostics.

Looking at the GUI, Docker is 34% full and the 1 TB cache drive, a SanDisk SSD, has about 20% free space.

But looking at the log for the cache drive, I get a large repeating list of entries like this:

Jul 6 18:36:57 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:57 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:57 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:36:58 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:58 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:58 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:36:59 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:59 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:59 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:36:59 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:59 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:59 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:37:00 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968

Should I replace the SSD or is there something I can do with BTRFS to try to fix any errors?

tower-diagnostics-20200706-1928.zip

JorgeB · July 6, 2020

Checksum errors suggest a hardware problem, like bad RAM.

Unable to write to cache and docker image

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation