Unable to write to cache and docker image


55 posts in this topic Last Reply

Recommended Posts

Note that just because you set minimum to larger than the largest file you expect to write doesn't mean that space will never be used.

 

For example, say you set minimum free to 10GB. A disk has 20GB free. You write a 15GB file. It could choose another disk depending on other factors such as allocation method and split level, but it is allowed to choose that disk because it has more than minimum. If it does choose the disk, it will write that 15GB file to it. After that, the disk will have 5GB free, which is now less than minimum. Unraid will not choose the disk again until it has more than minimum again.

Link to post
  • Replies 54
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Or even another example, one where you don't try to write a file larger than minimum.

 

Minimum is set to 10GB as before. The disk has 11GB free. You write a 9GB file. The disk can be chosen, and after it will only have 2GB free.

Link to post
On 4/14/2020 at 12:54 AM, sonofdbn said:

Unfortunately no joy. I went to the link you provided, and tried the first two approaches. My cache drives (btrfs pool) are sdd and sdb.

 

1) Mount filesystem read only (non-destructive)

I created  mount point x and then tried


mount -o usebackuproot,ro /dev/sdd1 /x

This gave me an error


mount: /x: can't read superblock on /dev/sdd1.

(Same result if I tried sdb1.) Then I tried


mount -o ro,notreelog,nologreplay /dev/sdd1 /x

This produced the same error.

 

So I moved to

2) BTRFS restore (non-destructive)

 

I created the directory /mnt/disk4/restore. Then entered


btrfs restore -v /dev/sdd1 /mnt/disk4/restore

After a few seconds I got this error message:


/dev/sdd1 is currently mounted.  Aborting.

This looks odd (in that the disk is mounted and therefore presumably accessible), so I thought I should check whether I've missed anything so far.

In trying to do the btrfs restore, I realised that it might not be surprising that "1) Mount filesystem read only (non-destructive) " above didn't work, because the disk is already mounted. I haven't had a problem actually accessing the disk. And it's then not surprising to see the last error message above.

 

So my problem was how to unmount the cache drives to try 1) again. Not sure if this is the best way, but I simply stopped the array and then tried 1) again. Now I have access to the cache drive at my /x mountpoint, at least in the console. But I was a bit stuck trying to use it in any practical way. I thought about starting up the array again so that I could copy the cache files to an array drive, but wasn't sure if the cache drive could be mounted both "normally" for unRAID and at mountpoint /x.

 

In any case, I had earlier used mc to try to copy files from the cache drive to the array, and that hadn't worked. So I've now turned to WinSCP and am copying files from mountpoint /x to a local drive. The great thing is that it can happily ignore errors and continue and it writes to a log. (No doubt there's some Linux way of doing this, but I didn't spend time looking.) Now I swear that some /appdata folders that generated errors when I tried copying earlier are now copying just fine, with no problems. Or perhaps the problem files are just not there any more ☹️, WinSCP can be very slow, but I think it's a result of the online/offline problem that I had with some files, and at least it keeps chugging away without horrible flashing which Teracopy did.

 

But to my earlier point, can I start the array again, say in safe mode with GUI? I'd like to read some files off it. What would happen to the cache drive at mountpoint /x?

 

 

 

Link to post
On 4/14/2020 at 11:05 PM, johnnie.black said:

If you run a scrub it will identify all corrupt files, so any files not on that list will be OK.

I changed a SATA cable on one of the cache drives in case that was a source of the weird online/offline access. Then I started the array with cache drives unassigned, and mounted the cache drives again at /x. Ran

 btrfs dev stats /x

and got 0 errors (two drives in the pool):

[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  0
[/dev/sdd1].generation_errs  0
[/dev/sdb1].write_io_errs    0
[/dev/sdb1].read_io_errs     0
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0

So time to scrub. I started the scrub, waited patiently for over an hour, then checked status, and found that it had aborted.

Scrub started:    Thu Apr 16 19:15:12 2020
Status:           aborted
Duration:         0:00:00
Total to scrub:   1.79TiB
Rate:             0.00B/s
Error summary:    no errors found

Basically it had aborted immediately without any error message. (I know I shouldn't really complain, but Linux is sometimes not too worried about giving feedback.) Thought this might be because I had mounted the drives as read-only, so remounted as normal read-write and scrubbed again. Waited a while, and the status looked good:

Scrub started:    Thu Apr 16 21:01:53 2020
Status:           running
Duration:         0:18:06
Time left:        1:07:25
ETA:              Thu Apr 16 22:27:27 2020
Total to scrub:   1.79TiB
Bytes scrubbed:   388.31GiB
Rate:             366.15MiB/s
Error summary:    no errors found

Waited patiently until I couldn't resist checking and found this:

Scrub started:    Thu Apr 16 21:01:53 2020
Status:           aborted
Duration:         1:04:36
Total to scrub:   1.79TiB
Rate:             262.46MiB/s
Error summary:    no errors found

Again no message that the scrub had aborted, had to run scrub status to see that it had aborted. Ran btrfs dev stats and again got no errors. Maybe this is an edge case, given that the drive is apparently full?

 

So is there anything else worth trying? I'm not expecting to recover everything, but was hoping to avoid having to re-create some of the VMs. What if I deleted some files (if I can) to clear space and then tried scrub again?

Link to post

If the scrub isn't working best bet it's to move the data, and all data successfully moved can be considered OK, any files that give you i/o error during the transfer are corrupt, you can get them with btrfs restore, but like mentioned before they will still be corrupt.

Link to post

Command seems to be OK. Am now happily copying files off /x to the array. I swear that some files that couldn't be copied before are now copying across at a reasonable - even very good - speed. A few folders seem to be missing entirely but everything I've tried so far has copied across with no problem. I'm hopeful that most of the files will be recovered.

 

Thanks again for all the help.

Link to post
On 4/16/2020 at 6:19 PM, sonofdbn said:

Thanks for all the help so far. Given that my cache pool had two drives, sdd and sdb, is this the correct command to mount them?

Sorry, missed this one, but yes, you can use either one, it's in the FAQ.

Link to post

Sorry to be back again, but more problems.

 

So I backed up what I could, then reformatted the cache drives and setup the same cache pool, then reinstalled most of the dockers and VMs. It was a pleasant surprise to find that just about everything that had been recovered was fine, including the VMs. As far as I could tell, nothing major was missing.

 

Anyway, the server trundled along fine for a few days, then today the torrenting seemed a bit slow, so I looked at the dashboard and found that the log was at 100%. So I stopped my sole running VM and then tried to stop some dockers but found I was unable to; they seemed to restart automatically. In the end I used the GUI to stop the Docker service, then tried to stop the array (not shutdown), but the disks couldn't be unmounted. I got the message: Array Stopping - Retry unmounting user share(s) and nothing else happened after that. I grabbed the diagnostics and in the end I used powerdown from the console and shut down the array.

 

From what I can see in the diagnostics, looks like there are a lot of BTRFS errors. So not sure what I should do at this point. Array is still powered down.tower-diagnostics-20200424-1108.zip

Link to post

OK, I can do that, but I already re-created the docker image when I reinstalled the dockers on the re-formatted cache drive. Any way of reducing the chance of this corruption happening again? Or could it be that there's some problem with the appdata files that I backed up and used to reinstall the dockers that is causing this?

Link to post
9 minutes ago, sonofdbn said:

OK, I can do that, but I already re-created the docker image when I reinstalled the dockers on the re-formatted cache drive. Any way of reducing the chance of this corruption happening again? Or could it be that there's some problem with the appdata files that I backed up and used to reinstall the dockers that is causing this?

Docker image doesn't look full in those diagnostics now, but are you sure you didn't fill it?

 

Since the appdata for a container contains the settings for that application, such as paths, it is certainly possible that the appdata contained incorrect settings, or maybe was missing some settings, that would cause you to fill docker image.

 

Did you check the settings for the applications before you used them again?

 

 

Link to post

Can't say I did check the settings. But I'm sure the docker image wasn't anywhere near full when I looked at the dashboard. I remember thinking that there was significantly more free space since I'd left out a number of dockers when I did the reinstall.

Link to post

I've re-created the docker image and all seemed fine. But this morning the log usage on the Dashboard jumped from 1% to 30% when I refreshed the browser. I did reinstall Plex yesterday, and prior to that the log was at 1% of memory (I have 31.4 GiB of usable RAM). Unfortunately it seems that the Dashboard doesn't necessarily update the log until you refresh the browser, so it's possible that the log size was higher than 1% earlier.

 

Is Log size on the Dashboard just the syslog or also docker log? Because docker log is at 36MB, syslog at only around 1MB.

 

Diagnostics are attached.

tower-diagnostics-20200427-0949.zip

Link to post

Took a quick look at the logs in the above diagnostics and they seem to have omitted docker.log.1. So I've attached it here after editing out many similar lines. (Think file size was too big to upload.)

 

Plex container ID is 8138acb243f3

Pihole container ID is 358107cb7f64

 

These two seem to come up in the logs.

docker.log.1.txt

Link to post

Now looking at Fix Problems, I see errors: Unable to write to cache (Drive mounted read-only or completely full.) and Unable to write to Docker Image (Docker Image either full or corrupted.).

 

According to Dashboard, doesn't look like cache is full (90% utilisation - about 100 GB fee). This is what I get (now) when I click Container Size on the Docker page in the GUI.

 

Name                              Container     Writable          Log
---------------------------------------------------------------------
calibre                             1.48 GB       366 MB      64.4 kB
binhex-rtorrentvpn                  1.09 GB         -1 B       151 kB
plex                                 723 MB       301 MB      6.26 kB
CrashPlanPRO                         454 MB         -1 B      44.9 kB
nextcloud                            354 MB         -1 B      4.89 kB
mariadb                              351 MB         -1 B      9.62 kB
pihole                               289 MB         -1 B      10.2 kB
letsencrypt                          281 MB         -1 B      10.1 kB
QDirStat                             210 MB         -1 B      19.4 kB
duckdns                             20.4 MB      9.09 kB      5.18 kB

 

Link to post
8 hours ago, sonofdbn said:

doesn't look like cache is full (90% utilisation - about 100 GB fee).

No but considering the size of cache, that is a lot of data on cache. Any idea what is taking all that space? Except for the usual "system" shares looks like you have one starting with 'd' that is cache-only. I am guessing this is a downloads share. Does it really need to be cache-only? I can understand wanting to post-process on cache, but you might consider making this cache-yes so it can at least overflow and so anything that sits there too long will get moved to the array.

 

On the other hand, most of your array is mostly full so you might want to consider getting more capacity there also.

Link to post

Yes, it's a downloads share for torrents. I did try using cache-prefer, but then of course some files did, correctly, go to the array. But I didn't like keeping the array disk spinning for reads. What I'd like to do is download to my unassigned device (SSD) and then manually move things I want to seed longer back to the cache drive. But I can't find any way of doing this in the docker I use (rtorrentvpn).

Link to post
  • 2 months later...

So in the end I changed the BTRFS cache pool to a single SSD (also BTRFS), re-created everything and was fine for a few months. Unfortunately today I got error messages from the Fix Common Problems plug-in: A) unable to write to cache and B) unable to write to Docker Image. I'm assuming that B is a consequence of A, but anyway I've attached diagnostics.

 

Looking at the GUI, Docker is 34% full and the 1 TB cache drive, a SanDisk SSD, has about 20% free space.

 

But looking at the log for the cache drive, I get a large repeating list of entries like this:

Jul 6 18:36:57 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:57 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:57 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:36:58 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:58 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:58 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:36:59 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:59 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:59 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:36:59 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968
Jul 6 18:36:59 Tower kernel: BTRFS info (device sdd1): no csum found for inode 39100 start 3954470912
Jul 6 18:36:59 Tower kernel: BTRFS warning (device sdd1): csum failed root 5 ino 39100 off 3954470912 csum 0x86885e78 expected csum 0x00000000 mirror 1
Jul 6 18:37:00 Tower kernel: BTRFS error (device sdd1): parent transid verify failed on 432263856128 wanted 2473752 found 2472968

Should I replace the SSD or is there something I can do with BTRFS to try to fix any errors?

tower-diagnostics-20200706-1928.zip

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.