All my dockers are missing!?! Please help!


Recommended Posts

Hi all.  I'm a bit stuck... and so looking for advice.   While my unraid server has been rock solid for about a year, it is currently complete offline as all docker apps have gone.

 

Here is summary of what I am seeing.

 

  1. Yesterday, noticed my Windows 10 VM was suspending/crashing frequently and so decided to reboot unraid.  Also noticed some dockers were crashing and generally behaving erratically.
  2. On reboot and array start, I noticed my docker tab is completely empty.  I'm missing the ~10 dockers I had installed and running. 
  3. I'm also missing my configured Windows 10 VM on the VM tab.
  4. About a week ago, I noticed the server seemed to kind of crash while I was remotely using Plex to transcode a show to my phone.  It was odd as I have used Plex transcode while away from home before and it worked fine.    I rebooted in that case as well, and the dockers and VM started ok and seemed to run until the last few days where they became more erratic.

 

Attached is  a section of my System Log.

I did notice this line which looks a bit unsettling: 

Aug 9 21:17:01 unraid root: truncate: failed to truncate '/mnt/cache/system/docker/docker.img' at 21474836480 bytes: No space left on device
 

I'm scanning the forums to see if anyone has reported similar issues.  I'd really like to get my system back online, and I'm *really* hoping I have not actually lost my dockers and VM as I have invested huge time in carefully setting up my ~10 dockers , and the Windows 10 VM.   

 

Thanks for any advice/info!  -Glenner.

 

This is my system info:

unraid 6.3.5

Model: Custom
M/B: ASUSTeK COMPUTER INC. - PRIME H270-PRO
CPU: Intel® Core™ i7-7700 CPU @ 3.60GHz
HVM: Enabled
IOMMU: Enabled
Cache: 256 kB, 1024 kB, 8192 kB
Memory: 32 GB (max. installable capacity 64 GB)
Network: bond0: fault-tolerance (active-backup), mtu 1500 
 eth0: 1000 Mb/s, full duplex, mtu 1500
Kernel: Linux 4.9.30-unRAID x86_64
OpenSSL: 1.0.2k

 

SystemLog.txt

Link to comment

Docker and possibly libvrit images are corrupt, but the underlying problem are your cache devices:

 

Quote

Aug  9 21:01:24 unraid kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 14402, rd 1, flush 0, corrupt 0, gen 0
Aug  9 21:01:24 unraid kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 200298, rd 0, flush 0, corrupt 0, gen 0

 

Those are write errors, both devices are likely dropping out at various times causing the filesystem corruption, this is a hardware problem, since there are no cables to check make sure they are well seated, bios update can also help if available.

Link to comment

Hi JB, thanks for looking at my logs.   Really appreciate it.

 

But so here is an update...

  1. I have 2 Samsung EVO 960 250G SSDs running in a fault tolerant raid0 setup (followed SpaceInvader youtube to setup this up).   My understanding is that with this setup if one SSD fails, the system stays online.  In any event, I'm hoping it's not a hardware issue as my hardware has been stable and untouched for a year.
  2. I noticed that my dockers have actually returned this morning!  I started the array last night and confirmed my docker tab was still empty.  Then I started this post to see if anyone has any ideas.   Up until now I did not want my crippled array left started, so I had the array stopped.  But last night, I left the array started overnight and noticed my dockers were up in the morning when I got up.  I'm not sure what could have happened other than maybe the mover ran and freed up some space?
  3. My windows 10VM is still missing from the VM tab, though I see my VM image here in this 42 GB file I have: /mnt/cache/domains/Windows 10/vdisk1.img.  I'm kind of hoping there is a way to restore my windows VM and get it working, but I'm not sure how to do that...
  4. I think I have some kind of device space issue.  The initial error in my log I noticed yesterday after rebooting and my docker tab was empty: Aug 9 21:17:01 unraid root: truncate: failed to truncate '/mnt/cache/system/docker/docker.img' at 21474836480 bytes: No space left on device
  5. Now, I'm looking at my system log while my dockers have returned and are running and I see "no space left on device" errors like those shown below.
  6. In other threads, I've seen that missing dockers may be related to a filled docker.img file?  I'm trying to figure out how to check for that as I'm not clear if that is my situation.
  7. I'm also not understanding why my cache disks seem to report so much disk space is used.  I have a 250GB cache (see main tab screenshot).  The main tab UI show ~180GB used, and 70GB free.  Yet if I eyeball the files I actually have on the cache drive, I have appdata (~10GB I think), domains (42GB Win10 VM), docker.img (20GB), libvrt.img (1GB), plus some transient media data that the mover moves to the array every 6 hours.  So I think I should only have about 70-80GB used on the cache.  How can I check where my 180GB is going?  Is there a way to check the cache disks and make sure nothing is wrong?

Thanks! -Glenner.

 

image.thumb.png.a1e129b68545d039a22ef7e122e28055.png

 

ErrorWarningSystemArrayLogin


Aug 11 16:01:17 unraid root: >f+++++++++ sagemedia/tv/TheLateShowWithStephenColbert-S03E189-JimAcostaNinaDobrev-8917033-12.mpg.properties
Aug 11 16:01:17 unraid root: .d..t...... sagemedia/tv/
Aug 11 16:01:17 unraid root: .d..t...... sagemedia/
Aug 11 16:01:17 unraid root: mover finished
Aug 11 22:29:20 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 11 22:29:25 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 11 22:29:25 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 11 22:29:27 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 11 22:29:30 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 11 22:32:12 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.4.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.5.log (28) No space left on device
Aug 11 22:32:13 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Cache/CloudAccess.dat.tmp.XXfhyapA /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Cache/CloudAccess.dat (28) No space left on device
Aug 11 22:32:13 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.4.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.5.log (28) No space left on device
Aug 11 22:32:17 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/PMS Plugin Logs/com.plexapp.agents.lastfm.log.2 /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/PMS Plugin Logs/com.plexapp.agents.lastfm.log.3 (28) No space left on device
Aug 11 22:32:17 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.2.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.3.log (28) No space left on device
Aug 11 22:32:17 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.lastfm/HTTP.system/63/._4190e53018e6273f7e438f695cb1505e21dde4_attributes /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.lastfm/HTTP.system/63/4190e53018e6273f7e438f695cb1505e21dde4_attributes (28) No space left on device
Aug 11 22:32:17 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.lastfm/HTTP.system/._CacheInfo /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.lastfm/HTTP.system/CacheInfo (28) No space left on device
Aug 11 22:32:17 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.lastfm/HTTP.system/._CacheInfo /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.lastfm/HTTP.system/CacheInfo (28) No space left on device
Aug 11 22:32:17 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Metadata/Artists/c/5038b2eb1102b83f6ec48c3295dac0c1e7e057c.bundle/Contents/com.plexapp.agents.lastfm/._Info.xml /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Metadata/Artists/c/5038b2eb1102b83f6ec48c3295dac0c1e7e057c.bundle/Contents/com.plexapp.agents.lastfm/Info.xml (28) No space left on device
Aug 11 22:32:18 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/PMS Plugin Logs/com.plexapp.agents.localmedia.log.4 /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/PMS Plugin Logs/com.plexapp.agents.localmedia.log.5 (28) No space left on device
Aug 11 22:32:22 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Data/com.plexapp.system/._Dict /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Data/com.plexapp.system/Dict (28) No space left on device
Aug 11 22:32:27 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/PMS Plugin Logs/com.plexapp.agents.htbackdrops.log.4 /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/PMS Plugin Logs/com.plexapp.agents.htbackdrops.log.5 (28) No space left on device
Aug 11 22:32:27 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.htbackdrops/HTTP.system/ad/._3bcd937b316c57255c218179b0269c9b06a550_attributes /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.htbackdrops/HTTP.system/ad/3bcd937b316c57255c218179b0269c9b06a550_attributes (28) No space left on device
Aug 11 22:32:27 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.htbackdrops/HTTP.system/._CacheInfo /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Plug-in Support/Caches/com.plexapp.agents.htbackdrops/HTTP.system/CacheInfo (28) No space left on device
Aug 11 22:32:30 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Metadata/Artists/c/5038b2eb1102b83f6ec48c3295dac0c1e7e057c.bundle/Contents/_combined/._Info.xml /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Metadata/Artists/c/5038b2eb1102b83f6ec48c3295dac0c1e7e057c.bundle/Contents/_combined/Info.xml (28) No space left on device
Aug 11 22:35:35 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/binhex-delugevpn/state/torrents.state /mnt/cache/appdata/binhex-delugevpn/state/torrents.state.bak (28) No space left on device
Aug 11 22:38:55 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/binhex-delugevpn/state/torrents.state /mnt/cache/appdata/binhex-delugevpn/state/torrents.state.bak (28) No space left on device
Aug 11 22:42:12 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.4.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.5.log (28) No space left on device
Aug 11 22:42:15 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/binhex-delugevpn/state/torrents.state /mnt/cache/appdata/binhex-delugevpn/state/torrents.state.bak (28) No space left on device
Aug 11 22:45:35 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/binhex-delugevpn/state/torrents.state.tmp /mnt/cache/appdata/binhex-delugevpn/state/torrents.state (28) No space left on device
Aug 11 22:50:26 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Cache/CloudAccess.dat.tmp.XXYg3otU /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Cache/CloudAccess.dat (28) No space left on device
Aug 11 22:52:12 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.4.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.5.log (28) No space left on device
Aug 11 22:52:12 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.1.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.2.log (28) No space left on device
Aug 11 22:53:48 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/sickrage/cache.db-journal /mnt/cache/appdata/sickrage/.fuse_hidden0012bf7600000056 (28) No space left on device
Aug 11 22:55:02 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/sagetv/server/Sage.properties.tmp /mnt/cache/appdata/sagetv/server/Sage.properties (28) No space left on device
Aug 11 23:02:12 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.4.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.5.log (28) No space left on device
Aug 11 23:02:12 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.1.log (28) No space left on device
Aug 11 23:02:13 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.4.log /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.5.log (28) No space left on device
Aug 11 23:02:13 unraid shfs/user: err: shfs_rename: rename: /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Cache/CloudAccess.dat.tmp.XX3nJR9Z /mnt/cache/appdata/plex/Library/Application Support/Plex Media Server/Cache/CloudAccess.dat (28) No space left on device

 

 

Edited by glenner
Link to comment
4 hours ago, glenner said:

I have 2 Samsung EVO 960 250G SSDs running in a fault tolerant raid0 setup (followed SpaceInvader youtube to setup this up).   My understanding is that with this setup if one SSD fails, the system stays online.  In any event, I'm hoping it's not a hardware issue as my hardware has been stable and untouched for a year.

You are using raid1, raid0 has no redundancy, but if one SSD drops offline, and they are both dropping, though possibly not at the same time, you'll still get corruption in any NODATACOW shares, like unRAID uses for docker and VMs by default, also there's definitely a hardware problem, it's the only way to get those write errors.

 

4 hours ago, glenner said:

I think I have some kind of device space issue.  The initial error in my log I noticed yesterday after rebooting and my docker tab was empty: Aug 9 21:17:01 unraid root: truncate: failed to truncate '/mnt/cache/system/docker/docker.img' at 21474836480 bytes: No space left on device

You do, but the write errors are a much more serious issue, to fix the out of space error you would run this, but since your pool is corrupt don't known if there is much of a point, best yould be to redo the pool, or at least reset the stats to see if you're still having more write errors, you should also upgrade to latest unRAID, this out of space problem was fixed on newer kernels.

 

 

Link to comment

Man... this sounds potentially terrible.  But so my server is currently up, all my dockers seem to be running.  I've changed the mover to run every 4 hours instead of 8 as it should keep more space available on the cache.  I've shutdown Plex for now, just to see if that makes things more stable.  I'm just missing my VM, but I can live without it for now. 

 

So I'm wondering how corrupted my system actually is.   I'm not that experienced with doing low level maintenance, but I'd like to and should be able to do anything advised to fix the issues.  This system has been really solid for year and the main load is the SageTV docker which results in 100GB+ of over-the-air TV mpg files being written/read to the cache on a daily basis.   SageTV docker wrote 10-20GB overnight to the cache and it seems to be fine...  Not sure how I got here as I thought hardware SSD issues are incredibly rare.  Issues seemed to me to start within the last few weeks/days maybe and came to a head when I ran Plex transcoding last week for a bit.

  1. The main tab shows "0 errors" for my cached drives.  How do I reset the stats and look for write errors over the next days to see if they are still happening?
  2. I would like to upgrade to the latest unRAID, but there seems to be be some compatibility issues with the SageTV docker so I was holding off until necessary.  Might be best to stabilize what I have before I take that on.
  3. Right... I have raid1.  Maybe I should just switch to raid0 then if I'm having cache issues?  Not sure how to do that...  Maybe I should try an xfs cache while I'm at it?
  4. Is there a good guide to "redo" my cache pool?    Not sure how to do that or what's involved.  I really want to preserve/save my system...
  5. I have not backed up anything on the cache at this point.  I don't have a real backup strategy unfortuantely, so I'm wondering if there is anything I should do on that front immediately.  I likely need to figure out how to save stuff on the cache in case things get worse.   
  6. Maybe I should try the balance?  So to do that I just run this command now and report back?   btrfs balance start -dusage=75 /mnt/cache

Thanks!  Really appreciate any feedback and guidance. 

Link to comment
3 hours ago, glenner said:

I've changed the mover to run every 4 hours instead of 8 as it should keep more space available on the cache.

You might consider not caching some of your User Shares. There is no requirement to have all writes go first to cache then moved. Some of us mostly use cache just for the performance it gives dockers / VMs and just do our normal User Share writes directly to the parity array. Typically you would have cache large enough so that any moves could take place when other things weren't going on, but you are considering running mover so frequently that it is likely to impact other things. If you just write directly to the array it doesn't need to be moved and it is immediately protected by parity.

Link to comment
3 hours ago, trurl said:

You might consider not caching some of your User Shares. There is no requirement to have all writes go first to cache then moved. Some of us mostly use cache just for the performance it gives dockers / VMs and just do our normal User Share writes directly to the parity array. Typically you would have cache large enough so that any moves could take place when other things weren't going on, but you are considering running mover so frequently that it is likely to impact other things. If you just write directly to the array it doesn't need to be moved and it is immediately protected by parity.

 

I don't have much on the cache.  Just appdata, docker.img, and Windows 10 VM.  And I also have the SageTV recordings share.  I have 4 ATSC tuners on my network that the SageTV docker can use to record up to 4 HD TV shows at once.  At maximum throughput, that works our to 6GB/hour per channel...   Or ~24GB/hr.  Though usually I'm only recording 1-2 shows at once.  Recorded shows are then moved to the array by the mover.  On the SageTV forums, I believe this is the recommended config for performance.  

 

 

Link to comment

Ok... So while my SageTV, Deluge, Sickrage dockers all seem to still be up and running properly and some behaviour appears normal, my unraid setup is clearly having some issues.  I tried stopping and starting my simple DuckDNS docker and it fails to start.   "Server Execution Failure" (see screencap).  I also cannot update any dockers.  I cannot set my dockers to disable the autostart toggle in the UI.  The toggle change is not preserved in the UI.  I think if I reboot, which is what I tried last time, I will get the blank Dockers tab, and maybe a few hours later my dockers will all actually be started, and if I refresh the Dockers tab I will see my dockers again.

 

I am trying to figure out how to do 3 things (and am searching the forum threads) in order of priority as I am assuming this is what I need to do:

  1. Backup my cache appdata folder immediately to save all my valuable config.  I'm looking for a process or command line to do this.  Ideally I guess I will just backup my whole /mnt/cache to somewhere on the array.  Is there a recommended command or process for this?
  2. Rebuild my cache pool from scratch and recreate my docker.img.  This seems daunting and a tad scary... but I want to get started.  I'm looking for the "official guide" or thread for this too.  Does anyone know?  
  3. I'd like to test my cache SSDs to see if they are functioning correctly.  Is there some kind of test I can run?  I'm happy to buy new SSDs if necessary, or try to replace my premium Samsung SSDs under warranty... but would be nice to definitively confirm a hardware issue.  Right now, I'm happy to try to restore/recover my crippled system.

 

image.thumb.png.0e86ca578be5c03fda008dd6ed9ff4d5.png

Link to comment

You don't need a cache drive to get the required performance to write 24 GB /hour. If the array manages 40 MB/s with turbo write off, that's still about 5 times the write speed your recordings needs. So even without a cache you can record 24 GB/hour and at the same time look at one or more streams from the same data disk.

Link to comment
1 minute ago, pwm said:

You don't need a cache drive to get the required performance to write 24 GB /hour. If the array manages 40 MB/s with turbo write off, that's still about 5 times the write speed your recordings needs. So even without a cache you can record 24 GB/hour and at the same time look at one or more streams from the same data disk.

 

I think part of the issue/recommendation is that SageTV users may also be recording stuff 6-18 hours a day?  So maybe that means the array will mostly be constantly spinning?   Not sure.

 

But I'll try moving the SageTV recordings share to the array when I get back to stable.  I'm fine if it works and should be easy to test.  I think I also saw a thread on moving the transcoding directory out of the plex docker which also sounds useful.  I'll need to look at that too.  But right now, I'm trying to figure out how to stabilize and recover my crippled system.  Do you have any advice on my #1-3 above?

 

Thanks!

Link to comment
9 minutes ago, trurl said:

#1 - CA Backup plugin: https://lime-technology.com/forums/topic/61211-plugin-ca-appdata-backup-restore-v2/

#2 - Stop Docker service and it will allow you to delete and recreate docker img. Then you can reinstall your dockers with their previous settings from the Previous Apps feature of Apps (Community Applications)

 

Thanks trurl.  You are awesome.

 

#1. But so I actually had installed CA backup/restore initially when I setup the box, but never ran it or set it up.  My bad.  Got a little lazy once my system was setup.  It seems I have v1.  So looks like I should uninstall and use v2 instead.  I hope that works ok..  My system is clearly fragile right now.  I cannot actually update any dockers right now.  Maybe installing a new plugin will be fine.  I'm not even sure I should try updating community apps if that makes things more unstable?

 

#2. I'd like to recreate the dockers from templates as I have custom config I'm sure.  Just to confirm I'm looking at the previous apps feature right now, and don't see my dockers (only see 4 dockers I don't use in screencap).  I guess I will only see my dockers in there once I recreate a blank docker image?  Just wanted to confirm.

 

image.thumb.png.05264b11772b4c5cb06f9c00c7569bc6.png

 

Link to comment
1 hour ago, glenner said:

So maybe that means the array will mostly be constantly spinning?   Not sure.

 

Most disks works well to keep spinning 24/7 if you don't worry about the power consumption. I have lots and lots of disks spinning 24/7.

 

But if you have a cache drive, then of course you can use it. Just that I don't think you need it for performance reasons.

Link to comment
13 hours ago, glenner said:

The main tab shows "0 errors" for my cached drives.  How do I reset the stats and look for write errors over the next days to see if they are still happening?

On the console type:

btrfs dev stats -z /mnt/cache

This will show the current stats and reset them to 0, then once a day or so run the same command without -z to check current values, they should all be 0

 

You should also run a scrub on the pool, but note that NODATACOW shares can't be checked or fixed, e.g., the system share where the docker/VMs are is by default NODATACOW.

Link to comment
13 hours ago, glenner said:

#2. I'd like to recreate the dockers from templates as I have custom config I'm sure.  Just to confirm I'm looking at the previous apps feature right now, and don't see my dockers (only see 4 dockers I don't use in screencap).  I guess I will only see my dockers in there once I recreate a blank docker image?  Just wanted to confirm.

 

I think it only shows you dockers that aren't currently installed so a new image should work. It just uses the "my-templates" that were created when you filled out the Add Container form when you created your dockers. You can look at any of these templates just by bringing up the Add Container form again and selecting one of them.

Link to comment
14 hours ago, johnnie.black said:

On the console type:


btrfs dev stats -z /mnt/cache

This will show the current stats and reset them to 0, then once a day or so run the same command without -z to check current values, they should all be 0

 

You should also run a scrub on the pool, but note that NODATACOW shares can't be checked or fixed, e.g., the system share where the docker/VMs are is by default NODATACOW.

 

Thanks johnnie.black.  Ran it just now, here is the result.  It does seem like a lot of errors.  I'm not sure when these stats were last reset... On a working system you will only ever see 0's here?  Or can some kind of docker software issue also errors? ie. Plex transcoding issues, SageTV tuner hardware loses signal and results in corrupted TV mpg recording (this happens sometimes).  I'm just trying to make sure... Are you suggesting I need to pull these SSD cards and put in new ones?

 

I'll post updated status tomorrow... 

Right now, I still I mostly only have SageTV up and running and hitting the cache at up to 10-18GB/hr or so at times... So lots of IO goes to this cache on a daily basis.

I think I need to shutdown my dockers to make a cache and appdata backup... I'm trying to figure out how to recover my system and rebuild my cache....

 

root@unraid:/mnt/cache# btrfs dev stats -z /mnt/cache
[/dev/nvme0n1p1].write_io_errs   14402
[/dev/nvme0n1p1].read_io_errs    1
[/dev/nvme0n1p1].flush_io_errs   0
[/dev/nvme0n1p1].corruption_errs 0
[/dev/nvme0n1p1].generation_errs 0
[/dev/nvme1n1p1].write_io_errs   200298
[/dev/nvme1n1p1].read_io_errs    0
[/dev/nvme1n1p1].flush_io_errs   0
[/dev/nvme1n1p1].corruption_errs 0
[/dev/nvme1n1p1].generation_errs 0


 

Link to comment
On 8/12/2018 at 10:37 PM, trurl said:

#1 - CA Backup plugin: https://lime-technology.com/forums/topic/61211-plugin-ca-appdata-backup-restore-v2/

#2 - Stop Docker service and it will allow you to delete and recreate docker img. Then you can reinstall your dockers with their previous settings from the Previous Apps feature of Apps (Community Applications)

 

For #1, I'm getting set to do that and create an appdata backup.  I'm just avoiding shutting down my dockers while my wife and kids are watching TV... WAF is an issue for me my server is mission critical. :-)

 

But so... I could also just use midnight commander (mc) to make a full copy of my /mnt/cache folder to a backup folder on the array?  That will work too?  Do you have to shutdown dockers before doing a backup?

 

I did have crashplan installed at one point and it was backing up appdata... but I'd rather not use crashplan if I can avoid it.  Either #1 or mc sounds much easier to me.

Edited by glenner
Link to comment
9 hours ago, glenner said:

I'm not sure when these stats were last reset...

If you never did it they show the stats since the filesystem was first created.

 

9 hours ago, glenner said:

On a working system you will only ever see 0's here?  Or can some kind of docker software issue also errors?

Any non 0 values on read/write errors indicate a hardware problem, they can't be caused by software.

Link to comment
14 hours ago, glenner said:

I could also just use midnight commander (mc) to make a full copy of my /mnt/cache folder to a backup folder on the array?  That will work too?  Do you have to shutdown dockers before doing a backup?

I always use mc for any manual disk to disk copy. Make sure you only copy disk to disk and don't mix user shares and disks when moving / copying.

 

If you are going to do the whole cache you should run mover first to get any cached user share files moved. Then stop the Docker / VM services.

 

 

Link to comment
14 hours ago, johnnie.black said:

If you never did it they show the stats since the filesystem was first created.

 

Any non 0 values on read/write errors indicate a hardware problem, they can't be caused by software.

 

Thanks Johnnie.  This is what I've been seeing in the 24 hours after resetting the cache disk error stats:

 

1. I don't have any new errors since.

root@unraid:/mnt# btrfs dev stats /mnt/cache
[/dev/nvme0n1p1].write_io_errs   0
[/dev/nvme0n1p1].read_io_errs    0
[/dev/nvme0n1p1].flush_io_errs   0
[/dev/nvme0n1p1].corruption_errs 0
[/dev/nvme0n1p1].generation_errs 0
[/dev/nvme1n1p1].write_io_errs   0
[/dev/nvme1n1p1].read_io_errs    0
[/dev/nvme1n1p1].flush_io_errs   0
[/dev/nvme1n1p1].corruption_errs 0
[/dev/nvme1n1p1].generation_errs 0

 

2. I have 9 dockers configured and usually they would all be up.  Right now I'm only running 5 dockers: sagetv, logitech media server, deluge, sickrage, and openvpn.  Crashplan, duckdns, handbrake and plex are shutdown.

3. My SageTV docker recorded a bunch of shows last night (cache writes).... and I was able to watch another show simultaneously (cache and array reads depending on what I'm watching).  Last night's recordings would results in at 20GB+ being written to the cache.  No issues in these recordings...

4. My SageTV front end UI did slow down and become erratic while I was watching a show last night about Aug 13 23:49:03.  

5. See the syslog.  I start getting errors like this below.  Between 20:04 (last mover run) and 23:49, SageTV is busy recording let's say ~20GB+ of prime time shows.

Aug 13 20:04:16 unraid root: mover finished
Aug 13 23:49:03 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 13 23:49:03 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 13 23:52:15 unraid kernel: loop: Write error at byte offset 3852955648, length 4096.
Aug 13 23:52:15 unraid kernel: blk_update_request: I/O error, dev loop1, sector 7525296
Aug 13 23:52:15 unraid kernel: BTRFS error (device loop1): bdev /dev/loop1 errs: wr 433, rd 0, flush 0, corrupt 0, gen 0
Aug 13 23:52:15 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 13 23:52:15 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 13 23:54:31 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 13 23:54:31 unraid shfs/user: err: shfs_write: write: (28) No space left on device
Aug 13 23:57:06 unraid shfs/user: err: shfs_write: write: (28) No space left on device


6. These errors stop once I run the mover which finishes at:

Aug 14 00:24:29 unraid root: mover finished

The mover has moved the recently recorded shows from the cache to the array thereby clearing ~20GB from the cache.

 

7. I have since changed the mover to run hourly in order to keep the cache as lean as possible... and have not had these out of space errors so far today.

8. So the BTRFS errors and the "no space left on device" errors are dependent on whether I run the mover or not.

9. My system should have lots of free space on the cache and so it's not clear to me why it thinks it runs out of space unless the mover moves 10-20GB off the cache on an hourly basis.  Main tab reports 156/250 GB used, with 94GB free.   I should not be out of space?

10.  Even with all these "BTRFS error (device loop1)" in the log, the error stats are still 0.

 

So wondering what you think... I realize I still need to rebuild my cache to fix my system... but when would we expect my SSD write errors to recur?  Wouldn't writing 20GB to the SSDs last night cause the error counts to increment?  Is it time to replace SSDs or wait a bit longer and try to just restore my cache on my current hardware?

 

 

 

 

 

syslog.txt

Link to comment
4 minutes ago, glenner said:

1. I don't have any new errors since.

That's good news, keep monitoring for the next few weeks.

 

5 minutes ago, glenner said:

9. My system should have lots of free space on the cache and so it's not clear to me why it thinks it runs out of space unless the mover moves 10-20GB off the cache on an hourly basis.  Main tab reports 156/250 GB used, with 94GB free.   I should not be out of space?

Did you do this:

On 8/12/2018 at 9:01 AM, johnnie.black said:

to fix the out of space error you would run this,

 

Link to comment

Thanks Johnnie.   I have not run the rebalance just yet... I will run the balance after I make some backups of my cache.

I'll try: btrfs balance start -dusage=75 /mnt/cache

  1. Right now, I've been taking screencaps of as much of my docker and system config as possible, in case I need to more fully rebuild my whole system for some reason.
  2. I'm also trying to use mc, rsync, and CA backup/restore to create a backup of /mnt/cache.  You can never have too many backups at times like this. 
  3. I also want to see what crashplan may have backed up for me if I can get that docker up.  I do have a crashplan backup on my array, but can't tell what's on it.  Note to myself: crashplan backup on the array is not very useful if the crashplan docker is offline and unusable.
  4. The first thing I noticed is that Logitech Media Center has a log file that is out of control.  A year's worth of logging has produced an 85GB file, or more than half of my stated used cache.  I've trashed the log.  I'll need to figure out how to limit that log going forward.  Damn!  Ideally this kind of runaway file should just not be allowed, or maybe an alert could be triggered?  Will need to look at that...
  5. But now I'm wondering if this runaway log could trigger most all of the issues I've been having, including the write errors?
root@unraid:/mnt/cache/appdata/LogitechMediaServer/logs# ls --block-size=M -al
total 84545M
drwxrwxrwx 1 nobody users     1M Jul 19  2017 ./
drwxrwxrwx 1 nobody users     1M Jul 17  2017 ../
-rw-rw-rw- 1 nobody users     0M Jul 17  2017 perfmon.log
-rw-rw-rw- 1 nobody users     1M Jul 19  2017 scanner.log
-rw-rw-rw- 1 nobody users 84545M Aug 13 01:05 server.log
-rw-rw-rw- 1 nobody users     1M Jul 18  2017 spotifyfamily1d.log

 

Edited by glenner
Link to comment
But now I'm wondering if this runaway log could trigger most all of the issues I've been having, including the write errors?

Like I already posted more than once the write errors can't be caused by software, it was a hardware problem, most likely the NVMe devices dropped offline one ore more times, and due to high number of errors and the fact that it happened to both it will most likely happen again.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.