ScottAS2

Members
  • Posts

    37
  • Joined

  • Last visited

Everything posted by ScottAS2

  1. This chimes with a discovery I've made in the meantime: bypassing the cache and writing directly to the array seems to solve the problem. Both cache drives (BTRFS/RAID1) are Crucial BX500 1TB SATA SSDs; part number CT1000BX500SSD1. I'm not sure how to find out if they have a DRAM cache, although the data sheet makes no mention of it, which probably means "no".
  2. I believe this is the mistake. You should select the "custom" network that can be created in Unraid's Docker settings. This will allow the Time Machine container to have its own IP in and exchange multicast DNS/DNS Service Discovery messages with devices on your 192.168.178.0/24 network. The "bridge" network will put Docker containers in a 172.17.0.0/16 (by default) internal network that you can think of as NAT-ed behind your Unraid server's IP. In most cases, this is fine and you just forward the appropriate ports through that NAT to a container*, but the mDNS and DNS-SD messages needed for clients to discover and communicate with the Time Machine service won't cross from one network to the other. Even if they did, the Time Machine container on a "bridge" network only knows its 172.17.0.0/16 address, which is no use to the machines on 192.168.178.0/24 that are looking to back up. * In theory, you might be able to set Time Machine up like this - it's "just" an SMB server - but it would be more work to set up the clients and Time Machine is notoriously picky; without the multicast stuff it might decline to accept the server.
  3. Hi all, I've been facing a problem with my Unraid server for a while whereby a large file transfer to the server will somehow upset Docker and cause problems, including: The Docker portions of the webGUI become slow Attempts to stop, remove, or kill Docker containers fail with error messages along the lines of "attempted to kill container but did not receive a stop event from the container" This is regardless of whether I use the webGUI, the command line, or Compose Connections to Docker containers from elsewhere on the network fail. (particularly annoying) my ADS-B receiver Docker container stops working and needs to be manually restarted before it'll work again, even if the offending transfer has already finished. Examples of big transfers in that can cause this problem: A backup coming in to the server from offsite (through rsync directly on Unraid) Windows File History backing up to the server (through Unraid's built-in SMB server functionality) A large download through Lancache on the server (through the Lancache Docker container) You'll notice that the first two of those should have nothing to do with Docker. Frustratingly, the following do not cause the problem: A backup going offsite from the server (rsync again) Mac OS backing up through (a Time Machine Docker container) Media being served out to other devices (built-in SMB) Clearly, my server is running out of some resource that Docker needs, but I'm at a loss as to what it is. Despite the problem being associated with a lot of network traffic, I doubt it's bandwidth itself, since one of the triggers comes from offsite, and it's unlikely my 100Mb connection (less VPN overheads) is saturating the server's gigabit Ethernet. While the CPU does run up during problematic transfers, it doesn't seem to be totally overloaded, and there's bags and bags of free memory. Let's take as a specimen what happened yesterday evening and see what Netdata recorded: As you can see, the ADS-B receiver fell over sometime around 17:00. I believe this was triggered by Windows backing up to Unraid's SMB server. The CPU runs up a bit for about 20 minutes at 16:30, then a bit more an hour later, but it never goes above 60%: There's some disk IO at the same times: Nonetheless, we have all of the memory: And there's a network traffic spike, but nothing gigabit Ethernet shouldn't be able to handle: Can anyone suggest other metrics I should examine? Or is there a way to decrease the general niceness of the Docker daemon so it will just take a bigger share of what resources there are? Vital statistics: Unraid v6.12.6 Dell PowerEdge R720XD Diagnostics are attached unraid-diagnostics-20231228-1006.zip
  4. Yes, that's a reasonable thing on the first run, but on subsequent runs it's just frustrating because you've already examined it and decided (against the clear advice of the plugin) to continue doing it anyway. A third option has occurred to me: perhaps have a button to suppress a particular warning, like Fix Common Problems does. Thanks. I'll take that up with Unraid if I can't figure it out.
  5. Hello all. First of all, thank you for this excellent plugin. @Squid's version was already pretty good, but you've taken it to the next level, @KluthR! I have a couple of tiny problems and one medium-sized one, though. Let's get the small ones out of the way: Firstly, could I please prevail upon you to not raise notifications about configuration warnings at runtime? Every week I get notifications generated telling me "NOT stopping [container] because it should be backed up WITHOUT stopping!". I know! I chose that setting! In defiance of the very clear advice on the UI that this isn't recommended, too. By all means note in the log that the backup isn't stopping the container, or even raise a warning dialog when saving the settings, but please don't raise a notification. It doesn't tell me anything, and is just noise. I'd turn off warnings, but sometimes there are warnings about things I don't already know about, and want to do something about. Talking of which... Last night, for several containers I got three warning notifications in addition to the configuration one mentioned above. They weren't obvious in the log because they had the [ℹ️] that I presume denotes information, not the [⚠️] for warnings, making them hard to spot. They were of the format "[[timestamp]][ℹ️][[container]] Stopping [container]... Error while stopping container! Code: - trying 'docker stop' method". That brings me on to my medium-sized problem: I got a warning for some, but not all, of my containers saying "Stopping [container]... Error while stopping container! Code: - trying 'docker stop' method". The next message notes that "That _seemed_ to work.". Indeed, if I SSH into Unraid and do "docker stop [container]", it does work. By what method does the plugin try to stop containers on its first attempt? Do you have any tips for how I can debug my containers to see why they don't stop on that first attempt?
  6. I nearly forgot to do this: now at over two weeks with no crashes. I'm going to declare this fixed by switching to ipvlan.
  7. 10 days of uptime with ipvlan. Then again, I've been here before...
  8. *sigh* It died again overnight. Let's give the ipvlan thing a go.
  9. So I've just passed seven days with almost everything up on 6.12.4. I wonder if it was related to the macvlan changes, since I am using that for Docker with a parent interface that is a bridge. Does anyone know where I would have been expected to see the "macvlan call traces and crashes" referred to in the release notes if that was the cause of my crashes?
  10. Usually the custom eth0 is the one to go for, because that allows the service to advertise itself on the network. What sort of not work are you getting?
  11. Deleting backups from Time Machine manually isn't all that hard. If you connect to your Time Machine share in the Finder, mount the disk image, and go into it, you'll see a bunch of folders named for the date and time they represent a backup of. You can then delete as you see fit. With regards to Time Machine not recognising the reduced quota, try removing the backup location and adding it again (which you can do without disrupting your history. Although it will say the angst-inducing "waiting to complete first backup" when you re-add the location, don't panic: it'll find the history once it does the first "new" backup).
  12. I suspect this is a powers of 10 vs powers of 2 issue and the 3TB you're declaring in the Docker parameters is slightly larger than the 3TB your drive actually holds. Time Machine's "do I need to delete old backups?" check looks only at the declared size, and pays no regard to the underlying storage (it will also ignore any storage contents that aren't sparsebundles!). I suspect if you declare a slightly smaller size, say 2.5TB*, Time Machine will realise it's full and start deleting. * n TB is about 9.05% smaller than n TiB, but some wiggle room is probably a good idea.
  13. I find it can fiddle around for a few minutes after each backup has completed, but it eventually stops and the disk spins down (sometimes just in time for the next backup to spin it back up 😕). I’ve no idea what it’s doing, though. Updating cached indexes or some other metadata perhaps? How long after your backup does the activity persist?
  14. How did you get on with this, @halorrr? Was resetting disk.cfg the solution? Although I don't know for certain, I would guess that the entries you ask about are the equivalents of the per-disk preferences you get when clicking on the disk name in the "Main" page of the web interface. How did you reset it? Copy the file over from a fresh install?
  15. Yes, this looks just like what I've got. I managed to get uptime to over seven days, slowly bringing back Docker services as I grew more confident. A few hours after I brought up my ADS-B constellation, the server crashed. Thinking I'd found the culprit, I brought up everything except ADS-B, but the server crashed again a few days later. I'm therefore of the opinion that the probability of crashing increases the more heavily-loaded the server is, but I still don't know an exact cause, let alone a solution. If anyone knows the answers to any of the questions I posed above, please speak up!
  16. So I've been up for two days, ten hours in safe mode with one significant weirdness. The parity check I was running cancelled itself a couple of hours before it was due to finish. It then ran another parity check for all of thirty seconds. Very weird. My end-of-the-month parity check is coming up soon. Let's see how that goes.
  17. Yes, our four experiences do seem remarkably similar. Until I saw your experiences I expected the 6.12.2 upgrade was a coincidence due to the initial nine days of stability, but now I'm much more suspicious of it. I'll give safe mode a try to rule out one more thing.
  18. Hello all, I'm facing some troubles with my Unraid server crashing every few days, and I'm not even very sure how to figure out what's going wrong, let alone how to fix it. Here's what I know so far: The server is a Dell PowerEdge R720xd. I believe I upgraded to Unraid 6.12.2 on the 8th of July - I am certain about the version, though not the date, which I've pieced together by looking through the lifecycle log. On the 17th of July, the server became unresponsive and I was unable to connect to it by any means. The iDRAC log had an entry saying that "A runtime critical stop occurred." followed in the next few seconds by three separate "OEM software event"s. I was able to restart the server through the iDRAC by issuing a reset (warm boot) command. The only problem when it came back up was that the Docker image had become corrupt and I had to delete it and recreate all my containers. Looking at the log entries in more detail through IPMI Tools the runtime critical stop has the description "Run-time Critical Stop ; OEM Event Data2 code = 61h ; OEM Event Data3 code = 74h". The OEM software events have the descriptions "Fatal excep", "tion in int", and "errupt" in that order. QUESTION: What do these codes mean? Does the "OEM" designation means they're specific to Unraid? Google's top hit is this forum post which the posters believed to be a memory error, but I don't know if that was gleaned from the event codes or the memory errors that shortly preceded the runtime critical stop in the OP's log. I do not have any memory errors (correctable or otherwise) in my log. Since then, every couple of days the same thing has happened: the server becomes unresponsive, a "runtime critical stop" appears in the log, and the system has to be reset through the iDRAC. At some point, I think on the 22nd, I upgraded to Unraid 6.12.3 (this date is mostly a guess; I haven't been able to distinguish between various commanded restarts). The only difference this made was that the Docker image didn't get corrupted every time the server crashed - which may just be blind luck anyway. In terms of what I've tried, I've been able to preserve the syslogs for a couple of occurrences. These are in the attached file. The one named "syslog" was retained by mirroring the log to the flash drive. A runtime critical stop occurred at 02:37 on the 23rd in the period covered by that file, although you will see that there were no entries between just after midnight on the 23rd and 07:40 when I restarted. QUESTION: Is there a way to increase syslog verbosity? In the file named "syslog-127.0.0.1.log", I got fed up of "Fix Common Problems" moaning at me for mirroring the syslog to flash, so I captured it by sending the logs to Unraid's own syslog server. In the period it covers, runtime critical stops occurred at 08:25 on the 25th and 05:59 on the 26th. Again, there are no log entries particularly close to the crash that suggest what might be happening. The other thing that I did, given the top Google hit mentioned above, was to run Memtest on Sunday. I left all the settings at their defaults, and it found no problems by the time it had run to completion. QUESTION: Am I correct in my understanding that Memtest is not an exhaustive test, and a pass does not therefore absolutely guarantee that there is no failing DIMM in the system? QUESTION: Are there different settings for Memtest that I should try that might make it more comprehensive? Finally, what other information can I retrieve to help diagnose this problem. I have omitted the usual Unraid diagnostics package, as by the time the problem has occurred I'm unable to communicate with Unraid until it has restarted and obliviated any diagnostics. QUESTION: Or could there be useful things in there? Would it be helpful to post a diagnostics package anyway? QUESTION: Is there anything else I should try to investigate to help find out what's going wrong? Thanks if you've made it all the way down to the bottom of this mammoth post, and thank you in advance for all your helpful suggestions and advice that will no doubt be forthcoming.
  19. On the Unraid side, all you need to do is edit the container and specify a fixed IP address of your choosing. Then on your Mac you'll probably need to set up Time Machine again to bind it to the IP address and not the .local hostname that isn't available through Wireguard. Don't worry: you can retain your backup history when doing so. Firstly, go to your Time Machine preferences and delete your current .local backup destination. Then go to the Finder and do "Go" -> "Connect to Server...". In the dialogue box that comes up type "smb://[fixed IP address]/[share name]" and click "Connect". Sign in and the share should mount on your desktop. Now go back to Time Machine in System Preferences and select a backup destination. Make sure you pick the fixed IP address, not the .local hostname. Possibly Time Machine will be clever and collapse both destinations together, so you may have to do this when you're connected through Wireguard so that the IP address is the only option. IIRC, at this point it will notice that there are existing backups and will ask if you want to re-use them or start from scratch; presumably you want the former. The backup destination will now be sitting with a status of "Waiting to complete first backup" or similar. Don't panic! Your backup history is still there. Go back to the desktop and unmount the share. Then, the first time you run a backup, it'll rediscover the history and continue as if nothing had happened.
  20. So the rebuild has completed, all the data is there, and Time Machine has just finished its first post-rebuild backup. I've still no idea where I went wrong, but everything works again and I didn't lose any data. Ultimately I'm happy.
  21. Nope. All healthy. Nor did I get any notifications about warnings other than the expected ones when a drive was removed, when rebuilds finished, etc..
  22. Okay, I'm completely bamboozled now. As I mentioned above I had drive 6A mounted with Unassigned Devices. I pondered copying the data off it onto other drives in the array or connecting 6B to the server to do a local copy onto 6A (which I guess is where trurl was going). I decided against these because I'd have had to zero out a drive to remove it from the array, and/or copy Disk 6's contents more than once and settled on formatting 6A and copying 6B onto it from the Raspberry Pi (it turns out USB3 + gigabit Ethernet is actually faster then just USB2). So, I reallocated drive 6A to the array. Yes, all contents will be overwritten when the array is started. Start the array, please. Then I went to find the format button. It wasn't there any more. Disk 6 showed as emulated with no complaint about missing or no file system, and a rebuild was in progress. Finding this odd, I took a look at the contents of Drive 6, and... everything seems to be there, as it was just before drive 6B was removed. Obviously, I'm going to wait for the rebuild to complete before declaring this fixed, but it's looking good, even if I don't understand why. My best guess at the moment is that Unraid found drive 6A somehow suspicious (perhaps due to an unclean dismount while it was being used outside of Unraid?) and thus declined to take it into the array, but mounting it with Unassigned Devices cleared this and allowed it to re-enter the array. I don't know why it wouldn't show the emulated contents as it would if Disk 6 was physically absent, though.
  23. Noted. It's tempting to look for ways to avoid having to transfer several terabytes back to Unraid but, unlike Disk 6, I don't have spare copies of everything else. 6B isn't connected to the server, but I've got it attached for now to a Raspberry Pi and it seems to be happy and have the data on it. 6A is in the server and does mount with Unassigned Devices and the (slightly outdated) data is still there. Again, it's tempting to look for a way to use it, but am I correct in thinking that any attempt to do so puts us into the scenario apandey warns against above? Okay. This looks like what I'm going to have to do.
  24. Hello all, I've just messed up a drive swap and I'm really, really close to losing data. First things first, the vital statistics: Unraid v6.11.5, running on... Dell PowerEdge R720xd 2x parity drives, 10x data drives, 2x cache drives - all formatted as BTRFS - controlled by... PERC H310 Mini, Firmware version 20.13.3-0001. Exposing the drives in JBOD mode, of course. Diagnostics are attached The drive I tried to swap, Disk 6, is dedicated to the share that holds my Time Machine backups, and similarly that share is pinned to Disk 6. So what did I do to get into this mess? I needed to remove what I'm going to call drive 6A, so I stopped the array and removed it. I thought it would be out for a while, so I replaced it with an identically-sized drive I'm going to call 6B. I assigned drive 6B to the array, then started the array. The emulated contents of Disk 6 were correct, and they were eventually rebuilt onto drive 6B. All was well for a time. After a while drive 6A became available again. Because it's a better drive than 6B, I decided to swap it back in. I stopped the array, removed 6B, and put 6A back in. Then things started to go a bit wrong. When I went back to restart the array Disk 6 showed as having no filesystem (though I'd never wiped it - it should have had all the same data on it as when it was removed). Although the option to format it was available, I did not do so - the warning message when you tick the format checkbox was quite clear that it wasn't what should be done. Although I found it a little bit odd that this hadn't happened with the earlier disk swap, I assumed from what I know of Unraid that the rebuild would of course rebuild the file system, so I started the array and allowed the rebuild to go on. Here's where I made my second mistake: I didn't check the emulated contents of Disk 6. While the rebuild was ongoing Time Machine kept failing to do its backups. I thought it was just being fussy about its storage (as it can be sometimes) and it would sort itself out once the rebuild was complete. In hindsight, this should also have tipped me off that all was not well on Disk 6 and I should have checked its emulated contents. Then, this evening, the rebuild finished. I ran Time Machine and it still failed. I went and had a look in Unraid, and Disk 6 was still showing as unmountable with no or wrong file system. I looked on both /mnt/user/Time\ Machine and /mnt/disk6. In both places the data was completely gone; the only thing remaining was the share directory. So, what do I do now? The server's sitting there with the array stopped. I presume there's no hope of recovering the data from within Unraid, but I do still have drive 6B. I haven't checked yet, but it should still have the data from Disk 6. Is the best course of action to just format 6A then rsync the data from 6B back onto the array? Finally, it's much less pressing, but I'd also like to know what I did wrong, and how I should have done it so I don't make this mistake again. Thank you in advance for all of your help.
  25. Hello. Thanks for providing this most useful plugin, itimpi! Unfortunately I ran into a small problem with it when I last restarted my Unraid server. When this plugin's installed (specifically, version 2022.12.05 on Unraid version 6.11.5), the root of the filesystem gets its permissions set to 777. This causes OpenSSH to refuse to allow login with public/private key pairs with the error message "Authentication refused: bad ownership or modes for directory /" because it's worried about some other user nefariously providing extra "authorised" keys. If I uninstall the plugin and restart, / is back to 755 which satisfies OpenSSH. The workaround, of course, is to manually set the permissions of / after start up, but it would be better if this plugin could avoid changing them in the first place.