Howboys

Members
  • Posts

    105
  • Joined

  • Last visited

Everything posted by Howboys

  1. On the Main page, I see that my parity drive has 3 errors. So I can a parity check with write corrections. It did it and fixed the 3 errors. Then I ran the party check again, and this time it shows 0 errors. But the drive still shows the 3 errors on the Main page. Are these actual errors or just leftover state? How do I get rid of it?
  2. Ah that fixed it! I actually have Fix Common Problems plugin installed already but I didn't see it there (or get notified of it).
  3. If `/var/log/notify_Discord`, I see failed attempts to send notification but not actual errors or reasons why. Snippet: Mon Aug 22 06:52:01 PDT 2022 attempt 1 of 4 failed {"embeds": ["0"]} I also don't see further attempts in logs - just 1 of 4 as above for different notifications. I can do a quick test just fine: bash /boot/config/plugins/dynamix/notifications/agents/Discord.sh And I see the notification in Discord. But when I do a full test: /usr/local/emhttp/webGui/scripts/notify -e "My Event" -s "My Subject" -d "My Description" -m "My Message" -i "alert" -l "/Dashboard" I only see the notification in the UI and NOT in Discord. Here's the log for that: Mon Aug 22 07:18:13 PDT 2022 attempt 1 of 4 failed {"embeds": ["0"]} { "content": "<@1234>", "embeds": [ { "title": "My Event", "description": "My Subject", "url": "http://Tower/Dashboard", "timestamp": "2022-08-22T14:18:12.000Z", "color": "14821416", "author": { "icon_url": "https://craftassets.unraid.net/uploads/logos/[email protected]", "name": "Tower" }, "thumbnail": { "url": "https://craftassets.unraid.net/uploads/discord/notify-alert.png" }, "fields": [ { "name": "Description", "value": "My Description\n\nMy Message" }, { "name": "Priority", "value": "alert", "inline": true } ] } ] }
  4. Ah I'm dumb. "provide your personal Discord ID (it is a series of numbers, not letters)." - I was adding username + numbers ("foobar#1234") but adding just "1234" works.
  5. I'm setting up a discord server for my homelab and with unraid, I'm having some issues. And here's how I've set the Discord agent: Now, I can receive the test notifications just fine: But I'm not receiving any other notifications. I only see notifications in the unraid web ui. Even adding my tag id (the numbers) doesn't work. All I see in syslog is Aug 22 06:52:01 Tower Discord.sh: Failed sending notification Is there any way to turn up verbosity of Discord.sh or see why it failed?
  6. Version 6.10.3 2022-06-14 Repro: 1. Go to notifications settings 2. Go to the discord settings part 3. Add webhook URL 4. Apply 5. Done 6. Go back to notification settings -> Discord 7. Add tag id -> Apply Observations: All settings except "Agent function" are reset when saving tag id. It also happens if you set the tag id and webhook URL in the same step. For now, I'm just not setting the tag id.
  7. I think I may have this solved. I bought a new PSU and new RAM and when I went to replace the PSU, I noticed that my one of my current RAM sticks was not all the way seated. I also noticed that the ATX 24-pin cable was a teeny-bit loose. Anyways, after plugging them in right, my current uptime has been almost 2 days with no hiccups. If the server stays up for a few more days, I'll upgrade to a bigger case so my SATA power cables aren't pushing against the ATX port (which is what seems to be happening). Consider this solved (I'll post if I notice issues again).
  8. That's all fair and good points. I don't usually reboot often but when I'm debugging issues (which I've had lately), rebooting a few times a day is common. Yeah so best case scenario I save 20-30 seconds even if a custom image were possible so perhaps what we have is the best we'll get.
  9. It's not a huge deal but without the plugin, my boot is less than a minute. So 2-3x high boot up seems like.. something not ideal. I know with normal Linux, I could create a preloaded image but idk if unRAID has that option.
  10. At boot, it takes nearly a minute for the driver to install: Aug 18 16:08:45 Tower root: --------------------Nvidia driver v515.57 found locally--------------------- Aug 18 16:08:45 Tower root: Aug 18 16:08:45 Tower root: -----------------Installing Nvidia Driver Package v515.57------------------- Aug 18 16:09:36 Tower kernel: nvidia: loading out-of-tree module taints kernel. Aug 18 16:09:36 Tower kernel: nvidia: module license 'NVIDIA' taints kernel. Aug 18 16:09:36 Tower kernel: Disabling lock debugging due to kernel taint Aug 18 16:09:36 Tower kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243 Aug 18 16:09:36 Tower kernel: Aug 18 16:09:36 Tower kernel: nvidia 0000:06:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem Aug 18 16:09:36 Tower kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 515.57 Wed Jun 22 22:44:07 UTC 2022 Aug 18 16:09:36 Tower root: Is there any way to speed this up?
  11. So I'm now in basically a crash loop - every 20 minutes or so the server just crashes and restarts (though the original symptom of the nvme SSD missing is not present anymore). Do you have any idea what hardware might be the issue? Could it be the RAM or the CPU or something else given the symptoms?
  12. From > "Share cache full" means the share to where you are transferring his hitting the minimum free space setting, for that share, not cache, so the transfer would bypass your cache device and go directly to the array That does seem to be the case. One of my shares has "Minimum free space" set to 200GB: I don't understand what the logic is here for cache and cache bypass, and why it would cause a server crash.
  13. It happened again and here's the logs: Aug 16 19:51:52 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:53:03 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022 Aug 16 19:53:03 Tower kernel: Command line: BOOT_IMAGE=/bzimage rcu_nocbs=0-15 initrd=/bzroot Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Aug 16 19:53:03 Tower kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Cache is definitely not full (which seems to have caused the crash?): Attaching diagnostics. tower-diagnostics-20220816-2000.zip
  14. So I was just double checking my config for unraid and noticed that I had not added `rcu_nocbs=0-15` to the append line but instead on a new line. I've now added it to the append line, so hopefully that fixes it? The other change I've made is to enable global C-states but disable C6 using zenstates in /config/go. If after all that the server crashes, I'll disable global C-states and see if that does it. I'm pretty confident that correctly applying `rcu_nocbs=0-15` will resolve this but only time can tell.
  15. Hey. Did this permanently resolve your issue? Even with c-states disabled, mine crashes. So now I'm testing it with c-states enabled.
  16. I guess it could be my memory though I'm not sure. According to BIOS it's running at 2133Mhz (though it is 3000Mhz) and I've got 2 sticks in the 2nd and 4th slot away from the CPU. According to the Internet, that should be the right configuration?
  17. Just to check in - has it been stable since?
  18. I should've mentioned I've already done all that when I built the server.
  19. Ok it happened just now. Here's the syslog around the time of the crash (11:36 local time). What's not in the file attached is what's right before: Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered blocking state Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered disabled state Aug 14 11:28:52 Tower kernel: device veth63b1366 entered promiscuous mode Aug 14 11:28:53 Tower kernel: eth0: renamed from veth2a1387c Aug 14 11:28:53 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth63b1366: link becomes ready Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered blocking state Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered forwarding state Aug 14 11:29:19 Tower sshd[27775]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 34804 on $UNRAID_IP port 22 rdomain "" Aug 14 11:29:19 Tower sshd[27775]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:29:19 Tower sshd[27775]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 34804 Aug 14 11:30:19 Tower sshd[29895]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35046 on $UNRAID_IP port 22 rdomain "" Aug 14 11:30:19 Tower sshd[29895]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:30:19 Tower sshd[29895]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35046 Aug 14 11:31:19 Tower sshd[31937]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35272 on $UNRAID_IP port 22 rdomain "" Aug 14 11:31:19 Tower sshd[31937]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:31:19 Tower sshd[31937]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35272 Aug 14 11:32:19 Tower sshd[1650]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35494 on $UNRAID_IP port 22 rdomain "" Aug 14 11:32:19 Tower sshd[1650]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:32:19 Tower sshd[1650]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35494 Aug 14 11:33:19 Tower sshd[3711]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35720 on $UNRAID_IP port 22 rdomain "" Aug 14 11:33:19 Tower sshd[3711]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:33:19 Tower sshd[3711]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35720 Aug 14 11:36:19 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022 Aug 14 11:36:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot From what I can tell, there's basically no logs preceeding the crash that are useful. The logs I've attached seem to be corresponding to the server re-boot after the crash. So, what else can I provide or enable for more visibility? Just to provide more info: - My build is Gigabyte B450 Aorus M (Bios F63a), Ryzen 7 1700X - I've already run memtest and made sure the memory is okay (did the memtest when I got the RAM a month or so ago) - I have `rcu_nocbs=0-15` added to my flash config (https://forums.unraid.net/topic/61129-ryzen-freezes/) - I'm not running any VMs, just Docker container and storage shares - There's no significant load on the server at the time of the crash. In fact, things are totally fine any time I do a cpu or GPU stress test. sys
  20. Ok it happened again last night but I didn't have syslog server enabled then (the server didn't restart itself though the PC's power light was on, and disks were spinning. No output/signal on the monitor). I've enabled syslog server now.
  21. So over the last 24 hours or so, it seems like my server is just crashing. I have no idea why. Once it crashed and rebooted in the middle of the night, and once just a few mins ago. In both cases, when it started back up, the nvme cache drive was missing (probably due to it being in a bad state after an unclean shutdown). Rebooting didn't help but clean shutdown + start up does. When it crashed a few mins ago and restarted back up, I grabbed the attached diagnostic report (but didn't do a shutdown so my cache only has one SSD now). I'm not really pushing the limits of my server or anything, so idk why it's been restarting. tower-diagnostics-20220808-1935.zip
  22. Wondering the same. I doubt a container allowing full unrestricted access to the host system is a good idea though because that could easily be abused and might be a pretty bad CVE. In that case, maybe we should install tailscale on the host in unRAID? Maybe with user scripts or something?