Jump to content

Batman

Members
  • Posts

    22
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

Batman's Achievements

Noob

Noob (1/14)

5

Reputation

  1. I haven't had the chance to dig into hardware troubleshooting yet, but I realized that Home Assistant's data recording should give me a better indication of the actual failure time. I only had detailed data for the past 3 crashes, but they revealed the last entries in the Unraid log were separated from the actual crash time by 5 to 30 minutes. In my mind, this means there's probably no correlation between those entries and the root cause. I think the RAM is at the top of my suspect list right now, so I'll start with a memtest and go from there.
  2. Is there any good way to start narrowing down the source? On one hand, the old drives seem like the obvious place to start, but it's not consistent which one shows up in the logs for each crash. Regarding the power supply, I find it hard to believe it's getting overloaded as it's rated over 2x the PCPartPicker estimate. I also checked the history of my UPS's measured power usage (recorded every 10s in Home Assistant) and the peak was less than 200W. I suppose the supply could be failing, but it's not that old. One last note: When the server crashes, the display still remains (statically), but the server is unresponsive (via both the web GUI and a hardwired keyboard).
  3. Over the past ~6 months, I've had a handful of server crashes. Based on the logs, I suspect it could be related to drives or the power supply, but I wanted to get some other thoughts. First off, some basic server information (build in 2021): Unraid OS Plus 6.12.5 (though the crashes date back to 6.11.X) Intel i7-4790K CPU (sourced from a working desktop) Gigabyte GA-H97N motherboard (eBay) G.Skill Ripjaws X 2x8 GB DDR3-1600 RAM 4x WD Red Plus (CMR) 6 TB (from desktop, purchased 2018) 2x WD Red (SMR) 6 TB (new 2021) WD Blue SN550 1 TB NVMe SSD (cache, new 2021) EVGA SuperNOVA G3 550W 80+ Gold power supply (new 2021) PCPartPicker estimates the power consumption at 246W, so there should be plenty of margin. Here are the diagnostics: diagnostics-20240319-1902.zip I have syslogs from several crashes after I enabled the syslog server (there were a handful before this). I've trimmed the logs to (what I think) is a reasonable amount of time before and after the crash. I can provide more of the logs if needed. 26 Nov 2023: Server crashes around 03:26:06, potentially while spinning down /dev/sdb. syslog-crash-231126.txt 28 Nov 2023: Server crashes around 04:23:10, potentially while reading SMART from /dev/sdc. syslog-crash-231128.txt 11 Dec 2023: Server crashes around 19:58:16, and I don't really have a good indication of what might have been happening when it crashed. When the UPS I was using at this time was at 100% charge, its status message was "OL+DISCHRG" even though it was using mains power, not battery; the log is a bit cluttered with messages about this. Additionally, the USB connection of the UPS was unreliable, so I had a script check for the connection once a minute and reset USB if it was disconnected. There may be messages about this in the log too. syslog-crash-231211.txt 14 Dec 2023: Server crashes around 07:08:16, and the log is pretty similar to the Dec 11 log, with many messages about the UPS. syslog-crash-231214.txt 31 Dec 2023: Server crashes around 16:48:35, and like the 28 Nov log, the last message was reading SMART from /dev/sdc. syslog-crash-231231.txt 16 Mar 2024: After a stable period of a few months, the server crashed around 00:15:16; this time the last message was reading SMART from /dev/sdb. Following start-up, during the resulting parity check, there are may disk0 read errors. This was the only such errors occurred after a crash. syslog-crash-240316.txt 19 Mar 2024: Back to more frequent crashes, this one happened around 14:31:04, while spinning down /dev/sdf. syslog-crash-240319.txt I will acknowledge some of the drives are old and likely need replacing, but I was hoping to narrow down the cause before I replace all of my drives.
  4. Even with this brief bit of logging, I'm seeing something interesting--but on the client side. Below is the history captured by Home Assistant, running a NUT Client. It shows the UPS load flat at 5.8% from 20:56:27 until 21:00:27, when it decreases to 5.19%: Unraid's log shows a number of updates during this period that aren't reflected in Home Assistant: Jan 25 20:56:21 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:56:39 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:56:43 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:56:45 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:56:47 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 20:56:49 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:57:01 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:57:03 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.80" Jan 25 20:57:05 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:57:11 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:57:25 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:57:27 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:57:29 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:57:31 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 20:57:33 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:57:49 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:57:51 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 20:57:53 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:57:57 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:57:59 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:58:05 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:58:15 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:58:17 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:58:23 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:58:25 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 20:58:27 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:59:01 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:59:07 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:59:17 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:59:21 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:59:23 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:59:27 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:59:31 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:59:35 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:59:47 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:59:49 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 20:59:55 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 20:59:59 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.19" Jan 25 21:00:03 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.80" Jan 25 21:00:05 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 21:00:07 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 21:00:11 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 21:00:13 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "6.50" Jan 25 21:00:15 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.19" Jan 25 21:00:17 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "7.09" Jan 25 21:00:21 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" Jan 25 21:00:25 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.19" Jan 25 21:00:31 BatRAID apc_modbus[19406]: [D5:19406] send_to_all: SETINFO ups.load "5.80" It's notable that 1) Home Assistant appears to only update NUT data on the 27th second of each minute and 2) it's discarding the point if the data did not change. This is probably why I thought I was seeing period without data being updated. (Edit: Home Assistant documentation confirms the NUT integration indeed uses Local Polling but doesn't specify the frequency, though that's apparently every minute.) Anyway, I'll let this run for a bit, but I may have gotten to the root cause of (what I believed was the) NUT problem. The APCUPSD problem on the other hand, is real. It would go days at a time without updating the value and didn't recognize if I were to simulate a power outage during those periods. Is there similar debugging I can do with APCUPSD? I don't see logging/debug options in the GUI so I presume I'll need to modify the configuration file.
  5. I'll try that and let it run for a bit. Do you know what debug level I should be using? 1 didn't seem to do much, but 2 is giving me entries like this one: Jan 25 20:35:02 BatRAID apc_modbus[21197]: [D2:21197] send_to_one: sending SETINFO battery.charge "100.00" Thing is, I got many of those all at once and none in the several minutes since then. Is that indicative of no new data from the UPS or is this expected for level 2 debug logging? Edit: I stepped up the levels until I got to 5, which seems to be providing a lot more information. Perhaps too much--this will be a big log file. But I'm getting very frequent entries with the updated data. As an alternative, I see that I can download data from the NUT Statistics Server, but I see two potential issues. The first is that there are no timestamps for each entry, just the date, though perhaps I can infer each entry is a minute based on my configuration. But more importantly, will the statistics server still include an entry if it thinks it's connected to the UPS but actually has stale data?
  6. In the terminal, use lsusb. It will list the USB devices connected to your server: root@BatRAID:~# lsusb Bus 002 Device 002: ID 8087:8001 Intel Corp. Integrated Hub Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002: ID 8087:8009 Intel Corp. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 003 Device 011: ID 051d:0003 American Power Conversion UPS Bus 003 Device 003: ID 0930:6545 Toshiba Corp. Kingston DataTraveler 102/2.0 / HEMA Flash Drive 2 GB / PNY Attache 4GB Stick Bus 003 Device 009: ID 0c45:0133 Microdia USB Keyboard Bus 003 Device 010: ID 1a86:55d4 QinHeng Electronics 800 Z-Wave Stick Bus 003 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Find the one that corresponds to your UPS, and use that. You don't have to use the whole thing, but it does need to be long enough that it will be the only one that matches.
  7. I recently purchased an APC Smart-UPS 1500 (model SMT1500) and have it connected to my Unraid server via USB (type A to B). When using the ModBus protocol, communication eventually drops out without any "communication lost" warning. ModBus is enabled on the UPS. For a period of time after starting the APCUPSD server, everything works as expected. Power usage and voltage in/out are updated in real-time, and any power loss is detected immediately. After some time (say, 1-8 hours), all of the metrics stop being updated and a simulated power loss can't be detected. Even though none of the data is being updated, APCUPSD still thinks it's connected. The connection rarely recovers from this state without a manual restart of APCUPSD (or an Unraid reboot). If the connection does recover, a loss of communication is reported. The settings I'm using are: UPS cable: USB Custom UPS cable: <not specified> UPS type: ModBus Device: <not specified> When not using the ModBus protocol, communication seems stable, with no disconnections (detected or undetected) for the last week. I'd really prefer ModBus though, because APCUPSD is missing load percentage and various other metrics when using the default USB protocol. I also tried NUT using the ModBus protocol, but it seems like the updates are not as frequent, updating once per several minutes. I didn't notice any hard dropouts, but there were periods of time with suspiciously few updates. No restarts were required to restore communication, unlike APCUPSD. With a prior UPS, I had to periodically reset the USB connection (it was not using ModBus) when running NUT (but not APCUPSD). I have no idea if these problems are related. Has anybody else had similar issues? Do I need to modify the ModBus settings of the UPS for more reliable performance?
  8. @JonathanMI'm trying to do the same as Jokerigno--set my network cameras (restricted to LAN only) to use Unraid's NTP server. I tested the command you provided in Windows, but got this result: C:\Users\[username]>net time \\10.0.0.100 /set /yes Current time at \\10.0.0.100 is 7/25/2023 21:33:49 System error 1314 has occurred. A required privilege is not held by the client. I don't see settings in either Unraid or the camera for authentication or privileges. ---- Edit: This is because I wasn't using an elevated command prompt... when I run this with an elevated command prompt, the time is set successfully. Any ideas why my camera can't set itself from my server? They're both on the same subnet and I'm not aware of firewall rules that would block the connection. The camera already provides video to the server.
  9. I currently have both a DuckDNS (<mysubdomain>.duckdns.org) and Google Domains (ddns.<mydomain>.dev) DDNS services set up, and I've verified both point to my IP address. The Unraid VPN via the DuckDNS domain has been set up for a while and works fine, but I can't get the VPN to connect when using my Google Domains domain as the endpoint. The top-level domain for my Google DDNS is .dev, which enforces HSTS, so I suspect this may be the cause, though I really have no idea if Wireguard cares about this. Since HSTS requires SSL, I have tried a few things to associate a certificate with ddns.<mydomain>.dev: Created a Let's Encrypt certificate with my Asus router, which is handling DDNS updates for Google DDNS. This doesn't have any effect though because nothing redirects ddns.<mydomain>.dev to my router, nor would I want it to. Created a Let's Encrypt certificate with Nginx Proxy Manager with a 404 redirect. With this configured, ddns.<mydomain>.dev indeed redirects to a 404 page with the appropriate certificate. Used that same certificate with a proxy host for http://<myserver>:51821. This configuration yields a 502 bad gateway error for ddns.<mydomain>.dev. None of the above configurations allowed my clients to connect to Unraid's VPN using ddns.<mydomain>.dev. Does anyone know about Wireguard's compatibility with domains that enforce HSTS? Or more specifically, Unraid's implementation of Wireguard?
  10. For some reason, the name of my UPS (using usbreset) changed to "Back-UPS BGM1500B FW31316S12-31320S10", so the script above stopped working. Since that doesn't seem to be a stable way to reference it, and I couldn't figure out how or if I could use wildcards with usbreset, I modified the script to use lsusb like Gooserhino's: #!/bin/bash # Test communication with UPS if upscmd -l ups > /dev/nul; then echo The UPS is connected. else echo Error: Communication with the UPS has failed! # Get USB path for the UPS APC=$( lsusb | grep "American Power Conversion" | perl -nE "/\D+(\d+)\D+(\d+).+/; print qq(\$1/\$2)") # Test for a USB conection if [ -z "$APC" ]; then echo Error: No USB UPS detected! else echo Resetting the USB connection... # Reset the USB connection if usbreset $APC; then echo USB reset completed successfully. else echo Error: USB reset failed! fi fi fi
  11. It would be nice to actually solve the issue, but I wrote a short script that resets the USB when the connection is lost. It's scheduled to run on a regular basis with the User Scripts plugin. #!/bin/bash # Test communication with UPS if upscmd -l ups > /dev/nul; then echo The UPS is connected. else echo Error: Communication with the UPS has failed! if usbreset "Back-UPS BGM1500 FW:000000G0-313200S9"; then echo USB reset completed successfully. else echo Error: USB reset failed! fi fi Edit: usbreset now shows a slightly different name for my UPS, so the script above stopped working. I couldn't figure out if or how usbreset would use wildcards, so this modified script works: #!/bin/bash # Test communication with UPS if upscmd -l ups > /dev/nul; then echo The UPS is connected. else echo Error: Communication with the UPS has failed! # Get USB path for the UPS APC=$( lsusb | grep "American Power Conversion" | perl -nE "/\D+(\d+)\D+(\d+).+/; print qq(\$1/\$2)") # Test for a USB conection if [ -z "$APC" ]; then echo Error: No USB UPS detected! else echo Resetting the USB connection... # Reset the USB connection if usbreset $APC; then echo USB reset completed successfully. else echo Error: USB reset failed! fi fi fi
  12. The error is because calling the c-code as a script won't do anything, because it's trying to interpret it as a shell script. I discovered that a version of "usbreset" is already present on Unraid, or at least it's built-in to version 6.10.2. Using that, I wrote a script I can schedule to check if the UPS is communicating and reset the USB connection if it isn't: #!/bin/bash # Test communication with UPS if upscmd -l ups > /dev/nul; then echo The UPS is connected. else echo Error: Communication with the UPS has failed! if usbreset "Back-UPS BGM1500 FW:000000G0-313200S9"; then echo USB reset completed successfully. else echo Error: USB reset failed! fi fi
  13. I'm getting the same error message every few days. It doesn't seem to correlate to a particular time or event. Unraid's built-in UPS tool never had these dropouts, but I switched to NUT because APCUPSD interpreted the fully-charged state of my UPS as a power failure (my UPS was reporting "Online + On Battery"). I have an APC BGM1500B. When I unplug and replug the UPS into the server, it works again after restarting NUT. Restarting NUT, or disabling and reenabling without unplugging the USB cable does not fix the problem.
  14. I'm also having issues connecting to my LAN using Wireguard on Windows. I've set Wireguard up successfully on several other devices including Android and iOS. It appears the handshake is completing, but when I try to ping my Windows PC from Unraid's VPN UI, it fails. The ping succeeds for my non-Windows devices. I can't access SMB shares or Unraid's UI either. I've tried setting the VPN to a private network (see below admin Powershell command) and disabling the firewall for private networks to no success. Set-NetConnectionProfile -InterfaceAlias 'peer-BatRAID-wg0-4' -NetworkCategory 'Private' Here's a sample from the log from Wireguard in Windows: 2022-08-07 HH:13:07.569350: [TUN] [peer-BatRAID-wg0-4] Starting WireGuard/0.5.3 (Windows 10.0.22000; amd64) 2022-08-07 HH:13:07.569350: [TUN] [peer-BatRAID-wg0-4] Watching network interfaces 2022-08-07 HH:13:07.570850: [TUN] [peer-BatRAID-wg0-4] Resolving DNS names 2022-08-07 HH:13:07.656637: [TUN] [peer-BatRAID-wg0-4] Creating network adapter 2022-08-07 HH:13:07.718632: [TUN] [peer-BatRAID-wg0-4] Using existing driver 0.10 2022-08-07 HH:13:07.725633: [TUN] [peer-BatRAID-wg0-4] Creating adapter 2022-08-07 HH:13:07.896134: [TUN] [peer-BatRAID-wg0-4] Using WireGuardNT/0.10 2022-08-07 HH:13:07.896134: [TUN] [peer-BatRAID-wg0-4] Enabling firewall rules 2022-08-07 HH:13:07.881134: [TUN] [peer-BatRAID-wg0-4] Interface created 2022-08-07 HH:13:07.903233: [TUN] [peer-BatRAID-wg0-4] Dropping privileges 2022-08-07 HH:13:07.903633: [TUN] [peer-BatRAID-wg0-4] Setting interface configuration 2022-08-07 HH:13:07.903633: [TUN] [peer-BatRAID-wg0-4] Peer 1 created 2022-08-07 HH:13:07.910464: [TUN] [peer-BatRAID-wg0-4] Interface up 2022-08-07 HH:13:07.910464: [TUN] [peer-BatRAID-wg0-4] Monitoring MTU of default v4 routes 2022-08-07 HH:13:07.918633: [TUN] [peer-BatRAID-wg0-4] Setting device v4 addresses 2022-08-07 HH:13:07.923320: [TUN] [peer-BatRAID-wg0-4] Monitoring MTU of default v6 routes 2022-08-07 HH:13:07.923763: [TUN] [peer-BatRAID-wg0-4] Setting device v6 addresses 2022-08-07 HH:13:07.925134: [TUN] [peer-BatRAID-wg0-4] Startup complete 2022-08-07 HH:13:08.108134: [TUN] [peer-BatRAID-wg0-4] Sending handshake initiation to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:13:08.174246: [TUN] [peer-BatRAID-wg0-4] Receiving handshake response from peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:13:08.174246: [TUN] [peer-BatRAID-wg0-4] Keypair 1 created for peer 1 2022-08-07 HH:13:51.108471: [TUN] [peer-BatRAID-wg0-4] Sending keepalive packet to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:14:39.315425: [TUN] [peer-BatRAID-wg0-4] Sending keepalive packet to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:15:03.840695: [TUN] [peer-BatRAID-wg0-4] Sending keepalive packet to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:15:10.995889: [TUN] [peer-BatRAID-wg0-4] Sending handshake initiation to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:15:11.061637: [TUN] [peer-BatRAID-wg0-4] Receiving handshake response from peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:15:11.061637: [TUN] [peer-BatRAID-wg0-4] Keypair 2 created for peer 1 2022-08-07 HH:15:11.061637: [TUN] [peer-BatRAID-wg0-4] Sending keepalive packet to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:15:21.672595: [TUN] [peer-BatRAID-wg0-4] Sending keepalive packet to peer 1 (<WAN IP Address>:51820) 2022-08-07 HH:15:38.645740: [TUN] [peer-BatRAID-wg0-4] Shutting down 2022-08-07 HH:15:38.647938: [MGR] [peer-BatRAID-wg0-4] Tunnel service tracker finished
  15. After changing the ups.conf as I noted above, NUT is behaving as expected both in Unraid and on my Windows PC.
×
×
  • Create New...