Rubene

Members
  • Posts

    10
  • Joined

  • Last visited

Rubene's Achievements

Noob

Noob (1/14)

0

Reputation

  1. @Emilio Unfortunately not. Nextcloud already didn't quite meet my needs and with these problems added, I switched to just shared network volumes. Problem did not occur again, so it was definitely related to the Nextcloud docker container. But absolutely no idea what the reason may have been.
  2. Same issue here suddenly since yesterday. I'm on 6.8.3 since it got released, never seen this issue before. Curious what triggered this. /etc/rc.d/rc.nginx restart seems like to fix it for now.
  3. Indeed, looks like it has something to do with IO. But I don't suspect the disks. The appdata folder is located on the cache drive. The 2 hard drives were both asleep. Also the cache drive had barely any ops (see also the screenshot in my first post). I suspect it has something to do with network. There I see the most correlation, but i'm not entirely sure yet.
  4. Today I had the issue again around 11.51. Syslog shows nothing special: May 3 10:38:37 Tower emhttpd: shcmd (77895): /etc/rc.d/rc.samba restart May 3 10:38:40 Tower root: Starting Samba: /usr/sbin/smbd -D May 3 10:38:40 Tower root: /usr/sbin/nmbd -D May 3 10:38:40 Tower root: /usr/sbin/wsdd May 3 10:38:40 Tower root: /usr/sbin/winbindd -D May 3 10:38:40 Tower emhttpd: shcmd (77904): smbcontrol smbd close-share 'x' May 3 10:55:13 Tower kernel: mdcmd (294): spindown 0 May 3 10:55:14 Tower kernel: mdcmd (295): spindown 1 May 3 12:02:59 Tower kernel: veth0d8fa17: renamed from eth0 May 3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state --- Restarted a docker container May 3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state May 3 12:02:59 Tower kernel: device vethb96c1c0 left promiscuous mode May 3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state Again Nextcloud was not responding. Not able to stop / kill that particular container (other contains are able to stop/start). Managed to forcefully stop docker with /etc/rc.d/rc.docker force_stop but the load was still there. Looking at top and ps I noticed there were quite some php-fpm processes in the D state ("uninterruptible sleep (usually IO)"). No way of stopping these. Nextcloud uses php-fpm but I expect these processes are gone when the container is no longer running. Also netdata was running. The only correlation I see is an increased amount of TCP sockets, higher number of ipv4 UDP errors, ipv6 packets and errors. Issue is still very vague to me. Does anyone have any idea what this could be?
  5. Thanks! Just did that. I was able to copy everything from Tools -> System log. But nothing around the time it started (10:27). At 11:31 I tried to stop the array. Apr 6 06:00:15 Tower root: /var/lib/docker: 18.9 GiB (20290076672 bytes) trimmed on /dev/loop2 Apr 6 06:00:15 Tower root: /mnt/cache: 217.2 GiB (233194831872 bytes) trimmed on /dev/mapper/sdd1 Apr 6 08:00:32 Tower kernel: mdcmd (62): spindown 0 Apr 6 08:00:33 Tower kernel: mdcmd (63): spindown 1 Apr 6 10:38:14 Tower webGUI: Successful login user root from 172.18.0.18 Apr 6 10:38:36 Tower login[24621]: ROOT LOGIN on '/dev/pts/0' Apr 6 10:39:38 Tower kernel: mdcmd (64): spindown 0 Apr 6 10:40:21 Tower nginx: 2020/04/06 10:40:21 [error] 10155#10155: *438771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.18, server: , request: "POST /webGui/include/Download.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "unraid.xxx.com", referrer: "https://unraid.xxx.com/Tools/Diagnostics" Apr 6 10:41:24 Tower kernel: mdcmd (65): spindown 1 Apr 6 11:31:29 Tower webGUI: Successful login user root from 192.168.2.55 Apr 6 11:31:39 Tower kernel: mdcmd (66): nocheck cancel Apr 6 11:31:40 Tower emhttpd: Spinning up all drives... Apr 6 11:31:40 Tower emhttpd: shcmd (8973): /usr/sbin/hdparm -S0 /dev/sdd Apr 6 11:31:40 Tower kernel: mdcmd (67): spinup 0 Apr 6 11:31:40 Tower kernel: mdcmd (68): spinup 1 Apr 6 11:31:40 Tower root: Apr 6 11:31:40 Tower root: /dev/sdd: Apr 6 11:31:40 Tower root: setting standby to 0 (off) Apr 6 11:31:45 Tower emhttpd: Stopping services... Apr 6 11:31:45 Tower root: Stopping docker_load Apr 6 11:31:46 Tower emhttpd: shcmd (8977): /etc/rc.d/rc.docker stop Apr 6 11:31:46 Tower kernel: br-b2a3f6552968: port 8(veth4088f17) entered disabled state
  6. Hi, This is the fourth time this happened on my server: This has multiple impacts: - I cannot access certain Docker containers anymore (Nextcloud e.g.), but most of them are still running fine (all behind Traefik as reverse proxy). - Stopping (also forced), restarting, creating, deleting of Docker containers is no longer possible. Not via the GUI, terminal or Portainer. Commands are hanging. - Creating a diagnostics file is no longer possible. Not via GUI or terminal. - Stopping array is not possible anymore (hangs, think because docker is not responding) The only way to solve this is a (unclean) reboot. I think it is related to Nextcloud. All four times I was doing something with Nextcloud (altough Nextcloud is used often, so why only these four times?). Also since Nextcloud is not accessible anymore (gateway timeout). My guess is that it is network related, or could it be something else? And how to verify that? Like I said, its impossible to get a diagnostics report. I'm currently on 6.8.3, but happened also on 6.8.1. Some more graphs: Thanks!
  7. Thanks for your reply! Glad to hear that it is nothing to worry about. Is this something unraid specific?
  8. The hardware errors actually happens every time during boot, except one time. Did around 20 - 30 reboots I guess. I checked the mcelog (/dev/mcelog) but its empty. I also did a second memtest: no errors. Apart from the notifications, the machine & array seems to be running fine. But I also had some issues with my new flash drive during some boots. Looked like it couldn't read some files. But these got solved when I moved the flash drive to a USB 3.0 port instead of a 2.0 which it was in before. Could that make sense? Could anyone please have a look at these hardware errors? The hardware is brand new, I want to avoid any problems with it.
  9. Yesterday it happened again: Jan 10 21:08:29 Tower kernel: smpboot: CPU0: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xb) Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: Machine check events logged Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 8: ae00000000801136 Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 8b445140 MISC 47040000086 Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:906eb TIME 1578686893 SOCKET 0 APIC 0 microcode ca Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: Machine check events logged Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ae00000000801136 Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 8b445100 MISC 43040000086 Jan 10 21:08:29 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:906eb TIME 1578686893 SOCKET 0 APIC 0 microcode ca Any idea what this could be? Since it happened twice now.
  10. I bought a new machine to run unRAID on. After booting I got this message: 'Your server has detected hardware errors'. The array seems to run fine (although it is still a small array without parity disk. Explored unRAID on old hardware. Since it's a good fit to replace my Synology I bought new hardware and going to purchase a license). This is in the syslog: Jan 9 21:21:03 Tower kernel: smpboot: CPU0: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xb) Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: Machine check events logged Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 8: ae00000000801136 Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 8b445140 MISC 43040000086 Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:906eb TIME 1578601246 SOCKET 0 APIC 0 microcode ca Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: Machine check events logged Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ae00000000801136 Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 8b445100 MISC 43040000086 Jan 9 21:21:03 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:906eb TIME 1578601246 SOCKET 0 APIC 0 microcode ca To be sure it's not the RAM, I ran memtest. After 4 hours it completed and did not find any issues. +---------------------------------------------+-----------------+--------+ | Test | # Tests Passed | Errors | +---------------------------------------------+-----------------+--------+ | Test 0 [Address test, walking ones, 1 CPU] | 4/4 (100%) | 0 | | Test 1 [Address test, own address, 1 CPU] | 4/4 (100%) | 0 | | Test 2 [Address test, own address] | 4/4 (100%) | 0 | | Test 3 [Moving inversions, ones & zeroes] | 4/4 (100%) | 0 | | Test 4 [Moving inversions, 8-bit pattern] | 4/4 (100%) | 0 | | Test 5 [Moving inversions, random pattern] | 4/4 (100%) | 0 | | Test 6 [Block move, 64-byte blocks] | 4/4 (100%) | 0 | | Test 7 [Moving inversions, 32-bit pattern] | 4/4 (100%) | 0 | | Test 8 [Random number sequence] | 4/4 (100%) | 0 | | Test 9 [Modulo 20, ones & zeros] | 4/4 (100%) | 0 | | Test 10 [Bit fade test, 2 patterns, 1 CPU] | 4/4 (100%) | 0 | | Test 13 [Hammer test] | 4/4 (100%) | 0 | +---------------------------------------------+-----------------+--------+ If I understood correctly, this might happen very early in the boot process during initialization. Doesn't seem something to worry about but I would like to be sure. Thanks! tower-diagnostics-20200109-2056.zip