ssb201

Members
  • Posts

    58
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

ssb201's Achievements

Rookie

Rookie (2/14)

3

Reputation

1

Community Answers

  1. The best I can tell, it seems that Plex is having an issue during content scanning as a nightly task and soaks up all the memory. Not sure what the hell is going on, but by adding a hard cap on the memory assigned to the container, I just get the OOM killer reaping the container rather than my entire box locking up. There really is no good reason for a non-privileged docker container to be able to effectively lock up the host operating system.
  2. I was able to do a little more digging: 1) The system seemed to run just fine without plugins or the array running. 2) The system seemed to run just fine with plugins and the array running but no VMs or docker containers. 3) The system seemed to run just fine with plugins and the array and a single VM running but no docker containers. 4) I selected three docker containers to run along with the single VM (OrganizrV2, Nginx, and Plex) and the system (sometime over night) had the issue again. This time I had an SSH connection already established, so I did not run into the timeout trying to login. I was able to run top (very slowly) and grabbed this: top - 10:10:31 up 4 days, 6:08, 1 user, load average: 139.34, 137.18, 136.81 Tasks: 621 total, 6 running, 593 sleeping, 0 stopped, 22 zombie %Cpu(s): 0.0 us, 13.3 sy, 0.0 ni, 16.7 id, 69.8 wa, 0.0 hi, 0.2 si, 0.0 st MiB Mem : 96299.8 total, 576.1 free, 95157.7 used, 566.1 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 103.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 241 root 20 0 0 0 0 R 100.0 0.0 363:18.89 kswapd0 19866 root 20 0 0 0 0 R 37.7 0.0 11:33.42 kworker/u40:4+loop3 21057 root 20 0 0 0 0 R 15.5 0.0 4:18.59 kworker/u40:7+loop3 12497 root 20 0 47.9g 44.2g 6532 D 14.2 47.0 2323:15 qemu-system-x86 20737 root 20 0 0 0 0 R 14.2 0.0 8:47.72 kworker/u40:16+loop3 3859 root 20 0 10576 7004 5748 D 7.0 0.0 0:12.33 sshd 13950 nobody 20 0 257492 36512 4 S 5.7 0.0 14:40.36 Plex Media Serv 6287 root 20 0 755916 19604 0 S 4.4 0.0 21:09.10 containerd 13673 root 20 0 722964 10460 0 S 3.5 0.0 14:00.71 containerd0:0+loop1 21067 root 20 0 0 0 0 I 3.1 0.0 2:48.72 kworker/u40:8-btrfs-endio-meta 21308 root 20 0 0 0 0 I 3.1 0.0 0:52.46 kworker/u40:10-btrfs-endio 20239 root 20 0 0 0 0 I 2.8 0.0 10:10.98 kworker/u40:9-btrfs-endio 20954 root 20 0 0 0 0 I 2.8 0.0 3:37.87 kworker/u40:17-btrfs-endio 21023 root 20 0 0 0 0 I 2.8 0.0 2:07.98 kworker/u40:5-btrfs-endio-meta 21145 root 20 0 0 0 0 I 2.8 0.0 2:17.69 kworker/u40:2-btrfs-endio 17850 root 20 0 722708 10676 0 S 2.6 0.0 12:06.54 containerd-shim 21112 root 20 0 0 0 0 I 2.6 0.0 2:02.28 kworker/u40:1-btrfs-endio-meta 13673 root 20 0 722964 10460 0 S 2.0 0.0 14:00.60 containerd-shim 16045 root 20 0 722964 11644 0 S 2.0 0.0 11:02.11 containerd-shim 630 root 20 0 0 0 0 D 1.7 0.0 6:12.87 usb-storage 17002 root 20 0 690924 41360 428 S 1.4 0.0 37:57.79 shfs 1669 ntp 20 0 74592 3040 2304 D 1.1 0.0 5:18.07 ntpd 6214 root 20 0 4354540 47096 0 S 1.1 0.0 7:20.32 dockerd The load count is extraordinarily high and I am not sure what kswapd is doing to consume all that CPU. The system has plenty of RAM so there should be no virtual memory swapping of any magnitude. This seems very similar to the issue here: but (referring to the final comment) I am not using ZFS. And essentially identical to the issue here:
  3. Yeah,. That may be my best bet. I had a similar problem with just the logins timing out (WebUI worked but SSH would not) a few months ago and found that it was an Active Directory issue (as soon as I turned off Samba everything was fine). That was a problem with the designated FSMO on my Windows servers. This is definitely something stranger. I hate to lose all server functionality (barely ever use it as an actual NAS), but I will try running without any dockers today and if it locks up again, go to safe mode and kill both dockers and VMs.
  4. Recently my server has developed a very strange problem, where it stops responding almost entirely. Please note, this is not a crash and it is not fully locked up. This started happening infrequently, but now happens every night. 1) The server stops responding to SSH or WebUI connections. 2) The server does respond to ICMP and scans do show open TCP ports, though no service banners are returned. 3) There is no seg fault and nothing contained in the syslog (attached). 4) If I physically connect to the console (USB keyboard and VGA) I get a login prompt, however it times out (message returns that login attempt timed out) when I attempt to login. I can access alternate TTYs this way, but they all respond the same. 5) (Anecdotal) It seems to happen around the same time every day (between 3 and 4AM). 6) If I initiate a reboot from the console (CTRL-ALT-DEL) it switches to INIT 6 and starts the shutdown process. It tries to gracefully shutdown, then initiates forced shutdown, and then hangs. This is a relatively new chassis (Mobo/CPU), with RAM and drives taken from a stable system that had been running for months, and has been working smoothly for weeks before anything started happening. I upgraded from 6.12.6 to the latest, but that did not change anything. I am using IPVLAN and Intel NICs so this is not the MACVLAN or Realtek chip issues. The lack of any error messages/significant log entries and it continuing to process interrupts has me leaning away from a hardware issue. Attaching syslogs. I attempted to SSH and the server would not accept the password and then stopped responding to anything from the network. Feb 21, I know stopped functioning around 3:20AM because of the SSH. Feb 22, I do not know exactly when it failed as it was not working when I work up in the morning. Please note that none of the interactions I attempted (either time) from the console show up in the syslog, nor do any of the init statements from the attempted reboot. syslog-Feb-21.txt syslog-Feb-22.txt
  5. MiniPCs like the UM790 do not have expansion slots. You cannot put an extra SATA/SAS controller into it. It does have a USB4 port or, as you posted originally, the new one has an Oculink port. You can take the QNAP you posted above or any other SATA/SAS JBOD enclosure, connect the PCIe SATA Interface Card it comes with into a PCIe "dock" and connect it to your machine using OcuLink or USB4. I am not sure why you would want the SATA/SAS controller external to your drive enclosure, but it is possible to do. If you were thinking of just a cable conversion between SATA/SAS (SFF-8088/8643) and PCIe(Oculink/USB4/Thunderbolt). That does not exists. They are completely distinct protocols.
  6. You would need an external dock (USB4 to PCIe) to connect the card to a miniPC. Most are "intended" for graphics cards, but should work for any PCIe card. I have had no problems with USB storage. Essentially it is just a different communications protocol. I had an 8-Bay Syba USB Drive Enclosure laying around. It is only USB 3.0 which does seem to limit my performance to around 110MB/s. If I was getting something new, I would probably go with USB 3.1 to double the performance. Since my NAS is primarily media and I have NVMe cache drives, disk performance is a relative non-issue. Especially since most of my home network is still limited to gigabit.
  7. Oculink is a standard PCIe connection, so there is no reason you cannot run any PCIe card you want over it. That said, the solution seems a bit expensive and complicated. I run my server on a UM970 Pro connected to an 8 bay JBOD enclosure using good old USB. It works just fine, though maybe a bit slower than a true PCIe -> SATA/SAS connection. The enclosure was less than half what the QNAP would cost before dealing with any of the Oculink componentry.
  8. It would be nice if Unraid functioned in a similar fashion to Hyper-V during array stops or system restarts. Hyper-V will automatically snapshot ("save") VMs in their current running state, shutdown/restart, and then rehydrate the VMs when the system starts back up. Currently, stopping the array requires a guest agent to be installed on the VM and for the VM to down itself gracefully for the array to shutdown. I have a Windows VM that never seems to accept the signal and thus hangs, keeping the entire array from shutting down due to file locks. I do not want anything running in the VM to dictate whether or not I can cleanly shutdown the array. It would be nice if Unraid did this by default or at least offered a configuration to do this.
  9. I did not think to upload the diagnostics because I assumed it would not be that interesting since the logs are completely full of the failed spindown. Here are the diagnostics: tower-diagnostics-20191226-2145.zip
  10. Upgraded my server (from 6.6.6) last week and ran into two issues: 1) Lockups and reboots. After upgrading the server would lockup and/or reboot randomly after a few hours to a day of running. This is the same behavior that happened when I tried to upgrade to 6.7.0 (I ended up going back down to 6.6.6 and was stable for 125 days straight). Nothing obvious appears in the logs. During one of the periods between reboots, I happened to have also installed the Disable Mitigation Settings plugin for some testing. As soon as I turned off all the mitigations the problem went away. The system seems to now be running completely stable without issue. This is sitting on a private VLAN at home so losing the security and gaining some performance is not that big a deal, but it seems very strange. I would love to know how the Intel microcode update is apparently breaking my hardware. 2) One of my drives is no longer able to spindown. Drive 5 (HUH728080AL4200) is now always running and never spins down. I see the following in the logs: Dec 26 18:35:53 Tower kernel: mdcmd (200610): spindown 5 Dec 26 18:35:53 Tower emhttpd: error: mdcmd, 2726: Input/output error (5): write Dec 26 18:35:53 Tower kernel: md: do_drive_cmd: disk5: ATA_OP e0 ioctl error: -5 A little Internet sleuthing found that this is related to an unsupported SAS command by the SATA/SAS controller. The online comments say that this can be safely ignored. What is strange is that this has never been an issue before on older versions of UNRAID. On 6.6.6 the drive spun down fine without any error in the logs.
  11. I seemed to have similar issues as others in this forum. Upgraded my server from 6.6.6 to 6.7.2. My server was stable for a few days and then it would either lock up or reboot unexpectedly. Nothing showed in logs. The last time it was fine when I went to sleep, but was not responding to ping in the morning. After reboot it came up, ran for an hour, then rebooted on its own again. As soon as I returned home from work I downgraded back to 6.6.6 and it has been stable once again. Any idea how I can collect logs or details of the crash/hang/reboot? I am leary of updating again, but if I can capture useful information and not risk my data I will.
  12. I hooked it up to the on-board SATA controller and saw ATA errors. [ 369.829354] sd 4:0:0:0: [sdc] tag#26 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06 [ 369.829360] sd 4:0:0:0: [sdc] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 00 00 64 00 00 00 06 00 00 00 [ 369.829362] print_req_error: I/O error, dev sdc, sector 25600 It still seems to work just fine on my Windows machine using a USB-SATA controller. I have given up trying to figure this puzzle out. I am ordering a new drive and will just use the problem child somewhere else. Thanks everyone for the ideas.
  13. Yeah, I understand that there will be additional work for the drives until I replace it. I took the drive out and used it with a USB-SATA controller on Windows and it worked just fine. That leads me to suspect a controller problem, despite the firmware update. I am just puzzled, because I have other 512e drives working just fine with the controller and the drive is explicitly listed in the controller support document: https://docs.broadcom.com/docs/IT-SAS-Gen2.5CompatibilityList The two Hitachi drives that are working with the controller are: HDN728080ALE604 - DeskStar - 512e SATA 6Gb/s - Secure Erase (overwrite only) HUH728080AL4200 - UltraStar - 4kn SAS 12Gb/s - Instant Secure Erase This one is not: HUH728080ALE600 - UltraStar - 512e SATA 6Gb/s - Instant Secure Erase The DeskStar has the exact same interface (512e SATA 6b/s) and the UltraStar uses an even more advanced interface and supports the same Instant Secure Erase. The only other idea I could come up with is that it has to do with the backplane expander, since this is a 12 port system and this is the first drive pushing it past the half way count (drive number 7). Any ideas what I could try next? I am wracking my brains on this.
  14. Doh. That makes perfect sense. I am still not clear why I am getting read errors without any SMART issues at all. I will have to pull the drive and try it in some other systems.
  15. New update: The drive shows as Not Installed (missing) from the array. For some reason it is still receiving writes to shares as when I extracted files to a share they ended up being written to the array. I assume this is due to some weirdness with the union file system. What I do not get is why I was able to go directly to /mnt/disk5 and read and write files without any errors or problems.