ssb201

Members
  • Posts

    58
  • Joined

  • Last visited

Everything posted by ssb201

  1. The best I can tell, it seems that Plex is having an issue during content scanning as a nightly task and soaks up all the memory. Not sure what the hell is going on, but by adding a hard cap on the memory assigned to the container, I just get the OOM killer reaping the container rather than my entire box locking up. There really is no good reason for a non-privileged docker container to be able to effectively lock up the host operating system.
  2. I was able to do a little more digging: 1) The system seemed to run just fine without plugins or the array running. 2) The system seemed to run just fine with plugins and the array running but no VMs or docker containers. 3) The system seemed to run just fine with plugins and the array and a single VM running but no docker containers. 4) I selected three docker containers to run along with the single VM (OrganizrV2, Nginx, and Plex) and the system (sometime over night) had the issue again. This time I had an SSH connection already established, so I did not run into the timeout trying to login. I was able to run top (very slowly) and grabbed this: top - 10:10:31 up 4 days, 6:08, 1 user, load average: 139.34, 137.18, 136.81 Tasks: 621 total, 6 running, 593 sleeping, 0 stopped, 22 zombie %Cpu(s): 0.0 us, 13.3 sy, 0.0 ni, 16.7 id, 69.8 wa, 0.0 hi, 0.2 si, 0.0 st MiB Mem : 96299.8 total, 576.1 free, 95157.7 used, 566.1 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 103.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 241 root 20 0 0 0 0 R 100.0 0.0 363:18.89 kswapd0 19866 root 20 0 0 0 0 R 37.7 0.0 11:33.42 kworker/u40:4+loop3 21057 root 20 0 0 0 0 R 15.5 0.0 4:18.59 kworker/u40:7+loop3 12497 root 20 0 47.9g 44.2g 6532 D 14.2 47.0 2323:15 qemu-system-x86 20737 root 20 0 0 0 0 R 14.2 0.0 8:47.72 kworker/u40:16+loop3 3859 root 20 0 10576 7004 5748 D 7.0 0.0 0:12.33 sshd 13950 nobody 20 0 257492 36512 4 S 5.7 0.0 14:40.36 Plex Media Serv 6287 root 20 0 755916 19604 0 S 4.4 0.0 21:09.10 containerd 13673 root 20 0 722964 10460 0 S 3.5 0.0 14:00.71 containerd0:0+loop1 21067 root 20 0 0 0 0 I 3.1 0.0 2:48.72 kworker/u40:8-btrfs-endio-meta 21308 root 20 0 0 0 0 I 3.1 0.0 0:52.46 kworker/u40:10-btrfs-endio 20239 root 20 0 0 0 0 I 2.8 0.0 10:10.98 kworker/u40:9-btrfs-endio 20954 root 20 0 0 0 0 I 2.8 0.0 3:37.87 kworker/u40:17-btrfs-endio 21023 root 20 0 0 0 0 I 2.8 0.0 2:07.98 kworker/u40:5-btrfs-endio-meta 21145 root 20 0 0 0 0 I 2.8 0.0 2:17.69 kworker/u40:2-btrfs-endio 17850 root 20 0 722708 10676 0 S 2.6 0.0 12:06.54 containerd-shim 21112 root 20 0 0 0 0 I 2.6 0.0 2:02.28 kworker/u40:1-btrfs-endio-meta 13673 root 20 0 722964 10460 0 S 2.0 0.0 14:00.60 containerd-shim 16045 root 20 0 722964 11644 0 S 2.0 0.0 11:02.11 containerd-shim 630 root 20 0 0 0 0 D 1.7 0.0 6:12.87 usb-storage 17002 root 20 0 690924 41360 428 S 1.4 0.0 37:57.79 shfs 1669 ntp 20 0 74592 3040 2304 D 1.1 0.0 5:18.07 ntpd 6214 root 20 0 4354540 47096 0 S 1.1 0.0 7:20.32 dockerd The load count is extraordinarily high and I am not sure what kswapd is doing to consume all that CPU. The system has plenty of RAM so there should be no virtual memory swapping of any magnitude. This seems very similar to the issue here: but (referring to the final comment) I am not using ZFS. And essentially identical to the issue here:
  3. Yeah,. That may be my best bet. I had a similar problem with just the logins timing out (WebUI worked but SSH would not) a few months ago and found that it was an Active Directory issue (as soon as I turned off Samba everything was fine). That was a problem with the designated FSMO on my Windows servers. This is definitely something stranger. I hate to lose all server functionality (barely ever use it as an actual NAS), but I will try running without any dockers today and if it locks up again, go to safe mode and kill both dockers and VMs.
  4. Recently my server has developed a very strange problem, where it stops responding almost entirely. Please note, this is not a crash and it is not fully locked up. This started happening infrequently, but now happens every night. 1) The server stops responding to SSH or WebUI connections. 2) The server does respond to ICMP and scans do show open TCP ports, though no service banners are returned. 3) There is no seg fault and nothing contained in the syslog (attached). 4) If I physically connect to the console (USB keyboard and VGA) I get a login prompt, however it times out (message returns that login attempt timed out) when I attempt to login. I can access alternate TTYs this way, but they all respond the same. 5) (Anecdotal) It seems to happen around the same time every day (between 3 and 4AM). 6) If I initiate a reboot from the console (CTRL-ALT-DEL) it switches to INIT 6 and starts the shutdown process. It tries to gracefully shutdown, then initiates forced shutdown, and then hangs. This is a relatively new chassis (Mobo/CPU), with RAM and drives taken from a stable system that had been running for months, and has been working smoothly for weeks before anything started happening. I upgraded from 6.12.6 to the latest, but that did not change anything. I am using IPVLAN and Intel NICs so this is not the MACVLAN or Realtek chip issues. The lack of any error messages/significant log entries and it continuing to process interrupts has me leaning away from a hardware issue. Attaching syslogs. I attempted to SSH and the server would not accept the password and then stopped responding to anything from the network. Feb 21, I know stopped functioning around 3:20AM because of the SSH. Feb 22, I do not know exactly when it failed as it was not working when I work up in the morning. Please note that none of the interactions I attempted (either time) from the console show up in the syslog, nor do any of the init statements from the attempted reboot. syslog-Feb-21.txt syslog-Feb-22.txt
  5. MiniPCs like the UM790 do not have expansion slots. You cannot put an extra SATA/SAS controller into it. It does have a USB4 port or, as you posted originally, the new one has an Oculink port. You can take the QNAP you posted above or any other SATA/SAS JBOD enclosure, connect the PCIe SATA Interface Card it comes with into a PCIe "dock" and connect it to your machine using OcuLink or USB4. I am not sure why you would want the SATA/SAS controller external to your drive enclosure, but it is possible to do. If you were thinking of just a cable conversion between SATA/SAS (SFF-8088/8643) and PCIe(Oculink/USB4/Thunderbolt). That does not exists. They are completely distinct protocols.
  6. You would need an external dock (USB4 to PCIe) to connect the card to a miniPC. Most are "intended" for graphics cards, but should work for any PCIe card. I have had no problems with USB storage. Essentially it is just a different communications protocol. I had an 8-Bay Syba USB Drive Enclosure laying around. It is only USB 3.0 which does seem to limit my performance to around 110MB/s. If I was getting something new, I would probably go with USB 3.1 to double the performance. Since my NAS is primarily media and I have NVMe cache drives, disk performance is a relative non-issue. Especially since most of my home network is still limited to gigabit.
  7. Oculink is a standard PCIe connection, so there is no reason you cannot run any PCIe card you want over it. That said, the solution seems a bit expensive and complicated. I run my server on a UM970 Pro connected to an 8 bay JBOD enclosure using good old USB. It works just fine, though maybe a bit slower than a true PCIe -> SATA/SAS connection. The enclosure was less than half what the QNAP would cost before dealing with any of the Oculink componentry.
  8. It would be nice if Unraid functioned in a similar fashion to Hyper-V during array stops or system restarts. Hyper-V will automatically snapshot ("save") VMs in their current running state, shutdown/restart, and then rehydrate the VMs when the system starts back up. Currently, stopping the array requires a guest agent to be installed on the VM and for the VM to down itself gracefully for the array to shutdown. I have a Windows VM that never seems to accept the signal and thus hangs, keeping the entire array from shutting down due to file locks. I do not want anything running in the VM to dictate whether or not I can cleanly shutdown the array. It would be nice if Unraid did this by default or at least offered a configuration to do this.
  9. I did not think to upload the diagnostics because I assumed it would not be that interesting since the logs are completely full of the failed spindown. Here are the diagnostics: tower-diagnostics-20191226-2145.zip
  10. Upgraded my server (from 6.6.6) last week and ran into two issues: 1) Lockups and reboots. After upgrading the server would lockup and/or reboot randomly after a few hours to a day of running. This is the same behavior that happened when I tried to upgrade to 6.7.0 (I ended up going back down to 6.6.6 and was stable for 125 days straight). Nothing obvious appears in the logs. During one of the periods between reboots, I happened to have also installed the Disable Mitigation Settings plugin for some testing. As soon as I turned off all the mitigations the problem went away. The system seems to now be running completely stable without issue. This is sitting on a private VLAN at home so losing the security and gaining some performance is not that big a deal, but it seems very strange. I would love to know how the Intel microcode update is apparently breaking my hardware. 2) One of my drives is no longer able to spindown. Drive 5 (HUH728080AL4200) is now always running and never spins down. I see the following in the logs: Dec 26 18:35:53 Tower kernel: mdcmd (200610): spindown 5 Dec 26 18:35:53 Tower emhttpd: error: mdcmd, 2726: Input/output error (5): write Dec 26 18:35:53 Tower kernel: md: do_drive_cmd: disk5: ATA_OP e0 ioctl error: -5 A little Internet sleuthing found that this is related to an unsupported SAS command by the SATA/SAS controller. The online comments say that this can be safely ignored. What is strange is that this has never been an issue before on older versions of UNRAID. On 6.6.6 the drive spun down fine without any error in the logs.
  11. I seemed to have similar issues as others in this forum. Upgraded my server from 6.6.6 to 6.7.2. My server was stable for a few days and then it would either lock up or reboot unexpectedly. Nothing showed in logs. The last time it was fine when I went to sleep, but was not responding to ping in the morning. After reboot it came up, ran for an hour, then rebooted on its own again. As soon as I returned home from work I downgraded back to 6.6.6 and it has been stable once again. Any idea how I can collect logs or details of the crash/hang/reboot? I am leary of updating again, but if I can capture useful information and not risk my data I will.
  12. I hooked it up to the on-board SATA controller and saw ATA errors. [ 369.829354] sd 4:0:0:0: [sdc] tag#26 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06 [ 369.829360] sd 4:0:0:0: [sdc] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 00 00 64 00 00 00 06 00 00 00 [ 369.829362] print_req_error: I/O error, dev sdc, sector 25600 It still seems to work just fine on my Windows machine using a USB-SATA controller. I have given up trying to figure this puzzle out. I am ordering a new drive and will just use the problem child somewhere else. Thanks everyone for the ideas.
  13. Yeah, I understand that there will be additional work for the drives until I replace it. I took the drive out and used it with a USB-SATA controller on Windows and it worked just fine. That leads me to suspect a controller problem, despite the firmware update. I am just puzzled, because I have other 512e drives working just fine with the controller and the drive is explicitly listed in the controller support document: https://docs.broadcom.com/docs/IT-SAS-Gen2.5CompatibilityList The two Hitachi drives that are working with the controller are: HDN728080ALE604 - DeskStar - 512e SATA 6Gb/s - Secure Erase (overwrite only) HUH728080AL4200 - UltraStar - 4kn SAS 12Gb/s - Instant Secure Erase This one is not: HUH728080ALE600 - UltraStar - 512e SATA 6Gb/s - Instant Secure Erase The DeskStar has the exact same interface (512e SATA 6b/s) and the UltraStar uses an even more advanced interface and supports the same Instant Secure Erase. The only other idea I could come up with is that it has to do with the backplane expander, since this is a 12 port system and this is the first drive pushing it past the half way count (drive number 7). Any ideas what I could try next? I am wracking my brains on this.
  14. Doh. That makes perfect sense. I am still not clear why I am getting read errors without any SMART issues at all. I will have to pull the drive and try it in some other systems.
  15. New update: The drive shows as Not Installed (missing) from the array. For some reason it is still receiving writes to shares as when I extracted files to a share they ended up being written to the array. I assume this is due to some weirdness with the union file system. What I do not get is why I was able to go directly to /mnt/disk5 and read and write files without any errors or problems.
  16. After updating the firmware I no longer had any problems at start, but my read speeds on the array plummeted to almost nothing. I tried removing the disk (since nothing of interest had been written yet) and rerunning preclear on it. As soon as the pre-read started it was spitting out the same read errors. Still no issues showing in SMART and I am baffled why it is only on reads, never writes.
  17. The disk is connected via the 12 port back plane that all the drives are connected to. The controller is a Dell Perc H200 flashed with IT firmware. The UPS is not plugged in at the moment. I had to disconnect it to reset the memory when replacing batteries and have not plugged it back in yet. I see a million read errors in dmesg, but not a one in the SMART(posted above - not sure why it shows on the direct page but not in diagnostics). No write errors at all. Only difference I can think of between this and the other disks is this is an AF 4Kn drive. I could try swapping this to a different bay in the server, I doubt it will make a difference.
  18. Diagnostics attached tower-diagnostics-20190216-1915.zip
  19. So I just added a new drive to my array and I am getting weird errors. I pre-cleared the drive (admittedly without pre or post read since this was a drive I had used previously) and it ran overnight successfully. I added the drive to my array and it formatted and joined seemingly normal. But if I review the syslog I see: Feb 16 17:35:01 Tower kernel: sd 8:0:6:0: attempting task abort! scmd(00000000f8e68c0b) Feb 16 17:35:01 Tower kernel: sd 8:0:6:0: [sdj] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 61 c1 b0 00 00 00 08 00 00 Feb 16 17:35:01 Tower kernel: scsi target8:0:6: handle(0x0010), sas_address(0x50015b20780b9435), phy(21) Feb 16 17:35:01 Tower kernel: scsi target8:0:6: enclosure logical id(0x50015b20780b943f), slot(3) Feb 16 17:35:02 Tower kernel: sd 8:0:6:0: task abort: SUCCESS scmd(00000000f8e68c0b) Feb 16 17:35:03 Tower kernel: sd 8:0:6:0: Power-on or device reset occurred If I review the individual disk log I see repeated errors. But if I review the SMART data the drive comes back all green. The only thing that looks a bit off in the report is the number of read recovery attempts. HGST_HUH728080ALE600_2EGM5T4X-20190216-1739.txt
  20. The array was definitely in an unusual state. New config worked well. Plex docker was missing and had to rejoin AD, but everything else looked fine. A few movies are corrupt and will need to be re-ripped, but nothing major. I have a new parity drive building now. So glad a pulled the trigger on an extra drive for Black Friday. Unfortunately, the drive the failed, while under warranty is flagged as needing to be returned to system vendor rather than Hitachi, and I have no idea who that is. Thanks everyone for the assistance.
  21. Correct. Apologies for any confusion, the tooltip for the yellow says emulated:
  22. Attached. (Please note I have removed the failed parity drive and the other drive that was reporting a SMART end to end error (the one I did a BTRFS restore from). Steps to get here: 1) Parity check slowed to crawl at 97.3%, VM running fine, Dockers unresponsive. 2) Web UI unresponsive, from SSH attempt to reboot hangs. 3) Power cycle in safe mode. System seems fine, web UI responsive and no errors showing in log. Array down. 4) Decide to replace 4TB drive showing End to End SMART error. Pull drive, put new 8TB drive in. Start array and select rebuild from parity. 5) Rebuild slows to a crawl. read errors start to appear on parity drive. 6) Cancel rebuild, reboot, confirm, parity drive is bad. (Freak a bit). 7) Put 4TB drive back in, unable to mount due to FS corruption. 8 ) Clean BTRFS format of 8TB replacement disk (since parity has started on it, a superblock and some fs was there but no data). 9) BTRFS restore almost all files from 4TB to 8TB. 10) Remove 4TB drive. As stands now I have 3 original data drives and a new copy (not clone) of the 4th data drive. No parity drive, but an array configuration expecting a parity rebuild of drive 3 (hence emulation) and a missing parity drive. tower-diagnostics-20171216-1221.zip
  23. I just discovered New Config. It looks like I should be able to just do New Config->Retain Data and Cache slots. Then apply. Will that work when the array thinks that disk 3 is emulated, even if disk 3 does have all the data it should? Do I have to remove that drive and then add it in separately? I assume once the array is good I would then be able to add a new parity drive and build parity.
  24. Yeah. The drive went south quickly, even without SMART notification warning. So I managed to complete the BTRFS restore. Ten or twenty files had warnings about loops and may not have restored properly, the rest hopefully are good. I will need to double check their contents once I have things up and running. Thankfully most are just Blu-Ray rips that I can redo. Now that I have four drives with all my data. Any pointers on recreating the array? Currently the new drive shows up as Disk 3 (drive contents emulated). I am guessing I will need to create a brand new array and slowly copy my data over drive by drive, before importing the next drive.
  25. Attached new diagnostics, though I gave up on the prior status quo.. The parity rebuild was going nowhere. Just tons of errors on the parity drive. I ended up rebooting and leaving the array in a downed state. In the mean time, I reformatted the new drive (the one that had been precleared and added to the array to replace the SMART error drive), since the parity rebuild had barely begun the file system was entirely corrupt. I kicked off a btrfs restore from the drive I had replaced to the new drive to preserve a copy of as much of my data as possible. Since the parity drive seemed to actually be the problem, I am hoping that despite some btrfs corruption (caused by the less than graceful initial shutdown?), the vast majority of my data will be recovered. Between the three good data drives and the new one I restored to, I should at least have all or almost all of my data, without any dependency on the f--g parity drive. tower-diagnostics-20171215-0012.zip