relink

Members
  • Posts

    235
  • Joined

  • Last visited

Everything posted by relink

  1. @primeval_god thanks for the tip, I tried running the command and got the following output, and the swap is still not enabled. swapon: /mnt/cache-nvme/swap/swapfile: insecure permissions 0666, 0600 suggested. swapon: /mnt/cache-nvme/swap/swapfile: read swap header failed After some googling I see people using other Linux distorts running “mkswap”, but they are all referring to a swap partition. I’m not sure if that’s the correct fix for unraid.
  2. Hmmm, yah it looks like “/dev/sda/“ doesn’t even exist in the container. I would need to use “shfs” which apparently doesn’t work with this variable, as it expects a physical device...crap. Well the other ones work, and if the write become too much then I’ll just set the containers to write to the cache and I’ll run the mover at a lower IO priority.
  3. This is what every example I could find said to use. It seemed a little off to me too, so I’m still looking into it, but haven’t found any different info yet.
  4. I was beginning to think the same thing. Specifically Sonarr, Radarr, Lidarr, and their accompanying app as they can all pull in large amounts of data, and theres nothing stopping the 3 of them from all trying to do so at the same time. So I applied the below solution to all of them and so far so good...im almost 24hours stable. First off I pinned them all to only 4 out of 12 threads. Next I added the below options to "Extra Parameters" on each container to throttle them back; --memory=2G --cpu-shares=100 --device-write-bps /dev/sda:10MB Only time will tell if the issue is solved.
  5. Im having the same issue. Mine is set to "/mnt/cache-nvme/swap/" so I ran the below commands; - cd /mnt/cache-nvme/swap/ - truncate -s 0 ./swapfile - chattr +C ./swapfile - btrfs property set ./swapfile compression none everything looks fine, and I don't get any errors. I see "Swap file exists: ✔" but when I click "start" nothing happens and it still shows "Swap file in use: ✖"
  6. So, my server made it through the night and it's still running right now. I only have one error in my syslog this morning; blk_update_request: critical target error, dev sda, sector 76048 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0 Im not sure how or what changed but let me try to layout the current status and my thoughts. Since yesterday morning the following dockers have been disabled: -Sabnzbd -Sonarr -Radarr -Lidarr -AMD Even with those being disabled that still left a lot running, such as Plex, Nextcloud, MariaDB, NPM, PiHole, etc. I am actually watching Plex while typing this and its working fine. Yesterday I had a friend remotely play a 4K HDR movie and transcode it to 1080P with tone-mapping, all while 2 4K movies direct played in my house. While that was going on, I started a transfer for about 600GB to a non cache enabled share, and I also started a transfer from my unraid server of around another 600GB, this all ran fine. So I decided to double down and I started a non-correcting parity check on top of all of that. The Parity check started out kind of slow, and after a couple minutes made it up-to about 120MB/s, while the writes to the non-cache share slowed proportionately. Once the parity check hit 120MB/s the writes essentially stopped (showed 0Kb/s), when I stopped the parity check the transfers picked right back up. I did all that because the most prevalent error in my syslog pertained to "aacraid" which is my raid controller. so I stressed it to the best of my ability and I could not make it crash again. I did get one of those macvlan call traces yesterday morning, but It didn't seem to have caused any issues when it happened. I didn't really understand what the fix for that was either. Im not sure what else to check now, I find it hard to believe that any of my containers were capable if causing hardware errors, that i couldn't reproduce myself with the transfers i did. I've had the custom IPs for several years and had no problems, but I'm still willing to try the fix, I just don't know how.
  7. Well to be fair the writes didn't full on crash, they are just terribly slow. They writes are running around 50KB/s, however the parity check is running around 120MB/s, so im assuming the Parity check has a higher priority over the SMB transfer...this is all still running by the way.
  8. Ok, so I have several 100GB going into a no cache share, I have several 100GB being read from the array, and I'm running a parity check. The writes have slowed to about 1MB/s or less, the reads are between 34-40MB/s, and the parity check is running at about 45MB/s. Also CPU useage is around 50% average, and RAM is around 50%. Ok things changed before I even posted this. The Parity check is now running around 80MB/s and the writes have practically stalled.
  9. just to spice things up since just writing data doesn't seem to be enough, I decided to start a non-correcting parity check while it's still transferring. I'll probably start to do a large read transfer now too.
  10. No parity checks run just fine, over 100MB/s the entire time. Im also currently running a simple test, Im transferring several 100GB of data from my one of my computers to a no cache share on my unraid server...and so far im transferring at about 36MB/s and not a single error in my log and everything is still working...I dont get it...
  11. What would I need to do to ensure that I will be able to restore my current config when I'm done? Also I'm not sure anything would happen, the troubleshooting I have done so far seems to point to issues when there are a high number of writes going onto the array. I kept my server online all day yesterday until I ran the memtest, and it ran fine. What I did different was I disabled things like Radarr and Sabnzbd, anything that would do heavy writes. In fact Im about to run a test and see if I can make it crash by doing a large transfer from my desktop to unraid.
  12. Appdata, System, and Domains are all on a NVME cache pool. Also I have the Maxview utility installed as a container. I don't know if you're familiar but it's a tool to monitor and manage Adaptec cards. The screenshot below is my HBAs status, which all looks good to me.
  13. @Vr2Io Alright the memtest ran over night and passed. MemTest86-Report-20210428-211336.html
  14. I read over that thread numerous times. It really only seems to pertain to 1st gen Ryzen, but I tried it anyway, and its made no difference.
  15. So far so good. I’m going let it run overnight and I’ll check it again in the morning.
  16. wow, I just read that entire thread, I really hope it's not a RAM or CPU issue, Neither are exactly affordable right now. But I'll take your advice and run the test as soon as I get home from work in about an hour.
  17. Thanks, I was trying to figure out a quick way to do that but couldn't think of anything.
  18. Is the attached file any better? Unfortunately I cant, it just freezes when I try to get diagnostics. All_2021-4-28-12_39_59.html
  19. Sorry about that, just went with what my Syslog server gives me. New copy is attached. I'll also add it to the original post too. Logs.zip
  20. I have been having this particular issues for a couple days now. My server will become mostly unresponsive. Sometimes it recovers, and sometimes a hard reboot is the only option. I'll do my best to lay out everything I know, but I'm beginning to feel overwhelmed. Now what I mean by mostly unresponsive is that some of my docker containers will still work just slowly (such as NGINX, Wallabag, PiHole, etc.) while others become unresponsive, most notably Plex. Sometimes the unraid UI becomes inaccessible. sometimes it can be accessed but I cant get diagnostics, I cant start or stop any containers or the array, I cant reboot or shutdown. Its like the UI is visible, but I cant actually DO anything. When this happens I can usually still SSH in, but I cant control anything. I tried to reboot over SSH by running "shutdown -r now" and it came back with the usual "system is going down" message, but it never actually did anything. I do have a separate Syslog server so I have been monitoring the errors, and the only thing that catches my attention are Kernel errors pertaining to "aacraid" which is my Adaptec 71605 HBA. I did find a page discussing solutions for my card but its using Ubuntu, so I don't know how or if I even should apply that to unRAID. The Page is here. But im not even sure if this is the issue, or just a coincidence. So far what I have done: -Ran Memtest (passed) -Checked SATA cables -ran chkdsk on flash (Passed) -updated mobo to latest bios -updated RAID card to latest bios -Tried RAID card in other PCIE slot. -unraid is fully up to date -all plugins are fully up to date -disabled write cacheing on HBA disks (this was mostly to protect me from all the hard shutdowns im doing) System Specs: Ryzen 5 2600 ROG Strix B450-F Gaming 32GB Corsair Dominator Adaptec 71605 RAID Controller I attached: -Full Syslog -Syslog showing only Kernel errors -Diagnostics, although Im unsure if it will really show much. EDIT: Attached a copy of the logs as .txt files in a .zip archive. All_2021-4-28-10_7_11.csv Kernel-Errors-Only_All_2021-4-28-10_8_4.csv serverus-diagnostics-20210428-1052.zip Logs.zip
  21. The BIOS on my controller was was from 2015. I just finished updating to the latest version which is from 2018. So fingers crossed that I don’t have anymore issues. If I do I’ll probably just open a new thread as this has gotten off topic from the original problem.
  22. After searching for some of the errors I came across this post here, the first part talks about changing the device timeout to 45seconds. The problem is I don't know how to do that in unRAID. Running "ls /sys/block/" shows disks listed as both "mdx" and "sdx", plus I'm not even sure if a change like that would break unRAID, or persist over a reboot, etc... Now the second part of the post I can definitely check and will when I get home. Thats is a BIOS update for the controller. Firmware Version 7.3.0 Build 30612 lists "Resolved an issue where I/O would slow and eventually result in a controller reset" as one of the fixes. My only concern is this BIOS was released in 2013, I find it hard to believe my card would have a bios older than that, but I will check.
  23. I dug through and pulled all kernel errors from my syslog server. Seems related to my raid controller, but I have no idea why, still looking into it. Date Time Level Host Name Category Program Messages 2021-04-27 07:27:13 Error SERVERUS kern kernel 2021-04-27 04:01:56 Error SERVERUS kern kernel aacraid 0000:0a:00.0: Controller reset type is 3 2021-04-27 04:01:56 Error SERVERUS kern kernel aacraid: Host bus reset request. SCSI hang ? 2021-04-27 04:01:50 Error SERVERUS kern kernel aacraid: Outstanding commands on (2,1,6,0): 2021-04-27 04:01:50 Error SERVERUS kern kernel aacraid: Host adapter abort request. 2021-04-27 00:00:07 Error SERVERUS kern kernel blk_update_request: critical target error, dev sda, sector 76048 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0 2021-04-26 21:38:48 Error SERVERUS kern kernel Memory cgroup out of memory: Killed process 32031 (xteve) total-vm:2773040kB, anon-rss:2055388kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:4164kB oom_score_adj:0 2021-04-26 10:15:53 Error SERVERUS kern kernel 2021-04-26 06:41:16 Error SERVERUS kern kernel aacraid 0000:0a:00.0: Controller reset type is 3 2021-04-26 06:41:16 Error SERVERUS kern kernel aacraid: Host bus reset request. SCSI hang ? 2021-04-26 06:41:10 Error SERVERUS kern kernel aacraid: Outstanding commands on (11,1,6,0): 2021-04-26 06:41:10 Error SERVERUS kern kernel aacraid: Host adapter abort request. 2021-04-26 06:09:30 Error SERVERUS kern kernel sd 11:1:5:0: [sdd] tag#166 timing out command, waited 7s 2021-04-26 06:08:33 Error SERVERUS kern kernel aacraid 0000:0a:00.0: Controller reset type is 3 2021-04-26 06:08:33 Error SERVERUS kern kernel aacraid: Host bus reset request. SCSI hang ? 2021-04-26 06:08:32 Error SERVERUS kern kernel aacraid: Outstanding commands on (11,1,5,0): 2021-04-26 06:08:32 Error SERVERUS kern kernel aacraid: Host adapter abort request. 2021-04-26 04:09:16 Error SERVERUS kern kernel sd 11:1:5:0: [sdd] tag#221 timing out command, waited 7s 2021-04-26 04:08:19 Error SERVERUS kern kernel aacraid 0000:0a:00.0: Controller reset type is 3 2021-04-26 04:08:19 Error SERVERUS kern kernel aacraid: Host bus reset request. SCSI hang ? 2021-04-26 04:08:17 Error SERVERUS kern kernel aacraid: Outstanding commands on (11,1,5,0): 2021-04-26 04:08:17 Error SERVERUS kern kernel aacraid: Host adapter abort request. 2021-04-26 04:06:11 Error SERVERUS kern kernel aacraid 0000:0a:00.0: Controller reset type is 3 2021-04-26 04:06:11 Error SERVERUS kern kernel aacraid: Host bus reset request. SCSI hang ? 2021-04-26 04:06:06 Error SERVERUS kern kernel aacraid: Outstanding commands on (11,1,6,0): 2021-04-26 04:06:06 Error SERVERUS kern kernel aacraid: Host adapter abort request. 2021-04-26 00:00:03 Error SERVERUS kern kernel blk_update_request: critical target error, dev sda, sector 76048 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0 2021-04-25 23:01:41 Error SERVERUS kern kernel 2021-04-25 22:51:30 Error SERVERUS kern kernel 2021-04-25 22:39:03 Error SERVERUS kern kernel 2021-04-25 22:28:09 Error SERVERUS kern kernel 2021-04-25 20:43:54 Error SERVERUS kern kernel 2021-04-25 20:18:51 Error SERVERUS kern kernel 2021-04-25 19:40:40 Error SERVERUS kern kernel
  24. Little bit of an update. After digging in further I realized most of my containers other than Plex are actually not frozen. However I cannot download diagnostics still, I cannot restart or stop any containers, I cannot stop the docker service, and I cannot stop the array...
  25. Ok I think things are now getting worse. This is the 4th time where everything looks like it’s still running, all my containers look fine, and the unraid UI is responsive, but in reality everything is kind of in the unresponsive state. I can’t seem to pull a diagnostics either, every time I try it gets stuck at this exact spot; Downloading... /boot/logs/serverus-diagnostics-20210427-0632.zip smartctl -x '/dev/sdg' 2>/dev/null|todos >'/serverus-diagnostics-20210427-0632/smart/ST1000VN002-2EY102_Z9C5A3AJ-20210427-0632 disk2 (sdg).txt'