howiser1

howiser1 started following Error: Call Traces found on your server , Going crazy trying to figure out GPU Passthrough stability issues , Can't stop docker container Nextcloud or reboot the whole system and 3 others December 18, 2023

Going crazy trying to figure out GPU Passthrough stability issues

howiser1 replied to dcoulson's topic in VM Engine (KVM)

**Possible Solution** I had this exact same issue and it was driving me nuts too. I see this thread is a little old and not much on here. So I figured I share my experience and solution, but your mileage may differ. I was building a gaming VM, passing through a GTX 1080ti GPU and could play for around 10-20mins before the VM would just “crash” and unraid will fill the logs with this error: vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs ... Literally 1000’s of those until I would force stop the VM. After which I could no longer restart the VM – I would get a 127 error, not able to access the GPU device (see below as to my assumption). And my GPU was on 0000:01:00.0 – as well. Others might be in the #2 slot and so forth, which your addressing should follow. I didn’t have the CoreFreq Plugin as suggested above, so that’s ruled out. So to cut right to it – after hours and yes, DAYS of ruling out configuration settings, trying vbios, MB bios settings, VM graphics drivers, etc, etc. IT WAS THE DAM PSU (power supply). 🤬 I only came across this once on other thread… somewhat similar but not 100% the same setup and experience. So in my particular case – I added the GPU above (used from Ebay) and since my server case is more server grade case, there are no 6 or 8 pinouts for GPUs. Yes, I could “adapt” to those, but I only have two main power lines: one powering my disk array and the other powering the SATA drive enclosure. So I didn’t want to steal power from “unraid”… Thus I bought a secondary power supply (flat/modular) 400W and wired it up to power on with the primary PSU. (you can google how to do all this – there are “multiple power supply adapters”) – DON’T just jump it with a paper clip… 🤦‍♂️ Anyway, after swapping the GPU to my desktop computer – I could play for “hours” without any issues, no crashes, etc. (using the primary 500w PSU, which has two main cables with 6-8 pinout plugs). This ruled out any GPU card issues… it’s fine. 😁 I pulled the secondary PSU (from the unraid server) and used it to “replicate” my setup. And BINGO – 10-20mins in the game CRASHed! I looked over at the GPU which now has blinking LEDs on the power ports… 🤔. Looking at the secondary PSU – the only tell was the fan was not spinning. Yep it “died”. 🤦‍♂️ So I’m guessing it has some thermal/over heat protection or it can’t supply 250w long term…. Which is what my GPU needs at 95-100% usage.. So don’t just buy a crappy PSU for GPUs – You have to have a PSU that can drive 300w+ long term for hours JUST for the GPU. That would mean a system PSU of 600+w, just to be clear. Again, I’m just driving the GPU – so 400w should have been enough, but that PSU just can’t handle the load…. "junk" and getting returned. This is also why I had to power cycle / reboot unraid every time this would crash. Since there was no power to the GPU, Unraid could not reset the board (GPU). I also would get these errors… (clues) sometimes, not all times, in the logs. vfio-pci 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessible vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible Your experience could slightly different, if you’re using a primary PSU, it might just drop momentarily in power (voltage) because the 250w (or more, very GPU specific here) the GPU needs continuous is too much of a long term demand and the voltage drops just slightly enough that the GPU will then “crash”… at which time the mb / OS would try to reset the card – the logging above. Once the card is reset - some people are able then to re-start their VMs after the GPU crashed. Unless the Primary PSU would also die, you’d never know for a moment it dropped in voltage just enough for the GPU to crash. As Unraid and the rest of your system would be “fine”. (but could crash too) For me the secondary PSU was off/died – so there was no way to reset and repower the GPU, other than to reboot. I guess I probably could have just pulled the power plug on the secondary PSU and then plugged it back in – now thinking about it. But whatever, a full reboot was necessary for me, which “reset” the secondary PSU to power back up. (and I only rebooted like 100+ times, trying to figure all this out, for DAYS) AHHHH 🤬 Overall, you get the idea… this post is now long enough. I would very, very strongly recommend you rule out the PSU if you get these or similar errors. Especially if the VM/GPU is fine during normal operational, but only crashes and these errors show up when it’s under load… Good luck and hope this helps someone else.

Can't stop docker container Nextcloud or reboot the whole system

howiser1 replied to SuberSeb's topic in General Support

I too am now having the same issues as above... no point in duplicating. I had no issues on NC 17 or 18. It was only when I went to NC 20.0.4 that this started to occur... So it is NC? As mentioned above, I recently added (on NC 20) a binding/redirect from the docker /tmp path back to my cache disk. After just locking up again, I'm going to remove that... and see what happens. What has been consistent on the "crash" is accessing NC via the iphone IOS app. For whatever reason that seems to be the the "thing" the locks up the docker. Not always, but at least once a day or so when I tried to access the app on my iphone, it will do a server request and then just die. Curious if others see that too. Could it be with IOS app? Other access with NC client on Windows 10 hasn't locked anything up yet... Really hoping someone figures this out!!!

[Support] Linuxserver.io - Nextcloud

howiser1 replied to linuxserver.io's topic in Docker Containers

I'm having a very similar issue - how did you solve? I never saw a follow up or response to your post?

Docker container hangs

howiser1 replied to deanosim's topic in Docker Engine

I'm currently now having this same issue with the Nextcloud docker after upgrading to NC 20.0.4 I wonder if it's Nextcloud... Didn't have any issues on v17. I've also try starting with a fresh new copy of the Netcloud docker along with a new MariaDB - didn't seem to help. Curious if you ever solved this?

Out of Memory - First time.

howiser1 posted a topic in General Support

blackbox-diagnostics-20201211-2035.zip Hi, Just started getting this error, which is rather odd, I have 20GB of RAM. However I do get random freezes every few weeks, maybe this is the cause? I mirror my syslog file, so it's quite large, to catch these kinds of issues. Server was locked up/frozen when I tried to access it today. Full Diag Attached. First Instance Scroll to: Dec 10 22:42:43 Blackbox kernel: Out of memory: Kill process 10083 (smbd-notifyd) score 680 or sacrifice child Thanks for any and all help!

December 12, 2020

Unraid - locked up/hangs/frozen/unresponsive

howiser1 replied to howiser1's topic in General Support

Thanks @johnnie.black for getting back to me quickly. Well that is unfortunate, but it probably makes the most sense. Based on my current setup I really can't run with the system down (no VM/dockers) for extended weeks until this would occur again. Most times it takes weeks and/or months for it to crop up, again. Point being, it's pretty infrequent, which makes this really hard to troubleshoot and with no log errors or crash dumps, probably impossible to pin-point the piece of hardware. Actually just about the time I forget the last time it locked up, it does it again. I think I might try just a weekly reboot... and see if that helps. Other than a memory or cpu stress test, if there are any other diag options let me know. Thanks again.

Unraid - locked up/hangs/frozen/unresponsive

howiser1 posted a topic in General Support

Hi all I've been chasing this issue for some time now; even through several versions of OS. This is not something new on 6.8.3 (current os). Unraid will randomly just lockup, freeze, become unresponsive. I've tried using a COM/Serial port so I can monitor / access the system, and well that's locked up too. No console/keyboard access nor http either. Finally with the mirror option I was able to get a full syslog when this occurred. Unfortunately, the log doesn't show any issue when the lockup occurred. Full Diags are attached. Lockup occurred some time after Aug 7 -- 1:10:56 am. Reboot was at 8:33:34 this morning. Nothing logged in-between. Really hope someone has some insight. Aug 6 10:11:52 Blackbox kernel: mdcmd (546): spindown 2 Aug 6 10:12:22 Blackbox kernel: mdcmd (547): spindown 0 Aug 6 10:12:22 Blackbox kernel: mdcmd (548): spindown 1 Aug 6 12:25:51 Blackbox kernel: mdcmd (549): spindown 3 Aug 6 12:28:29 Blackbox kernel: mdcmd (550): spindown 4 Aug 6 20:37:10 Blackbox kernel: mdcmd (551): spindown 0 Aug 7 00:00:01 Blackbox Plugin Auto Update: Checking for available plugin updates Aug 7 00:00:06 Blackbox Plugin Auto Update: Community Applications Plugin Auto Update finished Aug 7 01:05:56 Blackbox kernel: mdcmd (552): set md_write_method 1 Aug 7 01:05:56 Blackbox kernel: Aug 7 01:10:56 Blackbox kernel: mdcmd (553): set md_write_method 0 Aug 7 01:10:56 Blackbox kernel: Aug 7 08:33:34 Blackbox kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13 Aug 7 08:33:34 Blackbox kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020 Aug 7 08:33:34 Blackbox kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot console=ttyS0,115200 pcie_acs_override=downstream,multifunction console=tty0 Aug 7 08:33:34 Blackbox kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Aug 7 08:33:34 Blackbox kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Aug 7 08:33:34 Blackbox kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Aug 7 08:33:34 Blackbox kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Aug 7 08:33:34 Blackbox kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. Aug 7 08:33:34 Blackbox kernel: BIOS-provided physical RAM map: blackbox-diagnostics-20200807-0853.zip

Disk 1 Read Errors - Help

howiser1 replied to howiser1's topic in General Support

Thanks -- it's a break out cable from a LSI HBA controller. So not as simple as a single SATA cable replacement. Not sure what you mean by both cables? I'll re-seat all cables and then rebuild. If the issue comes back - I'll replace that entire breakout cable. Thanks again for your prompt review!

Disk 1 Read Errors - Help

howiser1 replied to howiser1's topic in General Support

Full Diag attached. Also the extended SMART passed with no issues. diagnostics-20200616-0722.zip

Disk 1 Read Errors - Help

howiser1 posted a topic in General Support

Disk 1 logged read disk errors and is now disabled. Here's the snippet from the logs. If you need a full Diag I'm happy to post. I did a Short SMART - and it came back: -- Completed without error -- Going to kick off an extended SMART overnight. No SMART errors or issues prior to this. Jun 15 07:41:57 Tower kernel: mdcmd (99): spindown 2 Jun 15 07:41:59 Tower kernel: mdcmd (100): spindown 5 Jun 15 07:42:04 Tower kernel: mdcmd (101): spindown 0 Jun 15 07:42:05 Tower kernel: mdcmd (102): spindown 6 Jun 15 10:48:30 Tower kernel: mdcmd (103): spindown 2 Jun 15 10:50:04 Tower kernel: mdcmd (104): spindown 1 Jun 15 10:50:05 Tower kernel: mdcmd (105): spindown 4 Jun 15 10:50:17 Tower kernel: mdcmd (106): spindown 0 Jun 15 10:50:18 Tower kernel: mdcmd (107): spindown 3 Jun 15 15:34:25 Tower kernel: mdcmd (108): spindown 3 Jun 15 20:06:33 Tower kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9878 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9878 Sense Key : 0x2 [current] Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9878 ASC=0x4 ASCQ=0x0 Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9878 CDB: opcode=0x88 88 00 00 00 00 01 1e 13 cf 50 00 00 00 40 00 00 Jun 15 20:06:33 Tower kernel: print_req_error: I/O error, dev sde, sector 4799582032 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799581968 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799581976 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799581984 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799581992 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799582000 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799582008 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799582016 Jun 15 20:06:33 Tower kernel: md: disk1 read error, sector=4799582024 Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9120 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9120 Sense Key : 0x2 [current] Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9120 ASC=0x4 ASCQ=0x0 Jun 15 20:06:33 Tower kernel: sd 1:0:0:0: [sde] tag#9120 CDB: opcode=0x8a 8a 00 00 00 00 01 1e 13 cf 50 00 00 00 40 00 00 Jun 15 20:06:33 Tower kernel: print_req_error: I/O error, dev sde, sector 4799582032 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799581968 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799581976 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799581984 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799581992 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799582000 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799582008 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799582016 Jun 15 20:06:33 Tower kernel: md: disk1 write error, sector=4799582024 Jun 15 20:07:01 Tower sSMTP[21923]: Creating SSL connection to host Jun 15 20:07:01 Tower sSMTP[21923]: SSL connection using TLS_AES_256_GCM_SHA384 Jun 15 20:07:04 Tower sSMTP[21923]: Sent mail for XXXXXXXXXXXXXXXXX (221 2.0.0 closing connection p44sm14351530qta.12 - gsmtp) uid=0 username=root outbytes=812 Jun 15 20:09:27 Tower kernel: sd 1:0:0:0: Power-on or device reset occurred Jun 15 21:09:27 Tower kernel: mdcmd (109): set md_write_method 1

Failed Drive - check Logs Please

howiser1 replied to howiser1's topic in General Support

Thanks @johnnie.black Appreciate the explanation of Raw_Read_Errors_Rate. It is frustrating that unRAID hasn't come up with a better way to preserve the Syslogs... like also writing to a cached drive. Anything you've come across to do this other than a Syslog server? (which I'm thinking about). As I mentioned I copied the emulated data, just in case, to other drives on the array. I've added the failed disk back to the array and it is currently rebuilding. Thanks again for your prompt response.

Failed Drive - check Logs Please

howiser1 posted a topic in General Support

Hi All, Well I joined the club of a failed Drive this week. I've read numerous post on how to proceed forward, so I'll post my server logs / diag and ask if someone with more experience could just review and provide any additional advice. The failed drive is #3 ST4000DM004-2CV104_ZFN00H90 - 4 TB (sdf). Yes it's a Seagate... I read plenty of bad stuff about these already. BUT they have been working fine. Note the Raw_Read_Error_Rate is pretty high on the SMART logs... However there doesn't seem to be a definitive answer on whether this values is "bad" or suggest pre-fail. SMART passes and others say you can just ignore this??? It is interesting that all my other disk are ZERO for this value. I've been using UNRAID for about a year now and it's been working great in this configuration. Here's what I think happened, we had a power outage. I do have the server on a UPS. But maybe it didn't shut down in time and there was a write error to Drive 3??? Thus -- failed. This occurred right after the power outage. However I don't have the old Syslog to determine exactly what happened. I looked through this syslog and can't find any clues. (I've since adjusted the shutdown timers so there will be more time to shutdown) I don't think the drive was just going along and failed.... nor is it a bad cable, controller, etc. Really thinking it was the power outage. Just looking to confirm the logs don't point to something else. For now I've copied the data from the emulated drive. Then I've removed the disk and put it back in service.... it's rebuilding; interested to see if it fails during this process or if any of the SMART "bad" values go up... 12+ hours to rebuild. blackbox-diagnostics-20180914-1733.zip

Error: Call Traces found on your server

howiser1 replied to Splatman2's topic in General Support

I just got my first Call Trace error. Diag attached. Looks NIC related based on previous post. Just please confirm and let me know if there is anything I can do. diagnostics-20180523-2209.zip

Posts

Joined

Last visited

Recent Profile Visitors

howiser1's Achievements

Noob (1/14)

Reputation

Going crazy trying to figure out GPU Passthrough stability issues

Can't stop docker container Nextcloud or reboot the whole system

[Support] Linuxserver.io - Nextcloud

Docker container hangs

Out of Memory - First time.

Unraid - locked up/hangs/frozen/unresponsive

Unraid - locked up/hangs/frozen/unresponsive

Disk 1 Read Errors - Help

Disk 1 Read Errors - Help

Disk 1 Read Errors - Help

Failed Drive - check Logs Please

Failed Drive - check Logs Please

Error: Call Traces found on your server