DrJake

Members
  • Posts

    42
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

DrJake's Achievements

Rookie

Rookie (2/14)

3

Reputation

1

Community Answers

  1. Hi all, I have a Gigabyte Gefore RTX3060 passed through to a win10 VM. It runs fine, no known performance issues. Since I installed the card, I've been unable to update the driver using the accompanying GeForce Experience software. I've attached a few screenshots below. It sees that new update is available (GPU1.png), and downloads it (GPU2.png), but nothing happens after download is complete. No error message, no install, nothing after restart, just nothing. So basically after the entire exercise it goes back to looking like GPU1.png. Is there any specific steps I need to follow to update the GPU driver?
  2. Hello, thought I'd check-in again. My Unraid server is still rebooting ever 3-4 weeks. Last reboot on 8/September, and the time before that was on 12/August, and 29/July. Been running a syslog server and recorded the last 3 resets (logs attached). The log entries below are there each time there was a reboot. Not sure if it happened before or after the reboot. The GPU for VM seems quite stable these days, even did stress testing without crashing. Could someone take a look and point me to the right direction? I'm running out of things to try here. Jul 29 21:35:20 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:35:20 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:35:24 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:35:24 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:37:55 Tower webGUI: Successful login user root from 192.168.11.7 Jul 29 21:38:01 Tower kernel: clocksource: timekeeping watchdog on CPU7: Marking clocksource 'tsc' as unstable because the skew is too large: Jul 29 21:38:01 Tower kernel: clocksource: 'hpet' wd_now: 71b477c4 wd_last: 7100a93b mask: ffffffff Jul 29 21:38:01 Tower kernel: clocksource: 'tsc' cs_now: ecc745390ca5c cs_last: ecc73ce4daf64 mask: ffffffffffffffff Jul 29 21:38:01 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog Jul 29 21:38:01 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Jul 29 21:38:01 Tower kernel: sched_clock: Marking unstable (1159212382806303, -10482851288)<-(1159202045772327, -145824138) Jul 29 21:38:02 Tower kernel: clocksource: Switched to clocksource hpet Aug 12 00:05:06 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:05:06 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:05:09 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:05:10 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:07:42 Tower kernel: clocksource: timekeeping watchdog on CPU15: Marking clocksource 'tsc' as unstable because the skew is too large: Aug 12 00:07:42 Tower kernel: clocksource: 'hpet' wd_now: 31b6948a wd_last: 313252fd mask: ffffffff Aug 12 00:07:42 Tower kernel: clocksource: 'tsc' cs_now: e710a7ac3d094 cs_last: e710a2445a3b4 mask: ffffffffffffffff Aug 12 00:07:42 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog Aug 12 00:07:42 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Aug 12 00:07:42 Tower kernel: sched_clock: Marking unstable (1131240282539958, -10214497034)<-(1131230211029576, -142991887) Aug 12 00:07:43 Tower kernel: clocksource: Switched to clocksource hpet Sep 8 17:35:55 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:35:56 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:35:59 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:35:59 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:38:34 Tower kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: Sep 8 17:38:34 Tower kernel: clocksource: 'hpet' wd_now: f5ae5efc wd_last: f4fdec69 mask: ffffffff Sep 8 17:38:34 Tower kernel: clocksource: 'tsc' cs_now: 1e2215fac635f8 cs_last: 1e221579001198 mask: ffffffffffffffff Sep 8 17:38:34 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog Sep 8 17:38:34 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Sep 8 17:38:34 Tower kernel: sched_clock: Marking unstable (2360442394677355, -18500752721)<-(2360424039483928, -145566028) Sep 8 17:38:34 Tower kernel: clocksource: Switched to clocksource hpet 20220908 unraid crash 1738.log 20220729 unraid crash 2130.log 20220812 double crash 0.00am and 9.40am.log
  3. Hi again, Thought I'd give an update and close this thread. Thx again ChatNoir, I think it helped a lot with the RAM speed dialed back, resolving that instability issue. Since then, the Unraid server crash problem feels more like 2 separate issues compounded together. It was very telling when the GPU crashes, and the VM and the Unraid server remained operational. I found this thread, https://www.reddit.com/r/Amd/comments/pf4ebr/figured_out_a_fix_for_the_rx_580_intermittent/, it sounds like the RX580OC crashing is a common problem, and coincidentally also relates to the OC clock speed (similar to the RAM speed issue). So since my last post couple days ago, I went ahead and rebooted the server. I followed the instructions in that thread above, installed Adrenalin (the install required many reboots, and RX580 kept crashing, and install wouldn't progress unless it detects a valid GPU... so had to reboot the server quite a few times...). Anyhow, it's now installed, and I used it to manually tune the GPU clock to the base speed of 1257MHz, did a stress test, and nothing crashed, phew. So that's where I am now. Let's see how long the server can run this time without issues (I'm quite hopeful actually).
  4. Hi again ChatNoir, since adjusting the RAM speed, my server uptime is ~29 days 14 hours now. ~2 days ago, the GPU passed through to a VM stopped working. I kinda suspected that my system crashes were related to the GPU (RX580) somehow, but cannot confirm. I've attached the syslog here, don't think there's anything about the GPU. So the current situation is that, the VM (with the GPU passed through) is still running, I can RDC into it and everything work, just that the screen attached to the Unraid server is not showing anything. But Win10 is reporting issues (see screenshot). In the past, I have used SpaceInvaderOne's script to reset the GPU, and more recently I tend to just shut down the server and restart it. I have left the server running for 2 days now since the GPU "crash", and it appears to be stable. Regarding the GPU, I followed SpaceInvaderOne's video on dumping vbios, and passing it through. I don't think there's much of an issue there, plus if I were to restart the server, the GPU would be working again. So my question is, what should I do to further diagnose the problem here? ~3-4 weeks seems to be the magical timeframe when the GPU experiences issues, and that's also the timeframe past unraid reboots have happened. Should I keep the server running for another month and see if it crashes? unraid syslog 20220715.log
  5. Thank you ChatNoir, there's a lot of information to digest. Already acted on the fix common problems issues as suggested. Re: memory speed, I'm running AMD Ryzen 7 3700X 8-Core @ 3600 MHz, and 4 pieces of Corsair Vengeance LPX (4x16GB) DDR4 3200MHz C16 Desktop Gaming Memory Black. I'm pretty sure I set the memory speed at 3200MHz. According to your linked thread, for 3rd gen processor running dual 4/4 config the speed should be dialed back to 2667MHz? is that correct? I will check on the c-state once I get back.
  6. Hi all, For reasons unknown, my Unraid sever reboots itself every 3-4 weeks, during this period, there doesn't seem to be any issues with the server. It has happened for 6 months+ now, and has recently caught my attention. I kinda discarded it as something I did to crash it in the early days. Anyhow, I have followed the instructions and installed "mcelog via the NerdPack plugin", the system diagnostics is attached, hope there is something in the log before the latest self-reboot (~6 hours ago). tower-diagnostics-20220605-1022.zip
  7. Thank you again Jorge, can mark this one as resolved. Lucky I encountered this issue without actually losing the cache data
  8. Thx for getting back to me Jorge, I suspected it was an issue with the cache pool, something is still bugging out. So I ended up taking the long way around. 1. transferred all the cache data back to the array (for peace of mind as well) 2. deleted the cache pool 3. rebooted the server 4. recreated the cache pool (at some point I needed to use the "btrfs-select-super -s 1 /dev/nvme1n1p1" command again, because the system was preventing me from mounting the drive as an unassigned device) 5. now I think everything is in working order, the mover is still moving data from the array back onto the new cache pool. So just to confirm, this means the cache pool is working in RAID1 config right? Does the ID number matter as to which drive data will be read/written from/to? because 1 is SATA and 1 is NVMe.
  9. Hi Jorge, something is not right. I'm only at step 1, but dont think the device was correctly added to the pool (for reasons unknown to me). It says the 1TB sata ssd is "part of a pool". I tried to do "balance" in the GUI of Cache (nvme1n1), twice, each time completed without complaint. FYI it would not allow me to "convert to raid1 mode" here. Afterwards, using "btrfs fi usage -T /mnt/cache" I get this, doesn't even look like the device is in the pool... Tried to run "btrfs balance start -mconvert=raid1 /mnt/cache" posted on the other thread, and got error syslog says Dec 30 11:48:38 Tower kernel: BTRFS error (device nvme1n1p1): balance: invalid convert metadata profile raid1 P.S. I tried stopping the array and adding the 2nd cache device twice, same issues. syslog and new diagnostics attached. syslog.txt tower-diagnostics-20211230-1151.zip
  10. oh wow, thank you thank you so much Jorge. the server back to life now, with VMs and Dockers all working fine. You've saved me a lot of time reconfiguring my VMs and security cameras... So back to what I was trying to do in the beginning. I'm trying to swap out the 500GB NVMe where the cache data is on at the moment, and replace it with a 1TB NVMe (currently has data, and used as an unassigned device) pooled with a 1TB SATA ssd. Following what I originally was planning to do, I guess: 1. add the 1TB SATA ssd to the cache pool (resulting in the redundant cache pool I should have had at the beginning) 2. remove the 500GB NVMe from the cache pool (the 1TB SATA ssd should still function, right?) 3. do my data transfer, so the 1TB NVMe is free 4. add the 1TB NVMe to the cache pool Is there a better way? I'm a bit worried about step 2 and 4, because what I just went through Would the 1TB SATA ssd be a functional cache drive by itself after step 2?
  11. So I guess I'm just trying to recover the data on my cache drive at this point. The 2 pooled drives were setup as raid1 in case 1 of them failed. But at the moment, I can't seem to figure out/find information about how to recover the data. I just saw this thread, and think I might be one of these cases... cant remember when I setup the redundancy, but I recall it was the Unraid version when the pooled cache feature was first introduced. I believe I have not done anything irreversible, (have not formatted any of the drives, have not rebooted the server). But think I need some expert help... tower-diagnostics-20211229-1432.zip
  12. Hi all, I currently have 2 drives assigned to the cache pool: 1TB SATA ssd 500GB NVMe ssd (intend to remove from the pool I read the FAQ on removing the cache drive, but I'm encountering issues I'm not sure how to proceed. I'm running Unraid 6.9.2, followed the instructions to the letter. 1. stop the array 2. unassign pool disk to remove 3. did not do any re-ordering 4. start the array (after checking the "I'm sure" box next to the start array button) The system is asking me if I want to format the cache2 drive, which is the one I intend to keep in the pool. The drive that was removed from the pool could not be mounted as an unassigned device. I planned on doing this later after the cache pool is sorted... Having encountered this problem, I didn't really want to deal with a server problem during the holidays... I tried to add the drive I removed back into the pool, but seems like I cannot anymore... I read on the forum about physically removing the drive, but that was an older version of Unraid, so I wanted to check with the pros before I shut down the server n all... Help... what are my options at this stage?
  13. Hi CS, no, I didn't. After diagnosing the Plex library indexing thing, I monitored the unraid server occasionally, and the disks are sleeping most of the time. So I haven't bothered with the *.tmp files. If you figure it out, please do post
  14. Hi all, Thx for all the input. I think this thread can be marked as (resolved now). I still get the odd temp file created and also get some odd opening of selected files on the Plex server at ~2.30am... when i know for sure nobody is accessing the plex server. Anyhow, these don't occur frequently, and don't matter much, as my HDD sleep timer is set a 3h, and these don't cause additional spin ups when mover does its thing. Overall, I found the fileactivity plugin very helpful in isolating the issues.
  15. I ended up moving the ISOs folder to cache, only took up like 15Gb of space, which is quite tolerable for me. I've been monitoring the fileactivity for a few days since last reporting here, it has been working very well, I think the disks only needs to spin up when I watch Plex, and at 3.40am for the movers. Regarding the syncthing problem, I still don't really know how it works, but in my current setup it appears to be not scanning much. In the fileactivity log, I see the following (I don't think it happens daily, as I haven't noticed it before), does anyone know what is creating the .tmp files? if it'll cause the drives to spin up, and how to avoid it? Jan 10 04:30:13 CREATE => /mnt/disk1/2041296744.tmp Jan 10 04:30:13 OPEN => /mnt/disk1/2041296744.tmp Jan 10 04:30:13 OPEN => /mnt/disk1/2041296744.tmp Jan 10 04:30:13 DELETE => /mnt/disk1/2041296744.tmp Jan 10 04:30:13 CREATE => /mnt/disk2/1694051432.tmp Jan 10 04:30:13 OPEN => /mnt/disk2/1694051432.tmp Jan 10 04:30:13 OPEN => /mnt/disk2/1694051432.tmp Jan 10 04:30:13 DELETE => /mnt/disk2/1694051432.tmp Jan 10 04:30:13 CREATE => /mnt/disk3/2030880645.tmp Jan 10 04:30:13 OPEN => /mnt/disk3/2030880645.tmp Jan 10 04:30:13 DELETE => /mnt/disk3/2030880645.tmp