ldrax

Members
  • Posts

    94
  • Joined

  • Last visited

Everything posted by ldrax

  1. Just want to express my newfound happiness. My motherboard has only 1 USB controller (to my disappointment), but the USB-C port of 2080Ti comes to the rescue! Using 6.11.5, I just need to VFIO bind (on System Devices), restart, and then pass through the NVidia USB controller. It works! I attached a Dell WD19TB dock that's been laying around forever. It's thunderbolt connection but it works with USB-C. Also attached another USB Hub to the WD19TB. This gives a dozen of USB ports for plug-n-play purpose to the Windows 11 guest OS. Transfer rate is fast, but of course to be shared with all the plugged-in devices. I guess there will be overall bus power limit as well, but I plugged in 3 external SSDs together with a bunch of thumb drives and an SD Card, just to test. This made my day!
  2. I probably should do the Preclear again, because for reason that escaped me, I opted to 'Skip pre-read' and 'Skip post-read' when I started the preclear that night.
  3. This is the 'Disk Log Information' from unassigned devices menu. Note that this log only after the Reboot described at point 7) above. The errors at the Nov 21 23:26 - 23:27 was just after the Preclear started (point 9 above). sdu-log.txt
  4. I need to find a way to get a cleaner diagnostics. Currently a bunch of DIY scripts are contributing to syslog messages that are not exactly public. I can post the 'Disk Log information' if that helps.
  5. Hi all, I got a situation as below. Basically I want to know if I should be confident enough to use the disk in question going forward, at its current exact location and SATA slot? My plan was to shrink the array, unassign disk10, and use it as unassigned device. 1. During non-correct parity check, disk10, /dev/sdu, reported rapid UDMA CRC errors (going from 0 to 3000+ count). 2. Turned off system, replaced SATA cable (actually it's a 4-ways branch-out cable from a jbod pcie controller, the order is preserved, i.e. the "3rd cable"). 3. Parity check finished without error, and without new UDMA CRC error. 4. Disk10 content emptied without error, empty shared folder deleted. 5. To follow the procedure to 'Shrink array', https://wiki.unraid.net/Shrink_array#The_.22Clear_Drive_Then_Remove_Drive.22_Method created clear-me directory, then started the clear_array_drive script. (I've done this procedure multiple times in the past when shrinking my array). 6. Syslog reported rapidly a bunch of write error on disk10, and unraid subsequently disabled disk10. 7. Stopped the clear_array_drive script, rebooted the system, did New Config, unassigned disk10. 8. Rebuild parity with remaining 9 disks, no error. 9. Preclear sdu (formerly disk10), it started with a few lines of write errors logged on syslog, but no further error for the next 14+hours until preclear finished with successful message. 10. Did short smart test, and then extended smart test on sdu, both passed. 11. On Unassigned Devices, formatted sdu with XFS, did few GBs copy test, no error reported. I intend to use this disk on unassigned devices as 'staging disk' for big-size content work, should I be worried? Thank you beforehand.
  6. Done reposted there. I will delete this post shortly. Thanks @johnnie.black!
  7. Recently I noticed many occurrences of many processes were 'unable to fork, resources not available' errors, and these processes failed to execute. Suspecting out of memory issue (I have 32GB), I watch the top command for a while when these errors occurring, but it doesn't look that way. 'ps' command surprisingly shows there are 30k+ processes of nv_queue. -- truncated -- 32757 ? S 0:00 [nv_queue] 32758 ? S 0:00 [nv_queue] 32759 ? S 0:00 [nv_queue] 32760 ? S 0:00 [nv_queue] 32761 ? S 0:00 [nv_queue] 32762 ? S 0:00 [nv_queue] 32763 ? S 0:00 [nv_queue] 32764 ? S 0:00 [nv_queue] 32765 ? S 0:00 [nv_queue] 32766 ? S 0:00 [nv_queue] 32767 ? S 0:00 [nv_queue] # ps ax | grep nv_queue | wc -l 31198 This might come from nvidia driver. I have been using the nvidia build for a long time but the only different thing I do recently is the command nvidia-smi -pm 1 to enable the persistence mode. Without this command, my graphics card (1080Ti) would always at 55W power when idle. With this command, it's down to 9-12W. Does anyone encounter this issue? I believe the very large numbers of nv_queue processes here has rendered the system unstable, depriving many other important processes of resources. I have rebooted since, and NOT executing the -pm 1 command above. Many hours later now, no such issue. So more or less it's the command that is responsible for it. But of course now, the 55W idle power usage is back an issue.
  8. Recently I noticed many occurrences of 'unable to fork' errors. I watch the top command for a while, suspecting out of memory issue, but it doesn't look that way. 'ps' command surprisingly shows there are 30k+ processes of nv_queue. -- truncated -- 32757 ? S 0:00 [nv_queue] 32758 ? S 0:00 [nv_queue] 32759 ? S 0:00 [nv_queue] 32760 ? S 0:00 [nv_queue] 32761 ? S 0:00 [nv_queue] 32762 ? S 0:00 [nv_queue] 32763 ? S 0:00 [nv_queue] 32764 ? S 0:00 [nv_queue] 32765 ? S 0:00 [nv_queue] 32766 ? S 0:00 [nv_queue] 32767 ? S 0:00 [nv_queue] # ps ax | grep nv_queue | wc -l 31198 This might come from nvidia driver. I have been using the nvidia build for a long time but the only different thing I do recently is the command nvidia-smi -pm 1 to enable the persistence mode. Without this command, my graphics card (1080Ti) would always at 55W power when idle. With this command, it's down to 9-12W. Does anyone encounter this issue? I believe the very large numbers of nv_queue processes here has rendered the system unstable, casuing many other important processes fail to execute due to out of resources.
  9. I see. So possible cause is because it's underloaded. Thanks @Benson
  10. I noticed a strange behaviour of my PSU (Silverstone Stridec 1200W Platinum ST1200). The PSU is installed with its fan facing downwards, taking air from the bottom of the case (Corsair 760T has the intake honeycomb with filter at this bottom position). For a long time I always had the HDDs spindown delay to 'Never', i.e. disabled it. Recently I enabled it. When those disks stay spun down, the PSU would periodically ramp up its fan speed for about 10-20 seconds before it goes quiet again. When I spun the HDDs up again, the PSU is quiet all the way again. This is rather strange, as spun-down HDDs would lower the overall case temperature. The only explanation I can think of is that the PSU 'compares' its internal temperature with interior case temperature, and if it thinks it's too hot relative to the case temperature, it will ramp up its fan's speed. Does anyone notice similar behaviour? Any idea what can I do?
  11. Done, looks like all errors are corrected. Thanks! scrub status for 3c12a05c-3bba-493e-98e5-d2d3a2c7e107 scrub started at Mon Mar 30 16:21:50 2020 and finished after 00:26:03 total bytes scrubbed: 1.87TiB with 300965 errors error details: csum=300965 corrected errors: 300965, uncorrectable errors: 0, unverified errors: 0
  12. BTRFS Scrub command (non repairing, yet), however, shows a lot of errors found: scrub status for 3c12a05c-3bba-493e-98e5-d2d3a2c7e107 scrub started at Mon Mar 30 15:52:07 2020, running for 00:01:21 total bytes scrubbed: 121.53GiB with 14191 errors error details: csum=14191 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 (in progress)
  13. So before I did the --clear-space-cache command, I started the array in normal mode to be able to backup some selective files from the cache pool. While doing this, there were a lot of errors message (including message to correct them) on syslog, and rebuilding space cache message as well: Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4907223482368, rebuilding it now Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4158791876608, rebuilding it now Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4512052936704, rebuilding it now Mar 30 15:18:56 gpt760t kernel: BTRFS error (device sdh1): csum mismatch on free space cache Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4999565279232, rebuilding it now Mar 30 15:19:19 gpt760t kernel: io_ctl_check_generation: 21 callbacks suppressed Mar 30 15:19:19 gpt760t kernel: BTRFS error (device sdh1): space cache generation (126117) does not match inode (126155) Mar 30 15:19:19 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4959836831744, rebuilding it now Mar 30 15:19:19 gpt760t kernel: BTRFS error (device sdh1): space cache generation (126115) does not match inode (126182) Mar 30 15:19:19 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4998491537408, rebuilding it now --- truncated, hundreds of these same messages ---- Once the backup is completed, I restarted the array in maintenance mode, and did a check --readonly, just to check. All previous errors are now gone: [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/sdh1 UUID: 3c12a05c-3bba-493e-98e5-d2d3a2c7e107 found 1030725935104 bytes used, no error found total csum bytes: 603511636 total tree bytes: 1761673216 total fs tree bytes: 533708800 total extent tree bytes: 218890240 btree space waste bytes: 481730316 file data blocks allocated: 67454937907200 referenced 1007469522944 I guess I don't have to run the btrfs-check --clear-space-cache then? Thanks @johnnie.black!
  14. Thank you @johnnie.black I'll do that shortly and update again here. You've been very helpful each time.
  15. Thank you @johnnie.black as always! Do i run it with v1 or v2, the description are there, but to be honest I don't really grasp the concept of free space cache here. --clear-space-cache v1|v2 completely wipe all free space cache of given type For free space cache v1, the clear_cache kernel mount option only rebuilds the free space cache for block groups that are modified while the filesystem is mounted with that option. Thus, using this option with v1 makes it possible to actually clear the entire free space cache. For free space cache v2, the clear_cache kernel mount option destroys the entire free space cache. This option, with v2 provides an alternative method of clearing the free space cache that doesn’t require mounting the filesystem.
  16. I have 4 SSD drives in the cache pool. I trusted my itchy hands to do some cabling work, and turned out the power cables to 2 drives were not stable, resulting in the drives fell out and in of the pool during a short amount of period (less than 5 mins), so I quickly stopped the array and powered down the system. Now it's all back up, started array in maintenance mode and, did filesystem check --readonly on the cache: [1/7] checking root items [2/7] checking extents [3/7] checking free space cache btrfs: csum mismatch on free space cache failed to load free space cache for block group 4040680275968 btrfs: space cache generation (126118) does not match inode (126182) failed to load free space cache for block group 4044975243264 btrfs: csum mismatch on free space cache failed to load free space cache for block group 4561445060608 btrfs: space cache generation (126113) does not match inode (126155) <------------------ truncated, there are about 200 lines of this error ---------> [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/sdh1 UUID: 3c12a05c-3bba-493e-98e5-d2d3a2c7e107 found 1430313537536 bytes used, no error found total csum bytes: 993257632 total tree bytes: 2197733376 total fs tree bytes: 572276736 total extent tree bytes: 238682112 btree space waste bytes: 415841311 file data blocks allocated: 67854097842176 referenced 1405220114432 I saw on another post, @johnnie.black mentioned that the 'csum mismatch' is just warning, nothing to worry about. Can you advise on what to do from here? While I'm glad to notice the line 'found 1430313537536 bytes used, no error found', I hope nothing serious happened. Do i restart the array in normal mode, and then do [repairing] scrub? (I have disabled docker and VM for the time being). Thanks!
  17. motherboard is Asus X99-E WS, so PCIe lanes are quite plenty, it can accommodate 7 x8 or 4 x16
  18. I have an LSI SAS/SATA card installed on PCIE slot 1. When I installed a GT730 on slot 2, the above SATA card got disabled (drives marked as Missing on Main page). The GPU itself is working, I got output on the monitor connected to it. When I installed a different brand of GT730 on the same slot 2, both this and the SATA card are working normally. I wonder what is the circumstance that would cause this? I got these 2 GT730s to explore GPU passthrough thing with VM.
  19. I wonder if anyone managed to get module nct6775 working for NCT6791D chip? (this chip is what the sensors-detect output showing) I did `modprobe nct6775` and `modprobe nct6775 0x290` at the command line to no effect. The other command `modprobe coretemp` did result in sensors reporting the CPU temperature (only).
  20. thanks! The check returned with 'no error found'. A bit of peace of mind.
  21. @johnnie.black Related to my earlier post that you helped me solve yesterday (cache disk appears unassigned), I wanted to run a test/check on the cache pool, just to see if this check find any error (hopefully not!). I have done the Scrub process (0 error), now just about to click on the FS CHECK button (with the default --readonly option). Hope no harm in doing this?
  22. Referring to https://wiki.unraid.net/Check_Disk_Filesystems, specifically for BTRFS section: When I click on BTRFS cache disk, it shows rather the opposite, Filesystem check only available in Maintenance Mode. Now there is both Scrub and File System Check button on the aforementioned cache disk page, do I need to run both to do test/check?
  23. I'm baffled by the the system failure to shutdown properly upon UPS threshold being reached (mins and/or % power before shutting down). If there is an open, forgotten console session somewhere (used with screen, for example) with working directory inside the array, or perhaps a long running rsync or rclone command copying content of a disk, then the 'Stopping array' will not succeed, hence the overall shutdown sequence won't complete, forcing the machine to lose power as the UPS can only wait for short period of time. I wonder what is the best practice in this situation?