Jump to content

Vcent

Members
  • Posts

    8
  • Joined

  • Last visited

Posts posted by Vcent

  1. Update, in case someone else hits this lovely snag in the future: The problem was indeed a ram-stick, which worked perfectly fine until stressed juuust right, at which point it started outputting gibberish garbage, which crashed the system. Unfortunately, none of the memtests identified the ram as being faulty, passing with no errors every time (unless I started faffing about in the options while a test ran, which would sometimes make a ton of errors appear - I'm guessing that's a bug in memtest though, rather than indicative of this particular flaw).

     

    I ended up finding it by removing one stick, running it for a day, stressing the system, and when it didn't crash, tried swapping the sticks to verify the problem - and indeed, it promptly wound up crashing the system once everything got loaded hard enough. 

     

    Curiously all memtests regardless of runtime still insist that everything is peachy, and that there's nothing wrong with the working, or the faulty stick. System has been both stable and dependable after removal of the problematic stick. 

    • Like 1
  2. On 7/13/2022 at 11:22 AM, Picha said:

    Anyone has an issue with digikam right now ?

    I´m only seeing a console with user abc@Hostname

    Probably too late to help you, but the solution to that is to type in 'digikam' without the quotes, and being slightly patient as it is launched.

     

    Unfortunately launching it that way, leads to a tiny window with digikam opening, since Guacamole for some godforsaken reason assumes window size should remain static, and squashes digikam into the same window size as a terminal window. I found some sort of way to fix that, but unfortunately it's been long enough by now, that I've forgotten how I did it.

    • Like 1
  3. It's currently on the last 25% of the final (fourth) pass of MemTest86 Free Version 9.3, which much like the included memtest86+ on unRaid has found .. diddly squat, except that there are no issues with the ram.

    I would be highly surprised if that changes during the last hour of the test, seeing as the included memtest was run several times previously, and also didn't find anything during any of the passes either. Unless unRaid is somehow significantly harder on the ram, since it manages to kill the machine in a much shorter period of time than any of the test runs.

     

    It's done with its run, finding nothing. The advice it gives to run it again in multi-CPU mode is nice, but not possible on my motherboard/CPU combo, due to some UEFI limitation on it. Don't have any other AMD boards lying around to test with either. I'm letting it run once more through the night, but I'd be gobsmacked if that changed the result.

  4. On 9/1/2021 at 10:02 PM, xgroleau said:

      

    Hi,

    I am getting some issue with deluge lately, it seems there is some recursive routing, but I am only using the VPN for deluge and a torrent indexer. I am using CyberGhost VPN with OpenVPN. I can still download torrent, though it spams the log file and the VPN seem to drop and restart every couple minutes and also torrent goes sometime in error. Has anyone encountered a similar issue?

    Here are the logs when it seems to restart:
     

    2021-09-01 15:41:20,270 DEBG 'watchdog-script' stdout output:
    [info] No torrents with state 'Error' found
    
    2021-09-01 15:41:20,741 DEBG 'start-script' stdout output:
    2021-09-01 15:41:20 us=739881 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:20,093 DEBG 'start-script' stdout output:
    2021-09-01 15:41:20 us=93338 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=93371 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=93386 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=93459 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:20,094 DEBG 'start-script' stdout output:
    2021-09-01 15:41:20 us=94796 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:20,095 DEBG 'start-script' stdout output:
    2021-09-01 15:41:20 us=94945 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=94956 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=94965 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=94974 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    2021-09-01 15:41:20 us=95128 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:20,270 DEBG 'watchdog-script' stdout output:
    [info] No torrents with state 'Error' found
    
    2021-09-01 15:41:20,741 DEBG 'start-script' stdout output:
    2021-09-01 15:41:20 us=739881 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:24,403 DEBG 'start-script' stdout output:
    2021-09-01 15:41:24 us=403033 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:25,404 DEBG 'start-script' stdout output:
    2021-09-01 15:41:25 us=403897 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:25,419 DEBG 'start-script' stdout output:
    2021-09-01 15:41:25 us=418975 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443
    
    2021-09-01 15:41:25,404 DEBG 'start-script' stdout output:
    2021-09-01 15:41:25 us=403897 Recursive routing detected, drop tun packet to [AF_INET]193.176.85.96:443

     

    Did you ever figure this out? Having the same problem, and have had it for a while now - tried changing my OVPN file to a different endpoint, didn't change anything - figure it's the torrent client trying to contact itself through the VPN, which OpenVPN doesn't like (for obvious reasons). 

  5. Been having the syslog server (or rather mirror) up for a while now, problem is that it rarely captures anything particularly interesting - a docker will drop a net connection, then make up a new one, and so on, until eventually .. the server just stops responding to anything, usually starts producing heat, and spins up the fans. Oftentimes the last message will either be about the dropped (local) IPv6 address, or about the disks spinning down, then nothing else gets logged.

     

    Quote

    <---cut more of the same--->

    Jan  9 11:51:14 Tower kernel: docker0: port 9(vethb2a9002) entered disabled state
    Jan  9 11:51:24 Tower avahi-daemon[7024]: Interface vethb2a9002.IPv6 no longer relevant for mDNS.
    Jan  9 11:51:24 Tower kernel: docker0: port 9(vethb2a9002) entered disabled state
    Jan  9 11:51:24 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethb2a9002.IPv6 with address fe80::3003:f3ff:fe5d:320e.
    Jan  9 11:51:24 Tower kernel: device vethb2a9002 left promiscuous mode
    Jan  9 11:51:24 Tower kernel: docker0: port 9(vethb2a9002) entered disabled state
    Jan  9 11:51:24 Tower avahi-daemon[7024]: Withdrawing address record for fe80::3003:f3ff:fe5d:320e on vethb2a9002.
    Jan  9 11:51:36 Tower kernel: veth2b6032c: renamed from eth0
    Jan  9 11:51:36 Tower kernel: docker0: port 6(veth74d3a5b) entered disabled state
    Jan  9 11:51:39 Tower kernel: veth0a68062: renamed from eth0
    Jan  9 11:51:39 Tower kernel: docker0: port 5(vethdbaf6a3) entered disabled state
    Jan  9 11:51:44 Tower avahi-daemon[7024]: Interface vethdbaf6a3.IPv6 no longer relevant for mDNS.
    Jan  9 11:51:44 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethdbaf6a3.IPv6 with address fe80::d851:52ff:feff:b08.
    Jan  9 11:51:44 Tower kernel: docker0: port 5(vethdbaf6a3) entered disabled state
    Jan  9 11:51:44 Tower kernel: device vethdbaf6a3 left promiscuous mode
    Jan  9 11:51:44 Tower kernel: docker0: port 5(vethdbaf6a3) entered disabled state
    Jan  9 11:51:44 Tower avahi-daemon[7024]: Withdrawing address record for fe80::d851:52ff:feff:b08 on vethdbaf6a3.
    Jan  9 11:51:50 Tower avahi-daemon[7024]: Interface veth74d3a5b.IPv6 no longer relevant for mDNS.
    Jan  9 11:51:50 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth74d3a5b.IPv6 with address fe80::6c06:e8ff:fee2:1e24.
    Jan  9 11:51:50 Tower kernel: docker0: port 6(veth74d3a5b) entered disabled state
    Jan  9 11:51:50 Tower kernel: device veth74d3a5b left promiscuous mode
    Jan  9 11:51:50 Tower kernel: docker0: port 6(veth74d3a5b) entered disabled state
    Jan  9 11:51:50 Tower avahi-daemon[7024]: Withdrawing address record for fe80::6c06:e8ff:fee2:1e24 on veth74d3a5b.
    Jan  9 11:51:54 Tower kernel: veth032db53: renamed from eth0
    Jan  9 11:51:54 Tower kernel: docker0: port 7(veth59e8068) entered disabled state
    Jan  9 11:51:55 Tower kernel: veth38d78d4: renamed from eth0
    Jan  9 11:51:55 Tower kernel: docker0: port 4(vethc70f7d4) entered disabled state
    Jan  9 11:52:03 Tower kernel: veth115e8f9: renamed from eth0
    Jan  9 11:52:03 Tower kernel: docker0: port 3(veth55bc601) entered disabled state
    Jan  9 11:52:11 Tower avahi-daemon[7024]: Interface veth59e8068.IPv6 no longer relevant for mDNS.
    Jan  9 11:52:11 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth59e8068.IPv6 with address fe80::809e:85ff:fe4c:823e.
    Jan  9 11:52:11 Tower kernel: docker0: port 7(veth59e8068) entered disabled state
    Jan  9 11:52:11 Tower kernel: device veth59e8068 left promiscuous mode
    Jan  9 11:52:11 Tower kernel: docker0: port 7(veth59e8068) entered disabled state
    Jan  9 11:52:11 Tower avahi-daemon[7024]: Withdrawing address record for fe80::809e:85ff:fe4c:823e on veth59e8068.
    Jan  9 11:52:14 Tower avahi-daemon[7024]: Interface vethc70f7d4.IPv6 no longer relevant for mDNS.
    Jan  9 11:52:14 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethc70f7d4.IPv6 with address fe80::f8a1:beff:fe37:281b.
    Jan  9 11:52:14 Tower kernel: docker0: port 4(vethc70f7d4) entered disabled state
    Jan  9 11:52:14 Tower kernel: device vethc70f7d4 left promiscuous mode
    Jan  9 11:52:14 Tower kernel: docker0: port 4(vethc70f7d4) entered disabled state
    Jan  9 11:52:14 Tower avahi-daemon[7024]: Withdrawing address record for fe80::f8a1:beff:fe37:281b on vethc70f7d4.
    Jan  9 11:52:15 Tower avahi-daemon[7024]: Interface veth55bc601.IPv6 no longer relevant for mDNS.
    Jan  9 11:52:15 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth55bc601.IPv6 with address fe80::3cf9:2bff:fe92:60d.
    Jan  9 11:52:15 Tower kernel: docker0: port 3(veth55bc601) entered disabled state
    Jan  9 11:52:15 Tower kernel: device veth55bc601 left promiscuous mode
    Jan  9 11:52:15 Tower kernel: docker0: port 3(veth55bc601) entered disabled state
    Jan  9 11:52:15 Tower avahi-daemon[7024]: Withdrawing address record for fe80::3cf9:2bff:fe92:60d on veth55bc601.
    Jan  9 12:00:02 Tower dhcpcd[1565]: br0: failed to renew DHCP, rebinding
    Jan  9 12:44:12 Tower kernel: BTRFS warning (device dm-3): csum failed root 5 ino 4668494 off 1588256768 csum 0xc582cc78 expected csum 0xf93b897d mirror 1
    Jan  9 12:44:12 Tower kernel: BTRFS error (device dm-3): bdev /dev/mapper/md4 errs: wr 0, rd 0, flush 0, corrupt 157, gen 0
    Jan  9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered blocking state
    Jan  9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
    Jan  9 13:16:37 Tower kernel: device vethba51b15 entered promiscuous mode
    Jan  9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered blocking state
    Jan  9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered forwarding state
    Jan  9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
    Jan  9 13:17:02 Tower kernel: eth0: renamed from veth153f703
    Jan  9 13:17:02 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethba51b15: link becomes ready
    Jan  9 13:17:02 Tower kernel: docker0: port 3(vethba51b15) entered blocking state
    Jan  9 13:17:02 Tower kernel: docker0: port 3(vethba51b15) entered forwarding state
    Jan  9 13:17:04 Tower avahi-daemon[7024]: Joining mDNS multicast group on interface vethba51b15.IPv6 with address fe80::3410:79ff:fe93:be73.
    Jan  9 13:17:04 Tower avahi-daemon[7024]: New relevant interface vethba51b15.IPv6 for mDNS.
    Jan  9 13:17:04 Tower avahi-daemon[7024]: Registering new address record for fe80::3410:79ff:fe93:be73 on vethba51b15.*.
    Jan  9 13:17:17 Tower kernel: veth389a0a7: renamed from eth0
    Jan  9 13:17:17 Tower kernel: docker0: port 8(veth91f37a1) entered disabled state
    Jan  9 13:17:18 Tower avahi-daemon[7024]: Interface veth91f37a1.IPv6 no longer relevant for mDNS.
    Jan  9 13:17:18 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth91f37a1.IPv6 with address fe80::58c6:28ff:fea3:2840.
    Jan  9 13:17:18 Tower kernel: docker0: port 8(veth91f37a1) entered disabled state
    Jan  9 13:17:18 Tower kernel: device veth91f37a1 left promiscuous mode
    Jan  9 13:17:18 Tower kernel: docker0: port 8(veth91f37a1) entered disabled state
    Jan  9 13:17:18 Tower avahi-daemon[7024]: Withdrawing address record for fe80::58c6:28ff:fea3:2840 on veth91f37a1.
    Jan  9 13:25:58 Tower kernel: veth153f703: renamed from eth0
    Jan  9 13:25:58 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
    Jan  9 13:25:59 Tower avahi-daemon[7024]: Interface vethba51b15.IPv6 no longer relevant for mDNS.
    Jan  9 13:25:59 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethba51b15.IPv6 with address fe80::3410:79ff:fe93:be73.
    Jan  9 13:25:59 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
    Jan  9 13:25:59 Tower kernel: device vethba51b15 left promiscuous mode
    Jan  9 13:25:59 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
    Jan  9 13:25:59 Tower avahi-daemon[7024]: Withdrawing address record for fe80::3410:79ff:fe93:be73 on vethba51b15.
    Jan  9 13:30:14 Tower kernel: BUG: Bad page state in process kswapd0  pfn:36fecb
    Jan  9 13:30:14 Tower kernel: page:0000000091ed811b refcount:0 mapcount:-64 mapping:0000000000000000 index:0x1 pfn:0x36fecb
    Jan  9 13:30:14 Tower kernel: flags: 0x2ffff0000000000()
    Jan  9 13:30:14 Tower kernel: raw: 02ffff0000000000 dead000000000100 dead000000000122 0000000000000000
    Jan  9 13:30:14 Tower kernel: raw: 0000000000000001 0000000000000000 00000000ffffffbf 0000000000000000
    Jan  9 13:30:14 Tower kernel: page dumped because: nonzero mapcount
    Jan  9 13:30:14 Tower kernel: Modules linked in: tun veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter dm_crypt dm_mod dax md_mod amdgpu gpu_sched i2c_algo_bit drm_kms_helper ttm drm agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops it87 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding edac_mce_amd kvm_amd ccp kvm mpt3sas crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd sr_mod raid_class cdrom scsi_transport_sas r8169 cryptd i2c_piix4 realtek i2c_core ahci video glue_helper k10temp backlight acpi_cpufreq libahci button
    Jan  9 13:30:14 Tower kernel: CPU: 1 PID: 652 Comm: kswapd0 Not tainted 5.10.28-Unraid #1
    Jan  9 13:30:14 Tower kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88X-D3H, BIOS F7 12/25/2015
    Jan  9 13:30:14 Tower kernel: Call Trace:
    Jan  9 13:30:14 Tower kernel: dump_stack+0x6b/0x83
    Jan  9 13:30:14 Tower kernel: bad_page+0xcb/0xe3
    Jan  9 13:30:14 Tower kernel: check_free_page+0x70/0x76
    Jan  9 13:30:14 Tower kernel: free_pcppages_bulk+0xd0/0x205
    Jan  9 13:30:14 Tower kernel: free_unref_page_list+0xbe/0xf4
    Jan  9 13:30:14 Tower kernel: shrink_page_list+0x8e3/0x924
    Jan  9 13:30:14 Tower kernel: shrink_inactive_list+0x1d6/0x2e2
    Jan  9 13:30:14 Tower kernel: shrink_lruvec+0x369/0x4e7
    Jan  9 13:30:14 Tower kernel: ? __default_send_IPI_shortcut+0x1b/0x26
    Jan  9 13:30:14 Tower kernel: ? setup_local_APIC+0x20f/0x248
    Jan  9 13:30:14 Tower kernel: ? update_load_avg+0x2aa/0x2c4
    Jan  9 13:30:14 Tower kernel: mem_cgroup_shrink_node+0xa1/0xc9
    Jan  9 13:30:14 Tower kernel: mem_cgroup_soft_limit_reclaim+0x13c/0x237
    Jan  9 13:30:14 Tower kernel: balance_pgdat+0x1fc/0x3dc
    Jan  9 13:30:14 Tower kernel: kswapd+0x240/0x28c
    Jan  9 13:30:14 Tower kernel: ? init_wait_entry+0x24/0x24
    Jan  9 13:30:14 Tower kernel: ? balance_pgdat+0x3dc/0x3dc
    Jan  9 13:30:14 Tower kernel: kthread+0xe5/0xea
    Jan  9 13:30:14 Tower kernel: ? __kthread_bind_mask+0x57/0x57
    Jan  9 13:30:14 Tower kernel: ret_from_fork+0x22/0x30
    Jan  9 13:30:14 Tower kernel: Disabling lock debugging due to kernel taint

    And that's it, nothing more was logged. For once it actually managed to log ..something, rather than stop around 13:25:59 (withdrawing address, making a new one, yadda yadda, server unresponsive). DM-3 shows up in the log, which is as interesting as it is annoying, since I still don't have a cache device, although it might come from the system having a misconception somewhere about having a swap file (which it doesn't anymore, and swap is indicated at 0kb)?

    I actually thought for a while that I had figured out the problem, as a misbehaving docker got corrupted, and started spewing out files into its config directory, and since I was running minimal plugins, nothing ever reported that it was doing so, or filling up. Fairly sure I've fixed said docker, or at least directed it at a suitable target, but still - server goes space heater less often now, but it's still happening far too often to be useful. 

     

    I'm guessing that the almost constant "Old network address died, making a new one" is due to a docker VPN, which itself keeps detecting a loopback which makes it kill off the packet

    Quote

    2022-01-09 19:35:07,885 DEBG 'start-script' stdout output:
    2022-01-09 19:35:07 us=885547 Recursive routing detected, drop tun packet to [AF_INET]<--VPN Provider IP-->

    ..this happens not infrequently, but interestingly enough everything works ok(or at least as expected), and I can't quite figure out how to stop it from happening - the issue is that a client ends up trying to send packets to itself, through the tunnel, which uhh.. OpenVPN obviously doesn't like, and kills the packets. 

     

    Currently I'm guessing the shutdown is due to a thermal issue, but I can't conclusively say that it is - There's fairly decent cooling overall, although the area around the USB slots/North bridge does seem to get fairly hot for some reason.

    I do however have some pictures of the on screen/console output once the server goes down, sadly they're pretty much all the end of a trace, with a bunch of register addresses which mean ..nothing to me.

    Most understandable was

    > Kernel panic - not syncing: fatal exception in interrupt

    > Kernel Offset: disabled

    > ---[end kernel panic - not syncing: Fatal exception in interrupt ]---

  6. And I guess more bumping. System is by now killing itself multiple times a day, although I did manage to trace the php killing to a docker container, which has been removed. At this point even thinking about a parity check is a joke, as it just slows down the server before the next kill restarts the process. 

     

    There's nothing particularly consistent about it - sometimes I'm doing something that uses a good amount of resources and it works fine, other times the server dies, often it's just left on its own, then dies, one time I've even just left it at the unRaid login prompt, with nothing mounted or done (not even logged in), and yet it's managed to kill itself in the ~10 hours or so it was just ..standing at an idle login prompt, with no workload at all. 

    The logs are of little help to me, and I can't claim to be able to decipher the kernel panic log that remains on screen whenever the server crashes - it's not even the exact same every time either, although it does consistently appear to be of "Not syncing: fatal exception in interrupt" type, whatever that means beyond a fatal error in an interrupt though, 'i haven't the foggiest.

  7. Update, I guess. Not going to be helpful for anyone with a similar problem, I suspect.

     

    Got parity upgraded by rebuilding it, then upgraded drive by rebuilding onto the old parity. So parity swap, manual style. 

     

    Quite annoying, but worked first time I tried it.

     

    Currently the server is busy crashing/freezing itself about every 1-2 days, killing some flavour of php (php-7 I think?) for using too much memory, due to a pathetically low limit being set ..somewhere, that I can't find. Apparently it can only use ~270 something mb of ram, despite the system having 16gb available, yet that is above the limit, so it gets reaped by the oom killer. I'm also getting errors for a DM-3 device, which is curious, as I don't have a cache drive installed (never have), and I can only find mentions of that address/designation in threads about cache SSDs.

     

    Dockers have been nerfed, all running memory limits, none of them approaching them, only running a handful of dockers changes ..nothing, and at this point essentially all plugins have been uninstalled, to no help. System still kills itself randomly, with regularity, and displaying the same symptoms each time it happens:

    Fans are running at a decent clip, pumping hot air out of the case (despite unraid being frozen and unresponsive to even a basic ping), interestingly the network interface lights still light up, but overall apart from functioning as a space heater, the server is not functional.

  8. Right. I recently pretty much filled my initial uNRAID array, and had a disk that was throwing some errors, although has been stable for a fair while (I'm not entirely convinced the disk is the problem, rather than a temporary issue where I bumped the cable with the array up, reseated it, and it racked up a ton of errors in the meantime), but I digress - I figured I'd swap that disk for a bigger one, and since I got a deal on a 14Tb drive, with my current 12Tb parity, I had to upgrade parity first. 

     

    I had however learnt about the parity swap procedure before that, so figured it would make sense to do that instead, and use the old disk with the errors as a scratch drive for something. Anyhow, I ran a pre-clear on the new parity drive, which came up fine, and so follow the process for the parity swap, which goes fine ... until it doesn't.

     

    The progress will at some point just stop, and stay stuck wherever it got to, never progressing. Specifically once this

    Quote

    Dec 6 15:43:50 Tower kernel: general protection fault, probably for non-canonical address 0xf7ff8883d1ed11a0: 0000 [#1] SMP NOPTI

    Dec 6 15:43:50 Tower kernel: CPU: 3 PID: 5729 Comm: kworker/u8:4 Not tainted 5.10.28-Unraid #1

    appears in the syslog, there's pretty much a 100% chance of parts of the server being locked up, the parity swap being stuck, the relevant disks spinning down at their designated spin-down time, and zero chance of doing a clean shutdown or anything like that. Most functionality is still retained, at least insofar as a server with no mounted array can be said to have functionality - usually the webGUI works, although it can crash as well. Logs are accessible, but shutdown commands just get logged, without shutting down.

     

    So far I've tried several times to get the process completed, most successfully it got to 100%, then ..got stuck, of course. The syslog did actually manage to capture the successful termination of the old->new parity copy, but since whatever needed to be run afterward wasn't run, this was overall not a success, and the array didn't recognize the new parity drive as correct.

    Things I've done to try to keep it from happening:

    Run memtest. Several times, at varying lengths(including just running it all night), all came up clear.

    Most attempts were made while running in safe mode, to ensure that the problem didn't come from a plugin. 

     

    So far I'm at my wit's end, as I can't find any clear indication of what or why something is trying to access places it shouldn't, or how to prevent it - the PID listed in any error messages is long dead by the time I see it.

     

    For some interesting reason, the drives will all show as "Device encrypted and unlocked" once the server fails in this manner, regardless of whether or not I've actually mounted/unlocked the array before initiating the parity swap.

    tower-diagnostics-20211206-1716.zip

×
×
  • Create New...