Jump to content

Helmonder

Members
  • Posts

    2,818
  • Joined

  • Last visited

Everything posted by Helmonder

  1. Since I am redoing the complete cache drive anyhow I decided to also recreate my docker image file and redownload all my dockers.. Actually a very easy process: 1) turn of docker in settings 2) delete docker file 3) turn on docker in settings (and recreate file) 4) all of your dockers are now gone 5) choose "add docker" in the docker screen and look up your docker under "user templates" in the drop down, that will reinstall the docker with all your previous sessions and mappings 6) set the dockers to auto start if you had that before (this is not automatic)
  2. Its still a hassle... but... to be preferred over having a format option that people might make mistakes with... I dont mind jumping a few hoops..
  3. 1) I have copied the complete contents of the cache drive to a share in my array. Before I did that I turned off docker and KVM in settings (them beiing active might interfere with the copy). 2) The copy I ran through putty but in "screen" to make sure an interrupted ssh session would not kill the copy (I used MC). 3) The copy went fine (no errors), but just to make sure it really - really was ok I did a compare on file size of the copy and the original; they were the same size. 4) Then on to reformatting the cache drive, that is not a straightforward process, it appears... there only is a format button when a drive is not formatted.. There used to be a way araound this by doing: - Stop the array - Go main, cache drive, change file system if you need to, then press format. - In case the filesystem was allready what you wanted you needed to change to a filesystem you do not want, format, and then do it the other way around. Now however, there only is the option for BTRFS... So this does not work any more... To get the disk to a point that I could reformat I did in the end: - Stop the array - Remove cache drive out of the array (not physically, just change to "no drive" in the cache choice tab - Now start the array, the drive will show up as "unassigned drive" - I did a limited preclear (erase only), that removes the filesystem - Stop the array again - Add the cache drive back in its original spot - Start the array and format the drive (which is now an option) 5) Now copiing all the data back from the array to the cache drive..
  4. Thanks for the up ! I run monthly appdata backups but I will make an extra one right now and then do a reformat.
  5. After a reboot this morning my cache drives seems to be unmountable... No idea what is going on... Syslog is attached Error messages in the log are as below: May 8 09:51:08 Tower kernel: ACPI: Early table checksum verification disabled May 8 09:51:08 Tower kernel: spurious 8259A interrupt: IRQ7. May 8 09:51:08 Tower kernel: floppy0: no floppy controllers found May 8 09:51:08 Tower kernel: random: 7 urandom warning(s) missed due to ratelimiting May 8 09:51:09 Tower rpc.statd[1802]: Failed to read /var/lib/nfs/state: Success May 8 09:51:09 Tower ntpd[1832]: bind(19) AF_INET6 fe80::1c3e:aeff:fe3a:defa%13#123 flags 0x11 failed: Cannot assign requested address May 8 09:51:09 Tower ntpd[1832]: failed to init interface for address fe80::1c3e:aeff:fe3a:defa%13 May 8 09:51:28 Tower avahi-daemon[11706]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! May 8 09:51:40 Tower kernel: WARNING: CPU: 2 PID: 12688 at fs/btrfs/extent-tree.c:6795 __btrfs_free_extent+0x1fd/0x8e4 May 8 09:51:40 Tower kernel: CPU: 2 PID: 12688 Comm: mount Not tainted 4.19.37-Unraid #1 May 8 09:51:40 Tower kernel: Call Trace: May 8 09:51:40 Tower kernel: BTRFS error (device nvme0n1p1): unable to find ref byte nr 1037649829888 parent 0 root 5 owner 77097 offset 230969344 May 8 09:51:40 Tower kernel: BTRFS: Transaction aborted (error -2) May 8 09:51:40 Tower kernel: WARNING: CPU: 2 PID: 12688 at fs/btrfs/extent-tree.c:6801 __btrfs_free_extent+0x250/0x8e4 May 8 09:51:40 Tower kernel: CPU: 2 PID: 12688 Comm: mount Tainted: G W 4.19.37-Unraid #1 May 8 09:51:40 Tower kernel: Call Trace: May 8 09:51:40 Tower kernel: BTRFS: error (device nvme0n1p1) in __btrfs_free_extent:6801: errno=-2 No such entry May 8 09:51:40 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_run_delayed_refs:2935: errno=-2 No such entry May 8 09:51:40 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_replay_log:2277: errno=-2 No such entry (Failed to recover log tree) May 8 09:51:40 Tower kernel: BTRFS error (device nvme0n1p1): pending csums is 134717440 May 8 09:51:40 Tower root: mount: /mnt/cache: mount(2) system call failed: No such file or directory. May 8 09:51:40 Tower emhttpd: /mnt/cache mount error: No file system May 8 09:51:40 Tower kernel: BTRFS error (device nvme0n1p1): open_ctree failed The cache drive is still listed as a cache drive, just with an unmountable file system, attributes do not show issues I recognise as a an issue: Critical warning 0x00 - Temperature 36 Celsius - Available spare 100% - Available spare threshold 5% - Percentage used 4% - Data units read 155,230,378 [79.4 TB] - Data units written 90,224,490 [46.1 TB] - Host read commands 464,542,688 - Host write commands 539,484,666 - Controller busy time 2,395 - Power cycles 21 - Power on hours 2,684 - Unsafe shutdowns 13 - Media and data integrity errors 0 - Error information log entries 10,922 - Warning comp. temperature time 0 - Critical comp. temperature time 0 Balance and scrub cannot be run "because array is not started" (array is ofcourse started and working) I have started the array in maitenance mode so I can run the btrfs filesystem check in readonly mode, results are as follows: [1/7] checking root items [2/7] checking extents ref mismatch on [1037649817600 8192] extent item 255, found 1 data backref 1037649829888 root 5 owner 77097 offset 230969344 num_refs 0 not found in extent tree incorrect local backref count on 1037649829888 root 5 owner 77097 offset 230969344 found 1 wanted 0 back 0xcd9f170 incorrect local backref count on 1037649829888 root 5 owner 77097 offset 17208183807669456896 found 0 wanted 4287137790 back 0x17974a30 backref disk bytenr does not match extent record, bytenr=1037649829888, ref bytenr=0 backpointer mismatch on [1037649829888 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 344c37ac-26f1-4307-8451-1116b06922be found 238952861696 bytes used, error(s) found total csum bytes: 172892316 total tree bytes: 1707900928 total fs tree bytes: 1359200256 total extent tree bytes: 124682240 btree space waste bytes: 369061441 file data blocks allocated: 1187284238336 referenced 233465798656 Since errors are found I changed the --readonly to --repair and started a new check, allowing BTRFS to fix itself. It looks however like a dialogue process is now presented that is waiting for input that I ofcourse cannot give thru the webpage: enabling repair mode Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 344c37ac-26f1-4307-8451-1116b06922be repair mode will force to clear out log tree, are you sure? [y/N]: To make sure something else is not rotten I stopped the array, unassigned the cache drive, started the array without cache drive, stopped the array and re-added the cache drive. Cache drives comes back but again without file system. Since the BTRFS repair option still might work but appears to be stuck in a dialogue process I want to run it through commandline, unofrtunately the /dev/ name listed as name of the cache drive does seem to work, if I give: btrfs check --repair /dev/nvme0n1 that comes back with a remark the there is no btrfs filesystem there.. I checked the log to see how the check is run through the GUI, this gives a different /dev/ name: /dev/nvme0nlpl I am now running the following command: btrfs check --repair /dev/nvme0nlpl Unfortunately it comes back as aborted, output is as follows: root@Tower:/dev# btrfs check --repair /dev/nvme0n1p1 enabling repair mode Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 344c37ac-26f1-4307-8451-1116b06922be repair mode will force to clear out log tree, are you sure? [y/N]: Y [1/7] checking root items Fixed 0 roots. [2/7] checking extents ref mismatch on [1037649817600 8192] extent item 255, found 1 repair deleting extent record: key [1037649817600,168,8192] adding new data backref on 1037649817600 root 5 owner 77097 offset 188153856 found 1 Repaired extent references for 1037649817600 data backref 1037649829888 root 5 owner 77097 offset 230969344 num_refs 0 not found in extent tree incorrect local backref count on 1037649829888 root 5 owner 77097 offset 230969344 found 1 wanted 0 back 0xce5cd30 incorrect local backref count on 1037649829888 root 5 owner 77097 offset 17208183807669456896 found 0 wanted 4287137790 back 0x17a32240 backref disk bytenr does not match extent record, bytenr=1037649829888, ref bytenr=0 backpointer mismatch on [1037649829888 4096] repair deleting extent record: key [1037649829888,168,4096] adding new data backref on 1037649829888 root 5 owner 77097 offset 230969344 found 1 Repaired extent references for 1037649829888 Failed to find [253425188864, 168, 16384] btrfs unable to find ref byte nr 253425221632 parent 0 root 2 owner 0 offset 0 transaction.c:195: btrfs_commit_transaction: BUG_ON `ret` triggered, value -5 btrfs[0x43e9f2] btrfs(btrfs_commit_transaction+0x1ae)[0x43efce] btrfs[0x45d282] btrfs(cmd_check+0xc07)[0x45fff7] btrfs(main+0x8e)[0x40dcbe] /lib64/libc.so.6(__libc_start_main+0xeb)[0x14f732db9b5b] btrfs(_start+0x2a)[0x40deba] Aborted I have tried the same with the array not running... same result.. I ran the fix a couple of more times... Because I think the output was slightly different every time, maybe it was working itself through something.. I got through it without an abort after 4 tries, when I now bootup the array in maintenance mode and do a readonly check I get the following output: [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 344c37ac-26f1-4307-8451-1116b06922be cache and super generation don't match, space cache will be invalidated found 238952861696 bytes used, no error found total csum bytes: 172892316 total tree bytes: 1707900928 total fs tree bytes: 1359200256 total extent tree bytes: 124682240 btree space waste bytes: 369061441 file data blocks allocated: 1187284238336 referenced 233465798656 This basically looks error free I think ? The cache drive continues to appear as without file system though... Even after stopping and restarting the array.. Therefore I did again: I stopped the array, unassigned the cache drive, started the array without cache drive, stopped the array and re-added the cache drive. Then started the array in maintenance mode. There is no message relating to an unmountable file system any more.. I stop the array and restart it regularly (without maintenance mode) Now the array comes back up without a missing filesystem. Cache drive appears to be back in full operation, dockers are also running again... Issue solved.. But any idea what went wrong here ? tower-syslog-20190508-0756.zip
  6. I just installed this docker, works great ! although extremely slow.. But then you only have to set it up once... Am wondering if it is maintained though... Current version in the docker is a few releases back... Does anyone know ?
  7. but those call traces I get with macvlan are not what should happen right ?
  8. Well.. disabling bonding does not cause the promiscuous mode to go away.
  9. Mmm.... Did some more searching...: https://www.linuxquestions.org/questions/linux-security-4/kernel-device-eth0-entered-promiscuous-mode-756884/ As explained before promiscuous mode means a packet sniffer instructed your ethernet device to listen to all traffic. This can be a benign or a malicious act, but usually you will know if you run an application that provides you with traffic statistics (say ntop or vnstat) or an IDS (say Snort, Prelude, tcpdump or wireshark) or Something Else (say a DHCP client which isn't promiscuous mode but could be identified as one). Reviewing your installed packages might turn up valid applications that fit the above categories. Else, if an interface is (still) in promiscuous mode (old or new style) then running 'ip link show' will show the "PROMISC" tag and when a sniffer is not hidden then running Chkrootkit or Rootkit Hunter (or both) should show details about applications. If none of the above returns satisfying results then a more thorough inspection of the system is warranted (regardless the time between promisc mode switching as posted above being ridiculously short). In my case I do not expect that there is something bad going on... Question is however, is this "promiscuous mode" triggered by an individual docker (and is that docker maybe "bad"), or is this mode triggered by unraid in combination with the docker mechanism, and if so: why .. So I checked... Starting any docker on my system will imediately trigger "promiscuous mode" to be on... It does not matter what docker it is.. So that points to unraid doing something there.. I checked my log file: Apr 28 07:07:27 Tower rc.inet1: ip link set bond0 promisc on master br0 up Apr 28 07:07:27 Tower kernel: device bond0 entered promiscuous mode Apr 28 07:07:30 Tower kernel: device eth1 entered promiscuous mode Apr 28 07:08:10 Tower kernel: device br0 entered promiscuous mode pr 28 07:08:12 Tower kernel: device virbr0-nic entered promiscuous mode The promiscuous mode is related to network bonding, which is what I am using.. I find some info on it in combination with changing a VLAN's hardware address: https://wiki.linuxfoundation.org/networking/bonding Note that changing a VLAN interface's HW address would set the underlying device – i.e. the bonding interface – to promiscuous mode, which might not be what you want. I am going to stop digging as I am completely out of my comfort zone... And into some kind of rabbithole, maybe this promiscous mode has nothing to do with the issue.. Anyone ?
  10. The last few lines in the log now show: Apr 28 11:27:35 Tower kernel: vetha56c284: renamed from eth0 Apr 28 11:27:47 Tower kernel: device br0 left promiscuous mode Apr 28 11:27:47 Tower kernel: veth3808ad4: renamed from eth0 The "renamed from eth0" point to the two Dockers stopping, the "device br0 left promiscuous mode" I googled a bit and read that "promiscuous mode" is most likely activated when some kind of traffic monitoring / sniffering is going on.. Is there something resembling that active in combination with Dockers in unraid ?
  11. I am seeing code traces in the current log (so at this time): Apr 28 11:08:45 Tower kernel: Call Trace: Apr 28 11:08:45 Tower kernel: <IRQ> Apr 28 11:08:45 Tower kernel: ipv4_confirm+0xaf/0xb7 Apr 28 11:08:45 Tower kernel: nf_hook_slow+0x37/0x96 Apr 28 11:08:45 Tower kernel: ip_local_deliver+0xa7/0xd5 Apr 28 11:08:45 Tower kernel: ? ip_sublist_rcv_finish+0x53/0x53 Apr 28 11:08:45 Tower kernel: ip_rcv+0x9e/0xbc Apr 28 11:08:45 Tower kernel: ? ip_rcv_finish_core.isra.0+0x2e5/0x2e5 Apr 28 11:08:45 Tower kernel: __netif_receive_skb_one_core+0x4d/0x69 Apr 28 11:08:45 Tower kernel: process_backlog+0x7e/0x116 Apr 28 11:08:45 Tower kernel: net_rx_action+0x10b/0x274 Apr 28 11:08:45 Tower kernel: __do_softirq+0xce/0x1e2 Apr 28 11:08:45 Tower kernel: do_softirq_own_stack+0x2a/0x40 Apr 28 11:08:45 Tower kernel: </IRQ> Apr 28 11:08:45 Tower kernel: do_softirq+0x4d/0x59 Apr 28 11:08:45 Tower kernel: netif_rx_ni+0x1c/0x22 Apr 28 11:08:45 Tower kernel: macvlan_broadcast+0x10f/0x153 [macvlan] Apr 28 11:08:45 Tower kernel: macvlan_process_broadcast+0xd5/0x131 [macvlan] Apr 28 11:08:45 Tower kernel: process_one_work+0x16e/0x24f Apr 28 11:08:45 Tower kernel: ? pwq_unbound_release_workfn+0xb7/0xb7 Apr 28 11:08:45 Tower kernel: worker_thread+0x1dc/0x2ac Apr 28 11:08:45 Tower kernel: kthread+0x10b/0x113 Apr 28 11:08:45 Tower kernel: ? kthread_park+0x71/0x71 Apr 28 11:08:45 Tower kernel: ret_from_fork+0x35/0x40 Apr 28 11:08:45 Tower kernel: ---[ end trace c12044621539eec0 ]--- This seems to correspond with "macvlan" as discussed in the following post: https://forums.unraid.net/topic/75175-macvlan-call-traces/ Maybe tonight I experienced a kernel panic as a result of this Macvlan issue ? Weird though.. Server has been stable for weeks... So maybe there is some combination going on with the amount of disk traffic the parity rebuild was causing ? In the days before the parity rebuild I had two WD red's 10TB doing a preclear.. That did not cause an issue... So the parity sync might be the thing that pushes something over the edge.. I actually allready closed down all of my dockers but for Pihole and the HA-Bridge for Domoticz.. I turned those off also now... Since there is no docker with its own ip address running anymore now I would expect the errors to go away now..
  12. System has come up again with invalid parity, it also restartes parity sync / data rebuild. The notification in the gui told me that parity sync was completed succesfully (ofcourse this wasn';t the case, seperate issue in the notification system). The preclear was stopped but could be resumed (restarted the post-read phase) I do not know if it is valuable but I included the current syslog. I also checked to logrotation but the previous one is five days old, so that will not help. Added: all dockers show update ready, does not seem correct, this was not the case yesterday. I have now stopped all dockers to give the array some rest during the parity rebuild syslog
  13. I will now restart the array, it is not responding to anything
  14. Yesterday evening I stopped my array and assigned a new parity drive to the primary parity slot (a 10TB WED RED instead of an 8TB). At the same time a preclear was also running for another 10TB WD RED This morning I woke up to the system beiing unresponsive. No webgui and no shares (I was plexing away fine yesterday evening). Putty also would not work. I could still IPMI in and tried to salvage the syslog, system was responding to my input but an ls to /mnt/user has caused my console to be unresponsive (I left it like this for half an hour hoping it would get out of this, but it hasn't. I have been able to make a few screenshots, hoping this shows something. unraid_crash.docx
  15. if grep -q "Out of memory" /var/log/syslog; then /usr/local/emhttp/plugins/dynamix/scripts/notify -e "OOM Checker" -s "Checked for OOM in syslog" -d "OOM error found in syslog" -i "alert" fi
  16. Oom is not really an issue any more, sure that you need this?
  17. And again... Changed it to 16384M now.. ERROR: CrashPlan for Small Business is running out of memory. The application crashed because of lack of memory. More memory needs to be allocated. This can be done via the CRASHPLAN_SRV_MAX_MEM environment variable.
  18. Now it does not continue starting but the error is back.. I will enlarge the memory to 8192M [app] starting CrashPlan for Small Business...ERROR: CrashPlan for Small Business is running out of memory. The application crashed because of lack of memory. More memory needs to be allocated. This can be done via the CRASHPLAN_SRV_MAX_MEM environment variable.
  19. The GUI cannot connect to its background process.. (crashplan notification, not the docker vnc..)
  20. It is running but I see an error message now.. Maybe that was causing issues before also: Application crashed because of lack of memory, incrrease crashplan_srv_max_mem. How do I go about that ?
  21. I installed a docker update yesterday, and after that it changed to a complete black screen... I just saw there was another now and am installing that.. If there still are issues I will post a screenshot !
×
×
  • Create New...