Jump to content

arough

Members
  • Posts

    15
  • Joined

  • Last visited

Posts posted by arough

  1. I think I got it fixed.

     

    Here's what I did in the end:

    • (disable every autostart function. Array, docker service and VMs)
    • copy all important files from `appdata` and everything in `system` to a backup folder on the array via rsync
      • Curiously I only got one error while syncing. If any copied files are still broken, that's the price I have to pay for not having a recent backup ¯\_(ツ)_/¯
      • I included the -a option so I wouldn't have problems with wrong permissions for the docker containers in the end. Not that I ever went through this before...
    • shrink the cache pool to one drive by unassigning the second drive (because I was at the drive limit)
    • remove the unused SSD
    • install the new SSD and create a new pool (`cachetwo`) with it
    • copy the data from step 1 to the new cache pool
    • swap every relevant share from `cache` to `cachetwo`
      • probably best to disable mover before this step, but I knew it wasn't time for it to run
      • Fix common problems will tell you that there is data for the shares on the old pool. While true, it is irrelevant in this instance
    • unassign the remaining drive from the original cache (this way I could remount it if I did something wrong)
    • delete the docker.img and rebuild it via `docker -> add container` and selecting
    • (add a new second pool drive when it arrives today)

    This way I also swapped out the pool drives for peace of mind.
    So the only hardware left from before are the array drives, PSU and the RAM (which is now at 3200MHz).

    For now everything seems to be stable and so far no CRC errors while downloading and unpacking .... stuff

     

    Did I do anything that could bite me in the ass later?

     

    Thank you very much for helping! I only would have stumbled across the Ryzen RAM issue after nights of furiously googling and ripping my hair out. If ever :D

  2. The memtest ran for 16 hours without issues at 3200 MHz.

     

    I noticed that the backup I tried to do from the readonly zfs pool to the second flash drive wasn't successful.
    Only a few folders are on the flash drive.
    Curiously while copying the flash drive seemed to fill and showed ~80GB used (even if the cache pool only has ~55GB of data).

    Looking at it now there are only 3.9GB used and merely three folders from the appdata copied.

    This would be the plethora of errors in the syslog.

     

    Is there any way to get all the data from the pool?

    As far as I can tell when I configure app data to move cache -> array it stops at the same file in the ESPHOME folder as if I would the files manually.
    I guess this would be one of the broken files mentioned below.
     

    The GUI shows there are 12 corrupted files on the pool, but running `zpool status -v` only shows this

    image.png.95df57c390eb9a27a31396e20f8026df.png

     

     

  3. As far as I can tell I have errors on my ZFS RAID1 2-disk cache pool which leads to a corrupt docker.img and crashing docker containers and unresponsive system or outright crashes.
    When everything seems to run "fine" i get CRC errors on nearly everything I download and unrar and eventually the server seems to self-destruct and I need to hard reboot.
    Also my Home Assistant regularly get's a broken db and starts a new one. Guess it's because it too get's corrupted from what ever is going on.

     

    I attached a few screenshots with errors I'm getting.
    The first few is the "crashed" state where I can still access the webGUI, navigate it but nothing I do in there does anything meaningful.

    The fourth is when trying to backup the docker.img

    I had bought new hardware and saw this as a sign to finally install the new mainboard and CPU. I also installed new SATA cables out of caution.
    I also ran memtest for ~4h and nothing came up.
    Nothing changed, but I didn't get my hopes up anyway, as the cache and/or docker.img are in a broken/corrupt state.

    I found what to do individually for a corrupt docker.img or pool, but am unsure as to how to proceed.

    I mounted the pool in readonly mode and backed up all I could (docker.img wouldn't copy but will be recreated anyway).
    I've got one new SSD on hand and will be ordering a second one.

    So far I've manually copied /appdata and docker.img from the pool to a usb drive.
    While copying them (via cp) I got a bunch of errors that it can't copy symlinks. (last screenshot) I could force it via -L but then when moving stuff back I would have lost the linking and just have hard copies. I guess I have to tackle the problem when recreating the docker.img and just don't copy any linked files right?

     

     

    What steps do I take to get to a running server with a fresh docker.img and a new pool with only the new SSD (the second one will be added once it arrives).


    Do I just

    • remove both SSDs
    • install the single new SSD
    • create a new pool
    • manually move the files back
    • delete the docker.img and recreate it and the Apps
    • ???? 
    • profit

     

    Keep in mind I have the 6 disk license and am already at the cap.

     

    Thanks in advance

    SCR-20240501-ddp.png

    SCR-20240430-all.png

    IMG_2446.png

    SCR-20240501-n52.png

    SCR-20240501-o22.png

    tower-diagnostics-20240501-1653.zip

  4. Thanks @trurl.
    Will delete the shares.

     

     

    I forked CA and hacked together an old version that still runs on 6.8.3 to potentially fix the Minecraft docker.
    In the end CA wasn't even needed to fix it :D

    I got Minecraft running again (was an issue with an old version of `runc` that is already updated on UnRAID 6.9.2+)

    The Minecraft server is running great again. Peaks to 90% when someone is joining or exploring new areas, but when playing in already generated chunks UnRAID hovers around 20%.

     

    It looks to me as if there is a direct link between my problems and the upgrade.
    Is there anything I should test while on 6.8.3 to gather data and compare it to 6.10.3?
     

    Here is another diagnostics with docker running and everything as fast as it was before.

    tower-diagnostics-20220913-1630.zip

    Are there maybe some other settings that were introduced in 6.9 or 6.10 that could mess with my setup?
    Should I maybe upgrade to 6.9 first to see if the slowdowns happen there too?

    Is there any way to check if the Spectre and Meltdown mitigations were correctly disabled on 6.10.3 after I edited the `syslinux.cfg`?

    I would of course check this after upgrading to 6.10.3 again.

     

     

    I wanted to upgrade to a Ryzen 3600 in the coming weeks and thought it would be good to upgrade UnRAID before that.
    Maybe that wasn't the best idea :D

  5. As far as I can tell this is only relevant for `prefer/only` right?

     

    Spoiler

    image.thumb.png.8802dce5b2551b4d87b7019b106b8e68.png

     

    The other ones already had folders on array disks and just used them.

    As there was no docker containers running and nothing moved in the `running without cache` state only empty folders were created and the complete `system` share was copied somehow.

    How should I proceed from here?
    Delete the empty folders manually via the console and delete the `system` folder on disk4?

     

    Spoiler

    image.png.ffe1333f9d67ce8365231c254bb73e2a.png

     

  6. God damn it...
    forget every iperf log I posted before.
    For some reason it is way slower over my Mac then with my PC.
    The two top ones are from my windows PC.

     

    Spoiler

    image.thumb.png.f0e80c28933bface88eb969fbe8722f7.png

     

    But UnRAID is nevertheless faster with 6.8.3.

    The duplicati backups take 1-2 minutes again.
    The minecraft server for some reason will not start so I can't test it right now.
    And because CA is missing (the newest version isn't compatible with 6.8 anymore and I can't find a way to install an old one) I can't install another minecraft app/docker.

    Is there any way to do this?

     

    Plex seems to behave better.
    It is now transcoding for ~30 secs at 100% to fill the buffer and then only spiking every few seconds for like a second to fill the buffer again.

     

    While moving the same file (on my Mac again to keep it comparable) htop now looks like this.
     

    Spoiler

    image.thumb.png.9a71147d26cb61e166944663ae17610c.png

     

    Way lower load than with 6.10.3.

     

    Moving files via FileBot is now back to ~450 MB/s again, opposed to ~140MB/s with 6.10.3.
    I use FileBot on Windows, but it is transferring directly on the cache drives as source and target are both on it.

     

    Should I update to 6.10.3 again and see how iperf performs with my PC so the data I present to you is actually useful? 😅

     

    EDIT: Just assigning the cache drives again worked btw.

  7. Sadly that didn't help.

     

    Spoiler

    image.thumb.png.7be4ab7455e8e6e9b1303d511a21abd5.png

     

    iperf was even slower for a bit, but I don't think that's because of the disabled mitigations.
    At least I couldn't explain how it should have an effect on it.

     

    Spoiler

    image.thumb.png.70e51620d76b82ff9d60c871032e0dba.png

     

    While copying a 2.1GB file this is what htop looks like:

     

    Spoiler

    image.thumb.png.18589c5a11fe7c50fad859342abf0ac5.png


    And the UnRAID GUI: 

     

    Spoiler

    image.png.11876977e33dbca99535bc96d0203a21.png

     

    As you can see there is some activity after the file transfer is complete.
    The progress bar for the transfer is at 100% and my MacBook isn't sending data anymore.
    Only when it's at 100% the cache is written to at it's normal speed of ~500MB/s.

    Same behaviour with my Windows PC.
    It feels like there is some kind of slower cache in between. It definitely isn't the RAM as it would be faster and the RAM usage isn't going up significantly while transferring.

     

    And of course I'm not using WiFi for all of this.

  8. Hey,

     

    I updated my server from 6.8 to 6.10 and now pretty much everything runs with high CPU load and/or slower then before the update.

     

    iperf is somewhat slow.
     

    Spoiler

     

    image.thumb.png.a9bc4e8c6c6c801f208bd29a62a5679f.png

     

     

    Before the update I was able to move files from a local machine to the server with the full 1Gbit/s and now im only getting 70-80MB/s.

    In the GUI it looks like there isn't even any activity on the cache drives until the transfer is nearly done and then it goes up to the ~500Mbit/s of my MX500s for the last few seconds.

    When moving files on the server the speed normally went up to ~400MB/s and now hovers around 140MB/s.

     

    One measurable slowdown is with my duplicati backup. It normally took 1-2 minutes when nothing has changes and only the checks were running. It now takes 5-7 minutes.

     

    Spoiler

    image.thumb.png.950017f29ba78e3cd1d66059a517fbc0.png

     

    A minecraft server I set up a few days ago ramps up my CPU to ~90% with one player connected.
    Before the update four players were able to play without that much load on the CPU.

     

    Plex also uses nearly 100% when transcoding for one user and ~50% when direct playing.

    That are the things I noticed so far, I think.

     

    I don't understand what is happening here. The network isn't up to snuff and the CPU is overloading constantly.
    The drives seem to be fine, when checking the speed via the DiskSpeed docker.
    (One of my array drives is collecting `Reallocated sector count` errors, but I'm already in contact with Seagate for advanced RMA. And as this seems to be a problem when using the cache drives, I don't see how that could be related)

     

    UnRAID forgot the settings for the SMART notifications after the update so I was getting the "normal" `current pending sector count` errors every MX500 gets. I didn't find any other settings UnRAID "lost" after the update so far.

     

    I attached the diag archive.

     

    Can someone make any sense of all this?
     

    Thanks in advance.

    tower-diagnostics-20220912-2100.zip

  9. Hey guys,

     

    loving this nextcloud docker.

    Have it working with MariaDB and LetsEncrypt Reverse Proxy.

     

    But I have two problems with it.

    When I logout of e.g. the admin account to log in with my user account it doesn't redirect me to the login page.

    I press logout and nothing happens. If I press F5 it redirects me, but not automatically.

    But this is only a minor problem and nothing to serious.

     

    Next I am apperantly in an update loop.

    I update, check for updates and there is another one ready.

    The docker says there is an update but only this happens:

    image.png.f5f9fb96ae6f9d122880c8d7b525dc4a.png

     

    Sometimes it does a real update and downloads some files, one of it is ~150mb in size.

    But I am still on 17.0.2.

    Is there no update for v18 in this docker yet?

     

    Thanks in advance guys ;)

     

×
×
  • Create New...