Jump to content

-C-

Members
  • Posts

    95
  • Joined

  • Last visited

Posts posted by -C-

  1. 2 hours ago, JorgeB said:

    There already is a warning not to format disks during a rebuild, and the txt changed a couple of times to make as clear as possible, but it seems it doesn't help in some cases.

     

    Format warning new v6.8.png

     

    I took that as a standard "formatting will remove all data on this disk" type warning. I didn't think any of that was relevant to what I was doing- I had an empty disk with no data on it to worry about. To me "...will normally lead to loss of all data on the disk being formatted" is about losing any data on the disk being added, it doesn't mention anything about a different format breaking the rebuild process.

  2. Thanks Jorge,

     

    I followed the official guide here: https://docs.unraid.net/unraid-os/manual/storage-management/#replacing-faileddisabled-disks

     

    I had missed the point in the notes there "The rebuild process can never be used to change the format of a disk - it can only rebuild to the existing format."

    I wasn't using the process to change the format, it was to replace a failed drive, so I skipped over this point. Then when it came to adding the new disk, I figured that as I was having to format the disk anyway, i might as well switch to ZFS so I can take advantage of some of its benefits over XFS like replication from cache etc.

     

    Now that I think about how parity works, I realise that there was no way it could have worked with a different format type. It was my mistake. This was the first time I rebuilt a disk from parity, so it was all new to me. Fortunately, none of the data on that disk was irreplaceable and I have a backup of most of it.

     

    I do think that there should be a bigger warning about this in the manual to make it clearer that a change of format will stop the rebuild from being able to work. Especially now that I'd imagine there are others like me who'd like to take advantage of ZFS since 6.12 and may be tempted to do what I did.
    Even better would be if there's a way for the system to check whether you're trying to rebuild onto a disk with a different format than the one being replaced when attempting to start a rebuild and give a nice clear data loss warning.

  3.  

     

    I had some troubles during the rebuild and had to restart the server, but the rebuild appeared to complete successfully:

    image.thumb.png.d81f264f74675f1f08723959a138f13b.png

     

    Any idea what could have caused this?
    This is the first time a disk died without warning and I've needed to use parity.

     

    I'm assuming the data from the failed drive's gone. Fortunately, I have most of it backed up.

     

    I am using syslog to save the logs, so should have logs of everything if they're of use.

  4. Just now, itimpi said:

    I wonder if you got an unclean shutdown (or the plugin erroneously thought one had happened) as that would stop anything being restarted.

    Certainly possible. The system wasn't happy when I rebooted it (which was the reason for the reboot) and it may have killed hung processes in order to reboot. It certainly took longer than usual.

     

    (I used powerdown -r to restart in case that makes any difference.)

  5. 2 hours ago, JorgeB said:

    Probably only a reboot will help.

    In which case I'm stuck in a loop, for now- I rebooted the first time it happened and everything was stable for a day or so before it happened again, without a reason I can find.

     

    What's painful is that the rebuild is happening slowly- when I could last access the GUI I was getting around 10-30 MB/s, so I'll likely be stuck without a GUI for another day at least. I've not had a disk fail without warning before, so not had to rebuild from parity like this and am not sure whether that's normal. It's certainly running a lot slower than a correcting check.

     

    38 minutes ago, itimpi said:


    Are you sure?   I am sure it used to.   I will have to check this out again.

    That's what happened when I rebooted part way through the rebuild yesterday. Not sure if that's normal though.

    I've disabled mover as the rebuild was stopping for the daily move and not restarting afterwards.

  6. Just tried logging into the Unraid GUI and am now getting a

    image.png.da5303676b9e63f28eb35a565bb160ab.png

     

    Load is pegged again:

    top - 12:24:40 up 1 day, 12:32,  1 user,  load average: 52.86, 52.47, 52.31
    Tasks: 1152 total,   1 running, 1151 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  2.9 us,  5.2 sy,  0.0 ni, 82.7 id,  9.1 wa,  0.0 hi,  0.1 si,  0.0 st
    MiB Mem :  31872.3 total,   6157.1 free,  12252.2 used,  13463.0 buff/cache
    MiB Swap:      0.0 total,      0.0 free,      0.0 used.  18476.1 avail Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    11805 root      20   0  975184 412108    516 S  93.7   1.3 158:06.42 /usr/local/bin/shfs /mnt/user -disks 31 -o default_permissions,allow_other,noatime -o remember=0
    29298 nobody    20   0  226844 109888  32124 S  22.5   0.3  12:33.84 /usr/lib/plexmediaserver/Plex Media Server
    12138 nobody    20   0  386324  70936  55404 S   6.0   0.2   0:00.28 php-fpm: pool www
     9015 root      20   0       0      0      0 S   4.3   0.0 137:11.80 [unraidd0]
    13271 nobody    20   0  386216  65896  50496 S   4.3   0.2   0:00.16 php-fpm: pool www
    22798 nobody    20   0  386744  79348  63384 S   4.3   0.2   0:17.22 php-fpm: pool www
     7495 root      20   0       0      0      0 D   1.0   0.0  26:23.74 [mdrecoveryd]

     

    Here's a list of installed plugins:

    root@Tower:~# ls /var/log/plugins/
    Python3.plg@                 dynamix.cache.dirs.plg@      dynamix.system.temp.plg@  open.files.plg@           unRAIDServer.plg@             zfs.master.plg@
    appdata.backup.plg@          dynamix.file.integrity.plg@  dynamix.unraid.net.plg@   parity.check.tuning.plg@  unassigned.devices-plus.plg@
    community.applications.plg@  dynamix.file.manager.plg@    file.activity.plg@        qnap-ec.plg@              unassigned.devices.plg@
    disklocation-master.plg@     dynamix.s3.sleep.plg@        fix.common.problems.plg@  tips.and.tweaks.plg@      unbalance.plg@
    dynamix.active.streams.plg@  dynamix.system.autofan.plg@  intel-gpu-top.plg@        unRAID6-Sanoid.plg@       user.scripts.plg@

     

    I can access files on the array OK over the network, rebuild is still running, albeit very slowly:

    root@Tower:~# parity.check status
    Status:  Parity Sync/Data Rebuild  (65.2% completed)

    Any advice on what I can try to get the load back down?

     

  7. Load is still climbing:

    load average: 57.57, 57.49, 57.00

    Looks like it could be related Docker:

     

    image.thumb.png.3df7131b0d1d28b48556b8f69417db62.png

     

    root@Tower:/mnt/user/system# umount /var/lib/docker
    umount: /var/lib/docker: target is busy.

     

    Parity rebuild seems to be going much slower than it should be, guess it's due to the high load.

    So I ran

    parity.check stop

    after doing so the GUI's now loading fine, but I'm getting a "Retry unmounting user share(s)" in the GUI footer.

     

    I tried a reboot but it's hung.

     

    Via SSH I tried stopping Docker service, but it doesn't seem to be that:

    root@Tower:/mnt/disks# umount /var/lib/docker
    umount: /var/lib/docker: not mounted.

    I left it (wasn't sure what else to try) and eventually it restarted and things seem to be back to normal.

     

    I've now discovered that the Parity Tuning plugin doesn't/ can't continue a Parity Sync/Data Rebuild in the same way that it can a correcting parity check, so it's back to the beginning with that.

     

    I'm going to avoid touching anything until the rebuild's finished.

  8. I checked the logs and found the crash happened around here:
    
    Sep  3 17:00:54 Tower webGUI: Successful login user root from 192.168.34.42
    Sep  3 17:01:25 Tower php-fpm[7836]: [WARNING] [pool www] server reached max_children setting (50), consider raising it

     

    Doing some further digging on that error, I found this post:

    In which the poster found the issue was due to the GPU Statistics plugin.

    I had just installed that a couple of days ago, so it would seem that this is likely the cause of my problem too.

     

    I successfully removed the plugin via CLI with

    plugin remove gpustat.plg

    ...but after a few minutes the sys load remains high and still no GUI.

     

    Looking like a reboot's my only option, but

    Status:  Parity Sync/Data Rebuild  (65.3% completed)
  9. I was updating some docker containers through the Docker GUI page when the page froze.

     

    Checked top via SSH and got this:
    
    top - 17:58:46 up 1 day, 21:02,  1 user,  load average: 53.92, 53.54, 53.07
    Tasks: 1107 total,   3 running, 1104 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.9 us,  2.6 sy,  0.0 ni, 54.9 id, 41.4 wa,  0.0 hi,  0.2 si,  0.0 st
    MiB Mem :  31872.3 total,   5800.0 free,  11092.2 used,  14980.1 buff/cache
    MiB Swap:      0.0 total,      0.0 free,      0.0 used.  19566.8 avail Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    24398 root      20   0   34316  32632   1960 R  21.5   0.1   0:05.21 find
    12074 root      20   0  974656 409920    532 S  10.9   1.3 215:55.56 shfs
    18749 nobody    20   0  455592 102776  80756 S   5.3   0.3   0:02.04 php-fpm82
    21905 nobody    20   0  386252  77244  61756 R   4.6   0.2   0:00.31 php-fpm82
    18604 nobody    20   0  455788 105656  83604 S   4.0   0.3   0:02.86 php-fpm82
    11495 root       0 -20       0      0      0 S   3.3   0.0  41:37.86 z_rd_int_0
    11496 root       0 -20       0      0      0 S   3.3   0.0  41:39.55 z_rd_int_1
    11497 root       0 -20       0      0      0 S   3.3   0.0  41:38.24 z_rd_int_2
     7491 nobody    20   0 2711124 259320  24452 S   1.7   0.8   0:11.99 mariadbd

    Thing is, I'm part way through an array rebuild having replaced a failed HDD. Usually I would restart the server if the GUI becomes snafu, but in this case, is it safe to do so? (I have the Parity Check Tuning plugin installed) or is there a CLI command I can try to bring things back?

     

     

  10. 9 hours ago, JorgeB said:

    You can enable the syslog server and post that after a crash, to see if there's something relevant logged, don't see anything in the diags posted.

     

    Thanks for checking the diags- I have some more moving to do and will be sure to post the logs.

     

    7 hours ago, sphbecker said:

    I also use rsync and had an issue with my GUI crashing, but I never made the connection. For me, the issue seemed to stop when I upgraded to 6.12.3, but that could also be coincidence. I don't put a ton of data on my server to need to sync.

    Interesting. I've not been able to find another example of someone having the same issue, so somewhat reassuring to know I'm (possibly) not the only one.

    Then again, this was happening when I first started using Unraid last year (6.11.3). I didn't upgrade until the 6.12 releases came out and have kept up to date since then. Unfortunately the problem has persisted across all of the versions I've been on.

  11. I'm moving files around so I can start making use of ZFS. This issue has been going on for most of this year, but I've rarely needed to move large amounts of data around so haven't spent much time on troubleshooting.

     

    I started doing large moves using rsync via CLI and had problems, so moved to unBALANCE as it gives better visibility of what's going on. The crashing appears to be the same when I use either method.

     

    I find that the move completes correctly, but the main Unraid GUI becomes unreachable after a while. So for example, the whole move may take 5 hours, but the GUI becomes unreachable after 2 hours. If I try to connect with the "Array - Unraid Monitor" Android app while the GUI's borked, it displays a "Sever error" message. When the Unraid GUI is down, the unBALANCE GUI is not affected and still runs fine.

     

    I created the attached diagnostics while unBALANCE was running. It was a 172GB move, took nearly 50 mins and completed successfully.

     

    This is what top looked like while the GUI was crashed after an earlier, larger move. This was many hours after the move had finished. No dockers or VMs running:1516639004_Screenshot2023-08-20123516-GUIcrashed-hoursafterrunningUnbalance.thumb.png.be7b35f6810d01834a7c654eba8d833a.png

    tower-diagnostics-20230820-1400.zip

  12. Thanks to whoever's responsible for the Nicotine+ container. I tried a few times over the years to get it running on Windows and never had any joy. It fired straight up on my Unraid 6.12.2 and all looks good.

     

    Couple of things I noticed:-

     

    Main thing is with the download folders specified in the template. I have them both set to reference subfolders within my /mnt/user/Audio/ share, but the files weren't being saved to those dirs. I eventually found that they were being saved, but in the container's config/.local/share/nicotine/downloads directory. I fixed this by going into the Nicotine+ prefs/ downloads and changing the Incomplete and completed to the container's dirs as specified by default in the template.
    I also had to go to prefs/ Shares and manually add the container's shares folder that was specified by default.

     

    On the template- both descriptions for the complete and incomplete downloads are the same.

     

    In Prefs, in User Interface the "Prefer dark mode" selection is being ignored.

  13. 4 hours ago, itimpi said:


     if you have mover or appdata backup running then when either of those is detected you are likely to get the plugin executing the pause but then not doing the resume so you need a manual resume to continue.  

    I have both of those running daily and although the PCT log entries stopped just after the mover started, the actual rebuild continued and completed seemingly successfully without any interaction on my part.

  14. On 10/8/2020 at 2:56 AM, trurl said:

    On the Dashboard page, click on the SMART warning for the disk and you can acknowledge it. It will warn again if the count increases.

    Thanks for this- there's no way I would've guessed to click on that warning. Good to see all green dots again.

  15. Thanks Dave- that makes things clearer. If only the standard messages were as descriptive as the Parity Check Tuning ones.

     

    I check in on my server most days and try to stay on top of app & plugin updates as soon as they become available. The Parity Check Tuning plugin is indeed on 2023.07.08 and I believe it was updated before I replaced the disk, but not certain.

    Good luck with finding the cause of the stopping monitoring task. In my case all seemed good until the daily mover operation started.

  16. On 7/10/2023 at 8:20 AM, JorgeB said:

    most users won't know about that

     

    I'm still not 100% sure about what's going on with all this 😜


    Here's an update with what happened. I followed the guide to replace the failing disk. The rebuild onto the new disk appears to have gone well with no errors reported:

    image.thumb.png.e5a90eb027423cb80b0d3e4fb9bb0267.png

     

    and

     

    image.png.2b9bb36de4e4ab75a6351994e43320d1.png

     

    What's strange is that there's nothing in the logs at the 10:00 timestamp that the parity result shows as the rebuild end time:

    Jul 10 06:45:11 Tower emhttpd: spinning down /dev/sde
    Jul 10 09:15:08 Tower autofan: Highest disk temp is 43C, adjusting fan speed from: 230 (90% @ 833rpm) to: 205 (80% @ 854rpm)
    Jul 10 09:20:14 Tower autofan: Highest disk temp is 44C, adjusting fan speed from: 205 (80% @ 868rpm) to: 230 (90% @ 834rpm)
    Jul 10 09:39:17 Tower emhttpd: read SMART /dev/sdh
    Jul 10 09:59:53 Tower webGUI: Successful login user root from 192.168.34.42
    Jul 10 10:00:43 Tower kernel: md: sync done. time=132325sec
    Jul 10 10:00:43 Tower kernel: md: recovery thread: exit status: 0
    Jul 10 10:05:23 Tower autofan: Highest disk temp is 43C, adjusting fan speed from: 230 (90% @ 869rpm) to: 205 (80% @ 907rpm)
    Jul 10 10:09:42 Tower emhttpd: spinning down /dev/sdh
    Jul 10 10:14:57 Tower webGUI: Successful login user root from 192.168.34.42
    Jul 10 10:15:29 Tower autofan: Highest disk temp is 42C, adjusting fan speed from: 205 (80% @ 869rpm) to: 180 (70% @ 854rpm)
    Jul 10 10:30:00 Tower webGUI: Successful login user root from 192.168.34.42
    Jul 10 10:30:34 Tower autofan: Highest disk temp is 41C, adjusting fan speed from: 180 (70% @ 850rpm) to: 155 (60% @ 853rpm)
    Jul 10 10:30:44 Tower emhttpd: spinning down /dev/sdg

     

    I can see this in the log when the rebuild starts:

    Jul  8 21:17:28 Tower Parity Check Tuning: DEBUG:   Parity Sync/Data Rebuild running
    Jul  8 21:17:28 Tower Parity Check Tuning: Parity Sync/Data Rebuild detected
    Jul  8 21:17:28 Tower Parity Check Tuning: DEBUG:   Created cron entry for 6 minute interval monitoring


    Then I get the update every 6 minutes as expected:

    Jul  9 02:24:34 Tower Parity Check Tuning: DEBUG:   Parity Sync/Data Rebuild running
    Jul  9 02:30:20 Tower Parity Check Tuning: DEBUG:   Parity Sync/Data Rebuild running
    Jul  9 02:36:33 Tower Parity Check Tuning: DEBUG:   Parity Sync/Data Rebuild running


    Until here:

    Jul  9 02:42:20 Tower Parity Check Tuning: DEBUG:   Parity Sync/Data Rebuild running
    Jul  9 02:42:20 Tower Parity Check Tuning: DEBUG:   detected that mdcmd had been called from sh with command mdcmd nocheck PAUSE

     

    Which happens a couple of minutes after this:

    Jul  9 02:40:01 Tower root: mover: started

     

    There are no further parity related entries after that.

     

    I'm not sure whether I can consider things OK now, or whether I should be investigating further.

  17. My issue with the 2 errors being found during parity check remains.

     

    I've now got a failing drive and have a new one to replace it with. I've successfully moved everything off the old drive.

     

    I had an unclean shutdown recently and Unraid came back up it ran an automatic correcting check which finished today and this is the result from the log:

     

    Jul 8 03:18:43 Tower Parity Check Tuning: DEBUG: Automatic Correcting Parity-Check running 
    Jul 8 03:19:25 Tower kernel: md: recovery thread: P corrected, sector=39063584664
    Jul 8 03:19:25 Tower kernel: md: recovery thread: P corrected, sector=39063584696 
    Jul 8 03:19:25 Tower kernel: md: sync done. time=1844sec 
    Jul 8 03:19:25 Tower kernel: md: recovery thread: exit status: 0

     

    The problem is with the same 2 sectors on parity P that have been coming up as bad since the  middle of December, but not always:

    image.thumb.png.221a313e2c33d70eec9d98f6544913df.png

     

    Both parity drives completed their SMART short self-tests without error.

     

    I'm unsure how best to proceed. As my largest data disk is 18TB, the parities are 20TB and these 2 problem sectors are right at the end of the 20TB, so outside the area with data and I have moved all of the data off the disk that I want to replace, do I just ignore the parity errors, then follow this guide: https://docs.unraid.net/unraid-os/manual/storage-management#replacing-a-disk-to-increase-capacity or is there something else I can try?

     

    tower-diagnostics-20230708-1356.zip

×
×
  • Create New...