Jump to content

Addy90

Members
  • Content Count

    37
  • Joined

  • Last visited

Community Reputation

3 Neutral

About Addy90

  • Rank
    Advanced Member

Converted

  • Gender
    Undisclosed
  1. Yes, definitely. I believe Limetech has loads of automated test patterns for all edge cases they know to be sure about changes that can cause data loss. Of course this is nothing one can implement quickly, but I think as it is possible to pause and continue a parity check already during operation, it should also be safe after reboot. Especially because a lot of conditions are checked after reboot if the array and all disks are available and there were no changes to the array. But it is not for me to assume too much here, it is just a Feature Request thread and I wanted to ask for something that I would love to see and that I think is not impossible to do (much easier than double parity or multi-streaming and other features we have seen recently). It is just a fair feature request in my opinion, and some suggestions about possible implementations because you made me to write about it. The rest is hope and patience, from which I have both. I prefer a solid tested feature than any other feature, like most of us here do; this is one reason for us to use Unraid, because it works. Don't agree?
  2. Ah, yes, of course it is nearly impossible for plugin development! I know you have it on the Wish List for your great Parity Check Tuner plugin! But it is exactly as you describe it there (copy from your Parity Check Tuner page) The offset ability has to be implemented by Limetech! This is why I ask in the Unraid Feature Request thread and not in your plugin thread, because I know that it is not possible for you and that you need that feature. But I think for Limetech it is not very difficult, because this is what I think: There is a function md_do_sync() with a variable mddev->curr_resync in the md daemon that holds the position of the current parity check run. This variable is incremented (mddev->curr_resync += sectors) within a while loop until the maximum sector (mddev->recovery_running) is reached, which marks the end of the parity check operation. Parity check can already resumed by giving options to check_array() function which revives the mddev->recovery_thread that continues with the current mddev->curr_resync position held in memory. I think it is easily possible to read the content of this variable (and others if needed, maybe with a checksum, just to be sure no modification happened). The status_md() function prints out the content of this sync-position variable, for example (seq_printf(seq, "mdResyncPos=%llu\n", mddev->curr_resync/2);). Maybe it is already possible to get the contents of the needed variables with this function. So when you can get the content of this variable, the check_array() function only needs an additional parameter for resuming operation at a specific position. It then sets the mddev->curr_resync, wakes up the sync-thread, and operation continues where it left off after reboot! This parameter can be read from a file, controlled by the checksum and the mdcmd tool can be improved by accepting this value for parity check via command line. Everything should be possible via mdcdm calls via UI, the functions permit the setting of the current position and the UI can save the current value to a file (with checksum just to be sure) and read it back after reboot and only delete the position file in case the operation is running and write it when it was paused or array shut down manually, so that when there is an unclean shutdown or the check ended successfully, the file is not there and next start check if there is a file and if not, starts from the beginning. I think, for Limetech, it is not very difficult to permit this operation. I would LOVE to see it Would made mine and some other days, also @itimpi's days brighter PS: I don't mind pausing and resuming parity sync over multiple reboots. It is as easy as just shut down the array, file with position is saved, and after reboot, parity check tells in UI that the last check was not finished and can be resumed (or ask to start from beginning by checkmark, which calls mdcmd without position argument). Not difficult for the user, but very handy for the use-case! PPS: As parity check is also used for rebuilding disks as far as I understood, it should either be possible to resume rebuild after reboot, but in this case, I would certainly warn about that to do or gray out the option to shut down, but it should be possible technically. In case of unclean shutdown, we have to restart either way.
  3. What I would find really great would be the possibility to pause parity sync over multiple reboots! I have two reasons for this: First is: In my situation, I am living in a flat and normally I have all doors open during night for better air support and because I have no radiator in the sleeping room, so I get my heat from other rooms. Unfortunately, parity sync takes longer than 24h and this is the only time in month I have to close doors because of the noise, and then it gets cold, too. Second reason is that normally I am only syncing backups from my other systems several times a week up to once a day, or I need some older data stored on the array that is not on my live systems some times in month, means that my array is shut down most of the day and only online for some hours, maybe half a day, because of noise and energy costs. I could use that time when I need it sparely, but ready, to continue parity sync during that time. This saves a lot of energy, because the only time during month when it is running for more than 24h, even during night when I don't access it, is for parity sync. So what I would appreciate is that the parity sync operation writes some stat file when paused explicitely and/or clean system shutdown to flash drive and that it gives the possibility to continue where it left off before (or restart from beginning, for example with an additional check-mark). Like this, parity sync could be paused and resumed whenever the system has not to be shut down for whatever reason. I don't think this is difficult because parity sync operation knows always where it is and I know there is parity check tuning plugin that lets run parity sync in intervals, but it does not survive reboot, which is critical for my use case. I hope this can be possible! Thank you! BTW this was already requested, but maybe the importance was not given as much as in my text:
  4. Replace it with a disk at least as large as the current disk. So you need a 6 TB with no less sectors than the current one, or you use a bigger disk and are able to upgrade / replace another data disk later with a bigger one, too.
  5. Errors there are read errors that were corrected on the fly, so at that moment, no files are damaged. Your Parity sdg disk definitely has read errors - critical medium error: Mar 1 09:15:52 Jewel kernel: sd 7:0:5:0: [sdg] tag#158 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Mar 1 09:15:52 Jewel kernel: sd 7:0:5:0: [sdg] tag#158 Sense Key : 0x3 [current] [descriptor] Mar 1 09:15:52 Jewel kernel: sd 7:0:5:0: [sdg] tag#158 ASC=0x11 ASCQ=0x0 Mar 1 09:15:52 Jewel kernel: sd 7:0:5:0: [sdg] tag#158 CDB: opcode=0x88 88 00 00 00 00 01 d1 17 f7 38 00 00 01 d0 00 00 Mar 1 09:15:52 Jewel kernel: blk_update_request: critical medium error, dev sdg, sector 7802976056 op 0x0:(READ) flags 0x0 phys_seg 58 prio class 0 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802975992 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976000 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976008 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976016 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976024 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976032 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976040 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976048 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976056 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976064 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976072 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976080 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976088 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976096 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976104 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976112 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976120 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976128 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976136 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976144 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976152 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976160 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976168 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976176 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976184 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976192 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976200 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976208 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976216 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976224 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976232 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976240 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976248 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976256 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976264 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976272 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976280 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976288 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976296 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976304 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976312 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976320 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976328 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976336 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976344 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976352 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976360 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976368 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976376 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976384 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976392 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976400 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976408 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976416 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976424 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976432 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976440 Mar 1 09:15:52 Jewel kernel: md: disk0 read error, sector=7802976448 If you want, make a second Parity sync. Check if more errors come up after the run. If yes, it can be the cable or the disk. For me it looks like the disk because of Pending sectors, no cable should cause this. Be extra sure that there are not Party sync errors but only read errors on the disk. Parity sync errors means data can not be rebuild correctly, luckily your Parity disk does not contain data, but if another disk failes, you could end up with lost data because the Parity disk can not rebuild from the error sectors. This means, the corrected read errors are rebuild from the remaining disks on the fly. If another disk failes, these read errors will result in data loss as you only have one parity. If you want to be extra sure, replace cable and do another Parity sync (or replace cable before the second Parity sync). If there are any more read errors and it is not the cable: Replace disk. Nothing is worse than data corruption because a failing disk was there for too long. Really, replace it. You cannot rebuild your data from a faulty parity disk. I had this too, replaced my disk as it always showed errors during each Parity sync and now the errors are gone. Never had a faulty cable I have to admit...
  6. I would definitely recommend to preclear both disks first, then when this is finished, you add the disk you want to add, then check if everything is up and running and THEN you replace your other disk and let it rebuild from parity. The adding of the disk should only take a minute or so when it is precleared, but the replacement is what takes time because of the parity rebuild. This is why I would definitely do it in this order, then the downtime is at minimum, because during preclear, everything is up and running and during rebuild of replacement disk, the content is emulated and up and running. Do not skip preclear of both disks for one run. Nothing is more painful than adding a disk or rebuilding a disk that crashes within hours because it was not tested first.
  7. When I built my Unraid at home, I had a bunch of WD Green drives, they are currently all in my array and so far, I had two defects but could always rebuild the data. They never were dropped, they just had read errors that were corrected on the fly. But of course, I got rid of such disks when the errors started showing. I also have a Purple disk that was previously used for Video Editing of multiple HD streams (works great for this, but was replaced with a big SSD now). Works very fine, though the random access is a bit slower than with Red drives, this is because these disks are made for parallel streams, not for single access patterns like a RAID does, so they are slower than RED with normal RAID use. The rest is all Red drives since I built the array and bought more disks, and they behave fine, too. Then I have an old Samsung and Hitachi disk, too. I once had two Seagate disks, but both got reallocated sectors, on one disk so much that it started to corrupt files, so I got rid of them. As cache, I have two Samsung SSDs as RAID1 BTRFS. I can say, it works! Of course, disk speed of old disks is slower than of new disks, but it was cheaper not to replace all the old disks. So currently, Green, Purple, Red, works all! But of course, as @Michael_P says, Red drives are made for RAID with TLER time-limited error recovery. Purple has TLER, too, btw. Some older benchmark: https://www.hwcooling.net/en/test-of-six-hhds-from-wd-which-colour-is-for-you/ But the principle is comparable. Though I have a Gold drive in my desktop, too, and it is not as fast as this benchmark tells with random access... who knows. Nevertheless, the Gold drive is a nice single disk.
  8. You can fake this over File Integrity plugin. There you get txt files with each file from each drive with a hash sum. Your plus here is that when you have drive failure and rebuild a drive, you can check the rebuilt files for validity with the previously exported hash sums. You can also automate the process of calculating hash sums of new files and export them as txt file. It is no tree, but it has each file in it. If not automatically, you export it manually (I do it manually, because automatically drastically drops write speed). Not perfect, but maybe good enough for you?
  9. I did not read the whole article, but when you seek for an enterprise solution, use something like CEPH. https://ceph.io/ I am using Unraid in my home setup as it is an incredible flexible and cheap solution for backup data and archiving. Nothing is as flexible and cheap in setup in my opinion. At work, I setup a CEPH cluster, here you can have hundreds of drives on dozens of servers with failover and parity over multiple servers. You can reboot parts of the cluster for updates with no down time and have virtual machines use CEPH as underlying network storage, for example with Proxmox (built-in). If you need additional backups, use a second technology like GlusterFS that also can run over multiple disks on multiple servers. If there is a problem with CEPH, your second storage on GlusterFS is not affected. Never confuse backups with failover! And then you have your scalable, enterprise ready system with multiple parity disks > 2, many data disks > 28, server and even location failover and backup. Nothing of this can be given by Unraid and Unraid should not be designed for this, as it would complicate the setup for every home and small business user - and it would compete with CEPH and GlusterFS, but for no reason as these systems already solve the problems you seem to seek for. CEPH also supports multiple pools with different redundancy settings, different synchronization strategies, distinct SSD and HDD pools on same infrastructure for high performance and high storage pools. You can choose which disks you want in which pool, you can choose the parity calculation, and more... you can use CEPH as block storage and file storage with CephFS. CEPH also has integrity verification built in (scrubbing). And and and... everything enterprise ready storage needs! I am sure it is possible to run SMB sharing over CephFS pools, though I am not sure about SMB failover, so maybe the gateway running SMB might have downtimes on reboot. Maybe SMB can be made high availability via keepalived (VRRP) and conntrackd, then the SMB shares would be failover, too, like the whole storage system. I don't want to make advertising for these technologies. I am not involved in any development of them, but I use them all where applicable. Unraid at home, CEPH for high performance, high availability, high scalable enterprise VM/Container storage at work, GlusterFS for backups of CEPH VMs/Containers at work. keepalived for failover of some services (but not SMB), conntrackd on VyOS firewalls for connection failover in case of reboots (but not with SMB). I just want to say, there are solutions out there! @shEiD Use the right tool for the right task. My opinion! Learn Linux administration, all the tools I mentioned are free of charge, Unraid costs money, CEPH not; but professional support can be bought. Proxmox has a GUI for CEPH management. CEPH also has an own Dashboard if you want to use CEPH standalone, but for advanced management, you need to learn Linux administration. For performance, see this: https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark PS: yes the thread is old, but for someone returning and being unhappy with Unraids limits (for example like @Lev), I would like to have some solutions written here to be sure nobody is unhappy with Unraid not knowing there is another tool for the task. Unraid is great, but I cannot repeat it enough: Use the right tool for the right task. If you have a hammer, everything looks like a nail. But screws need a screwdriver.
  10. I meant the client timeout: https://linux.die.net/man/5/dhclient.conf, not the router lease time. I thought the client timeout was set to 10 seconds in Unraid during boot, but I cannot confirm now as I have set to static. Because yes, I have a bonded interface. Now with DHCP enabled on bonded interface, the 10 seconds client timeout were too short for obtaining an IP address. As you say, it is because the interfaces are shortly put down since 6.8.0, this makes totally sense now why I could not get an IP address. In my case, I did 2. and will wait for 3. 😄 For the original author of this topic, I suspected that as the network connection is lost after some time, it looks like a dhcp client timeout like I had. Maybe it is something else then as you say this issue is only while initializing the network interfaces. This sounds like this should not happen on renewal, but I cannot confirm, maybe the author knows about this... Thanks @boniel for clarification!
  11. Are you using DHCP or do you have a static IP for your Unraid system? I encountered issues with DHCP, I don't know why but sometimes the server did not get an IP address and had no network connectivity after booting. As DHCP leases are renewed after some time (normally one day), it could be possible that the release sometimes does not work and the network stops as the Unraid system has no valid IP anymore. I suspect the timeout for the DHCP in Unraid is too low, but I did not investigate but set a static IP (had a static DHCP lease anyway) and I never had any network connectivity drops since then anymore. By the way, this started with 6.8.0 and was never before with any older version, my DHCP server of course did not change.
  12. This would be a very good feature. I normally shut down my array when I don't need it, especially to safe power. To be able to shutdown and (in best case automatically) continue the parity check on the next boot until it is finished would be best. Also it would be best to be able to automatically start parity checks that were scheduled but not started due to being shut down. So Unraid just should be able to check if a scheduled check has to be run, or if a current session is still unfinished, after it is started. When the run is finished, next parity check should be started after schedule was reached, either directly when it is online or when it is booted next time and did not start before, and of course continue until it is finished after every reboot. Only exception: Manual intervention. Manual stop should stop of course, until next scheduled check. That would be the perfect solution. Would not bother the people running it 24/7 but would be most comfortable for everyone else.
  13. @limetech Turning off hard link support indeed increases listening performance by a factor of more than ten as far as I can sense. My backup listing now took around two minutes instead of half an hour with hard link support enabled. This is much better! I will leave it disabled, I never used hard links in my Unraid storage anyway. Thank you for this suggestion!
  14. @limetech What happens when you compare a recursive listing of folders with hundreds of subfolders with each hundreds of files between 6.8.0 and 6.7.2. In my case, I am listing backup directories with thousands of directories and several hundred thousands of files and this is significantly slower by factors than on 6.7.2. Right now, while I was sending new backups to my Unraid and wanted to browse a folder, everything hung up. Transfer stopped for minutes, Unraid had mover running during transfer and browsing via explorer did timeout. Shouldn't the new multi-stream IO handle such cases? Might the new multi-stream IO be the culprit? Can we disable it and test the legacy behaviour with current Unraid to compare? It is difficult to tell you exactly what is going wrong here because I have no direct comparison, but something is definitely not right with the performance like never before...
  15. I can confirm the problem. In my case, I have large Backup directories that are synced with a 3rd-Party Software between my computer and my Unraid system. With 6.7.2, recursive directory listing via SMB of thousands of subdirectories with tens of thousands of files took seconds to minutes, now it takes minutes to hours with 6.8.0 stable. Looks like it is already very slow when you make a right click -> settings on a dictionary when Windows starts to sum up the size of all files inside the dictionary. This is definitely much slower than before, you can nearly see it sum up file by file. It also looks like the data transfer is much more than before. I have constant 20-30 kb/s transfer while listing directories. When I Wireshark the SMB2 packets, they take quite some time inbetween Requests and Responses. BTW: I have the Cache dirs plugin active, but it does not seem to have any impact whether enabled or disabled.