Jump to content

JorgeB

Moderators
  • Posts

    67,504
  • Joined

  • Last visited

  • Days Won

    706

Everything posted by JorgeB

  1. That's not part of stock Unraid, are you using some plugin? If yes you need to post on the appropriate plugin support thread.
  2. Look OK if it was a a replacement without copying data from the old device. That refers to the used size, not device capacity.
  3. Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.
  4. Those errors are likely the result of data corruption, btrfs will give an i/o error if corruption is detected so that you're not fed bad data without knowing, would need diags to confirm, if it's that you can use btrfs restore to bypass it, but data will still be corrupt.
  5. It shouldn't be needed if everything was done correctly, but as long as everything's good now...
  6. You can first try using the command line, or your favorite copy tool like midnight commander or krusader, note that it's good practice making regular backups of those files, both the appdata folder and libvirt.img (if you have VMs).
  7. Something is wrong but like mentioned can't see the beginning of the problem, reboot and post new diags if there are still errors.
  8. Current cache is sdg, you need to adjust the mount point, what was sdi being used for? It's not on current diags, suggesting that it might have dropped or been removed since.
  9. It doesn't with single parity, order also doesn't matter pol any btrfs pool.
  10. OK, there's definitely a problem, I managed to reproduce it with an identical array config: My plan is trying to find out in which circumstances this happens, if it's disk size or config related, and then create a bug report, but since I can't reproduce this with small disks this can be a long and tedious process, I expect that @limetechis very busy at the moment with the latest beta and this isn't an urgent issue but if you have any idea what might be causing this with the info below please advise. Quick summary: -over the last couple of years or so multiple users, more than could be explained by just user error or other issues, have been complaining of sync errors after a parity swap, errors start on the extra parity section, i.e., if old parity was 8TB and new parity is 10TB sync errors start when you pass the 8TB mark on the check -I could never reproduce this before on my test server but @Juniperwas willing to repeat the procedure twice so he/she could post the diags and both times it resulted in sync errors, looking at the diags procedure was correctly done as far as I could see. -I was finally able to reproduce this using the exact same array config as the OP, like all other cases the sync errors start immediately past the old parity size: Aug 2 09:14:08 Tower2 kernel: mdcmd (42): check nocorrect Aug 2 09:14:08 Tower2 kernel: md: recovery thread: check P ... Aug 2 09:15:01 Tower2 sSMTP[4234]: Creating SSL connection to host Aug 2 09:15:01 Tower2 sSMTP[4234]: SSL connection using TLS_AES_256_GCM_SHA384 Aug 2 09:15:03 Tower2 sSMTP[4234]: Sent mail for [email protected] (221 2.0.0 closing connection d11sm13298600wrw.77 - gsmtp) uid=0 username=xxx outbytes=708 Aug 2 09:38:35 Tower2 kernel: mlx4_en: eth2: Link Down Aug 2 10:18:41 Tower2 kernel: mlx4_en: eth2: Link Up Aug 2 13:47:06 Tower2 kernel: mlx4_en: eth2: Link Down Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053064 Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053072 Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053080 Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053088 Looking at at hex dump of this first block (after adding the 64 sectors from before the partition starts) you can confirm there's data there: Curiously there are some good blocks, notice the 20 block jump here: Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053488 Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053648 And of course the disk really is zeroed in these: root@Tower2:~# dd if=/dev/sdb skip=15628053560 count=152 | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00013000 152+0 records in 152+0 records out 77824 bytes (78 kB, 76 KiB) copied, 0.000285448 s, 273 MB/s Also curious for me that for the OP parity is wrong every third block (at least in the logged errors): Jul 2 17:00:03 Schiethucken kernel: md: recovery thread: check P ... Jul 2 17:00:09 Schiethucken emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog ### [PREVIOUS LINE REPEATED 1 TIMES] ### Jul 3 03:40:01 Schiethucken root: mover: cache not present, or only cache present Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628053640 Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628054664 Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628055688 Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628056712 Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628057736 Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628058760 The OP started with a precleared parity disk the first time, and a parity corrected one the second time, which means that part of the disk started as all zeros both times, and it looked to me by looking at the logs/graphs that when I did it the new parity disk was completely written till the end after the parity copy, so most likely explanation for this is that the new parity disk isn't being correctly zeroed after the parity copy, but I would assume you're using dd to zero out the extra capacity, so can't imagine how this could happen, any ideas?
  11. There's an increased risk of data corruption if there's a power failure, but it would only affect files being copied at the time, also you can (should) get a UPS.
  12. Please don't crosspost, anyone responding use the original thread below:
  13. Like mentioned you need to use the latest beta, no driver for your NIC on v6.8.x
  14. Make sure it's using the correct "power supply idle control setting", more info here.
  15. As you prefer, should be fine to wait if the previous errors were the result of an unclean shutdown.
  16. Syslog rotated so can't see start of the problem, see here to reset errors and to monitor for more in the future.
  17. Cache filesystem is severely corrupt, best bet it to backup, re-format and restore any important data there.
  18. First check was non correct so the one after that would find the same errors, last one was correct, so next one should find 0 errors.
  19. Write cache is disabled for all disks, HP Microservers have a write cache setting in the BIOS that is disabled by default, enable it and try again.
  20. I dint' read the complete thread but I see disk1 is now formatted xfs, so you won't be able to use any btrfs recovery tools, you might still be able to get something with UFS explorer.
  21. Looks more like a connection issue, replace/swap both cables (power + SATA) and try again.
  22. And you're sure it's xfs? Either way you can use iotop (install de nerdpack plugin) to and then try to find out which docker(s) is writing so much.
  23. Beta can still have a few bugs but there shouldn't be anything major, should be OK for most users, I'm running it on two servers.
×
×
  • Create New...