Jump to content

JorgeB

Moderators
  • Posts

    67,600
  • Joined

  • Last visited

  • Days Won

    707

Everything posted by JorgeB

  1. As long as there's no "all data on this device will be deleted" warning on the right side of the SSD it's fine.
  2. If you assign it as parity1 and let it re-sync it will cause the same problem.
  3. Did you wipe it and reboot? IF yes post new diags.
  4. If the drives are empty only option is using a data recovery tool, like for example UFS explorer.
  5. No, procedure needs a new config using the "parity is already valid" option before array start, but like mentioned parity would only be 100% valid if the replacement was done in maintenance mode.
  6. loop2 is the docker image.
  7. Stop array, unassign parity, start array, before assign it as parity2 you can confirm that is indeed the problem, wipe the partition with: wipefs -a /dev/sdX1 Replace X with the correct letter, it was m as of last diags, note the 1 in the end. Then reboot, start the array whit parity unassigned and see if the cache pool mounts, if yes, stop array, assign it as parity2, start array to begin parity sync.
  8. There have been several cases in the last week or so of hacked users will all the data deleted, forwarded ports (or dmz) appears to be the common denominator, doing that has always been a very bad idea, but apparently lately there are hackers actively looking for open Unraid servers.
  9. Did you power cycle? Just rebooting might not be enough, if a power cycle didn't help it's possibly a failed NVMe device.
  10. No, definatly not that, you could add an array device if you have one available or change parity to parity2 (will need a re-sync).
  11. Cache device dropped offline: Mar 22 19:37:30 Tower kernel: nvme nvme0: I/O 393 QID 6 timeout, aborting Mar 22 19:37:30 Tower kernel: nvme nvme0: I/O 394 QID 6 timeout, aborting Mar 22 19:37:30 Tower kernel: nvme nvme0: I/O 395 QID 6 timeout, aborting Mar 22 19:37:30 Tower kernel: nvme nvme0: I/O 396 QID 6 timeout, aborting Mar 22 19:37:33 Tower kernel: nvme nvme0: I/O 397 QID 6 timeout, aborting Mar 22 19:37:33 Tower kernel: nvme nvme0: I/O 398 QID 6 timeout, aborting Mar 22 19:37:35 Tower kernel: nvme nvme0: I/O 1007 QID 2 timeout, aborting Mar 22 19:37:41 Tower kernel: nvme nvme0: I/O 675 QID 11 timeout, aborting Mar 22 19:38:00 Tower kernel: nvme nvme0: I/O 393 QID 6 timeout, reset controller Mar 22 19:38:31 Tower kernel: nvme nvme0: I/O 16 QID 0 timeout, reset controller Mar 22 19:39:32 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Mar 22 19:39:32 Tower kernel: nvme nvme0: Abort status: 0x371 ### [PREVIOUS LINE REPEATED 7 TIMES] ### Mar 22 19:40:03 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Mar 22 19:40:03 Tower kernel: nvme nvme0: Removing after probe failure status: -19 Mar 22 19:40:33 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 A power cycle should bring it back, but look for a BIOS update for your board to see if it doesn't happen again, sometimes disabling NVMe power states also helps.
  12. Because without user shares there's no /mnt/user, docker would still work if you changed paths to be e.g., /mnt/disk1 or /mnt/cache
  13. There's a valid btrfs filesystem on the SSDs, I think the problem now might be related to this error: Mar 22 08:24:28 Servo emhttpd: shcmd (91): /sbin/btrfs device scan Mar 22 08:24:29 Servo root: ERROR: system chunk array too small 34 < 97 Mar 22 08:24:29 Servo root: ERROR: superblock checksum matches but it has invalid members Mar 22 08:24:29 Servo root: ERROR: cannot scan /dev/sdm1: Input/output error It wasn't interfering before but now apparently it is. These errors result of parity having a semi-valid btrfs filesystem because there's an odd number of array devices, if I'm correct this won't happen if you add (or remove) another array device so you get an even number, or alternativly re-sync parity as parity2, parity2 is calculated in a different way, so there shouldn't be an issue even with an odd number of array drives.
  14. Server is also running out of memory, reboot and start the dockers/VMs one at time and let run for a while to see if you can find the culprit.
  15. If if you can limit writes to the other disks and test for a few days it's worth trying.
  16. Log is spammed because of the cache being previously full, reboot, start the mover (with mover logging enable), wait a few minutes and post new diags.
  17. For now it's only that disk, but before there was corruption on 2 disks, which suggests another issue, unless it was the RAM also, but quite unusual to have two different things corrupting data, do you have another disk you could use in place of that one?
  18. You should use different IP subnets for both NICs, than always use the 10GbE IP for server to server transfers.
  19. See if you can copy the config folder, that's the only thing you need, and if you can you can then try re-formatting, it could be just an fs issue.
  20. That's a just a consequence of the issue, the HBA is timing out during boot: Mar 22 04:02:24 Asgard kernel: mpt2sas_cm0: port enable: FAILED with timeout (timeout=300s) And the timeout is 5 minutes. The main reason I asked to swap them is to see if it's the same controller that fails to initialize, if it's the other one it could be more of a general problem with the LSI driver, though many uses are using multiple LSI controller without issues, but it's still the same one it's likely something specific to that HBA that the new driver isn't liking, and that could be more difficult to get fixed.
  21. It could be an LSI driver issue, the one included in the newer kernel, did you try swapping the controllers?
  22. You need an older board without UEFI support, or use the UEFI tools.
×
×
  • Create New...