Jump to content

JorgeB

Moderators
  • Posts

    67,406
  • Joined

  • Last visited

  • Days Won

    705

Everything posted by JorgeB

  1. You can replace the devices or ignore for now, but it's a hardware problem, unrelated to you config or Unraid.
  2. Both enclosures are being detected correctly: Mar 8 04:54:20 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(05.00.13.00), ChipRevision(0x03), BiosVersion(07.05.04.00) Mar 8 04:54:20 Tower kernel: mpt2sas_cm0: Protocol=( Mar 8 04:54:20 Tower kernel: Initiator Mar 8 04:54:20 Tower kernel: ,Target Mar 8 04:54:20 Tower kernel: ), Mar 8 04:54:20 Tower kernel: Capabilities=( Mar 8 04:54:20 Tower kernel: TLR Mar 8 04:54:20 Tower kernel: ,EEDP Mar 8 04:54:20 Tower kernel: ,Snapshot Buffer Mar 8 04:54:20 Tower kernel: ,Diag Trace Buffer Mar 8 04:54:20 Tower kernel: ,Task Set Full Mar 8 04:54:20 Tower kernel: ,NCQ Mar 8 04:54:20 Tower kernel: ) Mar 8 04:54:20 Tower kernel: scsi host1: Fusion MPT SAS Host Mar 8 04:54:20 Tower kernel: mpt2sas_cm0: sending port enable !! Mar 8 04:54:20 Tower kernel: mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x500605b0055d6fa0), phys(8) Mar 8 04:54:20 Tower kernel: mpt2sas_cm0: port enable: SUCCESS Mar 8 04:54:20 Tower kernel: scsi 1:0:0:0: Enclosure HP P2000 G3 SAS T200 PQ: 0 ANSI: 5 Mar 8 04:54:20 Tower kernel: scsi 1:0:0:0: SSP: handle(0x0009), sas_addr(0x500c0ff1233b4000), phy(4), device_name(0x0000000000000000) Mar 8 04:54:20 Tower kernel: scsi 1:0:0:0: enclosure logical id (0x500605b0055d6fa0), slot(4) Mar 8 04:54:20 Tower kernel: scsi 1:0:0:0: Power-on or device reset occurred Mar 8 04:54:20 Tower kernel: scsi 1:0:0:0: Attached scsi generic sg2 type 13 Mar 8 04:54:20 Tower kernel: mpt2sas_cm1: LSISAS2008: FWVersion(05.00.13.00), ChipRevision(0x03), BiosVersion(07.05.04.00) Mar 8 04:54:20 Tower kernel: mpt2sas_cm1: Protocol=( Mar 8 04:54:20 Tower kernel: Initiator Mar 8 04:54:20 Tower kernel: ,Target Mar 8 04:54:20 Tower kernel: ), Mar 8 04:54:20 Tower kernel: Capabilities=( Mar 8 04:54:20 Tower kernel: TLR Mar 8 04:54:20 Tower kernel: ,EEDP Mar 8 04:54:20 Tower kernel: ,Snapshot Buffer Mar 8 04:54:20 Tower kernel: ,Diag Trace Buffer Mar 8 04:54:20 Tower kernel: ,Task Set Full Mar 8 04:54:20 Tower kernel: ,NCQ Mar 8 04:54:20 Tower kernel: ) Mar 8 04:54:20 Tower kernel: scsi host3: Fusion MPT SAS Host Mar 8 04:54:20 Tower kernel: mpt2sas_cm1: sending port enable !! Mar 8 04:54:20 Tower kernel: mpt2sas_cm1: host_add: handle(0x0001), sas_addr(0x500605b0029eafe0), phys(8) Mar 8 04:54:20 Tower kernel: mpt2sas_cm1: port enable: SUCCESS Mar 8 04:54:20 Tower kernel: scsi 3:0:0:0: Enclosure HP P2000 G3 SAS T200 PQ: 0 ANSI: 5 Mar 8 04:54:20 Tower kernel: scsi 3:0:0:0: SSP: handle(0x0009), sas_addr(0x500c0ff1233b4400), phy(0), device_name(0x0000000000000000) Mar 8 04:54:20 Tower kernel: scsi 3:0:0:0: enclosure logical id (0x500605b0029eafe0), slot(0) Mar 8 04:54:20 Tower kernel: scsi 3:0:0:0: Power-on or device reset occurred Mar 8 04:54:20 Tower kernel: scsi 3:0:0:0: Attached scsi generic sg3 type 13 This suggest the connections are good but the enclosure is not detecting the disk, though you should update both LSIs firmware since they are ancient, current release is 20.00.07.00
  3. We need the diags, most useful if grabbed after running a few days until it happens again, and before rebooting.
  4. It doesn't look like an Unraid problem, strange it was working before, those disks are not reporting temp, e.g.:, disk2: === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 0 C Drive Trip Temperature: 0 C The other disks are, e.g. disk 10: === START OF READ SMART DATA SECTION === SMART Health Status: OK Percentage used endurance indicator: 6% Current Drive Temperature: 30 C Drive Trip Temperature: 63 C That comes from the SMART data, can't see how the upgrade could affect that, can you downgrade back to v6.8.2 to confirm? Pleas post new diags if it is working.
  5. We can't see what happened but likely a controller error, when errors on multiple drives happen at the same time Unraid will disable as many drives as there are parity devices, which drives get disabled is luck of the draw, unlikely to be upgraded related, especially since if I understood correctly problem was before rebooting, and the new release would only be loaded after the reboot, when the disks were already disabled, just re-sync parity and try to get the diags before rebuilding if it happens again.
  6. That's OK if copying from the cache device to a different share, or use /mnt/user0/USER. As for the cache is showing only checksum/corruption errors, you should run a memtest.
  7. Which disk is that one? All array devices appear to be mounting correctly, though there are still ATA errors on at least two devices.
  8. You also have an overheating CPU, check cooling: Mar 5 08:11:15 Valyria kernel: CPU4: Core temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU12: Core temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU14: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU13: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU15: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU9: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU8: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU10: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU11: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU12: Package temperature above threshold, cpu clock throttled (total events = 1) Mar 5 08:11:15 Valyria kernel: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
  9. I would guess disk3 is the most likely culprit, but it can get better at any point since these are usually bad areas, though it can also end up giving some read errors. Unrelated to this, you have some filesystem complaining of low space: Mar 6 20:13:58 UNRAID kernel: Filesystem "md11": reserve blocks depleted! Consider increasing reserve pool size. Mar 6 20:13:58 UNRAID kernel: XFS (md11): Per-AG reservation for AG 0 failed. Filesystem may run out of space. Mar 6 20:13:58 UNRAID kernel: XFS (md11): Per-AG reservation for AG 1 failed. Filesystem may run out of space. Mar 6 20:13:58 UNRAID kernel: XFS (md11): Per-AG reservation for AG 2 failed. Filesystem may run out of space. There are a various disks with less than 50MB free, you might have issues if you need for example to run a filesystem check in any of those, you should leave a few GB free, like 10/20GB. Filesystem Size Used Avail Use% Mounted on /dev/md2 2.8T 2.8T 4.4M 100% /mnt/disk2 /dev/md6 2.8T 2.8T 45M 100% /mnt/disk6 /dev/md7 2.8T 2.8T 46M 100% /mnt/disk7 /dev/md9 2.8T 2.8T 48M 100% /mnt/disk9 /dev/md11 2.8T 2.8T 40M 100% /mnt/disk11 /dev/md12 3.7T 3.7T 38M 100% /mnt/disk12 /dev/md16 3.7T 3.7T 35M 100% /mnt/disk16
  10. Those ata errors on disk2 are a hardware problem, likely a connection issue, check/replace both cables.
  11. With just the NVMe device I would use it as cache for now, so you could also cache some writes, assuming the available space is enough, but if you plan to add a cache pool soon probably best to use it as unassigned device from the start, also yes to the second question.
  12. Don't see any clues here, but diags are right after rebooting, whatever happened likely happened before this, assuming that before the pool was re-formatted I agree there's likely some hardware issue, look for a firmware update for the NVMe device.
  13. For some time now Unraid keeps the current profile when adding new devices, it won't revert back to raid1. Pool is correctly configured for raid10, but like mentioned the usable space reported on the GUI will be wrong, this is a btrfs known issue when using different size devices.
  14. None. Yes, mostly because of the read errors on disk4 during initial parity sync.
  15. mdcmd status | egrep "mdResync" It will display some values, including: mdResync=total parity size mdResyncPos=current position
  16. Possibly, try a different one if available.
  17. You should create a bug report, my guess is that it's not converting TB to GB, and anyone with >= 1TB of RAM will have this issue.
  18. Your cache drive is completely full, this is preventing writes on the libvirt loop image, free up some space, re-start the VM service and it should return to normal, but it's always a good idea to backup libvirt.img, in case it gets corrupted.
  19. If it was an unclean shutdown you're fine, and it's the most likely, if it was a disk returning bad data not so good since there would be some data corruption, but like mentioned that is pretty rare.
  20. It is very strange to now be getting an "invalid partition error", this means something changed the mbr/partition and it does no longer conform to what Unraid expects, that should never happen (unlike filesystem corruption which can happen sometimes), I suggest you reformat disk1 (any data there will be lost), copy some data then stop and re-start the array, if it goes unmountable again post new diags before rebooting.
  21. There must be a reason for the errors, most obvious would be an unclean shutdown some time before this check, another possibility is one of the disks returning corrupted data, that's not supposed to happen but it is known to happen on some rare occasions.
×
×
  • Create New...