Jump to content

JorgeB

Moderators
  • Posts

    67,452
  • Joined

  • Last visited

  • Days Won

    706

Everything posted by JorgeB

  1. HPA is also enable on disk1. P.S: 2 SATA ports are still in IDE mode, these AMD controllers have separate settings, usually ports 1 to 4 and ports 5/6.
  2. There are several ways, depends on what diag file I open first, e.g., you can see it in lsscsi.txt Note more than one device with same ATA#, that only happens in IDE mode (and with some SATA port multipliers), also Intel controller in AHCI is just one device, usually 1f:2, in in IDE mode there are two devices, 1f:2 and 1f:5 You can also see it in lspci.txt Here you just need to look ate the driver in use, ata_piix for IDE, ACHI for AHCI, there's also de "IDE mode" on the controller name but this might not always be a reliable way to tell, especially on non Intel controllers, same on the syslog, you see the IDE driver being loaded: May 21 14:54:28 Tower kernel: ata_piix 0000:00:1f.2: version 2.13 May 21 14:54:28 Tower kernel: ata_piix 0000:00:1f.2: MAP [ P0 P2 P1 P3 ] May 21 14:54:28 Tower kernel: scsi host1: ata_piix May 21 14:54:28 Tower kernel: scsi host2: ata_piix and the dual ATA# ATA1.00 + ATA1.01 and ATA2.00 + ATA2.01 May 21 14:54:28 Tower kernel: ata2.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) May 21 14:54:28 Tower kernel: ata2.01: SATA link up 6.0 Gbps (SStatus 133 SControl 300) May 21 14:54:28 Tower kernel: ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300) May 21 14:54:28 Tower kernel: ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) May 21 14:54:28 Tower kernel: ata2.00: ATA-8: SanDisk SD6SB1M256G1022I, 134439400836, X230600, max UDMA/133 May 21 14:54:28 Tower kernel: ata2.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 0/32) May 21 14:54:28 Tower kernel: ata1.00: ATA-9: ST4000NM0033-9ZM170, Z1Z3ESJS, GA0A, max UDMA/133 May 21 14:54:28 Tower kernel: ata1.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 0/32) May 21 14:54:28 Tower kernel: ata2.01: ATA-10: ST4000DM005-2DP166, ZDH3HBM4, 0001, max UDMA/133 May 21 14:54:28 Tower kernel: ata2.01: 7814037168 sectors, multi 16: LBA48 NCQ (depth 0/32) May 21 14:54:28 Tower kernel: ata1.01: ATA-8: WD4000FYYX, WCC13LZR5569, 00.0D1K4, max UDMA/133 May 21 14:54:28 Tower kernel: ata1.01: 7814037168 sectors, multi 16: LBA48 NCQ (depth 0/32) May 21 14:54:28 Tower kernel: ata1.00: configured for UDMA/133 May 21 14:54:28 Tower kernel: ata2.00: configured for UDMA/133 May 21 14:54:28 Tower kernel: ata2.01: configured for UDMA/133 May 21 14:54:28 Tower kernel: ata1.01: configured for UDMA/133 It would be, best bet would be to warn if the ata_piix driver is loaded, though it might give some false positives since some boards/controllers might have an IDE mode for some add-on controller and not be in use, some older JMB controllers come to mind.
  3. https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601
  4. If Unraid is not shutting down cleanly it should create diags on the flash drive, logs folder, check that and upload those instead if they exist.
  5. It is because the board created the HPA, it's all explained in the link above, including how to remove it.
  6. Reboot in safe mode, disable the docker and VM services, if there's still activity install the Nerdpack plugin and enable iotop, it might show the active process.
  7. Looks like a hardware problem, run memtest and/or try a different board/controller if available.
  8. The diags you posted show where the problem began, disk1 was having ATA and read errors left and write and it ended up corrupting the filesystem even without being disable by Unraid: May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 Sense Key : 0x5 [current] May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 ASC=0x21 ASCQ=0x0 May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 25 6a da 20 00 00 00 20 00 00 May 18 06:13:25 Tower kernel: print_req_error: I/O error, dev sdb, sector 627759648 May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759584 May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759592 May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759600 May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759608 May 18 06:13:25 Tower kernel: ata1: EH complete May 18 06:13:25 Tower kernel: BTRFS error (device md1): parent transid verify failed on 379424161792 wanted 869496 found 844525 May 18 06:13:25 Tower kernel: loop: Write error at byte offset 67108864, length 4096. May 18 06:13:25 Tower kernel: print_req_error: I/O error, dev loop2, sector 131072 May 18 06:13:25 Tower kernel: BTRFS warning (device loop2): lost page write due to IO error on /dev/loop2 May 18 06:13:25 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 May 18 06:13:25 Tower kernel: BTRFS error (device md1): parent transid verify failed on 379424129024 wanted 869496 found 864712 May 18 06:14:16 Tower kernel: BTRFS error (device md1): parent transid verify failed on 379424718848 wanted 869496 found 864713 May 18 06:14:45 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xc00000 SErr 0x0 action 0x0 May 18 06:14:45 Tower kernel: ata1.00: irq_stat 0x40000008 May 18 06:14:45 Tower kernel: ata1.00: failed command: READ FPDMA QUEUED May 18 06:14:45 Tower kernel: ata1.00: cmd 60/08:b0:88:cd:97/00:00:05:00:00/40 tag 22 ncq dma 4096 in May 18 06:14:45 Tower kernel: res 41/10:00:88:cd:97/00:00:05:00:00/40 Emask 0x481 (invalid argument) <F> May 18 06:14:45 Tower kernel: ata1.00: status: { DRDY ERR } So in this case Unraid couldn't help you since the fs corruption happened before the disk got disabled/replaced and it also wasn't anything you did wrong, in theory the filesystem should survive these errors but unfortunately it's not always the case, can also be a problem with the disks's firmware, in any case only backups could have saved you here. Also for the typical user best to stick with XFS since it's usually more resilient, but not uncorruptible.
  9. Onboard controller is set to IDE, change it to AHCI for best performance and reliability.
  10. Problems with the onboard SATA controller, quite common with some Ryzen boards, there are some reports that the newer kernel on v6.9-beta1 helps, upgrade reboot and post new diags after array start.
  11. Yeah, those call traces are not hardware related, and should never cause any docker image corruption, more info here:
  12. Sorry, I wasn't very clear, I meant to say use just the degraded option, with or without ro (readonly), but without usebackuproot: mount -o degraded,ro But looks like it's missing more than a device, and that might not be the only problem, if degraded won't work can't help more, you can try #btrfs on freenode irc, they are very helpful there, but pool might be beyond saving.
  13. It will mount if all pool members are connected to the server (unless thee are other issues), and you just need to mount one of them, the others will be mounted together if found. If you don't have the missing member it might mount in degraded mode if the pool was redundant, see here for command.
  14. There should be, with that array config you should easily get gigabit line speed with reads and writes, something else is going on there.
  15. It's running above the officially supported speed, I consider that overclocking, ant it's a known issue, so regardless of what you call it it's where I would start.
  16. Did you bother to read the link? Anything above 2666Mhz with that CPU is an overclock.
  17. See here, you are overclocking the RAM, and most likely what's casing the problem, it's a known issue with Ryzen, even if no errors on memtest it can still corrupt data, usually also noticeable when you do a parity check by finding sync errors.
  18. It shouldn't be used, overall SMART test is mostly useless and it will only fail if there's a FAILING NOW attribute, of much more value is the extended SMART test, and it failed that one: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 12773 440020808
  19. Not a bad idea to run a scrub to check for more corruption, but since it's a single device pool it can only detect corruption, not fix it.
  20. It means that block doesn't have the checksum it should have, i.e., data is corrupt, you can fix it by deleting the file or overwriting it. Most likely cause would be the SSD, there's a bad block, that was reallocated: 183 Runtime_Bad_Block PO--C- 099 099 010 - 1 The SSD firmware shouldn't reallocate a block containing data without being able to write it correctly to another place, but it's known to happen, it happened to me a few years ago with a Sandisk SSD, another option would be a one time RAM bit flip, if it was bad RAM in general it would likely cause more issues.
  21. See if turning off hard link support in Settings/Global Share Settings helps, but you'll lose hard link support.
  22. Yes, it's fixed for upcoming release: https://forums.unraid.net/bug-reports/prereleases/68-rc1-kernel-tun-unexpected-gso-r634/?do=findComment&comment=8945
  23. Posting the diagnostics might also give some clues.
  24. Like mentioned if the same filesystem keeps getting corrupt I would create a new one.
×
×
  • Create New...