February 1, 20242 yr Community Expert Hi, Randomly, last night, with no activity, a 250gb SSD appeared to die. But, the other 2 1tb SSD's, running my cache pool, also seem to have been affected. All 3 SSD's are running from a PCI raid card, into an icydock 4-disk caddy, which hasn't been a problem since I set up this server about 6 months ago. I was going to stop the array and check the disks, but the Array will not stop. It's also not possible to collect Diagnostics, it hangs at, cp /etc/libvirt/qemu/*.xml '/kbnas-diagnostics-20240201-1657/xml' 2>/dev/null Obviously, it wants to access my unmounted cache disk for this... see screenshots and attached syslog. What's the best way to proceed, please? Thanks Kev kbnas-syslog-20240201-1556.zip
February 1, 20242 yr Author Community Expert Should I just mount the 2 SSD's used for cache? But why are they in UD?
February 1, 20242 yr Community Expert Looks like the device dropped offline, reboot the server, type reboot on the CLI, if it doesn't reboot after 5 minutes you will need to force it, then post new diags after array start.
February 1, 20242 yr Author Community Expert kbnas-diagnostics-20240201-1800.zipkbnas-syslog-20240201-1803.zip It seems to be ok now.. With the array still stopped. Disk4 is still showing with a red X, but is now showing as 'Overall Health - PASSED' Any further comments/advice, please? @JorgeB Thanks Kev
February 1, 20242 yr Community Expert Start the array, if the emulated disk mounts and contents look correct you can rebuild on top.
February 1, 20242 yr Author Community Expert Sorry - I saw that you asked for logs AFTER starting the array - here they are - and my cache looks unwell kbnas-diagnostics-20240201-1821.zipkbnas-syslog-20240201-1822.zip @JorgeB
February 1, 20242 yr Author Community Expert Would you suggest to hit 'Clear' - OR run 'zpool clear'. Smart isn't showing any problems on the cache disks. And what about Disk 4? Also showing no errors - but... Edited February 1, 20242 yr by Kev600
February 1, 20242 yr Community Expert 28 minutes ago, Kev600 said: Would you suggest to hit 'Clear' - OR run 'zpool clear'. It's the same, the error was during pool import. Feb 1 18:20:46 KBNAS kernel: sd 1:0:3:0: [sdd] Unaligned partial completion (resid=178, sector_sz=512) Feb 1 18:20:46 KBNAS kernel: sd 1:0:3:0: [sdd] tag#420 CDB: opcode=0x28 28 00 36 b3 e9 88 00 01 00 00 Feb 1 18:20:46 KBNAS kernel: sd 1:0:3:0: [sdd] tag#420 UNKNOWN(0x2003) Result: hostbyte=0x05 driverbyte=DRIVER_OK cmd_age=30s Feb 1 18:20:46 KBNAS kernel: sd 1:0:3:0: [sdd] tag#420 CDB: opcode=0x28 28 00 36 b3 e9 88 00 01 00 00 Feb 1 18:20:46 KBNAS kernel: I/O error, dev sdd, sector 917760392 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 2 Feb 1 18:20:46 KBNAS kernel: zio pool=cache vdev=/dev/sdd1 error=5 type=1 offset=469892272128 size=131072 flags=180990 This looks more like a power/connection issue, or try a different controller if possible,.
February 1, 20242 yr Author Community Expert Thanks @JorgeB But what an absolute pig of an issue to resolve I guess having my cache mirrored saved my ass here? I guess the same thing happened with Disk 4 - Should I readd it - or remove it?
February 1, 20242 yr Community Expert 6 minutes ago, Kev600 said: I guess the same thing happened with Disk 4 Possibly, you can rebuild if you need it, if you don't really use, it might be better to just remove it.
February 3, 20242 yr Author Community Expert Solution @JorgeB Update - I wanted to run 'zpool clear' but it didn't show the 'Clear' Button, only 'Scrub' So I ran that and it it finished quite fast. I rebooted and saw a lot more Disk errors in the logs. I ran the scrub again, then it took the pool into DEGRADED mode and showed a errors for one of the disks.. It hung at about 68%. (3+ hrs) I forced a hard shutdown/restart and ran a Scrub again - This time it showed loads of errors for both disks and went to SUSPENDED > DISABLED > SUSPENDED > DEGRADED status, and it seemed to hang again, although the UNRAID GUI was still responsive; I could not stop/start the Array. With more Device & I/O errors filling up the logs.. I rebooted again and the Array hung at 'mounting drives' - So I switched it off and went to bed. This afternoon, I pulled the SAS/SATA Controller card and re-placed it into a different PCI Slot. I reseated the cable at the card, and to each of the 4-bays at the SSD caddy. I also disconnected power and data cables from the DVD drive as it was unused. I started it, and now, unbelievably, everything seems back to normal, and all my Containers are available - with the zfs cache pool showing; pool: cache state: ONLINE scan: resilvered 72K in 00:00:00 with 0 errors on Sat Feb 3 15:56:05 2024 config: NAME STATE READ WRITE CKSUM cache ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /dev/sdb1 ONLINE 0 0 0 /dev/sdc1 ONLINE 0 0 0 errors: No known data errors How can I test the integrity of my cache pool & disks? Note: The 250gb (old disk 4) Disk, has been added under UD for now, showing no SMART Errors. I could potentially add 2x 4tb SSD drives to the Array at a later time. But need to know this 'half' of the storage is sound.. Thanks for your quick responses last night, and any other comments you might have. Fresh logs attached. Thanks, Kev kbnas-diagnostics-20240203-1623.zip kbnas-syslog-20240203-1622.zip Edited February 3, 20242 yr by Kev600
February 4, 20242 yr Community Expert 18 hours ago, Kev600 said: How can I test the integrity of my cache pool & disks? Since cache is zfs a scrub is the best way to test that, the disks are xfs formatted, so there's no way to test file integrity unless you have pre-existing checksums, but I would expect the data to be fine, you can run a parity check to al least check if there are no more errors during high loads.
February 4, 20242 yr Author Community Expert @JorgeB Successfull zfs Scrub and Array Parity checks! All good - nothing lost... But a very odd issue... which miiiiiiight reoccur Thanks mate
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.