TurkeyPerson Posted July 1 Share Posted July 1 I was out of town and my system randomly went offline. When I returned, one of my array drives was showing as "unmountable disk present unraid". I reseated some SATA cables and ran a repair on the drive in maintenance mode. It took some back and forth but it eventually started working. I also switched my SATA controller to a different PCI slot. Unfortunately, either I also had cache drives corruption, or I created it, and it seems like I now have a ton of issues that have cascaded from it(e.g., none of my services seem to connect to unraid, for example: NETWORK: getaddrinfo ENOTFOUND mothership.unraid.netCLOUD: Socket closed I even tested downloading an update and that didn't work). I tried fixing the corruption. I Took some diagnostics throughout so I'll share them just in case, but the most recent ends in 0209. Appreciate some help - think I may have made things worse by rushing through this and trying to get back online asap. tower-diagnostics-20240701-0209.zip tower-diagnostics-20240701-0152.zip tower-diagnostics-20240701-0148.zip tower-diagnostics-20240701-0025.zip Quote Link to comment
JorgeB Posted July 1 Share Posted July 1 Run a correcting scrub on the pool and post the results. 1 Quote Link to comment
TurkeyPerson Posted July 1 Author Share Posted July 1 (edited) Thanks JorgeB - I ended up doing that last night and restarting this morning, and then checking your post. It found a ton of errors and fixed them but I didn't save the results unfortunately. I restarted the server to see what was up, but looks like I'm still getting errors - and at the very least it looks like my docker image will need remaking (but I imagine that's the least of my concerns). tower-diagnostics-20240701-0958.zip EDIT: - Fixed the network issue (was an unrelated DNS issue - pihole backup wasn't configured and since dockers were down, no DNS.) - Deleted docker image & reinstalled dockers - Things appear as if they will be ok but I still have not figured out the root cause on this. I do have a syslog but I imagine I shouldn't post it due to personal information? EDIT2: Still seeing this error - is the root cause? I see that this is a known issue. Is this the replacement I want? https://www.amazon.ca/LSI-9211-8I-RAID-Controller-Card/dp/B0BVVN66XG/ Jul 1 11:00:22 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen Jul 1 11:00:22 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED Jul 1 11:00:22 Tower kernel: ata2.00: cmd 60/20:28:88:a1:1c/00:00:5e:01:00/40 tag 5 ncq dma 16384 in Jul 1 11:00:22 Tower kernel: res 40/00:28:88:a1:1c/00:00:5e:01:00/40 Emask 0x4 (timeout) Jul 1 11:00:22 Tower kernel: ata2.00: status: { DRDY } Jul 1 11:00:22 Tower kernel: ata2: hard resetting link Jul 1 11:00:22 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen Jul 1 11:00:22 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED Jul 1 11:00:22 Tower kernel: ata8.00: cmd 60/20:48:68:54:51/00:00:5d:01:00/40 tag 9 ncq dma 16384 in Jul 1 11:00:22 Tower kernel: res 40/00:48:68:54:51/00:00:5d:01:00/40 Emask 0x4 (timeout) Jul 1 11:00:22 Tower kernel: ata8.00: status: { DRDY } Jul 1 11:00:22 Tower kernel: ata8: hard resetting link Jul 1 11:00:25 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jul 1 11:00:25 Tower kernel: ata8.00: configured for UDMA/133 Jul 1 11:00:25 Tower kernel: ata8: EH complete Jul 1 11:00:28 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Jul 1 11:00:32 Tower kernel: ata2: COMRESET failed (errno=-16) Jul 1 11:00:32 Tower kernel: ata2: hard resetting link Once the dockers are done reinstalling, I'll post a new diagnostic. Edited July 1 by TurkeyPerson Quote Link to comment
JorgeB Posted July 1 Share Posted July 1 The ATA errors could just be a power/connection issue, did you try replacing the cables for the affected devices? Both SATA and power. Quote Link to comment
TurkeyPerson Posted July 1 Author Share Posted July 1 (edited) 50 minutes ago, JorgeB said: The ATA errors could just be a power/connection issue, did you try replacing the cables for the affected devices? Both SATA and power. Thanks. I think you must be right. From my understanding each of those is actually connected to a different controller. I will give that a shot once this dockers are done. Really appreciate everything you do around here - have a beer on me. Edit: Done - will report back but I don't see any issues so far. Edited July 1 by TurkeyPerson 1 Quote Link to comment
TurkeyPerson Posted July 4 Author Share Posted July 4 (edited) Alas, I'm back with new diagnostics and corruption-related errors. Full error is longer but this is the main part: XFS (md6p1): Internal error ltbno + ltlen > bno at line 1955 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0xe9/0x6af [xfs] Jul 3 13:19:57 Tower kernel: CPU: 6 PID: 11591 Comm: shfs Tainted: P O 6.1.79-Unraid #1 One part states: Jul 3 13:19:57 Tower kernel: XFS (md6p1): Corruption detected. Unmount and run xfs_repair Jul 3 13:19:57 Tower kernel: XFS (md6p1): Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll+0x479/0x503 [xfs] (fs/xfs/libxfs/xfs_defer.c:573). Shutting down filesystem. Jul 3 13:19:57 Tower kernel: XFS (md6p1): Please unmount the filesystem and rectify the problem(s) But I'm not even sure what drive it's talking about. Since its XFS, I assume its one of the drives in the array, so I'm checking each. I came across this on Disk 6: Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. tower-diagnostics-20240703-2044.zip And... its unmountable. So I guess I should destroy the log again ... I'm getting the feeling I'm screwed - attaching an updated log and I guess I'll set this to repair. But I did this last time and just ended up back here. Going to run it on each drive (except parity) and then run a scrub on the cache drive for good measure. I can't imagine this will be the solution but I'm all out of ideas. tower-diagnostics-20240703-2117.zip Edited July 4 by TurkeyPerson Quote Link to comment
JorgeB Posted July 4 Share Posted July 4 7 hours ago, TurkeyPerson said: So I guess I should destroy the log again Yep, that's the only option to try and repair the filesystem. Quote Link to comment
itimpi Posted July 4 Share Posted July 4 For future reference if the device name is something along the lines of "md6p1" then this would refer (as you found) to disk 6. The number after the 'md' part is the disk number in the main Unraid array. Whether it will be a simple as this when Unraid (eventually) supports multiple Unraid type arrays I have no idea Quote Link to comment
TurkeyPerson Posted July 4 Author Share Posted July 4 Thanks everyone. Quick follow up. If this doesn't work, what's the nuclear option? Transfer files off drive and format it? Quote Link to comment
JorgeB Posted July 4 Share Posted July 4 You cannot transfer files before trying to fix the filesystem, is the filesystem corruption on that specif disk a recurring issue? Quote Link to comment
TurkeyPerson Posted July 5 Author Share Posted July 5 I thought i fixed it, then fixed the cache issue, and then disk 6 went offline again. I mentioned it as background in my first post. Ran a memtest for 24h and all good on that front. Just turned things online so we'll see in a day or two I guess. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.