Jump to content

Cascade of issues and now BTRS errors on cache drive and unable to connect


Recommended Posts

I was out of town and my system randomly went offline. When I returned, one of my array drives was showing as "unmountable disk present unraid". I reseated some SATA cables and ran a repair on the drive in maintenance mode. It took some back and forth but it eventually started working. I also switched my SATA controller to a different PCI slot.

 

Unfortunately, either I also had cache drives corruption, or I created it, and it seems like I now have a ton of issues that have cascaded from it(e.g., none of my services seem to connect to unraid, for example: NETWORK: getaddrinfo ENOTFOUND mothership.unraid.netCLOUD: Socket closed

I even tested downloading an update and that didn't work). I tried fixing the corruption. 

 

I Took some diagnostics throughout so I'll share them just in case, but the most recent ends in 0209. Appreciate some help - think I may have made things worse by rushing through this and trying to get back online asap. 

tower-diagnostics-20240701-0209.zip tower-diagnostics-20240701-0152.zip tower-diagnostics-20240701-0148.zip tower-diagnostics-20240701-0025.zip

Link to comment
Posted (edited)

Thanks JorgeB - I ended up doing that last night and restarting this morning, and then checking your post. It found a ton of errors and fixed them but I didn't save the results unfortunately. I restarted the server to see what was up, but looks like I'm still getting errors - and at the very least it looks like my docker image will need remaking (but I imagine that's the least of my concerns).

tower-diagnostics-20240701-0958.zip

EDIT:
- Fixed the network issue (was an unrelated DNS issue - pihole backup wasn't configured and since dockers were down, no DNS.)

- Deleted docker image & reinstalled dockers

- Things appear as if they will be ok but I still have not figured out the root cause on this. I do have a syslog but I imagine I shouldn't post it due to personal information?


EDIT2: Still seeing this error - is the root cause? I see that this is a known issue. Is this the replacement I want? https://www.amazon.ca/LSI-9211-8I-RAID-Controller-Card/dp/B0BVVN66XG/

Jul  1 11:00:22 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen
Jul  1 11:00:22 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul  1 11:00:22 Tower kernel: ata2.00: cmd 60/20:28:88:a1:1c/00:00:5e:01:00/40 tag 5 ncq dma 16384 in
Jul  1 11:00:22 Tower kernel:         res 40/00:28:88:a1:1c/00:00:5e:01:00/40 Emask 0x4 (timeout)
Jul  1 11:00:22 Tower kernel: ata2.00: status: { DRDY }
Jul  1 11:00:22 Tower kernel: ata2: hard resetting link
Jul  1 11:00:22 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen
Jul  1 11:00:22 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Jul  1 11:00:22 Tower kernel: ata8.00: cmd 60/20:48:68:54:51/00:00:5d:01:00/40 tag 9 ncq dma 16384 in
Jul  1 11:00:22 Tower kernel:         res 40/00:48:68:54:51/00:00:5d:01:00/40 Emask 0x4 (timeout)
Jul  1 11:00:22 Tower kernel: ata8.00: status: { DRDY }
Jul  1 11:00:22 Tower kernel: ata8: hard resetting link
Jul  1 11:00:25 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul  1 11:00:25 Tower kernel: ata8.00: configured for UDMA/133
Jul  1 11:00:25 Tower kernel: ata8: EH complete
Jul  1 11:00:28 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Jul  1 11:00:32 Tower kernel: ata2: COMRESET failed (errno=-16)
Jul  1 11:00:32 Tower kernel: ata2: hard resetting link

Once the dockers are done reinstalling, I'll post a new diagnostic. 

Edited by TurkeyPerson
Link to comment
Posted (edited)
50 minutes ago, JorgeB said:

The ATA errors could just be a power/connection issue, did you try replacing the cables for the affected devices? Both SATA and power.

Thanks. I think you must be right. From my understanding each of those is actually connected to a different controller. I will give that a shot once this dockers are done. Really appreciate everything you do around here - have a beer on me. 

Edit: Done - will report back but I don't see any issues so far. 

Edited by TurkeyPerson
  • Like 1
Link to comment
Posted (edited)

Alas, I'm back with new diagnostics and corruption-related errors. Full error is longer but this is the main part: 

XFS (md6p1): Internal error ltbno + ltlen > bno at line 1955 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0xe9/0x6af [xfs]
Jul  3 13:19:57 Tower kernel: CPU: 6 PID: 11591 Comm: shfs Tainted: P           O       6.1.79-Unraid #1


One part states: 
 

Jul  3 13:19:57 Tower kernel: XFS (md6p1): Corruption detected. Unmount and run xfs_repair
Jul  3 13:19:57 Tower kernel: XFS (md6p1): Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll+0x479/0x503 [xfs] (fs/xfs/libxfs/xfs_defer.c:573).  Shutting down filesystem.
Jul  3 13:19:57 Tower kernel: XFS (md6p1): Please unmount the filesystem and rectify the problem(s)

 But I'm not even sure what drive it's talking about. 

Since its XFS, I assume its one of the drives in the array, so I'm checking each. I came across this on Disk 6:

Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

tower-diagnostics-20240703-2044.zip

 

And... its unmountable. So I guess I should destroy the log again :/ ... I'm getting the feeling I'm screwed - attaching an updated log and I guess I'll set this to repair. But I did this last time and just ended up back here. Going to run it on each drive (except parity) and then run a scrub on the cache drive for good measure. I can't imagine this will be the solution but I'm all out of ideas. 

tower-diagnostics-20240703-2117.zip

Edited by TurkeyPerson
Link to comment

For future reference if the device name is something along the lines of "md6p1" then this would refer (as you found) to disk 6.  The number after the 'md' part is the disk number in the main Unraid array.

 

Whether it will be a simple as this when Unraid (eventually) supports multiple Unraid type arrays I have no idea :) 

Link to comment

I thought i fixed it, then fixed the cache issue, and then disk 6 went offline again. I mentioned it as background in my first post. Ran a memtest for 24h and all good on that front. Just turned things online so we'll see in a day or two I guess. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...