Cascade of issues and now BTRS errors on cache drive and unable to connect

July 1, 20242 yr

I was out of town and my system randomly went offline. When I returned, one of my array drives was showing as "unmountable disk present unraid". I reseated some SATA cables and ran a repair on the drive in maintenance mode. It took some back and forth but it eventually started working. I also switched my SATA controller to a different PCI slot.

Unfortunately, either I also had cache drives corruption, or I created it, and it seems like I now have a ton of issues that have cascaded from it(e.g., none of my services seem to connect to unraid, for example: NETWORK: getaddrinfo ENOTFOUND mothership.unraid.netCLOUD: Socket closed

I even tested downloading an update and that didn't work). I tried fixing the corruption.

I Took some diagnostics throughout so I'll share them just in case, but the most recent ends in 0209. Appreciate some help - think I may have made things worse by rushing through this and trying to get back online asap.

tower-diagnostics-20240701-0209.zip tower-diagnostics-20240701-0152.zip tower-diagnostics-20240701-0148.zip tower-diagnostics-20240701-0025.zip

Quote

July 1, 20242 yr

Community Expert

Run a correcting scrub on the pool and post the results.

Quote

1

July 1, 20242 yr

Author

Thanks JorgeB - I ended up doing that last night and restarting this morning, and then checking your post. It found a ton of errors and fixed them but I didn't save the results unfortunately. I restarted the server to see what was up, but looks like I'm still getting errors - and at the very least it looks like my docker image will need remaking (but I imagine that's the least of my concerns).

tower-diagnostics-20240701-0958.zip

EDIT:
- Fixed the network issue (was an unrelated DNS issue - pihole backup wasn't configured and since dockers were down, no DNS.)

- Deleted docker image & reinstalled dockers

- Things appear as if they will be ok but I still have not figured out the root cause on this. I do have a syslog but I imagine I shouldn't post it due to personal information?

EDIT2: Still seeing this error - is the root cause? I see that this is a known issue. Is this the replacement I want? https://www.amazon.ca/LSI-9211-8I-RAID-Controller-Card/dp/B0BVVN66XG/

Jul  1 11:00:22 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen
Jul  1 11:00:22 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul  1 11:00:22 Tower kernel: ata2.00: cmd 60/20:28:88:a1:1c/00:00:5e:01:00/40 tag 5 ncq dma 16384 in
Jul  1 11:00:22 Tower kernel:         res 40/00:28:88:a1:1c/00:00:5e:01:00/40 Emask 0x4 (timeout)
Jul  1 11:00:22 Tower kernel: ata2.00: status: { DRDY }
Jul  1 11:00:22 Tower kernel: ata2: hard resetting link
Jul  1 11:00:22 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen
Jul  1 11:00:22 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Jul  1 11:00:22 Tower kernel: ata8.00: cmd 60/20:48:68:54:51/00:00:5d:01:00/40 tag 9 ncq dma 16384 in
Jul  1 11:00:22 Tower kernel:         res 40/00:48:68:54:51/00:00:5d:01:00/40 Emask 0x4 (timeout)
Jul  1 11:00:22 Tower kernel: ata8.00: status: { DRDY }
Jul  1 11:00:22 Tower kernel: ata8: hard resetting link
Jul  1 11:00:25 Tower kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul  1 11:00:25 Tower kernel: ata8.00: configured for UDMA/133
Jul  1 11:00:25 Tower kernel: ata8: EH complete
Jul  1 11:00:28 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Jul  1 11:00:32 Tower kernel: ata2: COMRESET failed (errno=-16)
Jul  1 11:00:32 Tower kernel: ata2: hard resetting link

Once the dockers are done reinstalling, I'll post a new diagnostic.

Edited July 1, 20242 yr by TurkeyPerson

Quote

July 1, 20242 yr

Community Expert

The ATA errors could just be a power/connection issue, did you try replacing the cables for the affected devices? Both SATA and power.

Quote

July 1, 20242 yr

Author

50 minutes ago, JorgeB said:

The ATA errors could just be a power/connection issue, did you try replacing the cables for the affected devices? Both SATA and power.

Thanks. I think you must be right. From my understanding each of those is actually connected to a different controller. I will give that a shot once this dockers are done. Really appreciate everything you do around here - have a beer on me.

Edit: Done - will report back but I don't see any issues so far.

Edited July 1, 20242 yr by TurkeyPerson

Quote

1

July 4, 20242 yr

Author

Alas, I'm back with new diagnostics and corruption-related errors. Full error is longer but this is the main part:

XFS (md6p1): Internal error ltbno + ltlen > bno at line 1955 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0xe9/0x6af [xfs]
Jul  3 13:19:57 Tower kernel: CPU: 6 PID: 11591 Comm: shfs Tainted: P           O       6.1.79-Unraid #1

One part states:

Jul  3 13:19:57 Tower kernel: XFS (md6p1): Corruption detected. Unmount and run xfs_repair
Jul  3 13:19:57 Tower kernel: XFS (md6p1): Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll+0x479/0x503 [xfs] (fs/xfs/libxfs/xfs_defer.c:573).  Shutting down filesystem.
Jul  3 13:19:57 Tower kernel: XFS (md6p1): Please unmount the filesystem and rectify the problem(s)

But I'm not even sure what drive it's talking about.

Since its XFS, I assume its one of the drives in the array, so I'm checking each. I came across this on Disk 6:

Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.


Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

tower-diagnostics-20240703-2044.zip

And... its unmountable. So I guess I should destroy the log again ... I'm getting the feeling I'm screwed - attaching an updated log and I guess I'll set this to repair. But I did this last time and just ended up back here. Going to run it on each drive (except parity) and then run a scrub on the cache drive for good measure. I can't imagine this will be the solution but I'm all out of ideas.

tower-diagnostics-20240703-2117.zip

Edited July 4, 20242 yr by TurkeyPerson

Quote

July 4, 20242 yr

Community Expert

7 hours ago, TurkeyPerson said:

So I guess I should destroy the log again

Yep, that's the only option to try and repair the filesystem.

Quote

July 4, 20242 yr

Community Expert

For future reference if the device name is something along the lines of "md6p1" then this would refer (as you found) to disk 6. The number after the 'md' part is the disk number in the main Unraid array.

Whether it will be a simple as this when Unraid (eventually) supports multiple Unraid type arrays I have no idea

Quote

July 4, 20242 yr

Author

Thanks everyone. Quick follow up. If this doesn't work, what's the nuclear option? Transfer files off drive and format it?

Quote

July 4, 20242 yr

Community Expert

You cannot transfer files before trying to fix the filesystem, is the filesystem corruption on that specif disk a recurring issue?

Quote

July 5, 20242 yr

Author

I thought i fixed it, then fixed the cache issue, and then disk 6 went offline again. I mentioned it as background in my first post. Ran a memtest for 24h and all good on that front. Just turned things online so we'll see in a day or two I guess.

Quote

Cascade of issues and now BTRS errors on cache drive and unable to connect

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)