Shares disappeared, new cache I/O errors


Go to solution Solved by JorgeB,

Recommended Posts

Hello!

I attempted to access some docker services today and realized that the cache drive was "out". The device appears as "Active, normal operation" but there's several errors in its log and most shares are missing. After a clean reboot there was a notification that an unclean shutdown was detected and currently there's a parity check running. Shares are still missing after the reboot.

The cache drive is new (~10 days) and was operating with no issues for the past week.

 

Here's the cache drive log:

Mar 24 15:56:05 Tower kernel: nvme0n1: p1
Mar 24 15:57:28 Tower emhttpd: Samsung_SSD_990_PRO_with_Heatsink_1TB_S73JNJ0W605701A (nvme0n1) 512 1953525168
Mar 24 15:57:28 Tower emhttpd: import 30 cache device: (nvme0n1) Samsung_SSD_990_PRO_with_Heatsink_1TB_S73JNJ0W605701A
Mar 24 15:57:32 Tower emhttpd: read SMART /dev/nvme0n1
Mar 24 15:57:47 Tower emhttpd: shcmd (57): mount -t xfs -o noatime,nouuid /dev/nvme0n1p1 /mnt/cache
Mar 24 15:57:47 Tower kernel: XFS (nvme0n1p1): Mounting V5 Filesystem
Mar 24 15:57:48 Tower kernel: XFS (nvme0n1p1): Starting recovery (logdev: internal)
Mar 24 15:57:48 Tower kernel: XFS (nvme0n1p1): Ending recovery (logdev: internal)
Mar 24 16:01:55 Tower kernel: nvme0n1: I/O Cmd(0x2) @ LBA 516769048, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 24 16:01:55 Tower kernel: I/O error, dev nvme0n1, sector 516769048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
Mar 24 16:01:55 Tower kernel: nvme0n1: detected capacity change from 1953525168 to 0
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1086305362, offset 1630208, sector 989418728
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1076556609, offset 0, sector 979578768
Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): log I/O error -5
Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): Filesystem has been shut down due to log error (0x2).
Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): Please unmount the filesystem and rectify the problem(s).
Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x587d2230 len 32 error 5
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 555942864, offset 0, sector 507454752
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 536871051, offset 73728, sector 498262064
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 556136556, offset 86016, sector 507648792
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1076717193, offset 0, sector 979739088
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 536871051, offset 77824, sector 498262072
Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1626563962, offset 4128768, sector 1508843144

 

Any ideas? 

Thanks!

tower-diagnostics-20240324-1802.zip

Edited by Tzundoku
Link to comment
20 hours ago, JorgeB said:

The NVMe device dropped offline, power cycle the server, don't just reboot, then post new diags.

Thanks for the prompt reply JorgeB.

 

I powercycled as directed but also included the following lines to the syslinux config after going through a couple of other posts:

  nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

 

Initially everything was working fine, then while watching a movie through Jellyfin shares went out again.

 

Diagnostics taken immediately after.

 

tower-diagnostics-20240325-1615.zip

Link to comment
3 hours ago, JorgeB said:

Device dropped offline again, try a different m.2 slot if available, if issues continue would recommend using a different brand/model device.

Any chance the issue could be related to something else? I.e. the mobo?

 

I was using a Samsung 870 evo with months of uptime prior to replacing it with the one that is currently going offline. A month or so ago that drive started displaying similar I/O errors whenever I initialized a windows vm (vfio binded ssd which was working fine until last month as well). Thought the 870 evo cache had the issue, hence the replacement.

 

 

Link to comment
  • 4 weeks later...
On 3/25/2024 at 8:53 PM, JorgeB said:

It's possible, but since you have another NVMe device, swap slots between them and re-test, see where the issues follows.

 

Swapped drives and at the same day the latest stable release dropped so I went with the update as well- no issues so far so either could be the fix.

 

Thanks for your time!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.