Tzundoku Posted March 24 Share Posted March 24 (edited) Hello! I attempted to access some docker services today and realized that the cache drive was "out". The device appears as "Active, normal operation" but there's several errors in its log and most shares are missing. After a clean reboot there was a notification that an unclean shutdown was detected and currently there's a parity check running. Shares are still missing after the reboot. The cache drive is new (~10 days) and was operating with no issues for the past week. Here's the cache drive log: Mar 24 15:56:05 Tower kernel: nvme0n1: p1 Mar 24 15:57:28 Tower emhttpd: Samsung_SSD_990_PRO_with_Heatsink_1TB_S73JNJ0W605701A (nvme0n1) 512 1953525168 Mar 24 15:57:28 Tower emhttpd: import 30 cache device: (nvme0n1) Samsung_SSD_990_PRO_with_Heatsink_1TB_S73JNJ0W605701A Mar 24 15:57:32 Tower emhttpd: read SMART /dev/nvme0n1 Mar 24 15:57:47 Tower emhttpd: shcmd (57): mount -t xfs -o noatime,nouuid /dev/nvme0n1p1 /mnt/cache Mar 24 15:57:47 Tower kernel: XFS (nvme0n1p1): Mounting V5 Filesystem Mar 24 15:57:48 Tower kernel: XFS (nvme0n1p1): Starting recovery (logdev: internal) Mar 24 15:57:48 Tower kernel: XFS (nvme0n1p1): Ending recovery (logdev: internal) Mar 24 16:01:55 Tower kernel: nvme0n1: I/O Cmd(0x2) @ LBA 516769048, 8 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 24 16:01:55 Tower kernel: I/O error, dev nvme0n1, sector 516769048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Mar 24 16:01:55 Tower kernel: nvme0n1: detected capacity change from 1953525168 to 0 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1086305362, offset 1630208, sector 989418728 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1076556609, offset 0, sector 979578768 Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): log I/O error -5 Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): Filesystem has been shut down due to log error (0x2). Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): Please unmount the filesystem and rectify the problem(s). Mar 24 16:01:55 Tower kernel: XFS (nvme0n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x587d2230 len 32 error 5 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 555942864, offset 0, sector 507454752 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 536871051, offset 73728, sector 498262064 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 556136556, offset 86016, sector 507648792 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1076717193, offset 0, sector 979739088 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 536871051, offset 77824, sector 498262072 Mar 24 16:01:55 Tower kernel: nvme0n1p1: writeback error on inode 1626563962, offset 4128768, sector 1508843144 Any ideas? Thanks! tower-diagnostics-20240324-1802.zip Edited March 24 by Tzundoku Quote Link to comment
JorgeB Posted March 24 Share Posted March 24 The NVMe device dropped offline, power cycle the server, don't just reboot, then post new diags. Quote Link to comment
Tzundoku Posted March 25 Author Share Posted March 25 20 hours ago, JorgeB said: The NVMe device dropped offline, power cycle the server, don't just reboot, then post new diags. Thanks for the prompt reply JorgeB. I powercycled as directed but also included the following lines to the syslinux config after going through a couple of other posts: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Initially everything was working fine, then while watching a movie through Jellyfin shares went out again. Diagnostics taken immediately after. tower-diagnostics-20240325-1615.zip Quote Link to comment
Tzundoku Posted March 25 Author Share Posted March 25 (edited) Should I post diagnostics immediately after powercycling or is the above adequate? Thanks tower-diagnostics-20240325-1636.zip Edited March 25 by Tzundoku diagnostics right after a powercycle. Quote Link to comment
JorgeB Posted March 25 Share Posted March 25 Device dropped offline again, try a different m.2 slot if available, if issues continue would recommend using a different brand/model device. Quote Link to comment
Tzundoku Posted March 25 Author Share Posted March 25 3 hours ago, JorgeB said: Device dropped offline again, try a different m.2 slot if available, if issues continue would recommend using a different brand/model device. Any chance the issue could be related to something else? I.e. the mobo? I was using a Samsung 870 evo with months of uptime prior to replacing it with the one that is currently going offline. A month or so ago that drive started displaying similar I/O errors whenever I initialized a windows vm (vfio binded ssd which was working fine until last month as well). Thought the 870 evo cache had the issue, hence the replacement. Quote Link to comment
Solution JorgeB Posted March 25 Solution Share Posted March 25 23 minutes ago, Tzundoku said: I.e. the mobo? It's possible, but since you have another NVMe device, swap slots between them and re-test, see where the issues follows. Quote Link to comment
Tzundoku Posted April 22 Author Share Posted April 22 On 3/25/2024 at 8:53 PM, JorgeB said: It's possible, but since you have another NVMe device, swap slots between them and re-test, see where the issues follows. Swapped drives and at the same day the latest stable release dropped so I went with the update as well- no issues so far so either could be the fix. Thanks for your time! 1 Quote Link to comment
Tzundoku Posted April 30 Author Share Posted April 30 On 3/25/2024 at 8:53 PM, JorgeB said: It's possible, but since you have another NVMe device, swap slots between them and re-test, see where the issues follows. Updating- issue is back since yesterday unfortunately. Quote Link to comment
JorgeB Posted May 1 Share Posted May 1 Problem is with the same device or same slot? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.