Cache drive failure

Beijergard · January 14

Hi there. My Unraid have been a bit problematic for a while now with different errors and this seems to be the most irritating one.
All my dockers went offline today and I got Execution Error 403 when I tried to start them. After a reboot, which took a lot longer than usual, my cache drive is now gone.. It is a single Kingston A1000 480GB, and now I wonder what is the smartest next step? My mobo only has one M.2 slot, but some avalible SATA ports.

I will remove the drive and try using it in a USB case to see if I can connect to it. And if all data is gone, I actually don't know what is lost. I haven't done any writes to the shares recently so there should not be any data loss there. But all appdata from Plex and Jellyfin might be gone I suppose.

plexworm-diagnostics-20240114-2200.zip

JorgeB · January 15

Try power cycling the server, not just rebooting, and/or using a different SATA port/cables.

Beijergard · January 16

On 1/15/2024 at 12:55 PM, JorgeB said:

Try power cycling the server, not just rebooting, and/or using a different SATA port/cables.

Thank you! That power cycle actually seems to fix it. Now it has ben up and running for quite a while. Could this be an indication of a failing m2 drive or something else?

Here is a smart report for the m2 drive. The drive is about 5 years old by now.

Spoiler

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.64-Unraid] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: KINGSTON SA1000M8480G
Serial Number: 50026B72821050AE
Firmware Version: E8FK11.L
PCI Vendor/Subsystem ID: 0x2646
IEEE OUI Identifier: 0x0026b7
Total NVM Capacity: 480,103,981,056 [480 GB]
Unallocated NVM Capacity: 0
Controller ID: 0
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 480,103,981,056 [480 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 0026b7 2821050ae5
Local Time is: Tue Jan 16 08:33:23 2024 PST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x001e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x04): Ext_Get_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 90 Celsius
Critical Comp. Temp. Threshold: 94 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.90W 0.0790W - 0 0 0 0 600 600
1 + 7.90W 0.0790W - 0 0 0 0 600 600
2 + 7.90W 0.0790W - 0 0 0 0 600 600
3 - 0.1000W 0.0790W - 3 3 3 3 1000 1000
4 - 0.0050W 0.0790W - 4 4 4 4 400000 90000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 100%
Percentage Used: 7%
Data Units Read: 10,739,194 [5.49 TB]
Data Units Written: 26,557,014 [13.5 TB]
Host Read Commands: 74,886,228
Host Write Commands: 397,426,169
Controller Busy Time: 2,368
Power Cycles: 3,297
Power On Hours: 4,272
Unsafe Shutdowns: 30
Media and Data Integrity Errors: 0
Error Information Log Entries: 156
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 2: 35 Celsius

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS Message
0 156 0 0x001c 0x0005 - 6 0 - Invalid Field in Command
1 155 0 0x0018 0x0005 - 12 0 - Invalid Field in Command
2 154 0 0x4010 0x0005 - 6 0 - Invalid Field in Command
3 152 0 0x0010 0x0005 - 6 0 - Invalid Field in Command
4 151 0 0x0014 0x0005 - 12 0 - Invalid Field in Command
5 150 0 0x0000 0x0005 - 6 0 - Invalid Field in Command
6 149 0 0x001d 0x0005 - 12 0 - Invalid Field in Command
7 147 0 0x0004 0x0005 - 12 0 - Invalid Field in Command
8 144 0 0x1008 0x0005 - 6 0 - Invalid Field in Command
9 143 0 0x1005 0x0005 - 12 0 - Invalid Field in Command
10 142 0 0x1014 0x0005 - 6 0 - Invalid Field in Command
11 137 0 0x1011 0x0005 - 12 0 - Invalid Field in Command
12 130 0 0x1005 0x0005 - 12 0 - Invalid Field in Command
13 129 0 0x1011 0x0005 - 6 0 - Invalid Field in Command
14 116 0 0x6004 0x0005 - 6 0 - Invalid Field in Command
15 109 0 0x002c 0x0203 - 0 0 - Invalid Queue Identifier

Self-tests not supported

JorgeB · January 16

30 minutes ago, Beijergard said:

Could this be an indication of a failing m2 drive or something else?

Unlikely, but when NVMe devices drop usually only a power cycle will bring them back, this may help if it keeps dropping:

On the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Reboot and see if it makes a difference.

Cache drive failure

Recommended Posts

Beijergard

Link to comment

JorgeB

Link to comment

Beijergard

Link to comment

JorgeB

Link to comment

Join the conversation