I just installed a 2TB Crucial P5 NVME SSD and lose access to the disk when UNRAID reads SMART data (when accessing from the web by clicking on the disk). I know that the problem is when reading the SMART information, because I reproduce the same error when I do it from the console with smartctl.
In syslog:
Oct 25 14:09:52 UNBuly kernel: DMAR: DRHD: handling fault status reg 2
Oct 25 14:09:52 UNBuly kernel: DMAR: [DMA Read] Request device [03:00.0] PASID ffffffff fault addr ffbf0000 [fault reason 06] PTE Read access is not set
Oct 25 14:10:31 UNBuly kernel: nvme nvme0: I/O 193 QID 23 timeout, aborting
Oct 25 14:10:52 UNBuly kernel: nvme nvme0: I/O 29 QID 0 timeout, reset controller
Oct 25 14:11:01 UNBuly kernel: nvme nvme0: I/O 193 QID 23 timeout, reset controller
The disk disappears from the system (I can't even see it in /dev /nvme0n1) and I don't get it back until I do a power off / power on.
smartctl displays this information before freezing:
# smartctl -a /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.10.28-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: CT1000P5SSD8
Serial Number: 21xxxx
Firmware Version: P4CR311
PCI Vendor/Subsystem ID: 0x1344
IEEE OUI Identifier: 0x00a075
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00a075 013084ec4c
Local Time is: Mon Oct 25 14:09:52 2021 CEST
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 78 Celsius
Critical Comp. Temp. Threshold: 81 Celsius
Namespace 1 Features (0x08): No_ID_Reuse
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.25W - - 0 0 0 0 0 0
1 + 3.00W - - 1 1 1 1 0 0
2 + 1.90W - - 2 2 2 2 0 0
3 - 0.0800W - - 3 3 3 3 10000 2500
4 - 0.0050W - - 4 4 4 4 12000 35000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 2,535,487 [1.29 TB]
Data Units Written: 709,366 [363 GB]
Host Read Commands: 2,922,852
Host Write Commands: 2,890,773
Controller Busy Time: 76
Power Cycles: 12
Power On Hours: 5
Unsafe Shutdowns: 11
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 42 Celsius
Temperature Sensor 2: 47 Celsius
Thermal Temp. 1 Transition Count: 1
After displaying the last line, the command hangs and I lose access to the disk.
Thanks id advance.
un-diagnostics-20211025-1554.zip