Remy Lind Posted June 23, 2018 Share Posted June 23, 2018 (edited) Hey guys. I've been struggling with this error since I built the system and I don't know what to do. Quote kernel: nvme nvme0: I/O 0 QID 2 timeout, aborting kernel: nvme nvme0: Abort status: 0x0 Everything halts to a crawl and the nvme just loops between timeout and abort status. The drive have been functioning well with my intel system under Windows. I've tried with two different drives of the same model and different m.2 slots I've tried zenstates, pcie_aspm=off and pci=nommconf and theres no difference. What are my options here? threadripper-diagnostics-20180624-0127.zip Edited June 23, 2018 by Remy Lind Quote Link to comment
Remy Lind Posted June 24, 2018 Author Share Posted June 24, 2018 Ok, so I did some more tests and concluded that if I change the filesystem from BTRFS to XFS, the nvme drive operate normal without errors. And now of course I don't have the option of a cache pool, but at least the server i functioning. Anyone know why this error occurs with BTRFS? Quote Link to comment
JorgeB Posted June 24, 2018 Share Posted June 24, 2018 58 minutes ago, Remy Lind said: Anyone know why this error occurs with BTRFS? Not really, can only tell you that I've been using a btrfs formatted NVMe cache device for years without issues, so possibly some hardware compatibility issue. Quote Link to comment
werkkrew Posted August 22, 2018 Share Posted August 22, 2018 I am seeing a similar issue. I have a ASX7000NPC-512GT-C M.2 NVMe device connected via PCIe 3.0x4, unRaid sees it just fine and it has been assigned as cache for about a week (I am still in my trial). I have most of my shares set up to use cache and during periods of heavy download/write activity I noticed every now and then unraid would appear to freeze for 30 seconds or so. I followed some advice in another thread and set vm.dirty_background_ratio to 3 and vm.dirty_ratio to 2 as advised and it seemed to help with the random hangs a bit, but not completely. For what its worth my system is a dual cpu E5-2630v2 box (12 core/24 thread) with 128GB of ECC memory, all applications doing IO are running in Docker, I don't think it is a system resource issue even though the entire system seems to hang for anywhere between 15-60 seconds during high write times. I started looking at the logs and I see a lot of entries like this: Aug 22 11:54:11 arthas kernel: nvme nvme0: I/O 0 QID 8 timeout, reset controller Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 0 QID 8 timeout, aborting Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 1 QID 8 timeout, aborting Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 2 QID 8 timeout, aborting Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 3 QID 8 timeout, aborting Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 4 QID 8 timeout, aborting Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 92 QID 8 timeout, aborting Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 93 QID 8 timeout, aborting Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 94 QID 8 timeout, aborting Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 95 QID 8 timeout, aborting Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 96 QID 8 timeout, aborting Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0 Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0 This is concerning as it seems to coincide with the hangs, but, the good news is no data ever seems to get corrupted or lost. The lspci output the NVMe drive is: 06:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2260 (rev 03) (prog-if 02 [NVM Express]) Subsystem: Silicon Motion, Inc. Device 2260 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 25 NUMA node: 0 Region 0: Memory at ef500000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 1024 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via message AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP+ BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [158 v1] #19 Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [180 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=10us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Kernel driver in use: nvme Kernel modules: nvme I have seen some posts around the web about a potential change to the nvme timeout settings, but I don't know if I can easily do that in unRaid, or if it's even the right thing to do. Any help is appreciated. Quote Link to comment
John_M Posted August 22, 2018 Share Posted August 22, 2018 Is there a firmware update for your NVMe device? Quote Link to comment
werkkrew Posted August 22, 2018 Share Posted August 22, 2018 Not as far as I can tell. They seem to only release firmware via some windows SSD Toolkit thing. I haven't been able to find any standalone downloads. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.