[6.5.3] nvme0: I/O 0 QID 2 timeout

Remy Lind · June 23, 2018

Hey guys. I've been struggling with this error since I built the system and I don't know what to do.

Quote

kernel: nvme nvme0: I/O 0 QID 2 timeout, aborting

kernel: nvme nvme0: Abort status: 0x0

Everything halts to a crawl and the nvme just loops between timeout and abort status.

The drive have been functioning well with my intel system under Windows.

I've tried with two different drives of the same model and different m.2 slots

I've tried zenstates, pcie_aspm=off and pci=nommconf and theres no difference.

What are my options here?

threadripper-diagnostics-20180624-0127.zip

Edited June 23, 2018 by Remy Lind

Remy Lind · June 24, 2018

Ok, so I did some more tests and concluded that if I change the filesystem from BTRFS to XFS, the nvme drive operate normal without errors.

And now of course I don't have the option of a cache pool, but at least the server i functioning.

Anyone know why this error occurs with BTRFS?

JorgeB · June 24, 2018

58 minutes ago, Remy Lind said:

Anyone know why this error occurs with BTRFS?

Not really, can only tell you that I've been using a btrfs formatted NVMe cache device for years without issues, so possibly some hardware compatibility issue.

werkkrew · August 22, 2018

I am seeing a similar issue.

I have a ASX7000NPC-512GT-C M.2 NVMe device connected via PCIe 3.0x4, unRaid sees it just fine and it has been assigned as cache for about a week (I am still in my trial).

I have most of my shares set up to use cache and during periods of heavy download/write activity I noticed every now and then unraid would appear to freeze for 30 seconds or so.

I followed some advice in another thread and set vm.dirty_background_ratio to 3 and vm.dirty_ratio to 2 as advised and it seemed to help with the random hangs a bit, but not completely.

For what its worth my system is a dual cpu E5-2630v2 box (12 core/24 thread) with 128GB of ECC memory, all applications doing IO are running in Docker, I don't think it is a system resource issue even though the entire system seems to hang for anywhere between 15-60 seconds during high write times.

I started looking at the logs and I see a lot of entries like this:

Aug 22 11:54:11 arthas kernel: nvme nvme0: I/O 0 QID 8 timeout, reset controller
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 0 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 1 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 2 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 3 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 4 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 92 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 93 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 94 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 95 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 96 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0

This is concerning as it seems to coincide with the hangs, but, the good news is no data ever seems to get corrupted or lost.

The lspci output the NVMe drive is:

06:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2260 (rev 03) (prog-if 02 [NVM Express])
        Subsystem: Silicon Motion, Inc. Device 2260
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 25
        NUMA node: 0
        Region 0: Memory at ef500000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 1024 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via message
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00002100
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP+ BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [158 v1] #19
        Capabilities: [178 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [180 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: nvme
        Kernel modules: nvme

I have seen some posts around the web about a potential change to the nvme timeout settings, but I don't know if I can easily do that in unRaid, or if it's even the right thing to do.

Any help is appreciated.

John_M · August 22, 2018

Is there a firmware update for your NVMe device?

werkkrew · August 22, 2018

Not as far as I can tell. They seem to only release firmware via some windows SSD Toolkit thing. I haven't been able to find any standalone downloads.

[6.5.3] nvme0: I/O 0 QID 2 timeout

Recommended Posts

Remy Lind

Link to comment

Remy Lind

Link to comment

JorgeB

Link to comment

werkkrew

Link to comment

John_M

Link to comment

werkkrew

Link to comment

Join the conversation