Jump to content
Remy Lind

[6.5.3] nvme0: I/O 0 QID 2 timeout

6 posts in this topic Last Reply

Recommended Posts

Hey guys. I've been struggling with this error since I built the system and I don't know what to do.


 

Quote

 

kernel: nvme nvme0: I/O 0 QID 2 timeout, aborting

kernel: nvme nvme0: Abort status: 0x0

 

 

 

Everything halts to a crawl and the nvme just loops between timeout and abort status.

The drive have been functioning well with my intel system under Windows.

 

I've tried with two different drives of the same model and different m.2 slots

 

I've tried zenstates, pcie_aspm=off and pci=nommconf and theres no difference.

 

What are my options here?

threadripper-diagnostics-20180624-0127.zip

Edited by Remy Lind

Share this post


Link to post

Ok, so I did some more tests and concluded that if I change the filesystem from BTRFS to XFS, the nvme drive operate normal without errors.

 

And now of course I don't have the option of a cache pool, but at least the server i functioning.

 

Anyone know why this error occurs with BTRFS? 

Share this post


Link to post
58 minutes ago, Remy Lind said:

Anyone know why this error occurs with BTRFS? 

Not really, can only tell you that I've been using a btrfs formatted NVMe cache device for years without issues, so possibly some hardware compatibility issue.

Share this post


Link to post

I am seeing a similar issue.

 

I have a ASX7000NPC-512GT-C M.2 NVMe device connected via PCIe 3.0x4, unRaid sees it just fine and it has been assigned as cache for about a week (I am still in my trial).

 

I have most of my shares set up to use cache and during periods of heavy download/write activity I noticed every now and then unraid would appear to freeze for 30 seconds or so.

 

I followed some advice in another thread and set vm.dirty_background_ratio to 3 and vm.dirty_ratio to 2 as advised and it seemed to help with the random hangs a bit, but not completely.

 

For what its worth my system is a dual cpu E5-2630v2 box (12 core/24 thread) with 128GB of ECC memory, all applications doing IO are running in Docker, I don't think it is a system resource issue even though the entire system seems to hang for anywhere between 15-60 seconds during high write times.

 

I started looking at the logs and I see a lot of entries like this:

 

Aug 22 11:54:11 arthas kernel: nvme nvme0: I/O 0 QID 8 timeout, reset controller
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 0 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 1 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 2 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 3 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: I/O 4 QID 8 timeout, aborting
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:20 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 92 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 93 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 94 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 95 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: I/O 96 QID 8 timeout, aborting
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0
Aug 22 11:55:24 arthas kernel: nvme nvme0: Abort status: 0x0

 

This is concerning as it seems to coincide with the hangs, but, the good news is no data ever seems to get corrupted or lost.

 

The lspci output the NVMe drive is:

 

06:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2260 (rev 03) (prog-if 02 [NVM Express])
        Subsystem: Silicon Motion, Inc. Device 2260
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 25
        NUMA node: 0
        Region 0: Memory at ef500000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 1024 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via message
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00002100
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP+ BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [158 v1] #19
        Capabilities: [178 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [180 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: nvme
        Kernel modules: nvme

 

I have seen some posts around the web about a potential change to the nvme timeout settings, but I don't know if I can easily do that in unRaid, or if it's even the right thing to do.

 

Any help is appreciated.

Share this post


Link to post

Not as far as I can tell.  They seem to only release firmware via some windows SSD Toolkit thing.  I haven't been able to find any standalone downloads.

 

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.