evakq8r Posted February 18, 2023 Share Posted February 18, 2023 I've recently rebuilt my server to new architecture (custom server vs an old Dell Poweredge) and had a disk go into a disabled state during a reboot. I've started the rebuild process and the speeds fluctuate wildly from 1MB/s - 125MB/s. Server specs are in my sig. The link speed of my LSI HBA is showing 8x, which as far as I know should be fine for a decent rebuild speed: lspci -vv -s 31:00.0 31:00.0 RAID bus controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) Subsystem: Fujitsu Technology Solutions HBA Ctrl SAS 6G 0/1 [D2607] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 72 IOMMU group: 35 Region 0: I/O ports at e000 [size=256] Region 1: Memory at c04c0000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at c0080000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at c0000000 [disabled] [size=512K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [d0] Vital Product Data pcilib: sysfs_read_vpd: read failed: No such device Not readable Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable+ Count=15 Masked- Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003800 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [138 v1] Power Budgeting <?> Capabilities: [150 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration- 10BitTagReq- Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq- IOVSta: Migration- Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00 VF offset: 1, stride: 1, Device ID: 0072 Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000c04c4000 (64-bit, non-prefetchable) Region 2: Memory at 00000000c00c0000 (64-bit, non-prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [190 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Kernel driver in use: mpt3sas Kernel modules: mpt3sas Attached are diagnostics. I'm not exactly sure why the speeds fluctuate so heavily, so any help would be appreciated. notyourunraid-diagnostics-20230218-1311.zip Quote Link to comment
Solution Squid Posted February 18, 2023 Solution Share Posted February 18, 2023 You have what appears to be a docker container that is continually restarting. The docker image is in the system share which is currently on the cache drive and disk 3. Impossible to tell from diagnostics where the docker.img file is on which of those two drives, but if its on disk 3 then what you're seeing would be expected. Go to Shares, next to system hit calculate. If you see 100G (which may be overkill) on disk 3 then that's your issue. Stop the entire docker service in settings - docker and run mover from the main tab 1 Quote Link to comment
evakq8r Posted February 18, 2023 Author Share Posted February 18, 2023 1 hour ago, Squid said: You have what appears to be a docker container that is continually restarting. The docker image is in the system share which is currently on the cache drive and disk 3. Impossible to tell from diagnostics where the docker.img file is on which of those two drives, but if its on disk 3 then what you're seeing would be expected. Go to Shares, next to system hit calculate. If you see 100G (which may be overkill) on disk 3 then that's your issue. Stop the entire docker service in settings - docker and run mover from the main tab Looks like there was a docker container restarting (which I thought I resolved a month ago). Removed the entire docker container and image so that should no longer be a problem. As for system, you're correct: It seems when I built this system last week, I neglected to check how much space the cache files I already had used up, so can only assume system can't move to cache because there isn't enough space. Mover is disabled in a data rebuild, so I guess I'll need to wait until it's finished? Or do you suggest cancelling the rebuild anyway and moving system to cache, then do the rebuild again? Rebuild is currently at 77.8% going at a reasonable speed: Quote Link to comment
itimpi Posted February 18, 2023 Share Posted February 18, 2023 3 hours ago, evakq8r said: It seems when I built this system last week, I neglected to check how much space the cache files I already had used up, so can only assume system can't move to cache because there isn't enough space. It would be normal for mover not to move files in the 'system' share if you have either the VM or Docker services enabled as they keep files open and mover cannot move open files. Those services should be disabled while trying to move files in the system share. As was mentioned you also seem to have the docker image file configured to be 100GB which is far more than you should need. The default of 20GB is fine for most people. If you find that you are filling it then that means you almost certainly have a docker container misconfigured so that it is writing files internal to the image when they should be mapped to an external location on the host. Quote Link to comment
evakq8r Posted February 18, 2023 Author Share Posted February 18, 2023 4 hours ago, itimpi said: It would be normal for mover not to move files in the 'system' share if you have either the VM or Docker services enabled as they keep files open and mover cannot move open files. This makes sense, however it was actually disabled because a parity operation was in place (as advised on the GUI). 4 hours ago, itimpi said: As was mentioned you also seem to have the docker image file configured to be 100GB which is far more than you should need. The default of 20GB is fine for most people. If you find that you are filling it then that means you almost certainly have a docker container misconfigured so that it is writing files internal to the image when they should be mapped to an external location on the host. I've stopped docker and VM, resized the image to 30GB and deleted the existing image, and now taking the (very painful) process of setting everything up again. Less than 20 docker containers in, and the 30GB image is already 70% full. Containers are all starting/not in a restart loop, so not sure why the image is already blown out. I did have about 75 containers running at once though, so I may fall into the 'not most people' category for docker image sizes. Quote Link to comment
evakq8r Posted February 18, 2023 Author Share Posted February 18, 2023 Everything setup again, system is much more responsive. Docker image is now 75GB, 68% used. Will leave as is for now, but all containers are working as expected. Thanks for your help @Squid. Quote Link to comment
Squid Posted February 18, 2023 Share Posted February 18, 2023 1 hour ago, evakq8r said: This makes sense, however it was actually disabled because a parity operation was in place (as advised on the GUI). Actually a bug in the system and fixed next rev Quote Link to comment
evakq8r Posted February 18, 2023 Author Share Posted February 18, 2023 7 minutes ago, Squid said: Actually a bug in the system and fixed next rev Oh! Good to know. Thanks! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.