BF90X Posted May 23, 2022 Share Posted May 23, 2022 (edited) Hello, I was wondering if someone could help me identify why my parity rebuild is slow. Some of the last changes were, - Installed a lsi sas3224 about 5 months or so ago - Replacing all my Seagate drives (Data Drives) I am getting roughly 10-20 MB/sec on the rebuild as of right now. I also have three 3 separate zfs pools, need to replace a drive on the hdd pool. The other two ssd and nvme are working properly. Attached is the diagnostics. prdnas002-diagnostics-20220523-0047.zip Edited May 23, 2022 by BF90X Quote Link to comment
ChatNoir Posted May 23, 2022 Share Posted May 23, 2022 58 minutes ago, BF90X said: I was wondering if someone could help me identify why my parity rebuild is slow. Users with more knowledge will probably chime in, but at first glance it looks like the controller is reset because of overheating. May 22 08:49:05 PRDNAS002 kernel: mpt3sas_cm0: Temperature Threshold flags 0 3 exceeded for Sensor: 0 !!! May 22 08:49:05 PRDNAS002 kernel: mpt3sas_cm0: Current Temp In Celsius: 112 May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0: fault_state(0x2810)! May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0: sending diag reset !! May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: diag reset: SUCCESS May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: _base_display_fwpkg_version: complete May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), ChipRevision(0x01), BiosVersion(08.27.00.00) May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ) May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: sending port enable !! May 22 08:49:20 PRDNAS002 kernel: mpt3sas_cm0: port enable: SUCCESS May 22 08:49:20 PRDNAS002 kernel: mpt3sas_cm0: search for end-devices: start Do you have sufficient cooling ? JorgeB could confirm, but that FW seems quite old from what I understand. Quote Link to comment
JorgeB Posted May 23, 2022 Share Posted May 23, 2022 Yes, start by updating the firmware and improving cooling on the HBA, these are designed for servers with very good cooling, when used in desktop cases they might need some active cooling, or at least a case with very good airflow. Quote Link to comment
BF90X Posted May 23, 2022 Author Share Posted May 23, 2022 (edited) Thanks for the input @ChatNoir & @JorgeB I updated the firmware of the HBA and added a temporary fan for additional cooling, ordered a few items to come up with permanent solution to ensure the HBA has better cooling. After making these changes, I am still seeing the same issue. Might not be related but at some point after upgrading to 6.10 I notice I was getting a lot hardware errors. May 23 05:45:24 PRDNAS002 kernel: {20}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512 May 23 05:45:24 PRDNAS002 kernel: {20}[Hardware Error]: It has been corrected by h/w and requires no further action I looked around online and found that someone had to run the following command to eliminate them, setpci -v -s 0000:40:01.2 CAP_EXP+0x8.w setpci -v -s 0000:40:01.2 CAP_EXP+0x8.w=0x2936 Attached is the new diagnostics. Appreciate all the help. prdnas002-diagnostics-20220523-1207.zip Edited May 23, 2022 by BF90X Quote Link to comment
JorgeB Posted May 23, 2022 Share Posted May 23, 2022 28 minutes ago, BF90X said: Might not be related but at some point after upgrading to 6.10 I notice I was getting a lot hardware errors. These errors appear to come from one of the NVMe devices: 42:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a] Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a801] Kernel driver in use: nvme Kernel modules: nvme If there's a different slot you can use try, if not you should be able to suppress the errors, to at least stop filling the log up. As for the parity check it's using a lot of CPU, parity check is single threaded, and it's showing 100% CPU, which would mean it's being limited by that, but it doesn't make much sense with the CPU you have... Did you notice slower performance with anything else? Quote Link to comment
BF90X Posted May 23, 2022 Author Share Posted May 23, 2022 8 minutes ago, JorgeB said: These errors appear to come from one of the NVMe devices: 42:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a] Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a801] Kernel driver in use: nvme Kernel modules: nvme If there's a different slot you can use try, if not you should be able to suppress the errors, to at least stop filling the log up. As for the parity check it's using a lot of CPU, parity check is single threaded, and it's showing 100% CPU, which would mean it's being limited by that, but it doesn't make much sense with the CPU you have... Did you notice slower performance with anything else? Thanks for sharing that information, I will follow the steps on that link. Ohhh, I guess that's why core 6 is maxed out. Everything else seems to be working properly. I run docker containers and a coupe of vm's without much issues. One of the steps I tried to troubleshoot the parity rebuild issue was to do a new config. Also, pre-cleared the drives as well. I rebooted on safe mode and try the rebuild as well but got the same speeds. Did not look at the CPU usage when I did that though. Quote Link to comment
JorgeB Posted May 23, 2022 Share Posted May 23, 2022 It's very strange, doesn't make sense CPU being at 100% while syncing at 16MB/s, don't really now what else to try, you can run diskspeed docker just to make sure all disks are performing normal. Quote Link to comment
JorgeB Posted May 23, 2022 Share Posted May 23, 2022 Other thing you could try is doing a new config and just assign about half of current array drives, see if there's a difference, if it's the CPU that is somehow limiting it should be noticebly faster, or sync the entire array with single parity also to compare. Quote Link to comment
BF90X Posted May 24, 2022 Author Share Posted May 24, 2022 (edited) On 5/23/2022 at 1:17 PM, JorgeB said: Other thing you could try is doing a new config and just assign about half of current array drives, see if there's a difference, if it's the CPU that is somehow limiting it should be noticebly faster, or sync the entire array with single parity also to compare. Long story short, the parity appears to have gone back to normal. I tried so many different things, I have no idea what fixed it. -Updated BIOS -Updated to UnRaid 6.10.1 -I tried one of the drives, it originally did not work. (Now that it's working, I did try the other drive instead) So that might be it? -Tried to boot UEFI instead of legacy (Literally hours trying but no luck) -At some point, I wasn't even able to boot with legacy due to the AER issue. I added the following to the syslinux, nvme_core.default_ps_max_latency_us=5500 pci=nommconf Already had, pcie_aspm=off I am still seeing the "Hardware error from APEI Generic Hardware Error" but not as much. The only major issue I am seeing for now is that one of the 2.5 inch drives on one of the ZFS pools is failing. Waiting for the replacement. prdnas002-diagnostics-20220524-1235.zip Edited May 24, 2022 by BF90X 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.