Slow Parity Rebuild

BF90X · May 23, 2022

Hello,

I was wondering if someone could help me identify why my parity rebuild is slow.

Some of the last changes were,

- Installed a lsi sas3224 about 5 months or so ago

- Replacing all my Seagate drives (Data Drives)

I am getting roughly 10-20 MB/sec on the rebuild as of right now.

I also have three 3 separate zfs pools, need to replace a drive on the hdd pool.

The other two ssd and nvme are working properly.

Attached is the diagnostics.

prdnas002-diagnostics-20220523-0047.zip

Edited May 23, 2022 by BF90X

ChatNoir · May 23, 2022

58 minutes ago, BF90X said:

I was wondering if someone could help me identify why my parity rebuild is slow.

Users with more knowledge will probably chime in, but at first glance it looks like the controller is reset because of overheating.

May 22 08:49:05 PRDNAS002 kernel: mpt3sas_cm0: Temperature Threshold flags 0   3  exceeded for Sensor: 0 !!!
May 22 08:49:05 PRDNAS002 kernel: mpt3sas_cm0: Current Temp In Celsius: 112
May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0: fault_state(0x2810)!
May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0: sending diag reset !!
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: diag reset: SUCCESS
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: _base_display_fwpkg_version: complete
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), ChipRevision(0x01), BiosVersion(08.27.00.00)
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: sending port enable !!
May 22 08:49:20 PRDNAS002 kernel: mpt3sas_cm0: port enable: SUCCESS
May 22 08:49:20 PRDNAS002 kernel: mpt3sas_cm0: search for end-devices: start

Do you have sufficient cooling ?

JorgeB could confirm, but that FW seems quite old from what I understand.

JorgeB · May 23, 2022

Yes, start by updating the firmware and improving cooling on the HBA, these are designed for servers with very good cooling, when used in desktop cases they might need some active cooling, or at least a case with very good airflow.

BF90X · May 23, 2022

Thanks for the input @ChatNoir & @JorgeB

I updated the firmware of the HBA and added a temporary fan for additional cooling, ordered a few items to come up with permanent solution to ensure the HBA has better cooling.

After making these changes, I am still seeing the same issue.

Might not be related but at some point after upgrading to 6.10 I notice I was getting a lot hardware errors.

May 23 05:45:24 PRDNAS002 kernel: {20}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
May 23 05:45:24 PRDNAS002 kernel: {20}[Hardware Error]: It has been corrected by h/w and requires no further action

I looked around online and found that someone had to run the following command to eliminate them,

setpci -v -s 0000:40:01.2 CAP_EXP+0x8.w
setpci -v -s 0000:40:01.2 CAP_EXP+0x8.w=0x2936

Attached is the new diagnostics.

Appreciate all the help.

prdnas002-diagnostics-20220523-1207.zip

Edited May 23, 2022 by BF90X

JorgeB · May 23, 2022

28 minutes ago, BF90X said:

Might not be related but at some point after upgrading to 6.10 I notice I was getting a lot hardware errors.

These errors appear to come from one of the NVMe devices:

42:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
    Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a801]
    Kernel driver in use: nvme
    Kernel modules: nvme

If there's a different slot you can use try, if not you should be able to suppress the errors, to at least stop filling the log up.

As for the parity check it's using a lot of CPU, parity check is single threaded, and it's showing 100% CPU, which would mean it's being limited by that, but it doesn't make much sense with the CPU you have... Did you notice slower performance with anything else?

BF90X · May 23, 2022

8 minutes ago, JorgeB said:
These errors appear to come from one of the NVMe devices:
42:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
    Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a801]
    Kernel driver in use: nvme
    Kernel modules: nvme
If there's a different slot you can use try, if not you should be able to suppress the errors, to at least stop filling the log up.

As for the parity check it's using a lot of CPU, parity check is single threaded, and it's showing 100% CPU, which would mean it's being limited by that, but it doesn't make much sense with the CPU you have... Did you notice slower performance with anything else?

Thanks for sharing that information, I will follow the steps on that link.

Ohhh, I guess that's why core 6 is maxed out.

Everything else seems to be working properly. I run docker containers and a coupe of vm's without much issues.

One of the steps I tried to troubleshoot the parity rebuild issue was to do a new config. Also, pre-cleared the drives as well.

I rebooted on safe mode and try the rebuild as well but got the same speeds. Did not look at the CPU usage when I did that though.

JorgeB · May 23, 2022

It's very strange, doesn't make sense CPU being at 100% while syncing at 16MB/s, don't really now what else to try, you can run diskspeed docker just to make sure all disks are performing normal.

JorgeB · May 23, 2022

Other thing you could try is doing a new config and just assign about half of current array drives, see if there's a difference, if it's the CPU that is somehow limiting it should be noticebly faster, or sync the entire array with single parity also to compare.

BF90X · May 24, 2022

On 5/23/2022 at 1:17 PM, JorgeB said:

Other thing you could try is doing a new config and just assign about half of current array drives, see if there's a difference, if it's the CPU that is somehow limiting it should be noticebly faster, or sync the entire array with single parity also to compare.

Long story short, the parity appears to have gone back to normal.

I tried so many different things, I have no idea what fixed it.

-Updated BIOS

-Updated to UnRaid 6.10.1

-I tried one of the drives, it originally did not work. (Now that it's working, I did try the other drive instead) So that might be it?

-Tried to boot UEFI instead of legacy (Literally hours trying but no luck)

-At some point, I wasn't even able to boot with legacy due to the AER issue. I added the following to the syslinux,

nvme_core.default_ps_max_latency_us=5500

pci=nommconf

Already had,

pcie_aspm=off

I am still seeing the "Hardware error from APEI Generic Hardware Error" but not as much.

The only major issue I am seeing for now is that one of the 2.5 inch drives on one of the ZFS pools is failing. Waiting for the replacement.

prdnas002-diagnostics-20220524-1235.zip

Edited May 24, 2022 by BF90X

Slow Parity Rebuild

Recommended Posts

BF90X

Link to comment

ChatNoir

Link to comment

JorgeB

Link to comment

BF90X

Link to comment

JorgeB

Link to comment

BF90X

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

BF90X

Link to comment

Join the conversation