Jump to content

Slow Parity Rebuild


Recommended Posts

Hello,

 

I was wondering if someone could  help me identify why my parity rebuild is slow.

Some of the last changes were, 

 

- Installed a lsi sas3224 about 5 months or so ago

- Replacing all my Seagate drives (Data Drives)

 

I am getting roughly 10-20 MB/sec on the rebuild as of right now.

 

I also have three 3 separate zfs pools, need to replace a drive on the hdd pool.

The other two ssd and nvme are working properly.

 

Attached is the diagnostics.

prdnas002-diagnostics-20220523-0047.zip

Edited by BF90X
Link to comment
58 minutes ago, BF90X said:

I was wondering if someone could  help me identify why my parity rebuild is slow.

 

Users with more knowledge will probably chime in, but at first glance it looks like the controller is reset because of overheating.

 

May 22 08:49:05 PRDNAS002 kernel: mpt3sas_cm0: Temperature Threshold flags 0   3  exceeded for Sensor: 0 !!!
May 22 08:49:05 PRDNAS002 kernel: mpt3sas_cm0: Current Temp In Celsius: 112
May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0: fault_state(0x2810)!
May 22 08:49:06 PRDNAS002 kernel: mpt3sas_cm0: sending diag reset !!
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: diag reset: SUCCESS
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: _base_display_fwpkg_version: complete
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), ChipRevision(0x01), BiosVersion(08.27.00.00)
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
May 22 08:49:07 PRDNAS002 kernel: mpt3sas_cm0: sending port enable !!
May 22 08:49:20 PRDNAS002 kernel: mpt3sas_cm0: port enable: SUCCESS
May 22 08:49:20 PRDNAS002 kernel: mpt3sas_cm0: search for end-devices: start

 

Do you have sufficient cooling ?

JorgeB could confirm, but that FW seems quite old from what I understand.

Link to comment

Thanks for the input @ChatNoir & @JorgeB

 

I updated the firmware of the HBA and added a temporary fan for additional cooling, ordered a few items to come up with permanent solution to ensure the HBA has better cooling.

After making these changes, I am still seeing the same issue.

 

Might not be related but at some point after upgrading to 6.10 I notice I was getting a lot hardware errors.

 

May 23 05:45:24 PRDNAS002 kernel: {20}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
May 23 05:45:24 PRDNAS002 kernel: {20}[Hardware Error]: It has been corrected by h/w and requires no further action

 

I looked around online and found that someone had to run the following command to eliminate them,

 

setpci -v -s 0000:40:01.2 CAP_EXP+0x8.w
setpci -v -s 0000:40:01.2 CAP_EXP+0x8.w=0x2936

 

Attached is the new diagnostics.

 

Appreciate all the help.

prdnas002-diagnostics-20220523-1207.zip

Edited by BF90X
Link to comment
28 minutes ago, BF90X said:

Might not be related but at some point after upgrading to 6.10 I notice I was getting a lot hardware errors.

 

These errors appear to come from one of the NVMe devices:

42:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
    Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a801]
    Kernel driver in use: nvme
    Kernel modules: nvme

 

If there's a different slot you can use try, if not you should be able to suppress the errors, to at least stop filling the log up.

 

 

As for the parity check it's using a lot of CPU, parity check is single threaded, and it's showing 100% CPU, which would mean it's being limited by that, but it doesn't make much sense with the CPU you have... Did you notice slower performance with anything else?

Link to comment
8 minutes ago, JorgeB said:

 

These errors appear to come from one of the NVMe devices:

42:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
    Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a801]
    Kernel driver in use: nvme
    Kernel modules: nvme

 

If there's a different slot you can use try, if not you should be able to suppress the errors, to at least stop filling the log up.

 

 

As for the parity check it's using a lot of CPU, parity check is single threaded, and it's showing 100% CPU, which would mean it's being limited by that, but it doesn't make much sense with the CPU you have... Did you notice slower performance with anything else?

 

Thanks for sharing that information, I will follow the steps on that link.

 

Ohhh, I guess that's why core 6 is maxed out.

Everything else seems to be working properly. I run docker containers and a coupe of vm's without much issues.

One of the steps I tried to troubleshoot the parity rebuild issue was to do a new config. Also, pre-cleared the drives as well.

I rebooted on safe mode and try the rebuild as well but got the same speeds. Did not look at the CPU usage when I did that though.

 

Link to comment
On 5/23/2022 at 1:17 PM, JorgeB said:

Other thing you could try is doing a new config and just assign about half of current array drives, see if there's a difference, if it's the CPU that is somehow limiting it should be noticebly faster, or sync the entire array with single parity also to compare.

 

Long story short, the parity appears to have gone back to normal.

I tried so many different things, I have no idea what fixed it.

 

-Updated BIOS

-Updated to UnRaid 6.10.1

-I tried one of the drives, it originally did not work. (Now that it's working, I did try the other drive instead) So that might be it?

-Tried to boot UEFI instead of legacy (Literally hours trying but no luck)

-At some point, I wasn't even able to boot with legacy due to the AER issue. I added the following to the syslinux,

 

nvme_core.default_ps_max_latency_us=5500

pci=nommconf

 

Already had, 

 

pcie_aspm=off 

 

I am still seeing the "Hardware error from APEI Generic Hardware Error" but not as much.

The only major issue I am seeing for now is that one of the 2.5 inch drives on one of the ZFS pools is failing. Waiting for the replacement.

 

image.thumb.png.b9b7f1c76d9f30fe55d9c8cdcb6117dc.png

 

prdnas002-diagnostics-20220524-1235.zip

Edited by BF90X
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...