Jump to content

Random Disks go into error state halfway through Parity check


Go to solution Solved by JorgeB,

Recommended Posts

Unraid 6.11.5

I keep having disks go into an error state during a parity check and it’s ticking me off and trying to zero in on the culprit. This has happened 3 times during a Pairty Check, usually within 10-15hrs into it the Parity check will stop and the server reports a disl going into error state. Drive 1 and 4 (20TB) were the ones that were reported. I'm sure I may of not made my post clear and unconfusing, I aplogize for that and will try to clarify anything in followup questions.  I plan on checking the hardware connection’s tomorrow.

image0.jpeg

 

image1.jpeg

syslog.1.txt.zip tower-smart-20230523-0643.zip

tower-diagnostics-20230523-0648.zip

Edited by Haroldkidd
Added Complete diagnostics
Link to comment
  • Solution
May 22 21:35:25 Tower kernel: mpt2sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
May 22 21:35:25 Tower kernel: mpt2sas_cm0: fault_state(0x7e21)!
May 22 21:35:25 Tower kernel: mpt2sas_cm0: sending diag reset !!
May 22 21:35:26 Tower kernel: mpt2sas_cm0: diag reset: SUCCESS

The HBA keeps faulting and resetting, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot if available.

  • Thanks 1
Link to comment

JorgeB, thanks I'll look into that. I just expanded my disk storage and rplaced the original Marvell controller with a LSI 9207-8i HBA and a SFF-8088 to SAF-8087 adapter. My case is a 3U rack so it may be getting hot. I will check the HBA seating and connections, whats a good way of cooling it. Maybe it's time to upgrade my Tower is back from 2014 that I purchased from Limetech when they were selling pre configured servers. Also if I do manage to be able to get through a complete Parity check, what do i do about the high number of Synch Errors? 

Edited by Haroldkidd
Link to comment
1 minute ago, Haroldkidd said:

whats a good way of cooling it.

These controllers are made for server cases, they require some airflow around them, if your case doesn't have good airflow in that zone you can add a fan close or on top of it.

 

2 minutes ago, Haroldkidd said:

Also if I do manage to be able to get through a complete Parity check, what do i do about the high number of Synch Errors? 

First step would be a correcting check.

Link to comment
4 hours ago, JorgeB said:
May 22 21:35:25 Tower kernel: mpt2sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
May 22 21:35:25 Tower kernel: mpt2sas_cm0: fault_state(0x7e21)!
May 22 21:35:25 Tower kernel: mpt2sas_cm0: sending diag reset !!
May 22 21:35:26 Tower kernel: mpt2sas_cm0: diag reset: SUCCESS

The HBA keeps faulting and resetting, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot if available.

I checked all the cables, reset both the HBA and the expander and I left the panal off of the tower and put a fan blowing on it, so we will see. Just curious why this only happens when I do a parity check and not when transfering new data or downloading to the share folders. Maybe because since all the drives are spun up when doing a Parity check it overheats. Curious 

Link to comment
  • 1 month later...

So this happened again, I did a Parity check before going to bed because I havent; done one since May 23, 2023 and I have added a lot more data to it. Woke up this morning with the Parity check paused and drive one disabled. Ugh this is frustrating, the HBA and expander are all new and I have the case open and a 26" box fan constantly blowing on the the card. Do I need to replace the wires or the HBA again? Hate to have to replace the wires as I suck at cable management, and it will be ugly.  Thanks for any support. Last time I had to remove the hba and reset it and disconnect and reconnect the cables before I could actually do a clean parity check. Will try to do that again. Could it be the Intel RES2SV240 24-Port Expander? It didn't come with the Bracket to screw it to the case, and it wobbles back and forth if you physically move it. 

tower-diagnostics-20230624-0705.zip syslog.1.txt syslog.txt

Edited by Haroldkidd
Added info
Link to comment

JorgeB, this is the same expander and HBA from the last post, was just explaining that it’s newly purchased and installed back in May 2023. I was able to run a parity check and currently sitting at 22% with 1 synch error. I had to pull the Intel RES2SV240 24-Port Expander out and reinsert it and make sure all the data cables were secure. I think it’s not sitting right in the PCI slot because there’s no bracket to secure it down and any movement or vibration causes it to wobble which maybe the reason why it’s resetting itself a lot. After this parity check I’m going to find an old bracket and see if that fixes the problem. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...