Disk Read Errors on multiple disks. Need help diagnosing. LSI 9211 reporting: FAULT_STATE(0X2622)


Recommended Posts

v 6.8.3

After moving my unRAID over to a different motherboard I've encountered read errors which I might be the fault of my LSI expansion card.  This is the second that resulted in a disk getting disabled.  I originally though it had something to do with my UEFI boot setting since the first time happened right at start up.  See topic below.  This time, the server was running for a couple days before it randomly happened.  

 

Does anyone have an idea of what the issue is specifically?  Is it the expansion card?  Or could problem still be my HDD that got disabled?  Before migrating over to the new mobo, I had zero issues for a year and half.  So I'd surprised if the card simply started dying out of nowhere.  One notable difference is I have 4 HDD's connected to the card instead of two.   

 

What are somethings I can try before buying a new card?  Is it possible the card is overheating? (unlikely due to when the errors popped up)

Is there any chance a mobo BIOS update will do anything?

 

 

 

nas-ng-diagnostics-20210420-1855.zip 

Edited by Marc_G2
Link to comment

Before shutting the system down or anything, I started the array in maintenance mode and started a read check.  So far it hasn't given any errors.  So is it likely that the problem is that one HDD?  Disk 1 was the drive that got disabled in both occasions?  But if it's just that disk, does it make any sense for unRAID to report errors on the other disks?  Also the SMART stats for Disk 1 didn't indicate any issues either.

Edited by Marc_G2
Link to comment

Problem seems on ST4000VN000-2AH166 ( disk 3 ), it haven't response to HBA on-time, so HBA reset again and again, this will affect all disk which connect to HBA.

 

You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit.

Edited by Vr2Io
Link to comment
2 minutes ago, Vr2Io said:

Problem seems on ST4000VN000-2AH166 ( disk 3 ), it haven't response to HBA on-time, so HBA reset again and again, this will affect all disk which connect to HBA.

 

You should disconnect the SATA link at disk one by one, then keep track the log until HBA no more reset, this could narrow down the cause a bit.

That would show up as an error in the system log right?  The problem there is my disks are getting disabled which requires a full rebuild afterward.  I swapped the sata cables and I'm doing rebuild right now.  I haven't seen any errors yet.

Link to comment
Just now, Marc_G2 said:

That would show up as an error in the system log right? 

Yes

 

1 minute ago, Marc_G2 said:

The problem there is my disks are getting disabled which requires a full rebuild afterward.

The problem is HBA non-stop reset due to device no response.

 

Previous reply amend

You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit.

Link to comment
11 minutes ago, Vr2Io said:

Previous reply amend

You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit.

The problem is there's no errors most of the time.   So if the array isn't active, it seems especially unlikely for the error to occur.   

 

What line in the system log did you find that the issue started with disk 3?

Link to comment
8 minutes ago, Marc_G2 said:

The problem is there's no errors most of the time.

As said, HBA non-stop reset ... did you got that in Web log viewer ?

 

8 minutes ago, Marc_G2 said:

What line in the system log did you find that the issue started with disk 3?

It always late response or no response. Below is example missing device 5:0:0:0

 

Apr 20 18:33:16 NAS-NG kernel: sd 5:0:1:0: Power-on or device reset occurred
Apr 20 18:33:16 NAS-NG kernel: sd 5:0:2:0: Power-on or device reset occurred
Apr 20 18:33:16 NAS-NG kernel: sd 5:0:3:0: Power-on or device reset occurred

 

Apr 20 18:33:25 NAS-NG kernel: sd 5:0:1:0: Power-on or device reset occurred
Apr 20 18:33:25 NAS-NG kernel: sd 5:0:2:0: Power-on or device reset occurred
Apr 20 18:33:25 NAS-NG kernel: sd 5:0:3:0: Power-on or device reset occurred

 

After HBA no reset, then shoot the real cause by swap cable/disks/port, HBA / HBA ports / cable / disks could be the cause, you need well troubleshoot out.

Edited by Vr2Io
Link to comment

Yes, looks like an issue with the HBA:

 

Apr 20 18:34:05 NAS-NG kernel: mpt2sas_cm0: fault_state(0x2622)!
Apr 20 18:34:05 NAS-NG kernel: mpt2sas_cm0: sending diag reset !!
Apr 20 18:34:06 NAS-NG kernel: mpt2sas_cm0: diag reset: SUCCESS

 

Have you tried not sleeping the server? Not every hardware supports sleep/wake up correctly.

Link to comment
  • Marc_G2 changed the title to Disk Read Errors on multiple disks. Need help diagnosing. LSI 9211 reporting: FAULT_STATE(0X2622)
Just now, JorgeB said:

Have you tried not sleeping the server? Not every hardware supports sleep/wake up correctly.

That's something that crossed my mind.  But I'm pretty sure that first time this issue happened was shortly after a boot up before ever going to sleep.  Later today I'm going to see if I can trigger the error by putting it to sleep and waking it up again. And just starting and stopping the array.

 

Is there the a way to configure unRAID to better handle this error?  Could I make unRAID immediately stop the array once it starts seeing this particular fault?  The way unRAID disables one of my disks after trying repeated resets is major a headache.  

Link to comment
10 minutes ago, Marc_G2 said:

But I'm pretty sure that first time this issue happened was shortly after a boot up before ever going to sleep.

Actually the April 17th log seems to show it did go to sleep for some reason (normally it'd never go to sleep on Saturday).  So I'll focus on that area

Edited by Marc_G2
Link to comment

I put system to sleep and woke it again with the array stopped.  And then tried starting it in maintenance mode.  I'm not getting fault so far.  My guess is the BIOS update fixed the issue.  I'd appreciate it if someone knowledgeable could look at this log after wake up to see if there's anything concerning.  

 

The one concerning thing I saw was a warning.  

ata4: COMRESET failed (errno=-16)

 

nas-ng-syslog-20210421-2123.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.