Disk Read Errors on multiple disks. Need help diagnosing. LSI 9211 reporting: FAULT_STATE(0X2622)

Marc_G2 · April 20, 2021

v 6.8.3

After moving my unRAID over to a different motherboard I've encountered read errors which I might be the fault of my LSI expansion card. This is the second that resulted in a disk getting disabled. I originally though it had something to do with my UEFI boot setting since the first time happened right at start up. See topic below. This time, the server was running for a couple days before it randomly happened.

Does anyone have an idea of what the issue is specifically? Is it the expansion card? Or could problem still be my HDD that got disabled? Before migrating over to the new mobo, I had zero issues for a year and half. So I'd surprised if the card simply started dying out of nowhere. One notable difference is I have 4 HDD's connected to the card instead of two.

What are somethings I can try before buying a new card? Is it possible the card is overheating? (unlikely due to when the errors popped up)

Is there any chance a mobo BIOS update will do anything?

nas-ng-diagnostics-20210420-1855.zip

Edited April 21, 2021 by Marc_G2

Marc_G2 · April 20, 2021

Before shutting the system down or anything, I started the array in maintenance mode and started a read check. So far it hasn't given any errors. So is it likely that the problem is that one HDD? Disk 1 was the drive that got disabled in both occasions? But if it's just that disk, does it make any sense for unRAID to report errors on the other disks? Also the SMART stats for Disk 1 didn't indicate any issues either.

Edited April 20, 2021 by Marc_G2

Vr2Io · April 21, 2021

Problem seems on ST4000VN000-2AH166 ( disk 3 ), it haven't response to HBA on-time, so HBA reset again and again, this will affect all disk which connect to HBA.

You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit.

Edited April 21, 2021 by Vr2Io

Marc_G2 · April 21, 2021

2 minutes ago, Vr2Io said:

Problem seems on ST4000VN000-2AH166 ( disk 3 ), it haven't response to HBA on-time, so HBA reset again and again, this will affect all disk which connect to HBA.

You should disconnect the SATA link at disk one by one, then keep track the log until HBA no more reset, this could narrow down the cause a bit.

That would show up as an error in the system log right? The problem there is my disks are getting disabled which requires a full rebuild afterward. I swapped the sata cables and I'm doing rebuild right now. I haven't seen any errors yet.

Vr2Io · April 21, 2021

Just now, Marc_G2 said:

That would show up as an error in the system log right?

Yes

1 minute ago, Marc_G2 said:

The problem there is my disks are getting disabled which requires a full rebuild afterward.

The problem is HBA non-stop reset due to device no response.

Previous reply amend

You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit.

Marc_G2 · April 21, 2021

11 minutes ago, Vr2Io said:

Previous reply amend

You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit.

The problem is there's no errors most of the time. So if the array isn't active, it seems especially unlikely for the error to occur.

What line in the system log did you find that the issue started with disk 3?

Vr2Io · April 21, 2021

8 minutes ago, Marc_G2 said:

The problem is there's no errors most of the time.

As said, HBA non-stop reset ... did you got that in Web log viewer ?

8 minutes ago, Marc_G2 said:

What line in the system log did you find that the issue started with disk 3?

It always late response or no response. Below is example missing device 5:0:0:0

Apr 20 18:33:16 NAS-NG kernel: sd 5:0:1:0: Power-on or device reset occurred
Apr 20 18:33:16 NAS-NG kernel: sd 5:0:2:0: Power-on or device reset occurred
Apr 20 18:33:16 NAS-NG kernel: sd 5:0:3:0: Power-on or device reset occurred

Apr 20 18:33:25 NAS-NG kernel: sd 5:0:1:0: Power-on or device reset occurred
Apr 20 18:33:25 NAS-NG kernel: sd 5:0:2:0: Power-on or device reset occurred
Apr 20 18:33:25 NAS-NG kernel: sd 5:0:3:0: Power-on or device reset occurred

After HBA no reset, then shoot the real cause by swap cable/disks/port, HBA / HBA ports / cable / disks could be the cause, you need well troubleshoot out.

Edited April 21, 2021 by Vr2Io

Marc_G2 · April 21, 2021

5 minutes ago, Vr2Io said:

As said, HBA non-stop reset ... did you got that in Web log viewer ?

The system has running for a couple hours. These are the only errors the system log is showing right now.

Vr2Io · April 21, 2021

Then you can keep track until problem happen again.

Marc_G2 · April 21, 2021

After looking over the system logs, I'm now thinking the LSI card (or less likely, the motherboard) is the problem. I don't think it's any of the disks.

But if anyone else has additional theories or things to try, please share

JorgeB · April 21, 2021

Yes, looks like an issue with the HBA:

Apr 20 18:34:05 NAS-NG kernel: mpt2sas_cm0: fault_state(0x2622)!
Apr 20 18:34:05 NAS-NG kernel: mpt2sas_cm0: sending diag reset !!
Apr 20 18:34:06 NAS-NG kernel: mpt2sas_cm0: diag reset: SUCCESS

Have you tried not sleeping the server? Not every hardware supports sleep/wake up correctly.

Marc_G2 · April 21, 2021

Just now, JorgeB said:

Have you tried not sleeping the server? Not every hardware supports sleep/wake up correctly.

That's something that crossed my mind. But I'm pretty sure that first time this issue happened was shortly after a boot up before ever going to sleep. Later today I'm going to see if I can trigger the error by putting it to sleep and waking it up again. And just starting and stopping the array.

Is there the a way to configure unRAID to better handle this error? Could I make unRAID immediately stop the array once it starts seeing this particular fault? The way unRAID disables one of my disks after trying repeated resets is major a headache.

Marc_G2 · April 21, 2021

10 minutes ago, Marc_G2 said:

But I'm pretty sure that first time this issue happened was shortly after a boot up before ever going to sleep.

Actually the April 17th log seems to show it did go to sleep for some reason (normally it'd never go to sleep on Saturday). So I'll focus on that area

Edited April 21, 2021 by Marc_G2

Marc_G2 · April 21, 2021

I put system to sleep and woke it again with the array stopped. And then tried starting it in maintenance mode. I'm not getting fault so far. My guess is the BIOS update fixed the issue. I'd appreciate it if someone knowledgeable could look at this log after wake up to see if there's anything concerning.

The one concerning thing I saw was a warning.

ata4: COMRESET failed (errno=-16)

nas-ng-syslog-20210421-2123.zip

JorgeB · April 22, 2021

9 hours ago, Marc_G2 said:

ata4: COMRESET failed (errno=-16)

I've seen those before after waking up, probably normal.

Marc_G2 · April 22, 2021

So these errors are a more concerning. Why am I getting drive errors right after they all spin down? The server is in maintenance mode at the moment, so does that have something to do with it?

image.png.97cd2b8b324e889c35648c2075fd2f71.png

JorgeB · April 22, 2021

6 minutes ago, Marc_G2 said:

The server is in maintenance mode at the moment, so does that have something to do with it?

No, and those errors after spin down are not that uncommon, and not a real problem, sometimes changing the disks APM level or disabling it helps.

Disk Read Errors on multiple disks. Need help diagnosing. LSI 9211 reporting: FAULT_STATE(0X2622)

Recommended Posts

Marc_G2

Link to comment

Marc_G2

Link to comment

Vr2Io

Link to comment

Marc_G2

Link to comment

Vr2Io

Link to comment

Marc_G2

Link to comment

Vr2Io

Link to comment

Marc_G2

Link to comment

Vr2Io

Link to comment

Marc_G2

Link to comment

JorgeB

Link to comment

Marc_G2

Link to comment

Marc_G2

Link to comment

Marc_G2

Link to comment

JorgeB

Link to comment

Marc_G2

Link to comment

JorgeB

Link to comment

Join the conversation