Marc_G2 Posted April 20, 2021 Share Posted April 20, 2021 (edited) v 6.8.3 After moving my unRAID over to a different motherboard I've encountered read errors which I might be the fault of my LSI expansion card. This is the second that resulted in a disk getting disabled. I originally though it had something to do with my UEFI boot setting since the first time happened right at start up. See topic below. This time, the server was running for a couple days before it randomly happened. Does anyone have an idea of what the issue is specifically? Is it the expansion card? Or could problem still be my HDD that got disabled? Before migrating over to the new mobo, I had zero issues for a year and half. So I'd surprised if the card simply started dying out of nowhere. One notable difference is I have 4 HDD's connected to the card instead of two. What are somethings I can try before buying a new card? Is it possible the card is overheating? (unlikely due to when the errors popped up) Is there any chance a mobo BIOS update will do anything? nas-ng-diagnostics-20210420-1855.zip Edited April 21, 2021 by Marc_G2 Quote Link to comment
Marc_G2 Posted April 20, 2021 Author Share Posted April 20, 2021 (edited) Before shutting the system down or anything, I started the array in maintenance mode and started a read check. So far it hasn't given any errors. So is it likely that the problem is that one HDD? Disk 1 was the drive that got disabled in both occasions? But if it's just that disk, does it make any sense for unRAID to report errors on the other disks? Also the SMART stats for Disk 1 didn't indicate any issues either. Edited April 20, 2021 by Marc_G2 Quote Link to comment
Vr2Io Posted April 21, 2021 Share Posted April 21, 2021 (edited) Problem seems on ST4000VN000-2AH166 ( disk 3 ), it haven't response to HBA on-time, so HBA reset again and again, this will affect all disk which connect to HBA. You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit. Edited April 21, 2021 by Vr2Io Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 2 minutes ago, Vr2Io said: Problem seems on ST4000VN000-2AH166 ( disk 3 ), it haven't response to HBA on-time, so HBA reset again and again, this will affect all disk which connect to HBA. You should disconnect the SATA link at disk one by one, then keep track the log until HBA no more reset, this could narrow down the cause a bit. That would show up as an error in the system log right? The problem there is my disks are getting disabled which requires a full rebuild afterward. I swapped the sata cables and I'm doing rebuild right now. I haven't seen any errors yet. Quote Link to comment
Vr2Io Posted April 21, 2021 Share Posted April 21, 2021 Just now, Marc_G2 said: That would show up as an error in the system log right? Yes 1 minute ago, Marc_G2 said: The problem there is my disks are getting disabled which requires a full rebuild afterward. The problem is HBA non-stop reset due to device no response. Previous reply amend You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit. Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 11 minutes ago, Vr2Io said: Previous reply amend You should disconnect the SATA link at disk side one by one ( HBA disk only, stop array ), then keep track web log until HBA no more reset, this could narrow down the cause a bit. The problem is there's no errors most of the time. So if the array isn't active, it seems especially unlikely for the error to occur. What line in the system log did you find that the issue started with disk 3? Quote Link to comment
Vr2Io Posted April 21, 2021 Share Posted April 21, 2021 (edited) 8 minutes ago, Marc_G2 said: The problem is there's no errors most of the time. As said, HBA non-stop reset ... did you got that in Web log viewer ? 8 minutes ago, Marc_G2 said: What line in the system log did you find that the issue started with disk 3? It always late response or no response. Below is example missing device 5:0:0:0 Apr 20 18:33:16 NAS-NG kernel: sd 5:0:1:0: Power-on or device reset occurred Apr 20 18:33:16 NAS-NG kernel: sd 5:0:2:0: Power-on or device reset occurred Apr 20 18:33:16 NAS-NG kernel: sd 5:0:3:0: Power-on or device reset occurred Apr 20 18:33:25 NAS-NG kernel: sd 5:0:1:0: Power-on or device reset occurred Apr 20 18:33:25 NAS-NG kernel: sd 5:0:2:0: Power-on or device reset occurred Apr 20 18:33:25 NAS-NG kernel: sd 5:0:3:0: Power-on or device reset occurred After HBA no reset, then shoot the real cause by swap cable/disks/port, HBA / HBA ports / cable / disks could be the cause, you need well troubleshoot out. Edited April 21, 2021 by Vr2Io Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 5 minutes ago, Vr2Io said: As said, HBA non-stop reset ... did you got that in Web log viewer ? The system has running for a couple hours. These are the only errors the system log is showing right now. Quote Link to comment
Vr2Io Posted April 21, 2021 Share Posted April 21, 2021 Then you can keep track until problem happen again. Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 After looking over the system logs, I'm now thinking the LSI card (or less likely, the motherboard) is the problem. I don't think it's any of the disks. But if anyone else has additional theories or things to try, please share Quote Link to comment
JorgeB Posted April 21, 2021 Share Posted April 21, 2021 Yes, looks like an issue with the HBA: Apr 20 18:34:05 NAS-NG kernel: mpt2sas_cm0: fault_state(0x2622)! Apr 20 18:34:05 NAS-NG kernel: mpt2sas_cm0: sending diag reset !! Apr 20 18:34:06 NAS-NG kernel: mpt2sas_cm0: diag reset: SUCCESS Have you tried not sleeping the server? Not every hardware supports sleep/wake up correctly. Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 Just now, JorgeB said: Have you tried not sleeping the server? Not every hardware supports sleep/wake up correctly. That's something that crossed my mind. But I'm pretty sure that first time this issue happened was shortly after a boot up before ever going to sleep. Later today I'm going to see if I can trigger the error by putting it to sleep and waking it up again. And just starting and stopping the array. Is there the a way to configure unRAID to better handle this error? Could I make unRAID immediately stop the array once it starts seeing this particular fault? The way unRAID disables one of my disks after trying repeated resets is major a headache. Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 (edited) 10 minutes ago, Marc_G2 said: But I'm pretty sure that first time this issue happened was shortly after a boot up before ever going to sleep. Actually the April 17th log seems to show it did go to sleep for some reason (normally it'd never go to sleep on Saturday). So I'll focus on that area Edited April 21, 2021 by Marc_G2 Quote Link to comment
Marc_G2 Posted April 21, 2021 Author Share Posted April 21, 2021 I put system to sleep and woke it again with the array stopped. And then tried starting it in maintenance mode. I'm not getting fault so far. My guess is the BIOS update fixed the issue. I'd appreciate it if someone knowledgeable could look at this log after wake up to see if there's anything concerning. The one concerning thing I saw was a warning. ata4: COMRESET failed (errno=-16) nas-ng-syslog-20210421-2123.zip Quote Link to comment
JorgeB Posted April 22, 2021 Share Posted April 22, 2021 9 hours ago, Marc_G2 said: ata4: COMRESET failed (errno=-16) I've seen those before after waking up, probably normal. 1 Quote Link to comment
Marc_G2 Posted April 22, 2021 Author Share Posted April 22, 2021 So these errors are a more concerning. Why am I getting drive errors right after they all spin down? The server is in maintenance mode at the moment, so does that have something to do with it? Quote Link to comment
JorgeB Posted April 22, 2021 Share Posted April 22, 2021 6 minutes ago, Marc_G2 said: The server is in maintenance mode at the moment, so does that have something to do with it? No, and those errors after spin down are not that uncommon, and not a real problem, sometimes changing the disks APM level or disabling it helps. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.