6.9.x, LSI Controllers & Ironwolf Disks Disabling - Summary & Fix


Cessquill

Recommended Posts

On 7/19/2022 at 12:07 PM, deltaexray said:

Would like to know If this is still needed or has been, sort of, fixed in the latest 6.10.XX Version? By now It's the only reason I'm not updating

I am waiting as well, and from what I can tell this issue will not be fixed by LimeTech. I think if we want to move on from 6.8.x to anything newer you'll have to run through the above outlined steps.

 

I am still holding out for something, but again, I'm not holding my breath. What frustrates me the most is this was working in 6.8.xx and not in 6.9 and onwards, so it is something they could potentially address.

Link to comment

It only takes a couple of minutes, can be done on existing drives while running <6.9 and on new drives while they're formatting.  The instructions are only long to walk through every step - it's really pretty straight forward.  Heck, if you're on 6.8.x, go through the steps anyhow - you won't break anything.  You're good to upgrade then when you want.

 

As I understand it, It's more of a Seagate/Linux issue.  Whilst I'd love to test 6.10, I don't have spare hardware.

Link to comment
  • 1 month later...

So, I'm back. And this time with good news: It worked. Without one single issue. So thank you for that and now onto the details haha.

I used the SeaChest Software and booted from a USB stick on my PC and did every drive one by one, one after the other. Luckily they were only four so it didn't took that long. After that, I booted the Server back up and upgraded to 6.10.3, which in between crashed my Cache drives but that has been fixed by now too.
Fun Fact: While googling more about said topic, I came across the Synology Forums and it turns out, they have/had the same issues. 8 and 10 TB drives weren't been read properly, didn't even showed up - the whole nine yards basically. It all came down to the EPC feature set, Seagate is using from it's Enterprise drives on their Ironwolf drives. It is so new most Software can't really work with that on the consumer side of things and, that's the core issue, most Software developers aren't fixing it. Synology did it from their side after a ton of reports on their forums, deactivated the EPC as a whole, just as we did here manualy. So it's basically a Software side issue, that needs work from them.

Anyway, that's it. This guide was immensely helpful, a big big Thank You to you for your work and this whole Thread, it did really saved my bacon.

Thanks :)

  • Like 1
Link to comment
  • 1 month later...

Again had issues with my iron wolf drives. Have disabled low power spin with seachest (epc is not found on my drives feature list). Eventually I'm going to move to WD red plus drives, but in meantime I was wondering that is this issue with all LSI controllers. Would it help if I would switch from H310 2008 based controller to 9207-8i? I have the latest FW on my current LSI.

Edited by ReLe
Link to comment
44 minutes ago, kolepard said:

I'm sorry to hear you're continuing to have problems.  I can tell you that since making these adjustments, I've had no trouble with the 9205-16i I have in my main server, which is on 24/7.  

 

Kevin

Thanks for the reply. Just to rule out possible issues with the controller, I just ordered a refurbished 9207 (server pull from reputable seller) and will see if problems will get resolved. 

Link to comment
  • 2 weeks later...

Hi all,

 

This is still happen with 6.11.1 right? I turned on spindown on my drives and  my ST8000VN004-2M2101 dropped off.

 

I see others mentioning the same revision (2M2101).

 

My other ST8000VN0004 seemed to be fine ... but perhaps it's just time?

Edited by coolspot
Link to comment
8 hours ago, coolspot said:

Hi all,

 

This is still happen with 6.11.1 right? I turned on spindown on my drives and  my ST8000VN004-2M2101 dropped off.

 

I see others mentioning the same revision (2M2101).

 

My other ST8000VN0004 seemed to be fine ... but perhaps it's just time?

My latest fails were on H310 crossflash LSI 2008 based card and Unraid version 6.11.1. On 3.11.2022 switched to LSI9207 and time will tell if this helps.

Edited by ReLe
Link to comment
  • 1 month later...
On 11/6/2022 at 3:42 AM, ReLe said:

My latest fails were on H310 crossflash LSI 2008 based card and Unraid version 6.11.1. On 3.11.2022 switched to LSI9207 and time will tell if this helps.

 

I got an RMA of my Ironwolf and it was replaced with a ST8000VN004-2M2101 and it just dropped off :(

 

It seems the 2M2101 seem to be most prone to the this error. My other ST8000VN004 are different revisions and don't seem to drop off.

 

 

Edited by coolspot
Link to comment
  • 3 weeks later...

Add me to the seemingly relatively small list of folks this solution (much appreciated fix/write-up) didn't work for.  Here's a summary of my experience:

 

Unraid 6.10.3

IBM M1015 9220-8i (IT mode with latest firmware)

Drive that keeps dropping off (as of beginning of December 2022) =  st10000dm0004 (2gr11l)

 

There are two other st10000dm0004 drives in the system (one is the same revision and the other is a different revision and is the parity drive - 1zc101).  There are also 2 x st10000ne0008 in the array.  Only the above-listed specific drive has started to drop off and this is after at least two years of no problems.  

 

First drop off occurred 01 Dec 2022 (scheduled parity check).  I researched the issue, found this thread (and others), checked firmware version for the drive that was getting dropped (despite not seeing why this should make any difference given trouble-free operation for last few years) and found no updates.  Did same for HBA card - same - already on latest firmware.  Ran through the changes outlined in OP and executed all recommended changes on all above-mentioned drives (not just the one that dropped off).  Realizing there could be other factors (cabling, power etc...) I also moved the drive to a different slot on the enclosure I'm using (rsv-sata-cage-34 in a Rosewill 4U whitebox build) and swapped places with another drive.  Theory was that if the issue happened again on the same slot, the rsv-sata-cage-34 or associated breakout cable could be suspect.  Removed drive from array, ran extended SMART tests - everything normal.  Reintroduced drive to array and performed rebuild - no errors.  Walked away and forgot about it.

 

01 Jan 2023, same drive dropped off again during start of parity check.  This eliminates slot/cable as culprit as drive was in a new slot with different breakout cable.  This time, I removed the IBM M1015 controller and reconnected all drives directly to motherboard.  Rebuilt drive over the top of itself again and this time also immediately followed up with a parity check.  No errors.  

 

I don't really know what to make of all this.  Obviously, it's possible I would immediately have had the same drop off issue if I had run the parity check after the first incident on 01 Dec 2022....or would I?  Also, when I ran the parity check this time around, I didn't spin the drive down before running it so this might mean the problem is going to show up again on 01 Feb 2023.  Even if the issue doesn't occur again in Feb, I'm always going to wonder if it will suddenly show up again at a later date.  Still begs the question why this would suddenly start happening after years of problem-free operation and no changes to hardware/software/OS around time of first occurrence.  This is a ugly problem and huge time waster when it happens.  Rebuilding the array and re-checking parity is not fun.  If this happens again to me, I'll follow up in this thread.  If it doesn't it seems either:

 

-  The Seachest solution doesn't work for the st10000dm0004 drive I started having problems with;  or

-  My IBM M1015 went bad (which makes no sense given that this is only happening on one drive ....so far);  or

-  The IBM M1015 simply decided it doesn't like that drive anymore for some reason:  or

-  We're still missing part of what the cause of this issue is

 

I'm going to go with the that last guess....

 

Possibly MTF.

 

Link to comment
  • 2 weeks later...

Hi guys, I have another data point for this problem.  I'm using LSI 2008 HBA's into HP sas expanders, with an assortment of drives (including Seagate Archive and EXOS) with spindown set, and am seeing exactly this thread's problem on drive spinup. Although the server is running a different Linux distro, I think the root cause might be the same as later versions of Unraid are seeing.  

 

I have narrowed the problem down to affecting Linux kernel 5.17 and onwards.  I've tested 5.17, 5.18, 5.19 and 6.1.6, all which behave identically when trying to spin up drives. (some Hitachi and western digital drives are also affected).  Linux Kernel 5.16 and prior are working fine.  I'm almost positive this is the same issue you're having with versions of Unraid past 6.8.* series, perhaps due to a backported patch (6.10 is running a late version of 5.15).  I really hope a dev stumbles upon this and it helps narrow down the root cause.

For the sake of searchability, I'll include kernel logs from both working and non working situations. 

First up is 5.16 (and prior) which is working perfectly. It produces the following errors on spin-up (but operates normally). I'm using bcache on these drives and the kernel readahead seemingly errors out with I/O errors due to spin-up timeout - but once the drive is spun up, everything operates normally and user space processes never receive any IO errors. Processes trying to access the drive are hung until its online however, which is normal. The contained BTRFS filesystem never complains, confirming data integrity is good, as its a checksummed filesystem. Note the cmd_age=10s on the first line, indicating spinup is taking longer than 10s.

[Jan18 05:08] sd 0:0:13:0: [sdn] tag#705 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s
[  +0.000011] sd 0:0:13:0: [sdn] tag#705 Sense Key : 0x2 [current] 
[  +0.000006] sd 0:0:13:0: [sdn] tag#705 ASC=0x4 ASCQ=0x2 
[  +0.000006] sd 0:0:13:0: [sdn] tag#705 CDB: opcode=0x88 88 00 00 00 00 02 3c 46 f5 f0 00 00 03 e0 00 00
[  +0.000003] I/O error, dev sdn, sector 9601218032 op 0x0:(READ) flags 0x80700 phys_seg 124 prio class 0
[  +0.123623] bcache: bch_count_backing_io_errors() sdn1: Read-ahead I/O failed on backing device, ignore


Next up is 5.17 (and onwards) which produces this upon drive spin up (which hangs the affected drive until it comes online):

[Jan16 08:49] sd 1:0:13:0: attempting task abort!scmd(0x0000000052c3e39b), outstanding for 10028 ms & timeout 60000 ms
[  +0.000015] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x1b 1b 00 00 00 01 00
[  +0.000004] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21)
[  +0.000007] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54) 
[  +1.449183] sd 1:0:13:0: task abort: SUCCESS scmd(0x0000000052c3e39b)
[  +0.000010] sd 1:0:13:0: attempting device reset! scmd(0x0000000052c3e39b)
[  +0.000005] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[  +0.000003] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21)
[  +0.000004] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54) 
[Jan16 08:50] sd 1:0:13:0: device reset: FAILED scmd(0x0000000052c3e39b)
[  +0.000006] scsi target1:0:13: attempting target reset! scmd(0x0000000052c3e39b)
[  +0.000006] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[  +0.000003] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21)
[  +0.000003] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54) 
[  +3.000146] scsi target1:0:13: target reset: SUCCESS scmd(0x0000000052c3e39b)
[  +0.248868] sd 1:0:13:0: Power-on or device reset occurred


And then occasionally (only on 5.17 onwards) I'm seeing this:

[Jan16 10:15] mpt2sas_cm1: sending diag reset !!
[  +0.480071] EDAC PCI: Signaled System Error on 0000:05:00.0
[  +0.000011] EDAC PCI: Master Data Parity Error on 0000:05:00.0
[  +0.000003] EDAC PCI: Detected Parity Error on 0000:05:00.0
[  +0.466972] mpt2sas_cm1: diag reset: SUCCESS
[  +0.057979] mpt2sas_cm1: CurrentHostPageSize is 0: Setting default host page size to 4k
[  +0.044426] mpt2sas_cm1: LSISAS2008: FWVersion(18.00.00.00), ChipRevision(0x03), BiosVersion(07.39.02.00)
[  +0.000009] mpt2sas_cm1: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[  +0.000078] mpt2sas_cm1: sending port enable !!
[  +7.814105] mpt2sas_cm1: port enable: SUCCESS
[  +0.000314] mpt2sas_cm1: search for end-devices: start
[  +0.001358] scsi target1:0:0: handle(0x000a), sas_addr(0x5001438018df2cc0)
[  +0.000011] scsi target1:0:0: enclosure logical id(0x5001438018df2ce5), slot(35)
[  +0.000117] scsi target1:0:1: handle(0x000b), sas_addr(0x5001438018df2cc1)
[  +0.000006] scsi target1:0:1: enclosure logical id(0x5001438018df2ce5), slot(34)
[  +0.000086] scsi target1:0:2: handle(0x000c), sas_addr(0x5001438018df2cc3)

etc (all drives reset and reconnect)

which and causes some problems with MD arrays etc as all drives are hung whilst this is going on and heaps of timeouts occur.  Between these two scenarios, it might explain the dropouts you're seeing in unraid.

Hope this helps!  I'm looking into changing HBA to one of the ones quoted in this thread to work around the problem - I dont want to be stuck on Linux kernel 5.16 forever. I will be back with updates!

Edited by unraidwok
mistake
  • Like 1
  • Thanks 1
Link to comment
  • 3 weeks later...

Been a month since I posted about my issues.  Parity check completed for the first time in two months without issues (as expected).  Once again, this problem started suddenly and mysteriously for me after running problem-free for years.  Also, no changes to the Kernel on my system (at least not in many months) so I can't see that being the cause either.  In the end, I simply removed my IBM M1015 9220-8i from the equation and I'm back to using sata directly off the motherboard - so far, no issues. 

Link to comment
  • 3 weeks later...

Adding to the growing number of users that this fix does not resolve.  I have an array 10 drives: four 8 and six 12 TB drives running from an LSI card / expander combo.  I have four ST8000VN004 drives and have disabled EPC and low power spin up.  Two of the ST8000VN004 drives continue to set errors: logs show they are not spinning up in 15 seconds, then the read / write errors occur.  All four have the same firmware (60).

 

My latest attempt to solve this is as follows: I have removed the expander card and run the 8TB drives direct from the MoBo.  I only have the 12TB drives on the LSI card.  Fingers crossed!

Link to comment

I'm new to linux and I'm attempting to view the status of EPC and low power spinup on my new Seagate EXOS 12gb drives (

ST12000NM0008) utilizing the seachest_info tool.  The problem is every time I attempt, I get a "command not found" error.  I've attempted using the Non-Raid ubuntu 20.04_x86_64 and the centos-7_x86_64 files as well as an old versiono posted in this thread - 150_11923_64 all with the same result.  Console printout below:

 

root@Tower:/tmp/seachest# sg_map
/dev/sg0  /dev/sda
/dev/sg1  /dev/sdb
/dev/sg2  /dev/sdc
root@Tower:/tmp/seachest# ls -a
./  ../  SeaChest_Info_150_11923_64*  SeaChest_PowerControl_1100_11923_64*
root@Tower:/tmp/seachest# chmod +x SeaChest_*
root@Tower:/tmp/seachest# SeaChest_Info_150_11923_64 -d /dev/sg1 -i
bash: SeaChest_Info_150_11923_64: command not found
root@Tower:/tmp/seachest# 

 

I'm assuming I missed something basic based on my level of familiarity, but any help would be appreciated.
 

Link to comment
On 2/23/2023 at 9:55 PM, pandbman_Tower said:

root@Tower:/tmp/seachest# SeaChest_Info_150_11923_64 -d /dev/sg1 -i
bash: SeaChest_Info_150_11923_64: command not found

 

Change into the /tmp/seachest directory and type

./SeaChest_Info_150_11923_64 -d /dev/sg1 -i

and it should execute.

./ tells bash "look in the directory I'm currently in for this".

 

If you don't explicitly state the path to the executable, bash thinks you want a command. That's basically scripts or commands that bash knows via e.g. the PATH environment variable or because they are integrated into bash.

Link to comment
On 2/25/2023 at 3:24 AM, Legion47 said:

 

Change into the /tmp/seachest directory and type

./SeaChest_Info_150_11923_64 -d /dev/sg1 -i

and it should execute.

./ tells bash "look in the directory I'm currently in for this".

 

If you don't explicitly state the path to the executable, bash thinks you want a command. That's basically scripts or commands that bash knows via e.g. the PATH environment variable or because they are integrated into bash.

That worked. I knew it'd be something basic I was screwing up, thanks for the help!  Disabled EPC just to prevent any future issues and ended up using the updated ubuntu x86_64-linux version.  Thanks again.

  • Like 1
Link to comment
On 2/23/2023 at 2:50 PM, jamikest said:

Adding to the growing number of users that this fix does not resolve.

I'm not sure whether this fix applies to your situation, but I hope your system is more stable now.

 

I've not had another issue since this thread was created, but there may be different issues with different LSI cards - not sure.

Link to comment
6 hours ago, JorgeB said:

Not AFAIK, only SAS2 models based on the SAS2008 and SAS2308 chips.

 

I am using a flashed Fujitsu D2607-A21 and I guess it's behaving like a SAS2008 (please correct me if I'm wrong). Anyway, I also had issues with Seagate drives and was made aware of this thread (my original submission here).

 

Long story short: I originally considered the issue to be insufficient cooling of the D2607 SAS-card. After channeling airflow and reducing data load by plugging more disks directly into the mainboard than the SAS-card, it subjectively seems that disk failures became fewer. In any case, I also followed the instructions to disable EPC. It was enabled for four drives, three of which failed before when rebuilding the failed drives. So maybe that was the actual fix to my issue.

 

I also noted that for two of my EPC-disabled drives there is a new firmware available from Seagate (link - use e.g., S.N. ZA217TFG). So it may also be worth trying updating the firmware instead/in addition to disabling EPC.

Link to comment
  • 3 weeks later...
On 1/3/2023 at 7:21 AM, ReLe said:

Just an update, after moving to LSI9207, from LSI2008 on 3.11.2022, no errors after switch to this date. But, who knows what it really was in my case and when the Ironwolf disks start to act again.



I'm already running a LSI SAS2308 card and my Ironwolf 8TB drives were dropping, so not sure if going to a newer LSI chipset would make a difference. It's definately related to wake up however.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.