Drive issues with UnRAID

ajfanch · August 13, 2020

Every other month or so 2 of my system Drives will be disabled, and rebuild will not go smoothly, and I'll likely have to format and re add the drives. Each drive passes SMART tests, but still at some point finds itself disabled or Loaded with errors almost uniform across the array.

Originally with my setup, I was using an unsupported SATA controller, so that was some of my issues, I replaced it with an HPE H240 and most issues were gone, but a few months ago, I had 2 drives do the same thing, so I noticed I was using onboard SATA not the H240, so I switched them, and after a reboot it was all fine. now I have all my array drives on the HPE, and its acting the same way.

It would stand to reason the H240 is the main issue, but I am not certain, and my guesses so far have not been successful.

(This is my first post, I assume after it is live I can attach my diagnostics)

Thank you all In advance

fancherdata-diagnostics-20200813-1212.zip

Edited August 13, 2020 by ajfanch
Attach Diagnostics

JorgeB · August 13, 2020

36 minutes ago, ajfanch said:

HPE H240

This is not really a recommended HBA, doesn't mean it won't work correctly, just that we usually recommend LSI since they are used for many users and the driver is usually solid, but not sure if the HBA is the problem:

Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:6:0: [sdg] tag#851 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00

first it was disk6 then all the other disks:

Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:1:0: [sdb] tag#856 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 13 10:44:36 FancherData kernel: sd 7:0:1:0: [sdb] tag#856 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00
Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdb, sector 14761505992
Aug 13 10:44:36 FancherData kernel: md: disk2 read error, sector=14761505928
Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:3:0: [sdd] tag#861 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 13 10:44:36 FancherData kernel: sd 7:0:3:0: [sdd] tag#861 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00
Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdd, sector 14761505992
Aug 13 10:44:36 FancherData kernel: md: disk1 read error, sector=14761505928
Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:4:0: [sde] tag#867 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 13 10:44:36 FancherData kernel: sd 7:0:4:0: [sde] tag#867 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00
Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sde, sector 14761505992
Aug 13 10:44:36 FancherData kernel: md: disk3 read error, sector=14761505928
Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:5:0: [sdf] tag#868 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 13 10:44:36 FancherData kernel: sd 7:0:5:0: [sdf] tag#868 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00
Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdf, sector 14761505992
Aug 13 10:44:36 FancherData kernel: md: disk0 read error, sector=14761505928
Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:7:0: [sdi] tag#869 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 13 10:44:36 FancherData kernel: sd 7:0:7:0: [sdi] tag#869 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00
Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdi, sector 14761505992
Aug 13 10:44:36 FancherData kernel: md: disk29 read error, sector=14761505928
Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone!
Aug 13 10:44:36 FancherData kernel: sd 7:0:8:0: [sdj] tag#870 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Aug 13 10:44:36 FancherData kernel: sd 7:0:8:0: [sdj] tag#870 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00
Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdj, sector 14761505992
Aug 13 10:44:36 FancherData kernel: md: disk5 read error, sector=14761505928
Aug 13 10:44:36 FancherData kernel: md: recovery thread: multiple disk errors, sector=14761505928
Aug 13 10:44:36 FancherData kernel: sd 7:0:7:0: [sdi] tag#857 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00

Note that when this happens Unraid disables as many devices as there are parity disks, which devoices get disabled is luck of the draw.

Now all disks dropping at the same time could be a connection problem, power problem or the HBA, but basically you need to start testing one thing at a time.

ajfanch · August 13, 2020

What ends up happening is that the discs all have a similar error count, they are not properly "Disabled" but the array is no longer stable and won't provide access to any data, a reboot solves the error count issue and data access.

If youre saying the HPE H240 is not a reccomended card, and an LSI brand card is a more common option, Ill order one of those then after a bit of research.

I am still getting going with understanding linux, I would rather stick with the reccomended parts for now.

Thank you for your assistance.

JorgeB · August 13, 2020

17 minutes ago, ajfanch said:

What ends up happening is that the discs all have a similar error count, they are not properly "Disabled" but the array is no longer stable and won't provide access to any data, a reboot solves the error count issue and data access.

Yes, all disks dropped, and Unraid only disabled the two (with dual parity), but you need to reboot to regain access to the other ones.

Daniel Ayers · January 22, 2022

I had exactly this problem on Linux with an H240. I purchased from eBay: https://www.ebay.com/itm/HP-H240-SAS-3-12Gbps-HBA-Host-Bus-Adapter-779134-001-761873-B21-726907-B21-ZFS-/163127028279

The cause of this problem is probably the same as the cause of mine - overheating. The H240 has a big heatsink (clue!) and an airflow guide to capture as much of the high-volume airflow from high-RPM server fans (another clue!). If the card is put in another type of case, it overheats and shuts down to protect itself.

I solved my issue by putting a 50mm fan on top of the heatsink (carefully superglued to the plastic shroud, which clips on to the card). The eBay seller, theartofserver, was extremely helpful with after-sales support.

You can test whether this is happening to you by installing the HP smart array tools and running:

# ssacli controller slot=0 show

... and checking the controller temperature reading. When overheating mine was showing >80 C - with the additional fan it's 45 C.

Also, remember you need to have enough power to run the controller (heatsink implies power hungry) and your drives.

D.

Jackal · April 10, 2022

On 1/22/2022 at 2:01 AM, Daniel Ayers said:

You can test whether this is happening to you by installing the HP smart array tools and running:

Hi @Daniel Ayers,

Didi you intalled some how "ssacli" tools in Unraid? Could you please share how did you do that? Thanks in advance!

Drive issues with UnRAID

Recommended Posts

ajfanch

Link to comment

JorgeB

Link to comment

ajfanch

Link to comment

JorgeB

Link to comment

Daniel Ayers

Link to comment

Jackal

Link to comment

Join the conversation