ajfanch Posted August 13, 2020 Share Posted August 13, 2020 (edited) Every other month or so 2 of my system Drives will be disabled, and rebuild will not go smoothly, and I'll likely have to format and re add the drives. Each drive passes SMART tests, but still at some point finds itself disabled or Loaded with errors almost uniform across the array. Originally with my setup, I was using an unsupported SATA controller, so that was some of my issues, I replaced it with an HPE H240 and most issues were gone, but a few months ago, I had 2 drives do the same thing, so I noticed I was using onboard SATA not the H240, so I switched them, and after a reboot it was all fine. now I have all my array drives on the HPE, and its acting the same way. It would stand to reason the H240 is the main issue, but I am not certain, and my guesses so far have not been successful. (This is my first post, I assume after it is live I can attach my diagnostics) Thank you all In advance fancherdata-diagnostics-20200813-1212.zip Edited August 13, 2020 by ajfanch Attach Diagnostics Quote Link to comment
JorgeB Posted August 13, 2020 Share Posted August 13, 2020 36 minutes ago, ajfanch said: HPE H240 This is not really a recommended HBA, doesn't mean it won't work correctly, just that we usually recommend LSI since they are used for many users and the driver is usually solid, but not sure if the HBA is the problem: Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:6:0: [sdg] tag#851 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 first it was disk6 then all the other disks: Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:1:0: [sdb] tag#856 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 13 10:44:36 FancherData kernel: sd 7:0:1:0: [sdb] tag#856 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00 Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdb, sector 14761505992 Aug 13 10:44:36 FancherData kernel: md: disk2 read error, sector=14761505928 Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:3:0: [sdd] tag#861 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 13 10:44:36 FancherData kernel: sd 7:0:3:0: [sdd] tag#861 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00 Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdd, sector 14761505992 Aug 13 10:44:36 FancherData kernel: md: disk1 read error, sector=14761505928 Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:4:0: [sde] tag#867 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 13 10:44:36 FancherData kernel: sd 7:0:4:0: [sde] tag#867 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00 Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sde, sector 14761505992 Aug 13 10:44:36 FancherData kernel: md: disk3 read error, sector=14761505928 Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:5:0: [sdf] tag#868 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 13 10:44:36 FancherData kernel: sd 7:0:5:0: [sdf] tag#868 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00 Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdf, sector 14761505992 Aug 13 10:44:36 FancherData kernel: md: disk0 read error, sector=14761505928 Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:7:0: [sdi] tag#869 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 13 10:44:36 FancherData kernel: sd 7:0:7:0: [sdi] tag#869 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00 Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdi, sector 14761505992 Aug 13 10:44:36 FancherData kernel: md: disk29 read error, sector=14761505928 Aug 13 10:44:36 FancherData kernel: hpsa 0000:01:00.0: handle_ioaccel_mode2_error: device is gone! Aug 13 10:44:36 FancherData kernel: sd 7:0:8:0: [sdj] tag#870 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Aug 13 10:44:36 FancherData kernel: sd 7:0:8:0: [sdj] tag#870 CDB: opcode=0x88 88 00 00 00 00 03 6f da b4 c8 00 00 00 08 00 00 Aug 13 10:44:36 FancherData kernel: print_req_error: I/O error, dev sdj, sector 14761505992 Aug 13 10:44:36 FancherData kernel: md: disk5 read error, sector=14761505928 Aug 13 10:44:36 FancherData kernel: md: recovery thread: multiple disk errors, sector=14761505928 Aug 13 10:44:36 FancherData kernel: sd 7:0:7:0: [sdi] tag#857 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Note that when this happens Unraid disables as many devices as there are parity disks, which devoices get disabled is luck of the draw. Now all disks dropping at the same time could be a connection problem, power problem or the HBA, but basically you need to start testing one thing at a time. Quote Link to comment
ajfanch Posted August 13, 2020 Author Share Posted August 13, 2020 What ends up happening is that the discs all have a similar error count, they are not properly "Disabled" but the array is no longer stable and won't provide access to any data, a reboot solves the error count issue and data access. If youre saying the HPE H240 is not a reccomended card, and an LSI brand card is a more common option, Ill order one of those then after a bit of research. I am still getting going with understanding linux, I would rather stick with the reccomended parts for now. Thank you for your assistance. Quote Link to comment
JorgeB Posted August 13, 2020 Share Posted August 13, 2020 17 minutes ago, ajfanch said: What ends up happening is that the discs all have a similar error count, they are not properly "Disabled" but the array is no longer stable and won't provide access to any data, a reboot solves the error count issue and data access. Yes, all disks dropped, and Unraid only disabled the two (with dual parity), but you need to reboot to regain access to the other ones. Quote Link to comment
Daniel Ayers Posted January 22, 2022 Share Posted January 22, 2022 I had exactly this problem on Linux with an H240. I purchased from eBay: https://www.ebay.com/itm/HP-H240-SAS-3-12Gbps-HBA-Host-Bus-Adapter-779134-001-761873-B21-726907-B21-ZFS-/163127028279 The cause of this problem is probably the same as the cause of mine - overheating. The H240 has a big heatsink (clue!) and an airflow guide to capture as much of the high-volume airflow from high-RPM server fans (another clue!). If the card is put in another type of case, it overheats and shuts down to protect itself. I solved my issue by putting a 50mm fan on top of the heatsink (carefully superglued to the plastic shroud, which clips on to the card). The eBay seller, theartofserver, was extremely helpful with after-sales support. You can test whether this is happening to you by installing the HP smart array tools and running: # ssacli controller slot=0 show ... and checking the controller temperature reading. When overheating mine was showing >80 C - with the additional fan it's 45 C. Also, remember you need to have enough power to run the controller (heatsink implies power hungry) and your drives. D. Quote Link to comment
Jackal Posted April 10, 2022 Share Posted April 10, 2022 On 1/22/2022 at 2:01 AM, Daniel Ayers said: You can test whether this is happening to you by installing the HP smart array tools and running: Hi @Daniel Ayers, Didi you intalled some how "ssacli" tools in Unraid? Could you please share how did you do that? Thanks in advance! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.