Failing hard-drives, not sure why...


Recommended Posts

Good day Unraid Community!

 

I've been using Unraid for a long time and I don't consider myself any expert by any means. I built myself over 9 years a nice 20 HDDs server (65TB + 2 Parity + 1 Cache) that has been rocking since. Over those years, I grew my server with additional hard-drives by adding 5x3 cages of various brand and using a 2x8 ports SuperMicro SATA card with SAS. I have on board 10 SATA ports. I've been having on and off issues with my hardware and would usually replace failing hard-drives. Now, I wonder if my issue is truly related to hard-drives or something deeper...

 

After doing a Swap parity procedure (because I had a failing hard-drive and I decided to upgrade one of my parity to a 8TB), I noticed that even my new 8TB failed twice in a month with errors for unknown reasons. It's important to know Unraid mentioned I have other SMART errors for disk 9 & 14 (with real error when doing long tests) and disks 15 & 17 starting to give me trouble as well but passes the SMART tests. Every time it fails, I rebuild my parity and it's fine for a few days. Then, 1 or 2 hard-drives will give me errors (often disk 9 & 14) and the parity will soon follow and get disabled. Since this is starting to cost time and some considerable money and I cannot afford to keep changing hard-drives, I'm digging to find the issue with the help of some experts around here. :) (*Btw, I will change the failing hard-drives as well but I just don't have the money at the moment.)

 

I'm basically looking for a second opinion for why my parity who's brand new would fail even if the preclear didn't come up with any error? All of the hard-drives are in cages connected with SAS cable (or SATA directly on the board.) I have yet to rebuild my parity but I wanted to ensure I troubleshoot the problem properly so I don't need to do this again in a few days.

 

I've attached the diagnostics for more details. You won't be able to see the SMART of the failing parity disk as it somehow didn't work. I noticed as well errons like this:

 

Mar 31 04:30:18 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Mar 31 04:30:18 Tower kernel: sas: trying to find task 0x00000000f2be872f
Mar 31 04:30:18 Tower kernel: sas: sas_scsi_find_task: aborting task 0x00000000f2be872f
Mar 31 04:30:18 Tower kernel: sas: sas_scsi_find_task: task 0x00000000f2be872f is aborted
Mar 31 04:30:18 Tower kernel: sas: sas_eh_handle_sas_errors: task 0x00000000f2be872f is aborted
Mar 31 04:30:18 Tower kernel: sas: ata18: end_device-2:3: cmd error handler
Mar 31 04:30:18 Tower kernel: sas: ata15: end_device-2:0: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata16: end_device-2:1: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata17: end_device-2:2: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata18: end_device-2:3: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata19: end_device-2:4: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata20: end_device-2:5: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata22: end_device-2:7: dev error handler
Mar 31 04:30:18 Tower kernel: sas: ata21: end_device-2:6: dev error handler
Mar 31 04:30:19 Tower kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Mar 31 04:30:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0
Mar 31 04:30:20 Tower kernel: ata18.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Mar 31 04:30:20 Tower kernel: ata18.00: revalidation failed (errno=-5)
Mar 31 04:30:31 Tower kernel: ata18.00: qc timeout (cmd 0xec)
Mar 31 04:30:31 Tower kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Mar 31 04:30:31 Tower kernel: ata18.00: revalidation failed (errno=-5)
Mar 31 04:30:31 Tower kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Mar 31 04:30:33 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0
Mar 31 04:30:33 Tower kernel: ata18.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Mar 31 04:30:33 Tower kernel: ata18.00: revalidation failed (errno=-5)
Mar 31 04:30:33 Tower kernel: ata18.00: disabled
Mar 31 04:30:33 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

This seems to have triggered the parity fail. I was sleeping so I cannot confirm exactly but it's the first errors I noticed.

 

Then, I would get this:

Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00
Mar 31 14:29:09 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

 

Those errors seemed to have happened after my parity failed. Perhaps it's because it failed and got disconnected somehow?

 

Anyone could help me out?

 

Thanks and have a great day!

tower-diagnostics-20190331-1919.zip

Link to comment

Google search of   supermicro aoc-sas2lp-mv8 unraid     brought up this thread as one of the results:

 

       https://forums.unraid.net/topic/70252-supermicro-aoc-sas2lp-mv8-8-port-sassata-compatible-with-unraid-6/

and this:

 

       https://forums.unraid.net/topic/53108-lsi-logic-sas9211-8i-vs-supermicro-aoc-sas2lp-mv8/

 

and this one from Reddit:

 

       https://www.reddit.com/r/unRAID/comments/7yerms/supermicro_aocsaslpmv8/

 

 

        

As you can surmise, these cards are not longer recommended.  They seem to work fine for some folks but others (are having)/(have had) problems with them.  

Link to comment

Thanks for your reply! I had actually done some research a few years before buying those cards and I had heard good things about it back then. I remember initially buying these cards back in the days because they were the recommended SATA cards to get for Unraid (6-7 years ago+) and a lot of people had them. I guess this changed over time... Can this behavior affect only a few ports? 

 

Reading over the pages you posted, it seems to be a hit and miss. It's sad because I considered those cards an investment (350$+) but I'll search for new ones. While writing this post, I've orderd 2x Dell Perc H310 in hope this will fix my issue.

 

In the meantime I receive those cards, is there anything I can do to minimize the risk of failure while continue using it? (*My wife would kill me if I shut down her entertainment center :P) Thanks!

Edited by alexricher
Link to comment

The Supermicro cards worked well with the 32,bit versions of Unraid, but they have been problematic for many users ever since Unraid went 64 bit.   The suspicion is that it is something lurking in the 64 bit Linux drivers, but nobody has any ideas what so it has remained an on-going issue.

Link to comment

Thanks everyone for the details! That might explain why unraid started giving me headaches at some point. I like being bleeding edge so I must have just upgraded unraid from 32 to 64 bits without much thoughts and started to have issues since then but never pointed it out to my SATA cards. Oh well... Love and learn! 

 

Anyway, the new cards are coming from China so I'll have a month to go with these cards. I've been able to handle the issues this far so I'll try to minimize writes. Because my parity #1 is currently disabled, should I wait for the new SATA cards before rebuilding this parity? My instinct would be to start the rebuild again. 

 

Thanks everyone and have a great day! 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.