alexricher Posted March 31, 2019 Share Posted March 31, 2019 Good day Unraid Community! I've been using Unraid for a long time and I don't consider myself any expert by any means. I built myself over 9 years a nice 20 HDDs server (65TB + 2 Parity + 1 Cache) that has been rocking since. Over those years, I grew my server with additional hard-drives by adding 5x3 cages of various brand and using a 2x8 ports SuperMicro SATA card with SAS. I have on board 10 SATA ports. I've been having on and off issues with my hardware and would usually replace failing hard-drives. Now, I wonder if my issue is truly related to hard-drives or something deeper... After doing a Swap parity procedure (because I had a failing hard-drive and I decided to upgrade one of my parity to a 8TB), I noticed that even my new 8TB failed twice in a month with errors for unknown reasons. It's important to know Unraid mentioned I have other SMART errors for disk 9 & 14 (with real error when doing long tests) and disks 15 & 17 starting to give me trouble as well but passes the SMART tests. Every time it fails, I rebuild my parity and it's fine for a few days. Then, 1 or 2 hard-drives will give me errors (often disk 9 & 14) and the parity will soon follow and get disabled. Since this is starting to cost time and some considerable money and I cannot afford to keep changing hard-drives, I'm digging to find the issue with the help of some experts around here. (*Btw, I will change the failing hard-drives as well but I just don't have the money at the moment.) I'm basically looking for a second opinion for why my parity who's brand new would fail even if the preclear didn't come up with any error? All of the hard-drives are in cages connected with SAS cable (or SATA directly on the board.) I have yet to rebuild my parity but I wanted to ensure I troubleshoot the problem properly so I don't need to do this again in a few days. I've attached the diagnostics for more details. You won't be able to see the SMART of the failing parity disk as it somehow didn't work. I noticed as well errons like this: Mar 31 04:30:18 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Mar 31 04:30:18 Tower kernel: sas: trying to find task 0x00000000f2be872f Mar 31 04:30:18 Tower kernel: sas: sas_scsi_find_task: aborting task 0x00000000f2be872f Mar 31 04:30:18 Tower kernel: sas: sas_scsi_find_task: task 0x00000000f2be872f is aborted Mar 31 04:30:18 Tower kernel: sas: sas_eh_handle_sas_errors: task 0x00000000f2be872f is aborted Mar 31 04:30:18 Tower kernel: sas: ata18: end_device-2:3: cmd error handler Mar 31 04:30:18 Tower kernel: sas: ata15: end_device-2:0: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata16: end_device-2:1: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata17: end_device-2:2: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata18: end_device-2:3: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata19: end_device-2:4: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata20: end_device-2:5: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata22: end_device-2:7: dev error handler Mar 31 04:30:18 Tower kernel: sas: ata21: end_device-2:6: dev error handler Mar 31 04:30:19 Tower kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Mar 31 04:30:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0 Mar 31 04:30:20 Tower kernel: ata18.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80) Mar 31 04:30:20 Tower kernel: ata18.00: revalidation failed (errno=-5) Mar 31 04:30:31 Tower kernel: ata18.00: qc timeout (cmd 0xec) Mar 31 04:30:31 Tower kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4) Mar 31 04:30:31 Tower kernel: ata18.00: revalidation failed (errno=-5) Mar 31 04:30:31 Tower kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Mar 31 04:30:33 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0 Mar 31 04:30:33 Tower kernel: ata18.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80) Mar 31 04:30:33 Tower kernel: ata18.00: revalidation failed (errno=-5) Mar 31 04:30:33 Tower kernel: ata18.00: disabled Mar 31 04:30:33 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 This seems to have triggered the parity fail. I was sleeping so I cannot confirm exactly but it's the first errors I noticed. Then, I would get this: Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Mar 31 14:29:09 Tower kernel: sd 2:0:3:0: [sdl] tag#0 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 Mar 31 14:29:09 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Those errors seemed to have happened after my parity failed. Perhaps it's because it failed and got disconnected somehow? Anyone could help me out? Thanks and have a great day! tower-diagnostics-20190331-1919.zip Quote Link to comment
Frank1940 Posted March 31, 2019 Share Posted March 31, 2019 2 hours ago, alexricher said: 2x8 ports SuperMicro SATA card with SAS Give us a better description of this card (i.e., model number). As I recall, there have been some issues with SuperMicro cards and Unraid... Quote Link to comment
alexricher Posted March 31, 2019 Author Share Posted March 31, 2019 (edited) Thanks for your help! These 2 cards are: SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Card. Edited March 31, 2019 by alexricher Typo Quote Link to comment
Frank1940 Posted April 1, 2019 Share Posted April 1, 2019 Google search of supermicro aoc-sas2lp-mv8 unraid brought up this thread as one of the results: https://forums.unraid.net/topic/70252-supermicro-aoc-sas2lp-mv8-8-port-sassata-compatible-with-unraid-6/ and this: https://forums.unraid.net/topic/53108-lsi-logic-sas9211-8i-vs-supermicro-aoc-sas2lp-mv8/ and this one from Reddit: https://www.reddit.com/r/unRAID/comments/7yerms/supermicro_aocsaslpmv8/ As you can surmise, these cards are not longer recommended. They seem to work fine for some folks but others (are having)/(have had) problems with them. Quote Link to comment
alexricher Posted April 1, 2019 Author Share Posted April 1, 2019 (edited) Thanks for your reply! I had actually done some research a few years before buying those cards and I had heard good things about it back then. I remember initially buying these cards back in the days because they were the recommended SATA cards to get for Unraid (6-7 years ago+) and a lot of people had them. I guess this changed over time... Can this behavior affect only a few ports? Reading over the pages you posted, it seems to be a hit and miss. It's sad because I considered those cards an investment (350$+) but I'll search for new ones. While writing this post, I've orderd 2x Dell Perc H310 in hope this will fix my issue. In the meantime I receive those cards, is there anything I can do to minimize the risk of failure while continue using it? (*My wife would kill me if I shut down her entertainment center :P) Thanks! Edited April 1, 2019 by alexricher Quote Link to comment
Frank1940 Posted April 1, 2019 Share Posted April 1, 2019 (edited) I don't recall any data loss per se that has been the result of these controllers -- Just a lot of headaches and hassle. But I (personally) would shutdown any write operations until they are replaced. PS--- Don't run a Parity check... Edited April 1, 2019 by Frank1940 Quote Link to comment
itimpi Posted April 1, 2019 Share Posted April 1, 2019 The Supermicro cards worked well with the 32,bit versions of Unraid, but they have been problematic for many users ever since Unraid went 64 bit. The suspicion is that it is something lurking in the 64 bit Linux drivers, but nobody has any ideas what so it has remained an on-going issue. Quote Link to comment
alexricher Posted April 1, 2019 Author Share Posted April 1, 2019 Thanks everyone for the details! That might explain why unraid started giving me headaches at some point. I like being bleeding edge so I must have just upgraded unraid from 32 to 64 bits without much thoughts and started to have issues since then but never pointed it out to my SATA cards. Oh well... Love and learn! Anyway, the new cards are coming from China so I'll have a month to go with these cards. I've been able to handle the issues this far so I'll try to minimize writes. Because my parity #1 is currently disabled, should I wait for the new SATA cards before rebuilding this parity? My instinct would be to start the rebuild again. Thanks everyone and have a great day! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.