July 26, 20178 yr I've gotten periodic, but regular, red X's on my old server every few months for years. Generally I just reseat the SATA cable, out of habit, and rebuild the drive. This is never failed to correct the problem. Its never the same drive, but it is always the same controller (SuperMicro 8-port). It may also be the same cable, or at least the same breakout (four-port) cable, but I've upgraded disks so much in the last year or so that I frankly don't know for sure, lost track. I had another red X tonight, new server, the only thing it has in common with the old server is the controller (SuperMicro) and cables. I have a pair of new four-port breakout cables ordered, to replace the existing. Also considering buying another LSI controller to replace the SuperMicro (my existing LSI has never had a failed drive). Am I on the right track? I may just use the new cables for a few months and see what happens. If it happens with the new cables, I'll replace the controller. Is my plan sound, or are there other things I should be looking at? (No drive has ever failed a SMART test after red X'ing in unraid) Thanks.
July 26, 20178 yr The SuperMicro controllers use Marvell chips, which have known issues with virtualization features enabled. I would recommend caution in continuing to use them, as they can corrupt your data. I preemptively replaced mine. The LSI SAS9201-8i is a good and economical option (on eBay). You should look at replacing your breakout cable if it is consistently causing drives to drop. Look at drive cages like the SuperMicro CSE-M35T-1B. Use of these will virtually eliminate drives dropping offline.
July 26, 20178 yr Community Expert I know it's a little drastic but if it were up to me I'd remove the mvsas driver from the next stable unRAID release, both the SASLP and the SAS2LP are a source of frequent issues, including serious data corruption/loss.
July 26, 20178 yr 2 hours ago, johnnie.black said: I know it's a little drastic but if it were up to me I'd remove the mvsas driver from the next stable unRAID release, both the SASLP and the SAS2LP are a source of frequent issues, including serious data corruption/loss. I'm glad its not up to you I'm one of those users with zero problems using those controllers.
July 26, 20178 yr Community Expert 19 minutes ago, Squid said: I'm glad its not up to you I'm one of those users with zero problems using those controllers. Yeah, that would be too drastic, I'm sure there would be better options, I'm also still using two SAS2LP on a backup server without issues, but any change of hardware or software can trigger a problem, I already retired all my SASLPs due to one ejecting a disk once, and I'm just waiting for the 1st sign of trouble with the SAS2LPs, IMO they are a ticking time bomb.
July 26, 20178 yr With a < $50 fix, hard to understand why people are still using the Marvell controllers. The problem is complicated because there is no test to see if a machine is immune, and very difficult to know if user's data is being subtly corrupted with the events. I'd like to see a GUI warning if the controller is present (at least if VT-d remains enabled).
July 27, 20178 yr Community Expert 6 minutes ago, bjp999 said: Sorry, what is FCP? Fix Common Problems plugin
July 27, 20178 yr On 7/26/2017 at 9:46 AM, Squid said: Is that a subtle hint for FCP? Sent from my LG-D852 using Tapatalk YES! I think it would be good to have 2 levels. 1 - if they have the controller at all (warning level), and 2 - if they have the controller and vt-d enabled. That is a definite no no.
July 29, 20178 yr On 7/27/2017 at 11:20 AM, bjp999 said: YES! I think it would be good to have 2 levels. 1 - if they have the controller at all (warning level), and 2 - if they have the controller and vt-d enabled. That is a definite no no. ok. Prepare for the inrush of posts about it. Quote It appears that your server has a Marvel based hard drive controller installed within it. Some users with Marvel based controllers exhibit random drives dropping offline, recurring parity errors during checks etc. This tends to be exacberated if VT-D / IOMMU is enabled in the BIOS. Generally, LSI based controllers would be preferred over Marvel based controllers because of these issues.Note that these issues are out of Limetech's hands. Depending upon the exact combination of hardware present in your server, you may not have any problems whatsoever. If you have no problems, then this warning can be safely ignored, but future versions of unRaid (and later Kernel versions) may (or may not) present you with the previously mentioned issues. Edited July 29, 20178 yr by Squid
July 29, 20178 yr @Squid - Looks good. Couple minor edits for your consideration. Instead of saying "out of Limetech's hands", you might say a defect in the Marvell chip and not something that LimeTech can remedy. I would say "if you are not seeing these problems, and your parity checks are coming back clean month after month, you may be able to ignore this warning, but future ..". The word "safely" seems a bit too comforting, as if the user starts experimenting he may trigger the issue and corrupt his data. You might list the controllers (i.e., an LSI SAS2008 based controller, 9201-8i, 9211-8i, IBM M1015, Dell H310, etc,) because all LSI controllers will not work.
July 29, 20178 yr 8 hours ago, bjp999 said: @Squid - Looks good. Couple minor edits for your consideration. Instead of saying "out of Limetech's hands", you might say a defect in the Marvell chip and not something that LimeTech can remedy. I would say "if you are not seeing these problems, and your parity checks are coming back clean month after month, you may be able to ignore this warning, but future ..". The word "safely" seems a bit too comforting, as if the user starts experimenting he may trigger the issue and corrupt his data. You might list the controllers (i.e., an LSI SAS2008 based controller, 9201-8i, 9211-8i, IBM M1015, Dell H310, etc,) because all LSI controllers will not work. I like the safely because the majority of users do NOT have any issues what so ever. I had a thought last night that this *might* be related to the old problem of slow access to/from a SAS2 (when certain drives were connected to the HBA) which LT fixed via the nr_requests tunable defaulting to 128. (IIRC before the fix, nr_requests was set to 32, and the 5 continual parity check errors, etc pretty much started at the same time as the fix came into being)
July 29, 20178 yr Understood. You are more knowledgeable on this than I, because when I heard SASLP and corruption used in the same sentence, I bolted before updating past 6.0.1. Not sure if there were no issues then, or if I just wasn't impacted. Is this better? Your server contains a disk controller (e.g., SASLP, SASLP2) based on a Marvell chip. The Marvell chips contain a defect that can cause drives to drop offline, parity errors, and even data corruption. (unRAID can't fix a controller chip.) Consider a replacement controller like the LSI SAS9201-8i, LSI SAS9211-8i, IBM M1015, or Dell H310. Read this post https://forums.lime-technology.com/topic/39003-marvell-disk-controller-chipsets-and-virtualization for more information on the problem and potential workarounds. If you are not experiencing problems, you may be able to safely ignore this warning, but educate yourself to make that determination.
July 29, 20178 yr I'm making progress on a real diagnosis / fix. I've managed to replicate one of the symptoms on my server that works perfectly. Testing it again to confirm then I'll undo the changes and retest.Sent from my LG-D852 using Tapatalk
Archived
This topic is now archived and is closed to further replies.