[solved] One drive after another - corruption - 6 drives lost so far

derekos · August 4, 2017

I've lost 6 drives so far in an failing battle against an unraid server. I have two unraid servers. One with 24 drives (original) and another with 14 drives (new). I've been battling a lot of problems since January on the newer server with 14 drives. Every time a monthly parity check runs, one or more drives starting throwing millions of errors. Almost always, a drive ends up getting disabled and listed as unmountable. Unraid then emulates the drive contents. Again, this exclusively occurs during parity check.

Almost all of these drives are coming over from the older server, where they ran flawlessly. As I upgrade to larger capacity drives on the old server, I move the smaller ones to the new server. So it seems unlikely that all these drives are bad. More importantly, it's not as easy as just throwing in a fresh new replacement drive and re-building, because what happens is one or more drives starts to throw errors during the rebuild.

I am frequently faced with two drives having errors at the same time. I cannot even begin to catalog the ways I have managed to get around this, but frequently it involves removing drives and re-seating them and then rebooting the server by power switch (cold start) --- repeatedly. Eventually, this allows me to get past the errors during parity re-build and survive another month. But, it has gotten to the point where drives are just dying one after the other (some going totally unmountable) and I have multiple drives with errors (during parity rebuild, not during normal activity) and an unmountable/emulated drive at the same time.

It seems like a hardware issue.

I have reseated every single drive. I have reseated every card in the server. I have reseated every cable, both ends. I have replaced the M1015 card. I have tracked what drives and what slots and there is no correlation of slots or rows to the errors. I've run xfs_repair on all of them.

Could it be a power issue? Could is be the RES240 expander card? Could it be the motherboard? Could it be the backplanes on the norco? Could it be the cables? I'm sure it could be any of these.

But, I think the key clue is that this only happens when parity checks run - under high IO these problems occur.

Hoping someone has an idea. Or at least a least-to-highest cost idea of what I should start replacing.

I'm worried that some of these drives have been physically harmed and it will be difficult to detect if a change has actually fixed the problem. help help.

Thanks much

Edited September 2, 2019 by derekos
updated topic: solved

Lev · August 4, 2017

Need diags to help...

itimpi · August 4, 2017

The fact it occurs during parity checks suggests that it is due to the system being under maximum load. I would suspect the power supply, although it could just be cables vibrating and intermittently losing connection. If you stop the array after one of these events does unRAID say that the drive is missing (which means it dropped offline). One additional thing to check is that any HBA is properly seated in the motherboard. At one point I could get symptoms rather like this when I had a SASLP controller that was not quite perfectly seated, although in this case there was a tendency for multiple drives to go offline at the same time.

Note that a drive being flagged as unmountable does not necessarily mean there anything physically wrong with it. If a drive unexpectedly drops offline then this can result in corruption at the file system level. If on the next boot the drive shows up again and the SMART attributes look fine then this is virtually always fixable using the appropriate file system recovery tool (accessible by clicking on the drive while the array is in Maintenance mode).

SSD · August 4, 2017

I'd like to know what disk controllers are in the system. Marvel-based controllers have been causing mischief recently and that'd be my first suspicion.

SnickySnacks · August 4, 2017

I'd be surprised if he wasn't running the whole server off the M1015+expander since that will cover the whole set of 24 slots on the norco.

Aside from the obvious "post your diagnostics", is your old server also running an M1015? Is the firmware the same? Are they both flashed to IT?

derekos · August 5, 2017

I am running a M1015 + RES2SV240 (flashed to IT) on both the old and new server. I've replaced the M1015 on the new server. I have not checked the exact firmware between the two machines, I will do that. I will also reseat things, cold reboot and post the diags.

@itimpi I agree, that I think its when the system goes under load. The drives that start throwing errors on parity rebuild do not go offline, get disabled or show as missing. Just to be clear, I am talking about the drives that are being read from during parity rebuild throwing errors (to fix a drive that was disabled or listed as unmountable during a regular parity check).

When I have gotten these drives throwing errors back online and working properly, I have run SMART reports and they always look fine. As you said, I've been able to run xfs_repair on them to fix the filesystem and then bring them back online.

BobPhoenix · August 5, 2017

I would suggest besides updating the firmware on the M1015 that you also update the firmware on the RES2SV240. I had problems with one of my RES2SV240s until I updated the firmware on it. I don't remember the exact symptoms so they could be different than yours but I would definitely try that. I updated the firmware on the RES2SV240 directly on unRAID whereas with my M1015s I always updated them from a DOS boot floppy.

derekos · August 5, 2017

So, I cold booted - reseated cards and instead of running all the drives off one molex ribbon cable, I am running them off two different molex cables. Now, I have unraid back online and I am faced with a dilemma.

Recall, in July the auto parity check ran, causing Drive 7 to drop out (disabled, contents emulated). I now have the system back up and ran xfs_repair on Drive 7. I mounted it and checked the data, which looks good. So, I now have two options:

(1) I can run a parity rebuild and risk multiple other drives running into problems and lots of reboots, reseating, etc or

(2) I can trust my parity and Drive 7 which means I live another month

If I go with #1, I will likely rinse and repeat and then face the same set of choices and eventually need to go with #2 to carry on for the next 26 days.

I am trying to figure out how to determine what BIOS I am running on the M1015 and the RES2SV240 (thx @BobPhoenix).

Lev · August 5, 2017

5 hours ago, derekos said:

how to determine what BIOS I am running on the M1015 and the RES2SV240

Right at boot-up, you should see it POST and then just press CTRL-C ... Take some pictures and attach. It'll show the firmware version.

JorgeB · August 5, 2017

The already asked for diagnostics would also show both.

FreeMan · August 5, 2017

I'm going to say that if you post diagnostics for the failing machine, people will be able to stop asking questions and taking guesses and actually be able to provide useful input.

Might be worthwhile to post diagnostics for the non-failing machine, as well (just make sure you label them very clearly), just for comparison purposes.

derekos · August 5, 2017

Hi guys, I am attaching diagnostics for the failing machine.

I attempted a parity rebuild and in this case the drive being written to was about 2.5 hours in with no problems. I woke up and found 18,446,744,073,709,551,616 writes (too many) and the drive has an X and is marked as "disabled, contents emulated". This type of thing has not happened before (drive being rebuilt failing), but it continues the issue of erratic behavior.

I will also note that I didn't want to rebuild over the old drive 7 (since I think its good), but instead used a spare drive I had.

I managed to get the M1015 firmware numbers

(good unraid) M1015 = LSISAS2008: FWVersion(14.00.00.00), ChipRevision(0x03), BiosVersion(00.00.00.00)

(problematic unraid) M1015 = LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.39.02.00)

persipolis-diagnostics-20170805-1102.zip

Edited August 5, 2017 by derekos

JorgeB · August 5, 2017

M1015 is using latest p20 firmware, there are no know issues it with (unlike the first p20 firmware 20.00.00.00), but you're virtualizing unRAID, that adds another source of possible compatibility issues.

Intel expander is also on the latest firmware.

Rebuilding disk dropped offline so there's no SMART report, assuming it's good it could be some issue with the LSI firmware and virtualization, some other virtualization issue/compatibility or a hardware issue, like power supply, controller, etc.

derekos · August 5, 2017

@johnnie.black Thanks for the info on the firmwares, that is helpful.

I am virtualizing on nearly identical hardware my other unraid server as well. I don't discount that there could be an issue, but I suspect its more likely a hardware issue.

I just rebooted unraid and I am attaching a SMART report for the drive that got disabled.

persipolis-smart-20170805-1131.zip

derekos · August 5, 2017

Per @FreeMan I am adding diagnostics for the good machine. Sorry, for the multiple posts.

good-diagnostics-20170805-1145.zip

JorgeB · August 5, 2017

You could swapping the M1015s from one server to another, this would allow you to eliminate both the controller and the firmware as potential issues, other than that I'd also try a different power supply if available.

derekos · August 6, 2017

So, I succesfully rebuilt drive 7. But, I fear I will be reposting to this thread on September 1 when the monthly parity check runs. I am going to replace the power cords from the PSU to the backplanes before then.

derekos · September 2, 2019

I am resurrecting this old thread to provide closure in case someone else comes across in the future.

I replaced the controller cards with two new ones. Problem solved.

[solved] One drive after another - corruption - 6 drives lost so far

Recommended Posts

derekos

Link to comment

Lev

Link to comment

itimpi

Link to comment

SSD

Link to comment

SnickySnacks

Link to comment

derekos

Link to comment

BobPhoenix

Link to comment

derekos

Link to comment

Lev

Link to comment

JorgeB

Link to comment

FreeMan

Link to comment

derekos

Link to comment

JorgeB

Link to comment

derekos

Link to comment

derekos

Link to comment

JorgeB

Link to comment

derekos

Link to comment

derekos

Link to comment

Join the conversation