[solved] One drive after another - corruption - 6 drives lost so far


Recommended Posts

I've lost 6 drives so far in an failing battle against an unraid server.   I have two unraid servers.  One with 24 drives (original) and another with 14 drives (new).   I've been battling a lot of problems since January on the newer server with 14 drives.   Every time a monthly parity check runs, one or more drives starting throwing millions of errors.   Almost always, a drive ends up getting disabled and listed as unmountable.  Unraid then emulates the drive contents.  Again, this exclusively occurs during parity check.     

 

Almost all of these drives are coming over from the older server, where they ran flawlessly.  As I upgrade to larger capacity drives on the old server, I move the smaller ones to the new server.   So it seems unlikely that all these drives are bad.  More importantly, it's not as easy as just throwing in a fresh new replacement drive and re-building, because what happens is one or more drives starts to throw errors during the rebuild.    

 

I am frequently faced with two drives having errors at the same time.   I cannot even begin to catalog the ways I have managed to get around this, but frequently it involves removing drives and re-seating them and then rebooting the server by power switch (cold start) --- repeatedly.  Eventually, this allows me to get past the errors during parity re-build and survive another month.   But, it has gotten to the point where drives are just dying one after the other (some going totally unmountable) and I have multiple drives with errors (during parity rebuild, not during normal activity) and an unmountable/emulated drive at the same time.   

 

It seems like a hardware issue. 

 

I have reseated every single drive.  I have reseated every card in the server.  I have reseated every cable, both ends.  I have replaced the M1015 card.  I have tracked what drives and what slots and there is no correlation of slots or rows to the errors.    I've run xfs_repair on all of them. 

 

Could it be a power issue?  Could is be the RES240 expander card?   Could it be the motherboard?   Could it be the backplanes on the norco?   Could it be the cables?  I'm sure it could be any of these. 

 

But, I think the key clue is that this only happens when parity checks run - under high IO these problems occur.   

 

Hoping someone has an idea.   Or at least a least-to-highest cost idea of what I should start replacing.  

 

I'm worried that some of these drives have been physically harmed and it will be difficult to detect if a change has actually fixed the problem.   help help. 

 

Thanks much

 

Edited by derekos
updated topic: solved
Link to comment

The fact it occurs during parity checks suggests that it is due to the system being under maximum load.    I would suspect the power supply, although it could just be cables vibrating and intermittently losing connection.  If you stop the array after one of these events does unRAID say that the drive is missing (which means it dropped offline).  One additional thing to check is that any HBA is properly seated in the motherboard.    At one point I could get symptoms rather like this when I had a SASLP controller that was not quite perfectly seated, although in this case there was a tendency for multiple drives to go offline at the same time.

 

Note that a drive being flagged as unmountable does not necessarily mean there anything physically wrong with it.   If a drive unexpectedly drops offline then this can result in corruption at the file system level.   If on the next boot the drive shows up again and the SMART attributes look fine then this is virtually always fixable using the appropriate file system recovery tool (accessible by clicking on the drive while the array is in Maintenance mode).

Link to comment

I am running a M1015 + RES2SV240 (flashed to IT) on both the old and new server.   I've replaced the M1015 on the new server.    I have not checked the exact firmware between the two machines, I will do that.   I will also reseat things, cold reboot and post the diags.

 

@itimpi I agree, that I think its when the system goes under load.  The drives that start throwing errors on parity rebuild do not go offline, get disabled or show as missing.  Just to be clear, I am talking about the drives that are being read from during parity rebuild throwing errors (to fix a drive that was disabled or listed as unmountable during a regular parity check).

 

When I have gotten these drives throwing errors back online and working properly, I have run SMART reports and they always look fine.   As you said, I've been able to run xfs_repair on them to fix the filesystem and then bring them back online. 

Link to comment

I would suggest besides updating the firmware on the M1015 that you also update the firmware on the RES2SV240.  I had problems with one of my RES2SV240s until I updated the firmware on it.  I don't remember the exact symptoms so they could be different than yours but I would definitely try that.  I updated the firmware on the RES2SV240 directly on unRAID whereas with my M1015s I always updated them from a DOS boot floppy.

Link to comment

So, I cold booted - reseated cards and instead of running all the drives off one molex ribbon cable, I am running them off two different molex cables.  Now, I have unraid back online and I am faced with a dilemma. 


Recall, in July the auto parity check ran, causing Drive 7 to drop out (disabled, contents emulated).   I now have the system back up and ran xfs_repair on Drive 7.  I mounted it and checked the data, which looks good.  So, I now have two options: 

 

(1) I can run a parity rebuild and risk multiple other drives running into problems and lots of reboots, reseating, etc or

(2) I can trust my parity and Drive 7 which means I live another month

 

If I go with #1, I will likely rinse and repeat and then face the same set of choices and eventually need to go with #2 to carry on for the next 26 days. 

 

I am trying to figure out how to determine what BIOS I am running on the M1015 and the RES2SV240 (thx @BobPhoenix).

 

 

Link to comment
5 hours ago, derekos said:

how to determine what BIOS I am running on the M1015 and the RES2SV240

 

Right at boot-up, you should see it POST and then just press CTRL-C ... Take some pictures and attach. It'll show the firmware version.

Link to comment

I'm going to say that if you post diagnostics for the failing machine, people will be able to stop asking questions and taking guesses and actually be able to provide useful input.

 

Might be worthwhile to post diagnostics for the non-failing machine, as well (just make sure you label them very clearly), just for comparison purposes.

Link to comment

Hi guys, I am attaching diagnostics for the failing machine.  

 

I attempted a parity rebuild and in this case the drive being written to was about 2.5 hours in with no problems.  I woke up and found 18,446,744,073,709,551,616 writes (too many) and the drive has an X and is marked as "disabled, contents emulated".   This type of thing has not happened before (drive being rebuilt failing), but it continues the issue of erratic behavior.

 

I will also note that I didn't want to rebuild over the old drive 7 (since I think its good), but instead used a spare drive I had.

 

I managed to get the M1015 firmware numbers

 

(good unraid) M1015  = LSISAS2008: FWVersion(14.00.00.00), ChipRevision(0x03), BiosVersion(00.00.00.00)

(problematic unraid) M1015 = LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.39.02.00)

 

persipolis-diagnostics-20170805-1102.zip

Edited by derekos
Link to comment

M1015 is using latest p20 firmware, there are no know issues it with (unlike the first p20 firmware 20.00.00.00), but you're virtualizing unRAID, that adds another source of possible compatibility issues.

 

Intel expander is also on the latest firmware.

 

Rebuilding disk dropped offline so there's no SMART report, assuming it's good it could be some issue with the LSI firmware and virtualization, some other virtualization issue/compatibility or a hardware issue, like power supply, controller, etc.

Link to comment
  • 2 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.