Jump to content

Multiple errors on multiple drives


levster

Recommended Posts

Just a little while ago I noticed multiple disks with errors and, probably on instinct, I hit parity check. Perhaps that was not the smartest thing to do, as I should have posted first here, but I did. Now, the parity check is going at break neck pace, and there a TON of errors. Any ideas?

tower2-diagnostics-20181117-0052.zip

 

Also, I seem to have lost ALL of my dockers! Also, looking at the tower it self, I only see one disk as being active. All other disks appear dormant despite the main page showing multiple writes. Should I cancel the parity check and reboot?

Link to comment
  • Replies 58
  • Created
  • Last Reply

Well, I just rebooted the server manually, as it would not do so from the main screen, and all of the previous errors have miraculously disappeared. There is one unused disc with error. Parity check started because of I assume an unclean shutdown. The speed is now what I would normally expect. 

 

Any clue as to what happened? All of the dockers are showing and VM is running. Prior to that, I could not access 90% of the shares.

tower2-diagnostics-20181117-0134.zip

Link to comment

I ran parity check once I rebooted the server and all was fine. Now, for no reason at all, disc 4 went off line as is being emulated. 

 

1. How do I get it back?

 

2. Should I downgrade to the older unRaid version? It seems that something is wrong with the latest build if several users with this controller are having issues.

 

3. As far as upgrading the firmware. That would involve re-flashing the controller boards? Do you  know of a link to the instructions? It has been a while since I did it and installed them.

 

Please help.

 

tower2-diagnostics-20181118-0310.zip

Link to comment

So, I found the link to update the LSI controller. Installed the latest. Plugged in the system and... No drives appear!

tower2-diagnostics-20181118-0516.zip

 

Looks like all three PERC cards were not flashed correctly. I just tried to run 1.bat and am getting "firmware returned exception". Looks like there is not firmware on the card. Now, it seems that all three cards are dead in the water. Any ideas how to get them back?

Link to comment
6 minutes ago, levster said:

Everything seemed to be doing well, and then last night both Parity discs went off line. Server reports both as disabled. I rebooted and the problem persists. 

tower2-diagnostics-20181119-0843.zip

The disabled state will not be removed until you rebuild the parity on the two disks.

 

The process is:

  • Stop the array;
  • Unassign the parity disks
  • Start the array without the parity disks assigned.  The array will now start unprotected.
  • Stop the array
  • Reassign the parity disks.
  • Restart the array and now Unraid will start building new parity.

Note that this process requires all your data drives to not have an errors.    If you suspect that will not be the case then check back here for advice before proceeding.

 

This leaves outstanding understanding why the parity disks were disabled in the first place.

Link to comment

1. You can run an extended SMART test on each data disk to check for problems with the disks themselves. You can check the syslog for signs of cable problems. You can check the integrity of the file system on each disk. You can check for file corruption if you happen to have been using the Dynamix File Integrity plugin, which maintains a checksum for each file - though installing it now, after the event, won't help.

 

2. The syslog would be the place to look. Also check the SMART reports for the parity disks, after rebooting.

 

So, grab diagnostics first - since the parity disks are offline their SMART reports will probably be missing. This is where you'll find the syslog with the useful information. Then reboot (or power cycle) and grab diagnostics again - to include the parity disks' SMART reports. The syslog in the second diagnostics will be less useful but should at least show a clean start up.

 

Are the two parity disks on the same HBA?

Link to comment

Well, those two SMART reports look good so the problem probably still lies with the cabling/controller or with power.

 

I'm really just responding to today's questions, 1 and 2, so I'm not interested in the diagnostics for what has gone before since you've made changes since then. It would be helpful if you could point out which diagnostics cover the period during which your parity disks became disabled.

Link to comment

Neither is of any use, I'm afraid. What we need is a log that includes the time when the disks became disabled, such as you could have got after discovering the problem but before you rebooted. That would contain entries showing the controller crashing or resetting the SATA link, or whatever. Maybe there was a problem with the flashing of the HBA firmware. Are both parity disks controlled by the same card? When a disk drops offline sometimes a reboot is insufficient to restore it and a power cycle is needed. I had an H310 start acting strangely during the heat of this summer. Cutting a long story short, I ended up replacing the thermal paste, which had completely dried out - the heatsink is easy to remove if you unhook the springy clip with a pair of needle nose pliers.

Link to comment
1 hour ago, John_M said:

Neither is of any use, I'm afraid. What we need is a log that includes the time when the disks became disabled, such as you could have got after discovering the problem but before you rebooted... Cutting a long story short, I ended up replacing the thermal paste, which had completely dried out - the heatsink is easy to remove if you unhook the springy clip with a pair of needle nose pliers.

I will take a look at it later today and switch the controller. I have 3 H310s, so am not sure if they are on the same one. I'll definitely post aferwards. If They are still not showing up, is it safe to rebuild both parity discs at the same time or should I do it one at a time?

Link to comment

First, you need to get the parity drives to be recognised. If a reboot doesn't restore them then a power cycle should. If they still don't show up, then you have a more serious problem. If they are on the same controller then it could point to the controller being faulty.

 

Second, you need them to stay connected. There's no point in trying to rebuild parity if they keep being dropped. You need to find a combination of controllers, slots, cables that works reliably.

 

Third, you don't have a choice of only rebuilding one parity disk at a time - (well, you do, actually, by physically disconnecting the other one). But you do have a choice between rebuilding parity and doing a correcting parity check. If you're worried about leaving your array unprotected while parity rebuilds and you don't think that parity is hugely out of sync. The downside is that the correcting check will take longer than a rebuild if there are a lot of errors to correct but the result will be the same and unreliable parity isn't much better than no parity at all. The thread is quite complicated and I'm not sure when it can be said with any certainty that parity was indeed truly in sync. Maybe it was as recently as yesterday when you wrote:

 

On 11/18/2018 at 8:24 AM, levster said:

I ran parity check once I rebooted the server and all was fine. Now, for no reason at all, disc 4 went off line as is being emulated. 

 

1. How do I get it back?

 

You never said how or whether you resolved this. It gets lost in the discussion about flashing HBAs. Then suddenly both parity disks have been dropped. So perhaps your best solution is a New Config and rebuild parity, after all. But only when the disks stay attached to their controllers.

Link to comment
1 hour ago, John_M said:

First, you need to get the parity drives to be recognised. If a reboot doesn't restore them then a power cycle should. If they still don't show up, then you have a more serious problem. If they are on the same controller then it could point to the controller being faulty...

 

You never said how or whether you resolved this. It gets lost in the discussion about flashing HBAs. Then suddenly both parity disks have been dropped. So perhaps your best solution is a New Config and rebuild parity, after all. But only when the disks stay attached to their controllers.

Got it.

 

So,

 

1. Shut down the system.

 

2. Check to see if the two parity discs are on the same controller. If they are I will move them both to a different one.

 

3. Boot and see if both parity discs come back

 

4. If they do, then either do a parity check, or do a new Config and run a rebuild on both parity discs at the same time.

 

On a side note, once I got my three H310 flashed correctly, as per help of 

 

All of the discs, except one came back correctly. I ran a rebuild of disc 4 and all was well until this morning.

Link to comment
7 minutes ago, levster said:

I ran a rebuild of disc 4 and all was well until this morning.

...at which point you discovered that both parity disks had dropped offline?

 

In that case, if Disk 4 rebuilt properly, there are likely to be only a few parity errors so it's worth doing a New Config, keeping all assignments. Double check that all disks are correctly assigned. Select the "Trust Parity" option and start the array. Then, run a correcting parity check.

 

Alternatively, just reassign the two parity disks and start the array to rebuild them. Your choice - the other option keeps the array protected. (OK, so it starts off keeping it protected with parity that you can't trust 100%, but as time passes the situation improves!)

 

But do the hardware checks first!

Link to comment
50 minutes ago, John_M said:

...at which point you discovered that both parity disks had dropped offline?

 

But do the hardware checks first!

Got it. I'll check the hardware. Restart and see if parity is there. Then recheck / rebuild. And, sorry for the scary avatar. It's actually a real, live person's image, just played with. No serious doctoring with it. Just a magic of Computed Tomography.

Link to comment

Warning: Off topic!

Ok. Good luck. The avatar's pretty cool, actually, and I did recognise it and make the connection so it served it's purpose much better than the plain default letter "l" would have done. FWIW, mine is a detail from a larger photo of a Victorian bridge that carries a railway line over a canal in London. The two paths cross at an awkward angle - anything other than a right angle is awkward as it results in a skew bridge, and being made entirely of brick it has to be in the form of an arch. Now, a brick skew arch is a complex and beautiful engineering problem and to stop it from falling apart the bricks have to be laid in spiral (actually, "helical" is the more precise term) courses. In this case there are six layers or concentric rings to support the heavy freight trains and, while there's a large offset between adjacent rings at the point where the arch springs from its abutments, at the highest point, or crown, of the arch they all align perfectly.

Link to comment
4 minutes ago, John_M said:

Warning: Off topic!

Ok. Good luck. The avatar's pretty cool, actually, and I did recognise it and make the connection so it served it's purpose much better than the plain default letter "l" would have done. FWIW, mine is a detail from a larger photo of a Victorian bridge that carries a railway line over a canal in London. The two paths cross at an awkward angle - anything other than a right angle is awkward as it results in a skew bridge, and being made entirely of brick it has to be in the form of an arch. Now, a brick skew arch is a complex and beautiful engineering problem and to stop it from falling apart the bricks have to be laid in spiral (actually, "helical" is the more precise term) courses. In this case there are six layers or concentric rings to support the heavy freight trains and, while there's a large offset between adjacent rings at the point where the arch springs from its abutments, at the highest point, or crown, of the arch they all align perfectly.

You must be an engineer. Some one like that will see the true beauty in the design.

 

So, moved the the parity discs to a different controller and the system still does not see them. So, I guess time to unassign and reassign. 

tower2-diagnostics-20181119-2110.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...