Jump to content
craigr

RED X unusual issues; help please

14 posts in this topic Last Reply

Recommended Posts

Two days ago my Disc1 was booted from the array with a red X.  The unRAID main page showed that Disc1 had over 15,000,000,000,000,000,000 writes and even more reads.  Fortunately, I had run a parity sync just the day before with zero error.  I had also just run an extended SMART test with zero errors.  So I reseated the power and SATA cable on Disc1 and rebuilt the drive.  Everything went fine and the drive rebuilt and is back in action.

 

Today I started a parity sync to double test (without writing corrections to parity) and shortly there after my parity drive got booted with a similar error.  Attached is the log and a screen capture on the unRAID Main screen showing the huge number of reads and writes to parity.

 

Both drives are connected to the same SAS port so I reseated the SAS cable, swapped the four drives on the that SAS input with another SAS input, and then reseated all the SATA cables.  I am now rebuilding parity on the same parity drive.

 

Please provide any advice.  This system has been running stably for years.  About four months ago, I upgraded any remaining drives below 8TB to 8TB drives, but my parity and Disc1 drives have been in the system for over 2.5 years and are not new.  I find it unlikely that they would both go bad three days apart in the exact same way.

 

Kind regards,

craigr

writes.png

unraid-diagnostics-20190401-0959.zip

Share this post


Link to post

Disk dropped offline then reconnected as sds, SMART looks fine, so likely a connection issue, if it happens again to one of those 4 disks replace/swap the miniSAS cable, could also be power related, PSUs do go bad.

Share this post


Link to post
2 minutes ago, johnnie.black said:

Disk dropped offline then reconnected as sds, SMART looks fine, so likely a connection issue, if it happens again to one of those 4 disks replace/swap the miniSAS cable, could also be power related, PSUs do go bad.

Well, my thoughts exactly. 

 

I didn't notice that when the parity drive reconnected it was reassigned sds, so thanks for pointing that out.

 

I am actually shopping for a miniSAS cable now just to have on hand.  I suppose it could be the one LSI9211-IT controller too :(  The power supply is just over a year old and is well regarded and under worked.  I doubt it's the PS, but at least if it is, it's under warranty and I have a spare.  

 

...shoot, I forgot to reseat the the SATA power extension cable that feeds those two drives.  I'll do that once the parity rebuild is finished in about 16 hours.

 

This is stressing me out.  Might be time for dual parity.

 

Thanks again,

craigr

Share this post


Link to post
8 minutes ago, craigr said:

I suppose it could be the one LSI9211-IT controller too :(

It could, but not very likely since that would usually show up as the controller stopping to respond, not just one disk.

Share this post


Link to post
Just now, johnnie.black said:

It could, but not very likely since that would usually show up as the controller stopping to respond, not just one disk.

Good!

 

Thanks again,

craigr

Share this post


Link to post

Everything seems fine again now.

 

I went ahead and ordered four new 8087 forward breakout cables that have clips on the SATA connectors.  My current cables do not and have been problematic for years occasionally.  However, in the past the only problems were drives not showing up after working on the system and a SATA connector or two being loose.

 

We'll see I guess.

 

Thanks again,

craigr

Share this post


Link to post
Posted (edited)

So I just upgraded my SAS controllers and when I did, I found that one of the power plugs on my modular power supply was not inserted all the way.  It just fell out while I was working.  I'm pretty sure this was the entire issue and I can't believe that I didn't have more problems than I did.  One of my hard drives on that power rail now has two UDMA CRC errors as well.  I'm pretty sure that's a result of a momentary power glitch on that drive... 

 

I wish there were a way to clear UDMA CRC errors.  Now I can't monitor that drive anymore because unRAID will send me a warning every day for the rest of the life of that drive if I do.  Or I wish unRAID had a setting to not warn me unless the CRC UDMA error rate increases from where it is now (2).

 

craigr

Edited by craigr

Share this post


Link to post
49 minutes ago, craigr said:

I wish I there were a way to clear UDMA CRC errors.  Now I can't monitor that drive anymore because unRAID will send me a warning every day for the rest of the life of that drive if I do.  Or I wish unRAID had a setting to not warn me unless the CRC UDMA error rate increases from where it is now (2)

If you go to the Dashboard and scroll down to the Array section showing the drives then the SMART icon for the drives in question will be orange.   If you click on that you get a small drop-down menu with ‘Acknowledge’ being one of the options.   Once you have done that the icon turns green and you only get new notifications if the value changes.

Share this post


Link to post
Posted (edited)
3 hours ago, itimpi said:

If you go to the Dashboard and scroll down to the Array section showing the drives then the SMART icon for the drives in question will be orange.   If you click on that you get a small drop-down menu with ‘Acknowledge’ being one of the options.   Once you have done that the icon turns green and you only get new notifications if the value changes.

THANK YOU!

 

The Dashboard does not have any orange SMART icons right now, but I had monitoring of the attribute disabled.  I suspect I will have to wait until tomorrow after unRAID checks the health of the array for the error to come back.  I just re-enabled "SMART attribute notifications:" for "Attribute = 199 UDMA CRC error rate" on that drive and right now I show no errors on the Dashboard page.

 

I'll follow up for sure.

 

Kind regards,

craigr

Edited by craigr

Share this post


Link to post

Actually, all I had to do was spin up the drive and the orange icon showed up.  I acknowledged so hopefully that's it.

 

However, I wish that the "199 UDMA CRC error count" line on the "Disk 5 Settings" page were not still highlighted... Oh well, can't have everything I guess.  I do think it would be a good feature to have the highlight removed after acknowledging an error though.

 

Thank you so much again,

craigr

Share this post


Link to post
2 hours ago, craigr said:

Actually, all I had to do was spin up the drive and the orange icon showed up.  I acknowledged so hopefully that's it.

 

However, I wish that the "199 UDMA CRC error count" line on the "Disk 5 Settings" page were not still highlighted... Oh well, can't have everything I guess.  I do think it would be a good feature to have the highlight removed after acknowledging an error though.

 

Thank you so much again,

craigr

I am personally torn both ways on that :).    Sometimes I wish it would go back to green, and other times I like having it orange to remind me it is an attribute I should be keeping my eye on.    

 

I do wish, though, that there was a more intuitive way of getting to the Acknowlege option as I am sure that many people switch of monitoring that attribute as they do not know how to suppress repeated notifications when the value is non-zero but has not changed.   I would also like to be able to do it from the page displayed when you click on the drive in the Main tab and scroll down to the SMART attributes section.

Share this post


Link to post
27 minutes ago, itimpi said:

I am personally torn both ways on that :).    Sometimes I wish it would go back to green, and other times I like having it orange to remind me it is an attribute I should be keeping my eye on.    

 

I do wish, though, that there was a more intuitive way of getting to the Acknowlege option as I am sure that many people switch of monitoring that attribute as they do not know how to suppress repeated notifications when the value is non-zero but has not changed.   I would also like to be able to do it from the page displayed when you click on the drive in the Main tab and scroll down to the SMART attributes section.

Yeah, I can see why you are torn.  I think on something like "196 Reallocated event count" or "197 Current pending sector" that the highlight should remain, but with UDMA CRC errors, once you correct the problem it's fixed until another event happens in the future... at which point you will be notified.  In fact, I would argue that having the highlight go away after acknowledgement, and then show up again if another error occurs would actually make it more noticeable.

 

Like you I think there should be a more intuitive way to acknowledge the error.  I didn't even know it could be done until I searched the forums with google and then I had to ask here how to actually do it because I couldn't figure out how.  I just replaced all my drives under 8TB several months back and many of them had UDMA CRC errors from a day a SAS cable was not fully inserted... my solution until today had been to turn off notifications on all of those drives, and there were like six of them.  I really wish I had known about the ability to acknowledge the warning before now.

 

As far as the highlight goes, it would be cool if the user could choose whether or not to display it really.  Maybe on a line by line basis, or maybe with certain themes?

 

Anyway, thanks again for your help.

 

craigr

Share this post


Link to post

.The source code to the GUI is freely available on gitHub.    If someone was motivated enough to work out the changes required to make it act along the lines you suggest I think there is an excellent chance the change might be accepted into a future Unraid release.  With the limited resources that Limetech has I think they are probably focusing on more fundamental enhancements at the heart of Unraid.

Share this post


Link to post
9 hours ago, itimpi said:

.The source code to the GUI is freely available on gitHub.    If someone was motivated enough to work out the changes required to make it act along the lines you suggest I think there is an excellent chance the change might be accepted into a future Unraid release.  With the limited resources that Limetech has I think they are probably focusing on more fundamental enhancements at the heart of Unraid.

Yeah, I understand that.

 

If I had the knowledge I'd do it... but I'm not a coder so I don't ;)

 

Kind regards,

craigr

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.