Unclear disk issue

GMAsterAU · January 2, 2021

Happy New Year everybody.

I have been having some strange issues with a disk since upgrading to 6.9.0-rc2. According to the UNRAID GUI it had 1024 errors and is now being emulated. This is the second time this has happened in the span of a week. The first time I checked all the connections and had the disk rebuild and it worked well for a couple days before it happened again. In response to the error I ran an extended SMART check and it passed. I also can not see in the SMART report any indication of the disk failing. I have attached the diagnostics. The disk in question is DISK 1.

FYI the disk is connected via a Silverstone ECS04 Raid card.

tower-diagnostics-20210103-0736.zip

JorgeB · January 3, 2021

Swap both cabes/slot with a different disk and see if problem follows the disk.

GMAsterAU · January 3, 2021

I did that just now and I am greeted with the message: 'Array has turned good. Array has 0 disks with read errors'. The disk is still emulated. I imagine that I now have to go through the whole rebuild routine?

JorgeB · January 4, 2021

14 hours ago, GMAsterAU said:

I imagine that I now have to go through the whole rebuild routine?

Yes.

GMAsterAU · January 4, 2021

I went through the rebuild and everything went well. So now I am again at a loss here. The disk does not have any UDMA CRC errors and also no reallocated sectors. This seems odd to me, as reading other posts and from past experience a bad disk has always carried some form of increasing, permanent, errors and a bad connection made the CRC error count go up but then stop increasing once the connection issue was fixed. In this case though I got neither scenario really. JorgeB, do you happen to know what may have caused these strange errors that UNRAID displayed? Do you recommend I send the disk back for a warranty replacement after only 1300h of use?

trurl · January 4, 2021

57 minutes ago, GMAsterAU said:

a bad connection made the CRC error count go up

Not always. In fact, not usually. Bad connection will often result in the disk not even knowing there was a problem since it can't even be accessed. CRC and other SMART attributes are stored within the disk and if the disk didn't know there was a connection problem it can't store anything about that.

If you want further advice based on your current situation post new diagnostics.

GMAsterAU · January 5, 2021

thanks for that, if you don't mind to have a look I would greatly appreciate it.

tower-diagnostics-20210105-1700.zip

JorgeB · January 5, 2021

If you swapped cables like recommend just wait to see if it happens again, if it does to the same disk it might be failing dispite no SMART issues.

GMAsterAU · January 5, 2021

thank you for that. I will update once I know more

GMAsterAU · January 30, 2021

After almost a month without issues, the identical disk issue appeared again. Overnight at 2 am DISK 1 showed read errors again. Once again there are no SMART errors reported as far as I can see. Following our previous discussion, the bottom line then is that the disk is failing in spite of no SMART errors? Is there any way to know what kind of errors these are?

tower-diagnostics-20210131-0809.zip

JorgeB · January 31, 2021

If the same disk keeps failing and ruled out cables it's likely a disk problem, you can also try using it with a different controller if not done yet.

GMAsterAU · January 31, 2021

thanks JorgeB I will give that a try and see what happens. It is still unclear to me how the SMART stats stay ok and the disk has been totally fine including a complete parity check.

GMAsterAU · February 25, 2021

Hi all, so I did a lot of poking and testing and the symptoms keep getting weirder. As a note any and all of these issues have started and persisted with Version: 6.9.0-rc2

1. currently I have 2 disks sent away for replacement and they are missing from the array as discussed above.

2. I have discovered that two out of 4 disks that are connected to the RAID card (Silverstone ECS04), show errors after about 1 -2 days of Server up-time. However when I restart the server the errors are removed and everything is good again until the cycle restarts.

2.1 in response to the RAID card, I have increased cooling, however when I checked on its temps it did not exceed the manufacturers recommendations

3. as part of the whole 'weird errors are happening' situation, I have also discovered that user shares do not show up in the 'SHARES' menu, and when I connect to the server I can only see select trees and everything else is missing, requiring a restart to fix.

Before restart:

After restart:

I have a couple key questions to understand what is going on:

1. does anyone know what kind of errors Unraid is recording and counting in the Main menu when the disk error rate goes up?

2. why do these errors get reset?

3. what could lead yo me having to restart the server to get it all sorted?

4. what governs the shares information and where is it stored? am I looking at a failed RAM module perhaps?

thanks for all your help with this

tower-diagnostics-20210226-0643_before restart.zip tower-diagnostics-20210226-0702_after_restart.zip

trurl · February 25, 2021

Any failed attempt to read or write a disk is counted, you can see these in the syslog in your diagnostics
The error counts in the Errors column on Main always start at zero when the server boots. You can also reset them at Main - Array Operation - Clear Stats
Maybe a controller problem is resetting and looks like disks 5,6 were having problems so that might indicate something they have in common is the culprit
The user shares are simply the aggregate of all top level folders on the pools and array, just another view of the disk files. If you create a user share, Unraid creates a top level folder named for the share on the pools or array as needed according to the settings for the share. Conversely, any top level folder on the pools or array is automatically a user share named for the folder. Problems reading the disks can sometimes interfere with aggregating the folders and so the user shares are "broken"

GMAsterAU · February 26, 2021

thank you @trurl. Do you have a recommendation on how to proceed? I was thinking to wait for the replacement drives to arrive, rebuild the array and I also have a tiny cooling fan coming for the RAID card as I have read that temperature issues can lead to corruption.

JorgeB · February 26, 2021

11 hours ago, GMAsterAU said:

I have discovered that two out of 4 disks that are connected to the RAID card (Silverstone ECS04), show errors after about 1 -2 days of Server up-time

Are both disks from the same model?

GMAsterAU · February 26, 2021

56 minutes ago, JorgeB said:

Are both disks from the same model?

yes they are; both are 8TB IronWolf

JorgeB · February 26, 2021

I believe there have been other reports of issues with those disks when used on a LSI with v6.9, possibly a driver issue, you could try connecting them to the onboard SATA, of course swap with disks from a different model.

GMAsterAU · February 26, 2021

24 minutes ago, JorgeB said:

I believe there have been other reports of issues with those disks when used on a LSI with v6.9, possibly a driver issue, you could try connecting them to the onboard SATA, of course swap with disks from a different model.

sure! I will try that tomorrow morning and report back

GMAsterAU · March 2, 2021

On 2/26/2021 at 9:01 PM, GMAsterAU said:

sure! I will try that tomorrow morning and report back

@JorgeB I am amazed! so far the swap worked fine. No errors reported with the same use, after more than 2 days when it previously used to show errors after 1 day. What I did was change the RAID card being connected from the 8TH Seagate Ironwolf drives to two 3TB WD drives and now there are no issues so far.

Is there a way to raise this with UNRAID? I imagine I am not experiencing this as an isolated case

JorgeB · March 3, 2021

12 hours ago, GMAsterAU said:

I imagine I am not experiencing this as an isolated case

It's not, you can try upgrading to v6.9 final, it might include a newer LSI driver, but LT can't do anything about this, this would be an LSI + those Seagate disks issue.

GMAsterAU · March 3, 2021

10 hours ago, JorgeB said:

It's not, you can try upgrading to v6.9 final, it might include a newer LSI driver, but LT can't do anything about this, this would be an LSI + those Seagate disks issue.

How unfortunate. Well either way thank you very much for your help with this. It looks like after a lot of trial and error I have reached a stable server configuration again.

Unclear disk issue

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation