GMAsterAU Posted January 2, 2021 Share Posted January 2, 2021 Happy New Year everybody. I have been having some strange issues with a disk since upgrading to 6.9.0-rc2. According to the UNRAID GUI it had 1024 errors and is now being emulated. This is the second time this has happened in the span of a week. The first time I checked all the connections and had the disk rebuild and it worked well for a couple days before it happened again. In response to the error I ran an extended SMART check and it passed. I also can not see in the SMART report any indication of the disk failing. I have attached the diagnostics. The disk in question is DISK 1. FYI the disk is connected via a Silverstone ECS04 Raid card. tower-diagnostics-20210103-0736.zip Quote Link to comment
JorgeB Posted January 3, 2021 Share Posted January 3, 2021 Swap both cabes/slot with a different disk and see if problem follows the disk. Quote Link to comment
GMAsterAU Posted January 3, 2021 Author Share Posted January 3, 2021 I did that just now and I am greeted with the message: 'Array has turned good. Array has 0 disks with read errors'. The disk is still emulated. I imagine that I now have to go through the whole rebuild routine? Quote Link to comment
JorgeB Posted January 4, 2021 Share Posted January 4, 2021 14 hours ago, GMAsterAU said: I imagine that I now have to go through the whole rebuild routine? Yes. Quote Link to comment
GMAsterAU Posted January 4, 2021 Author Share Posted January 4, 2021 I went through the rebuild and everything went well. So now I am again at a loss here. The disk does not have any UDMA CRC errors and also no reallocated sectors. This seems odd to me, as reading other posts and from past experience a bad disk has always carried some form of increasing, permanent, errors and a bad connection made the CRC error count go up but then stop increasing once the connection issue was fixed. In this case though I got neither scenario really. JorgeB, do you happen to know what may have caused these strange errors that UNRAID displayed? Do you recommend I send the disk back for a warranty replacement after only 1300h of use? Quote Link to comment
trurl Posted January 4, 2021 Share Posted January 4, 2021 57 minutes ago, GMAsterAU said: a bad connection made the CRC error count go up Not always. In fact, not usually. Bad connection will often result in the disk not even knowing there was a problem since it can't even be accessed. CRC and other SMART attributes are stored within the disk and if the disk didn't know there was a connection problem it can't store anything about that. If you want further advice based on your current situation post new diagnostics. Quote Link to comment
GMAsterAU Posted January 5, 2021 Author Share Posted January 5, 2021 thanks for that, if you don't mind to have a look I would greatly appreciate it. tower-diagnostics-20210105-1700.zip Quote Link to comment
JorgeB Posted January 5, 2021 Share Posted January 5, 2021 If you swapped cables like recommend just wait to see if it happens again, if it does to the same disk it might be failing dispite no SMART issues. Quote Link to comment
GMAsterAU Posted January 5, 2021 Author Share Posted January 5, 2021 thank you for that. I will update once I know more Quote Link to comment
GMAsterAU Posted January 30, 2021 Author Share Posted January 30, 2021 After almost a month without issues, the identical disk issue appeared again. Overnight at 2 am DISK 1 showed read errors again. Once again there are no SMART errors reported as far as I can see. Following our previous discussion, the bottom line then is that the disk is failing in spite of no SMART errors? Is there any way to know what kind of errors these are? tower-diagnostics-20210131-0809.zip Quote Link to comment
JorgeB Posted January 31, 2021 Share Posted January 31, 2021 If the same disk keeps failing and ruled out cables it's likely a disk problem, you can also try using it with a different controller if not done yet. Quote Link to comment
GMAsterAU Posted January 31, 2021 Author Share Posted January 31, 2021 thanks JorgeB I will give that a try and see what happens. It is still unclear to me how the SMART stats stay ok and the disk has been totally fine including a complete parity check. Quote Link to comment
GMAsterAU Posted February 25, 2021 Author Share Posted February 25, 2021 Hi all, so I did a lot of poking and testing and the symptoms keep getting weirder. As a note any and all of these issues have started and persisted with Version: 6.9.0-rc2 1. currently I have 2 disks sent away for replacement and they are missing from the array as discussed above. 2. I have discovered that two out of 4 disks that are connected to the RAID card (Silverstone ECS04), show errors after about 1 -2 days of Server up-time. However when I restart the server the errors are removed and everything is good again until the cycle restarts. 2.1 in response to the RAID card, I have increased cooling, however when I checked on its temps it did not exceed the manufacturers recommendations 3. as part of the whole 'weird errors are happening' situation, I have also discovered that user shares do not show up in the 'SHARES' menu, and when I connect to the server I can only see select trees and everything else is missing, requiring a restart to fix. Before restart: After restart: I have a couple key questions to understand what is going on: 1. does anyone know what kind of errors Unraid is recording and counting in the Main menu when the disk error rate goes up? 2. why do these errors get reset? 3. what could lead yo me having to restart the server to get it all sorted? 4. what governs the shares information and where is it stored? am I looking at a failed RAM module perhaps? thanks for all your help with this tower-diagnostics-20210226-0643_before restart.zip tower-diagnostics-20210226-0702_after_restart.zip Quote Link to comment
trurl Posted February 25, 2021 Share Posted February 25, 2021 Any failed attempt to read or write a disk is counted, you can see these in the syslog in your diagnostics The error counts in the Errors column on Main always start at zero when the server boots. You can also reset them at Main - Array Operation - Clear Stats Maybe a controller problem is resetting and looks like disks 5,6 were having problems so that might indicate something they have in common is the culprit The user shares are simply the aggregate of all top level folders on the pools and array, just another view of the disk files. If you create a user share, Unraid creates a top level folder named for the share on the pools or array as needed according to the settings for the share. Conversely, any top level folder on the pools or array is automatically a user share named for the folder. Problems reading the disks can sometimes interfere with aggregating the folders and so the user shares are "broken" Quote Link to comment
GMAsterAU Posted February 26, 2021 Author Share Posted February 26, 2021 thank you @trurl. Do you have a recommendation on how to proceed? I was thinking to wait for the replacement drives to arrive, rebuild the array and I also have a tiny cooling fan coming for the RAID card as I have read that temperature issues can lead to corruption. Quote Link to comment
JorgeB Posted February 26, 2021 Share Posted February 26, 2021 11 hours ago, GMAsterAU said: I have discovered that two out of 4 disks that are connected to the RAID card (Silverstone ECS04), show errors after about 1 -2 days of Server up-time Are both disks from the same model? Quote Link to comment
GMAsterAU Posted February 26, 2021 Author Share Posted February 26, 2021 56 minutes ago, JorgeB said: Are both disks from the same model? yes they are; both are 8TB IronWolf Quote Link to comment
JorgeB Posted February 26, 2021 Share Posted February 26, 2021 I believe there have been other reports of issues with those disks when used on a LSI with v6.9, possibly a driver issue, you could try connecting them to the onboard SATA, of course swap with disks from a different model. Quote Link to comment
GMAsterAU Posted February 26, 2021 Author Share Posted February 26, 2021 24 minutes ago, JorgeB said: I believe there have been other reports of issues with those disks when used on a LSI with v6.9, possibly a driver issue, you could try connecting them to the onboard SATA, of course swap with disks from a different model. sure! I will try that tomorrow morning and report back Quote Link to comment
GMAsterAU Posted March 2, 2021 Author Share Posted March 2, 2021 On 2/26/2021 at 9:01 PM, GMAsterAU said: sure! I will try that tomorrow morning and report back @JorgeB I am amazed! so far the swap worked fine. No errors reported with the same use, after more than 2 days when it previously used to show errors after 1 day. What I did was change the RAID card being connected from the 8TH Seagate Ironwolf drives to two 3TB WD drives and now there are no issues so far. Is there a way to raise this with UNRAID? I imagine I am not experiencing this as an isolated case Quote Link to comment
JorgeB Posted March 3, 2021 Share Posted March 3, 2021 12 hours ago, GMAsterAU said: I imagine I am not experiencing this as an isolated case It's not, you can try upgrading to v6.9 final, it might include a newer LSI driver, but LT can't do anything about this, this would be an LSI + those Seagate disks issue. Quote Link to comment
GMAsterAU Posted March 3, 2021 Author Share Posted March 3, 2021 10 hours ago, JorgeB said: It's not, you can try upgrading to v6.9 final, it might include a newer LSI driver, but LT can't do anything about this, this would be an LSI + those Seagate disks issue. How unfortunate. Well either way thank you very much for your help with this. It looks like after a lot of trial and error I have reached a stable server configuration again. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.