steve1977 Posted August 26, 2019 Author Share Posted August 26, 2019 Here we go with the diagnostic log after switching both sata and power cable. Let's see whether disk 13 is still causing trouble. If so, it must be the disk itself rather than controller or cables. tower-diagnostics-20190826-1518.zip Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 What about now? tower-diagnostics-20190826-2345.zip Quote Link to comment
JorgeB Posted August 27, 2019 Share Posted August 27, 2019 ATA errors on parity, which is now connected to the same SATA port disk13 was using, and practically at the same time errors on disk12, do they share something, like a power connector or splitter? Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 Thanks. So, the issue is not the disk 13. We are getting closer to narrow down the issue. Let me switch only the power cable. This way, we will know whether the issue is power or sata cable. It could be that 12 and 13 share the same power cable. Need to check this out later. The PSU per is highly unlikely the issue. I changed it twice before. The current one is quite new, over-powered and quite a good one. Quote Link to comment
trott Posted August 27, 2019 Share Posted August 27, 2019 maybe you should check the temperature of your HBA card, those card running very hot in normal case, not sure if it will cause some issue Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 I now switched the power cable (but not the sata one). Curious what is now making trouble... Disk 13 or the parity? tower-diagnostics-20190827-1326.zip Quote Link to comment
JorgeB Posted August 27, 2019 Share Posted August 27, 2019 Parity first then disk 12 about 5 minutes later. Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 Thanks. This implies that the power connection is fine, but the sata cable is the trouble maker. I had changed the cable, so this should not be the issue. Probably over-heating issue? Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 Second thought - the parity has errors first that is not on the raid card, but on the on-board controller. Besides cable, what could cause the issue? Suspicious that it is always the same HD failing (now parity). Any idea what could be the reason? Beside heat? Quote Link to comment
JorgeB Posted August 27, 2019 Share Posted August 27, 2019 It could also be that SATA port, it might be bad, as for disk12 try it on a different port/controller. Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 Disk 12 is the real issue. Disk 13 never disables / disappears / ejects. However, the errors from disk 13 seems to trigger disk 12 to disable. It is 100% consistent / replicable. Disk 12 always disables. Disk 13 only shows errors in the diagnostic (not GUI). After switching ports, parity has issues in log (not in GUI), but yet again it leads to disk 12 being disabled. I still believe it is related to the raid card. Either over-heating or just running with too much HD capacity for what the card can handle in a stable manner. Let me explore though to rule out all options. I am now trying to add a second raid card. And then connect four disks from the on-board controller over to the second raid card. This would address the concern of a potentially faulty port. Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 Seems, this didn't leave to any improvement. Disk 12 yet again failed after moving some to a second raid card. From what disks do the errors originate this time? I can double-check whether they have been moved to the second raid card. tower-diagnostics-20190827-1422.zip Quote Link to comment
JorgeB Posted August 27, 2019 Share Posted August 27, 2019 Disk13 isn't showing any problems currently, only parity which is using the same SATA port on the onboard controller disk13 was, and still the same problems with disk12 which is on a different controller then parity, so likely unrelated, though I don't remember if you already swapped disk12 around. Quote Link to comment
JorgeB Posted August 27, 2019 Share Posted August 27, 2019 3 minutes ago, steve1977 said: Disk 12 yet again failed after moving some to a second raid card. Not yet, it's still rebuilding without errors on these latest diags. Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 Parity and disk 13 errors somehow must be related. It's always the same frequence. One disk triggers errors (Unraid GUI is fine) followed by disk 12 with errors (and disk 12 being disabled). It seems though that I did something wrong as I'd wanted to move parity away from the on-board controllers, which may be on a faulty port. Must have done this wrong. Let me check and correct this. Quote Link to comment
steve1977 Posted August 27, 2019 Author Share Posted August 27, 2019 1 minute ago, johnnie.black said: Not yet, it's still rebuilding without errors on these latest diags. Oh... I am getting paranoid about saying errors in the GUI... I checked it again and indeed not showing errors now. No clue what I missed. Let's wait whether things are ok now. I have now moved four disks to a second raid card. I had planned for those four to come from the onboard controller, but may have done a mistake and moved from one raid to another raid instead. Quote Link to comment
steve1977 Posted August 30, 2019 Author Share Posted August 30, 2019 (edited) Failed again. Any new insights from the log, which disk is this time failing together with disk 12? tower-diagnostics-20190830-0444.zip Edited August 30, 2019 by steve1977 Quote Link to comment
JorgeB Posted August 30, 2019 Share Posted August 30, 2019 1 hour ago, steve1977 said: which disk is this time failing together with disk 12? None, only disk12 Quote Link to comment
steve1977 Posted August 30, 2019 Author Share Posted August 30, 2019 Ok, changed some cabling and what controller to connect the drives to. I am rebuilding the parity. I now see errors on disk 13, but no disk disabled (yet). Rebuilt under progress. Shall I stop the array or see whether it completes? tower-diagnostics-20190830-1111.zip Quote Link to comment
JorgeB Posted August 30, 2019 Share Posted August 30, 2019 No point in continuing a rebuild with read errors on another disk, since it will be rebuilding garbage, you need to start swapping things around to start ruling them out, cables, controllers, PSU, etc., also I might not have the time to keep checking new diags multiple times a day, check the syslog, if there are read error like this there's still problem, and it tells which disk it is: Aug 30 17:31:49 Tower kernel: md: disk13 read error, sector=2999712512 Aug 30 17:31:49 Tower kernel: md: disk13 read error, sector=2999712520 Aug 30 17:31:49 Tower kernel: md: disk13 read error, sector=2999712528 Aug 30 17:31:49 Tower kernel: md: disk13 read error, sector=2999712536 Aug 30 17:31:49 Tower kernel: md: disk13 read error, sector=2999712544 Aug 30 17:31:49 Tower kernel: md: disk13 read error, sector=2999712552 Quote Link to comment
steve1977 Posted September 2, 2019 Author Share Posted September 2, 2019 Thanks for your help. Let me report back. I did some changes in the config, but nothing worked. I now installed a dedicated fan for the raid card and things seem to work. So, it may have indeed be a cooling issue. I'll keep monitoring, but this may have been solved. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.