April 18, 20188 yr Diagnostic attached. Something weng wrong with disk 14. Smart is ok, no error shown. tower-diagnostics-20180418-2146.zip
April 18, 20188 yr Diags are after rebooting so not much to see, SMART does look fine, so more likely a connection issue.
April 18, 20188 yr Author Thanks. I actually didn't reboot. My array doesn't auto-start, so I am sure it didn't reboot on its own.
April 18, 20188 yr Just now, steve1977 said: so I am sure it didn't reboot on its own. It did: Apr 16 08:07:42 Tower emhttpd: unclean shutdown detected
April 20, 20188 yr Author Mmh.. That's weird. I tried to rebuild on the same disk and disk disabled again with errors. Please see diagnostic attached. I can also run an extended smart if helpful? tower-diagnostics-20180420-1932.zip
April 20, 20188 yr Author The cables are quite new. I had issues earlier with disks keep on getting disabled every few weeks or so (and rebuild of larger disks not working). Changing the M1015 card seemed to have it fixed and it appeared the conclusion was that the M1015 over-heated / started mis-functioning. It seems the issue is back now (or a genuinely faulty disk)?
April 20, 20188 yr Disk 14 has dropped off line and Disk 11's link is being reset. I don't think it matters how new the cables are. The SATA connector is a rather poor design.
April 20, 20188 yr Looks more like a connection issue, but it could be the disk, swap cables/backplane with a different disk and try again, if it fails again see if it follows the disk or remains with the slot/cables.
April 20, 20188 yr Author Pushed again all cables and rebooted. Rebuild in progress. Also running a short smart, but stalling at 90%. Didn't swap yet as I wasn't sure whether I can do this and still rebuild the disk. If I were to switch disks 13 and 14, I wonder whether this could prevent me from rebuilding?
April 20, 20188 yr 52 minutes ago, steve1977 said: Didn't swap yet as I wasn't sure whether I can do this and still rebuild the disk. You can.
April 20, 20188 yr Author Ok, I swapped two disks. Now, the short smart doesn't complete ("interrupted hoest reset"). Full diagnostic attached. tower-diagnostics-20180420-2357.zip
April 20, 20188 yr 19 minutes ago, steve1977 said: Ok, I swapped two disks. The server was powered off for the swap?
April 20, 20188 yr Sorry, misunderstood, interrupted by host usually means other process is trying to access the disk, e.g. a spin down you cause that, try rebuilding again.
April 21, 20188 yr Author I rebooted and it is now rebuilding. It is quite slow though (compared to prior rebuilds, but no error (yet). Short smart completed. I am now running an extended smart. Let's see whether this completes and shows anything. I am quite sure that it is not the cable. Hopefully, it is the HD. If not the HD, I am quite sure that it is yet again the M1015. I am having this same cabling issues for years and have changed pretty much everything (incl. cables). When I changed the M1015 two weeks ago, it appeared to have finally solved it once and for all. There was some advice in another threat that it may be that the M1015 over-heated / died, so the new one has fixed it. Maybe the new one is now also fried? No idea how I can better cool it or whether this is really the issue?
April 21, 20188 yr You should run the extended test when there's minimal disk activity, not during a rebuild. 6 hours ago, steve1977 said: There was some advice in another threat that it may be that the M1015 over-heated / died, so the new one has fixed it. Maybe the new one is now also fried? No idea how I can better cool it or whether this is really the issue? More likely if the issue continues to happen the problem wasn't the HBA.
April 21, 20188 yr Author One theory is that the card just "colled down" when I changed to the new card and that's why it worked. Now, the new M1015 has been running for 24/7 for a few weeks and may also have a heat issue. Anyhow, I am going ahead to switch the last component that I haven't changed yet to fix this. Will get a new PSU over the weekend and see where this leaves me. Maybe going from 850W to 1000W will do the trick...
April 21, 20188 yr You have any fans that moves air over the HBA? Some HBA can be quite hot if there isn't enough air circulation.
April 21, 20188 yr Author The new PSU didn't fix the issue, so I am back here. I can well imagine that the heat is the issue. I am thinking of two things now: 1) installing a fan close to the heat sink (recommendations which ones?), 2) making another attempt to pass-through my primary GPU to VM and then install two M1015 (reduced load and heat).
April 21, 20188 yr Author Ok, getting closer to understanding the issue. It must be something around heat. I removed my secondary GPU, which was adjacent to the M1015. It now seems to work (better). So, most likely, the heat from the GPU has negative impact on the HBA. Or the adjancy to it prevents it from getting air. Brings me back to the two options above. I'll give it a shot again to pass-through the primary GPU to my VM (but assume I'll fail again). Then, will look for a fan for my M1015. Any advice appreciated!
April 21, 20188 yr 2 hours ago, steve1977 said: installing a fan close to the heat sink (recommendations which ones?) It isn't critical what fans you use because it isn't huge amounts of heat you need to move away. The important thing is to not have heat pockets - if you can get reasonably cold air to reach the boards then you will be fine even if the air speed is rather low. If you have good air movement close by, then it might also be enough with a shroud or similar to aim some of the air in the right direction.
April 22, 20188 yr Author Ok. it seems confirmed that overheating of the hba causes my long-standing issue. i can now replicate the issue. when i install my gpu adjacent to the hba and play a graphics intense game, disk gets disabled. this likely is a result of the heat of the gpu impacting the hba. it’s now working ok after removing one of my two GPUs and keeping one slot empty between hba and gpu. i anticipate the issue will get worse again once the warm summer months will start. i’ll look into buying an additional fan for the hba. one way could be a dedicated one to put on-top of the heat sink. no idea how to screw it to the heatsink though and what model to pick. another way could be to slot a general slot into the pci slot between the hba and gpu. it seems you’re somewhat suggesting the latter? any recommendation on a specific model?
Archived
This topic is now archived and is closed to further replies.