steve1977 Posted April 18, 2018 Share Posted April 18, 2018 Diagnostic attached. Something weng wrong with disk 14. Smart is ok, no error shown. tower-diagnostics-20180418-2146.zip Link to comment
JorgeB Posted April 18, 2018 Share Posted April 18, 2018 Diags are after rebooting so not much to see, SMART does look fine, so more likely a connection issue. Link to comment
steve1977 Posted April 18, 2018 Author Share Posted April 18, 2018 Thanks. I actually didn't reboot. My array doesn't auto-start, so I am sure it didn't reboot on its own. Link to comment
JorgeB Posted April 18, 2018 Share Posted April 18, 2018 Just now, steve1977 said: so I am sure it didn't reboot on its own. It did: Apr 16 08:07:42 Tower emhttpd: unclean shutdown detected Link to comment
steve1977 Posted April 20, 2018 Author Share Posted April 20, 2018 Mmh.. That's weird. I tried to rebuild on the same disk and disk disabled again with errors. Please see diagnostic attached. I can also run an extended smart if helpful? tower-diagnostics-20180420-1932.zip Link to comment
John_M Posted April 20, 2018 Share Posted April 20, 2018 Did you check/replace the cables? Link to comment
steve1977 Posted April 20, 2018 Author Share Posted April 20, 2018 The cables are quite new. I had issues earlier with disks keep on getting disabled every few weeks or so (and rebuild of larger disks not working). Changing the M1015 card seemed to have it fixed and it appeared the conclusion was that the M1015 over-heated / started mis-functioning. It seems the issue is back now (or a genuinely faulty disk)? Link to comment
John_M Posted April 20, 2018 Share Posted April 20, 2018 Disk 14 has dropped off line and Disk 11's link is being reset. I don't think it matters how new the cables are. The SATA connector is a rather poor design. Link to comment
JorgeB Posted April 20, 2018 Share Posted April 20, 2018 Looks more like a connection issue, but it could be the disk, swap cables/backplane with a different disk and try again, if it fails again see if it follows the disk or remains with the slot/cables. Link to comment
steve1977 Posted April 20, 2018 Author Share Posted April 20, 2018 Pushed again all cables and rebooted. Rebuild in progress. Also running a short smart, but stalling at 90%. Didn't swap yet as I wasn't sure whether I can do this and still rebuild the disk. If I were to switch disks 13 and 14, I wonder whether this could prevent me from rebuilding? Link to comment
JorgeB Posted April 20, 2018 Share Posted April 20, 2018 52 minutes ago, steve1977 said: Didn't swap yet as I wasn't sure whether I can do this and still rebuild the disk. You can. Link to comment
steve1977 Posted April 20, 2018 Author Share Posted April 20, 2018 Ok, I swapped two disks. Now, the short smart doesn't complete ("interrupted hoest reset"). Full diagnostic attached. tower-diagnostics-20180420-2357.zip Link to comment
JonathanM Posted April 20, 2018 Share Posted April 20, 2018 19 minutes ago, steve1977 said: Ok, I swapped two disks. The server was powered off for the swap? Link to comment
JorgeB Posted April 20, 2018 Share Posted April 20, 2018 You're supposed to powerdown before swapping cables, reboot. Link to comment
steve1977 Posted April 20, 2018 Author Share Posted April 20, 2018 That's what I have done. Power down. Swap cables. Start. Link to comment
JorgeB Posted April 20, 2018 Share Posted April 20, 2018 Sorry, misunderstood, interrupted by host usually means other process is trying to access the disk, e.g. a spin down you cause that, try rebuilding again. Link to comment
steve1977 Posted April 21, 2018 Author Share Posted April 21, 2018 I rebooted and it is now rebuilding. It is quite slow though (compared to prior rebuilds, but no error (yet). Short smart completed. I am now running an extended smart. Let's see whether this completes and shows anything. I am quite sure that it is not the cable. Hopefully, it is the HD. If not the HD, I am quite sure that it is yet again the M1015. I am having this same cabling issues for years and have changed pretty much everything (incl. cables). When I changed the M1015 two weeks ago, it appeared to have finally solved it once and for all. There was some advice in another threat that it may be that the M1015 over-heated / died, so the new one has fixed it. Maybe the new one is now also fried? No idea how I can better cool it or whether this is really the issue? Link to comment
JorgeB Posted April 21, 2018 Share Posted April 21, 2018 You should run the extended test when there's minimal disk activity, not during a rebuild. 6 hours ago, steve1977 said: There was some advice in another threat that it may be that the M1015 over-heated / died, so the new one has fixed it. Maybe the new one is now also fried? No idea how I can better cool it or whether this is really the issue? More likely if the issue continues to happen the problem wasn't the HBA. Link to comment
steve1977 Posted April 21, 2018 Author Share Posted April 21, 2018 One theory is that the card just "colled down" when I changed to the new card and that's why it worked. Now, the new M1015 has been running for 24/7 for a few weeks and may also have a heat issue. Anyhow, I am going ahead to switch the last component that I haven't changed yet to fix this. Will get a new PSU over the weekend and see where this leaves me. Maybe going from 850W to 1000W will do the trick... Link to comment
pwm Posted April 21, 2018 Share Posted April 21, 2018 You have any fans that moves air over the HBA? Some HBA can be quite hot if there isn't enough air circulation. Link to comment
steve1977 Posted April 21, 2018 Author Share Posted April 21, 2018 The new PSU didn't fix the issue, so I am back here. I can well imagine that the heat is the issue. I am thinking of two things now: 1) installing a fan close to the heat sink (recommendations which ones?), 2) making another attempt to pass-through my primary GPU to VM and then install two M1015 (reduced load and heat). Link to comment
steve1977 Posted April 21, 2018 Author Share Posted April 21, 2018 Ok, getting closer to understanding the issue. It must be something around heat. I removed my secondary GPU, which was adjacent to the M1015. It now seems to work (better). So, most likely, the heat from the GPU has negative impact on the HBA. Or the adjancy to it prevents it from getting air. Brings me back to the two options above. I'll give it a shot again to pass-through the primary GPU to my VM (but assume I'll fail again). Then, will look for a fan for my M1015. Any advice appreciated! Link to comment
pwm Posted April 21, 2018 Share Posted April 21, 2018 2 hours ago, steve1977 said: installing a fan close to the heat sink (recommendations which ones?) It isn't critical what fans you use because it isn't huge amounts of heat you need to move away. The important thing is to not have heat pockets - if you can get reasonably cold air to reach the boards then you will be fine even if the air speed is rather low. If you have good air movement close by, then it might also be enough with a shroud or similar to aim some of the air in the right direction. Link to comment
steve1977 Posted April 22, 2018 Author Share Posted April 22, 2018 Ok. it seems confirmed that overheating of the hba causes my long-standing issue. i can now replicate the issue. when i install my gpu adjacent to the hba and play a graphics intense game, disk gets disabled. this likely is a result of the heat of the gpu impacting the hba. it’s now working ok after removing one of my two GPUs and keeping one slot empty between hba and gpu. i anticipate the issue will get worse again once the warm summer months will start. i’ll look into buying an additional fan for the hba. one way could be a dedicated one to put on-top of the heat sink. no idea how to screw it to the heatsink though and what model to pick. another way could be to slot a general slot into the pci slot between the hba and gpu. it seems you’re somewhat suggesting the latter? any recommendation on a specific model? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.