rxnelson Posted September 3, 2019 Share Posted September 3, 2019 I had an instance about 2 weeks ago where I thought that Plex filled up my cache drive and freaked the system out creating a bunch of errors on disk 1 and parity 1. I disabled both drives one at a time and let them rebuild. Unraid seemed happy and the drives passed a long smart test. Fast forward to the weekend and I'm not sure what happened as the system really wasn't in use but I have a full log, tons of errors during the monthly parity check, and those two drives are disabled again. I tried to pull the log from the tools menu in the gui but it looks blank. Anything I can do to assist the diagnosis before I reboot? Thanks. Quote Link to comment
rxnelson Posted September 3, 2019 Author Share Posted September 3, 2019 It looks like diagnostics has a good log file in it. towertest-diagnostics-20190903-2336.zip Quote Link to comment
trurl Posted September 4, 2019 Share Posted September 4, 2019 Both parity disks and disks 1-5 have disconnected. Are they on the same controller? Quote Link to comment
rxnelson Posted September 4, 2019 Author Share Posted September 4, 2019 Is there a way to tell that from a command or do I need to physically trace back the connections? Also, could that be old log info? It looks like they are connected in the gui right now. Maybe you are saying that is what happened in the past? Quote Link to comment
trurl Posted September 4, 2019 Share Posted September 4, 2019 Not just the log. Diagnostics could not get SMART for any of those disks either. If something has changed post a new diagnostic. Quote Link to comment
rxnelson Posted September 4, 2019 Author Share Posted September 4, 2019 (edited) I get what you are saying but I can click on the disks and see the contents (except disk 1 it says 0 objects). It doesn't say anything is emulated. That diagnostics file was the latest I had. Edited September 4, 2019 by rxnelson Adding what I see in the gui currently Quote Link to comment
trurl Posted September 4, 2019 Share Posted September 4, 2019 That screenshot doesn't show the Array Operations section. Are you currently rebuilding? Quote Link to comment
rxnelson Posted September 4, 2019 Author Share Posted September 4, 2019 No. I’ve left it sit. I didn’t want to do anything before I was told. Quote Link to comment
rxnelson Posted September 4, 2019 Author Share Posted September 4, 2019 Also maybe a stupid question but should I just disable the drives and readd them like I did before? Quote Link to comment
trurl Posted September 4, 2019 Share Posted September 4, 2019 You have completely filled up the log space due to all the errors on parity2. Reseat controller card. Check all connections, SATA and power. Reboot and post a new diagnostic. Quote Link to comment
rxnelson Posted September 4, 2019 Author Share Posted September 4, 2019 New diagnostic. Should I remove the disks one at a time and try to rebuild them? towertest-diagnostics-20190904-0240.zip Quote Link to comment
JorgeB Posted September 4, 2019 Share Posted September 4, 2019 (edited) HBA problem: Sep 1 10:50:40 TowerTest kernel: mpt2sas_cm0: SAS host is non-operational !!!! Make sure it's well seated and sufficiently cooled, you can also try another PCIe slot if available, then rebuild disk1 and resync parity. Edited September 4, 2019 by johnnie.black Quote Link to comment
rxnelson Posted September 6, 2019 Author Share Posted September 6, 2019 Thanks for all the help. I got busy the last few days but I did as you said and disk 1 is rebuilding. Quote Link to comment
rxnelson Posted September 8, 2019 Author Share Posted September 8, 2019 It rebuilt disk 1 and survived for about a day then it looks like it the card stopped responding again and disabled the disk. Logs attached. Card going bad? The machine has been stable for a couple years prior to this. Next steps? towertest-diagnostics-20190908-0412.zip Quote Link to comment
JorgeB Posted September 8, 2019 Share Posted September 8, 2019 On 9/4/2019 at 7:43 AM, johnnie.black said: Make sure it's well seated and sufficiently cooled, you can also try another PCIe slot if available, then rebuild disk1 and resync parity. If you already tried this time to try another HBA. Quote Link to comment
rxnelson Posted September 8, 2019 Author Share Posted September 8, 2019 I didn’t try another PCIe slot but I did reseat the card. I can try that after rebuild. If you don’t care to say what are you using and what firmware version? I believe I have the Dell cards that I flashed. Thanks Quote Link to comment
JorgeB Posted September 9, 2019 Share Posted September 9, 2019 I'm using the same controller and firmware on some of my servers, you're using the latest firmware which is the one recommended. Quote Link to comment
rxnelson Posted September 9, 2019 Author Share Posted September 9, 2019 Ok. Thanks. I was just wondering if you’d come back and say the dell cards are known to overheat or something like that. Anyway, I put the card in a different slot and the parity drive is resynching now. I’ll let that finish and wait a few more days to see if the card acts up again I guess. Thanks again Quote Link to comment
JorgeB Posted September 9, 2019 Share Posted September 9, 2019 3 minutes ago, rxnelson said: I was just wondering if you’d come back and say the dell cards are known to overheat or something like that. They are known to run hot, since they are designed for servers, so a fan blowing some air directly on the heat sink is highly recommended, it can be a side fan, below, directly on top of the heat synk, etc. Quote Link to comment
JonathanM Posted September 9, 2019 Share Posted September 9, 2019 2 minutes ago, rxnelson said: the dell cards are known to overheat or something like that. All server HBA cards will overheat if not actively cooled. They are designed for rack mount cases with plenty of forced through ventilation, putting them in desktop cases means being sure there is airflow directed over the heatsink, as most desktop cases have a pocket of stagnant air around the PCIe slots. It's assumed high draw devices like video cards in desktop rigs will have their own fan. Quote Link to comment
rxnelson Posted September 9, 2019 Author Share Posted September 9, 2019 It’s in an intel 4U tower style case similar to this one. http:// https://rover.ebay.com/rover/0/0/0?mpre=https%3A%2F%2Fwww.ebay.com%2Fulk%2Fitm%2F193042042853I thought that bottom fan would provide airflow and haven’t had issues since I put it together in what looks like 2016 according to purchase history. I guess that is why I’m still not sure if heat or bad card or maybe heat finally killed the card. I’ll see if I can rig something up closer to the card. For a dual PSU system it has a very small amount of molex and sata power. I’m assuming the card doesn’t have any “live” reading of temps once in Unraid or you guys would have already mentioned it. Quote Link to comment
JonathanM Posted September 9, 2019 Share Posted September 9, 2019 19 minutes ago, rxnelson said: I thought that bottom fan would provide airflow It should if it's not obstructed. When is the last time you cleaned and checked the inside of the case? Dust is a very good insulator, it builds up over time. Quote Link to comment
rxnelson Posted September 9, 2019 Author Share Posted September 9, 2019 Funny enough just before the issues started. I removed a secondary video card and did some housecleaning in the case. I thought maybe reseating everything would work because maybe I bumped something. Too many variables to really know. I guess I just have to eliminate one by one. Quote Link to comment
rxnelson Posted September 23, 2019 Author Share Posted September 23, 2019 Looks like it made it a week and a half or so. I put the card in another slot and rigged a fan to blow on it. Time for a new card? towertest-diagnostics-20190923-0332 (1).zip Quote Link to comment
JorgeB Posted September 23, 2019 Share Posted September 23, 2019 2 hours ago, rxnelson said: Time for a new card? It's what I would try. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.