DuzAwe Posted March 11, 2021 Share Posted March 11, 2021 Syslog attached. Detail of build is in my sig. Motherboard BMC chip died and has been replaced. Other than that and the upgrade to 6.9. Nothing out of the norm. Also has UPS if that makes any difference. thelibrary-syslog-20210311-0928.zip Quote Link to comment
JorgeB Posted March 11, 2021 Share Posted March 11, 2021 Assuming loop2 is the docker image, if you're not sure post the complete diagnostics, delete and recreate. Quote Link to comment
DuzAwe Posted March 11, 2021 Author Share Posted March 11, 2021 Docker deleted (had to reboot, that kills all the logs and what not right?) and rebuilt. I guess now we wait. Quote Link to comment
DuzAwe Posted March 13, 2021 Author Share Posted March 13, 2021 Ok here we go, It happened again. I reformatted the Cache disks after my post here. thelibrary-diagnostics-20210313-0116.zip Quote Link to comment
JorgeB Posted March 13, 2021 Share Posted March 13, 2021 If it keeps getting corrupt start by running memtest. Quote Link to comment
DuzAwe Posted March 13, 2021 Author Share Posted March 13, 2021 Too rule out ram issues? How long should I leave it run? Quote Link to comment
JorgeB Posted March 13, 2021 Share Posted March 13, 2021 Ideally 24H, but if there's a considerable problem it should be found quickly. Quote Link to comment
DuzAwe Posted March 13, 2021 Author Share Posted March 13, 2021 Anything wrong with taking two sticks at a time to another machine to test? Quote Link to comment
JorgeB Posted March 13, 2021 Share Posted March 13, 2021 You can do that, but if you don't find the problem should then still run it in the server as is. Quote Link to comment
itimpi Posted March 13, 2021 Share Posted March 13, 2021 10 minutes ago, DuzAwe said: Anything wrong with taking two sticks at a time to another machine to test? That will check that the RAM sticks themselves are not faulty. However, sometimes ram needs to be tested "in situ" in case there is any sort of bus loading issue affecting RAM. Also different motherboard/CPU combinations may have different maximum RAM clock rates they successfully support regardless of what the RAM itself has specified as its maximum clock rate. Quote Link to comment
DuzAwe Posted March 13, 2021 Author Share Posted March 13, 2021 Running in place now Quote Link to comment
DuzAwe Posted March 13, 2021 Author Share Posted March 13, 2021 Have been thinking. Since all this started I haven't been able to get the smart tests to run on the cache drives. Is that more indicative of the drives failing then the RAM being an issue? Memtest is two passes down with no errors. Quote Link to comment
DuzAwe Posted March 13, 2021 Author Share Posted March 13, 2021 Nineish hours of testing three complete passes with no errors so far. Quote Link to comment
JorgeB Posted March 14, 2021 Share Posted March 14, 2021 18 hours ago, DuzAwe said: Since all this started I haven't been able to get the smart tests to run on the cache drives. You can't run SMART tests on NVMe devices. Let it run for 24H but if it didn't find any issues so far it likely won't, also note that memtest sometimes can't find issues even when there are some, but the problem can come from somewhere else. Quote Link to comment
DuzAwe Posted March 14, 2021 Author Share Posted March 14, 2021 (edited) 3.5 hours left on the test and its all clear. What should I gear up for next? Edited March 14, 2021 by DuzAwe Quote Link to comment
JorgeB Posted March 14, 2021 Share Posted March 14, 2021 Remove two DIMMs and reacreate the docker image, if it corrupts again try with just the other two, if still issues unlikely to be RAM related. Quote Link to comment
DuzAwe Posted March 14, 2021 Author Share Posted March 14, 2021 24 hours no fails. 6 full passes. Running dim tests now. Quote Link to comment
dimitriz Posted March 14, 2021 Share Posted March 14, 2021 JFYI, I started dealing with this same issue on my own last night. Also after upgrading to 6.9.1 a day earlier. Quote Link to comment
DuzAwe Posted March 14, 2021 Author Share Posted March 14, 2021 (edited) Its worth noting in that case that this issue started with 6.9-RC2. My set up has not changed since it was built in Jan 2020 (better times) Its run stable up to the xmas period. With a few lock ups I'm starting to believe is related to logs filling the ram up. But Im not sure. Anyway I have caught the device loop2 issue and hope to help get to the bottom of it. Edited March 14, 2021 by DuzAwe Quote Link to comment
DuzAwe Posted March 14, 2021 Author Share Posted March 14, 2021 So new vdisk created and its immediately corrupt thelibrary-diagnostics-20210314-1413.zip Quote Link to comment
DuzAwe Posted March 14, 2021 Author Share Posted March 14, 2021 Got a lovely 503 for web gui on my reboot thelibrary-diagnostics-20210314-1427.zip Quote Link to comment
DuzAwe Posted March 14, 2021 Author Share Posted March 14, 2021 Not to jinx it but..........the next set of ram dimms seem to have everything stable...........logs attached just because. I have ordered ECC replacement sticks to hopefully rule this issue out permanently. Hopefully this is the end of this issue. Thanks for all the help. Ill report back should anything happen once I start using the system properly again. thelibrary-diagnostics-20210314-1533.zip Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.