RIDGID Posted April 6, 2021 Share Posted April 6, 2021 Recently I have had unraid go full unresponsive on me a couple times. Webui gone, no SSH, can't ping, not visible to my router, no video output, but still powered on and active. Trying to restart from IPMI give the error below. First unclean shutdown was 31Mar, 12TB parity check finshed 02Apr @ 5am. Crashed again 04Apr, parity check started at 2pm. Server unresponsive again this afternoon 05Apr 6pm. When I rebooted I get a notification that the parity check finished with 0 errors (Average speed: nan B/s), I assume it failed. I've now rebooted and the parity check is running currently. I've attached my diagnostics, maybe someone smarter than me could lend some insight as to what may be causing these crashes? Nothing in the logs jumped out at me, but I am not quite sure what to look for. Only two notable changes I've made recently to the otherwise stable server are: 1. Changing the server name via Settings>Identification 2. Upgrading to 6.9.1 supermicro-diagnostics-20210405-1821.zip Quote Link to comment
trurl Posted April 6, 2021 Share Posted April 6, 2021 We need syslog from before reboot to see what is happening. https://wiki.unraid.net/Troubleshooting#Persistent_Logs_.28Syslog_server.29 Quote Link to comment
trurl Posted April 6, 2021 Share Posted April 6, 2021 Also, Diagnostics have more complete information for us if the array is started when you take them. Quote Link to comment
RIDGID Posted April 6, 2021 Author Share Posted April 6, 2021 I do not have the syslog unfortunately, but I will moving forward. Attached diagnostics with array running (none of my dockers or VMs are running during the parity check though), though it looks like syslog is where the useful info will be so I will probably have to reproduce the issue. supermicro-diagnostics-20210406-0921.zip Quote Link to comment
trurl Posted April 6, 2021 Share Posted April 6, 2021 Unrelated, but your system share has files on the array (disk11). Those files are always open when Docker and VM Manager are enabled, so they will keep disks spunup, and docker / VM performance will be impacted by slower array. Mover (or anything else) can't move open files so you would have to go to Settings and disable Docker and VM Manager to get them moved. Also, docker.img is 50G. Have you had problems filling it? 20G is usually more than enough and making it larger won't fix filling it, it will only make it take longer to fill. Quote Link to comment
RIDGID Posted April 6, 2021 Author Share Posted April 6, 2021 Thanks, that is great info I never would have picked up on myself! Looks like my libvert.img file got moved to disk 11 somehow, I've moved it back onto the cache/system folder. No clue how it could have happened but I'm glad to fix it and will keep an eye on it in the future. As for the docker.img, a few years back I had an issue with it getting filled (I believe by radarr or rutorrent logs or something) and I probably increased the size hoping to fix the issue as I had a 2TB cache at the time. its been that way so long I forgot 50g wasnt the default. Any advantage to making it smaller other than saving 30g on the cache? Quote Link to comment
trurl Posted April 6, 2021 Share Posted April 6, 2021 2 hours ago, RIDGID said: never would have picked up on myself You can see how much of each disk is used by each user share by clicking Compute... for the share on the User Shares page, or Compute All button. 2 hours ago, RIDGID said: No clue how it could have happened If you enable Docker or VM Manager without a cache disk it gets created on the array. 2 hours ago, RIDGID said: Any advantage to making it smaller other than saving 30g on the cache? That plus next time I see your diagnostics for another issue I won't think I need to comment on it. 1 Quote Link to comment
RIDGID Posted April 10, 2021 Author Share Posted April 10, 2021 Here is my syslog from the most recent crash. Looking at this bit Apr 9 06:52:48 Supermicro kernel: mce: [Hardware Error]: Machine check events logged Apr 9 06:52:48 Supermicro kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Apr 9 06:52:48 Supermicro kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010093 Apr 9 06:52:48 Supermicro kernel: EDAC sbridge MC0: TSC 6d1be24ad4e8c Apr 9 06:52:48 Supermicro kernel: EDAC sbridge MC0: ADDR c4fce24c0 Apr 9 06:52:48 Supermicro kernel: EDAC sbridge MC0: MISC 40381286 Apr 9 06:52:48 Supermicro kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1617965568 SOCKET 0 APIC 0 Apr 9 06:52:48 Supermicro kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0xc4fce2 offset:0x4c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:1) I'm guessing bad memory stick? syslog Quote Link to comment
JorgeB Posted April 10, 2021 Share Posted April 10, 2021 4 hours ago, RIDGID said: I'm guessing bad memory stick? Looks like it, there might be more info in the system/ipmi event log. Quote Link to comment
RIDGID Posted April 10, 2021 Author Share Posted April 10, 2021 Good call. Correctable Memory ECC @ DIMMC1(CPU1) - Asserted Repeated ad nauseum in the event log. Believe I've isolated the bad stick and removed it, though it was in H1 not C1 so I will monitor for additional issues. Marking this solved as I know to look at syslog and impi events now. Thanks for the assistance. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.