[Solved] Unraid becomes unreachable

RIDGID · April 6, 2021

Recently I have had unraid go full unresponsive on me a couple times. Webui gone, no SSH, can't ping, not visible to my router, no video output, but still powered on and active. Trying to restart from IPMI give the error below.

1664122049_Screenshot2021-04-05181406.png.1e900ad09b591d85ca185e001cf3b0f7.png

First unclean shutdown was 31Mar, 12TB parity check finshed 02Apr @ 5am. Crashed again 04Apr, parity check started at 2pm. Server unresponsive again this afternoon 05Apr 6pm. When I rebooted I get a notification that the parity check finished with 0 errors (Average speed: nan B/s), I assume it failed. I've now rebooted and the parity check is running currently.

I've attached my diagnostics, maybe someone smarter than me could lend some insight as to what may be causing these crashes? Nothing in the logs jumped out at me, but I am not quite sure what to look for.

Only two notable changes I've made recently to the otherwise stable server are:

1. Changing the server name via Settings>Identification

2. Upgrading to 6.9.1

supermicro-diagnostics-20210405-1821.zip

trurl · April 6, 2021

We need syslog from before reboot to see what is happening.

https://wiki.unraid.net/Troubleshooting#Persistent_Logs_.28Syslog_server.29

trurl · April 6, 2021

Also, Diagnostics have more complete information for us if the array is started when you take them.

RIDGID · April 6, 2021

I do not have the syslog unfortunately, but I will moving forward. Attached diagnostics with array running (none of my dockers or VMs are running during the parity check though), though it looks like syslog is where the useful info will be so I will probably have to reproduce the issue.

supermicro-diagnostics-20210406-0921.zip

trurl · April 6, 2021

Unrelated, but your system share has files on the array (disk11). Those files are always open when Docker and VM Manager are enabled, so they will keep disks spunup, and docker / VM performance will be impacted by slower array. Mover (or anything else) can't move open files so you would have to go to Settings and disable Docker and VM Manager to get them moved.

Also, docker.img is 50G. Have you had problems filling it? 20G is usually more than enough and making it larger won't fix filling it, it will only make it take longer to fill.

RIDGID · April 6, 2021

Thanks, that is great info I never would have picked up on myself!

Looks like my libvert.img file got moved to disk 11 somehow, I've moved it back onto the cache/system folder. No clue how it could have happened but I'm glad to fix it and will keep an eye on it in the future.

As for the docker.img, a few years back I had an issue with it getting filled (I believe by radarr or rutorrent logs or something) and I probably increased the size hoping to fix the issue as I had a 2TB cache at the time. its been that way so long I forgot 50g wasnt the default. Any advantage to making it smaller other than saving 30g on the cache?

trurl · April 6, 2021

2 hours ago, RIDGID said:

never would have picked up on myself

You can see how much of each disk is used by each user share by clicking Compute... for the share on the User Shares page, or Compute All button.

2 hours ago, RIDGID said:

No clue how it could have happened

If you enable Docker or VM Manager without a cache disk it gets created on the array.

2 hours ago, RIDGID said:

Any advantage to making it smaller other than saving 30g on the cache?

That plus next time I see your diagnostics for another issue I won't think I need to comment on it.

RIDGID · April 10, 2021

Here is my syslog from the most recent crash. Looking at this bit

Apr  9 06:52:48 Supermicro kernel: mce: [Hardware Error]: Machine check events logged
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010093
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: TSC 6d1be24ad4e8c 
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: ADDR c4fce24c0 
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: MISC 40381286 
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1617965568 SOCKET 0 APIC 0
Apr  9 06:52:48 Supermicro kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0xc4fce2 offset:0x4c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:1)

I'm guessing bad memory stick?

syslog

JorgeB · April 10, 2021

4 hours ago, RIDGID said:

I'm guessing bad memory stick?

Looks like it, there might be more info in the system/ipmi event log.

RIDGID · April 10, 2021

Good call.

Correctable Memory ECC @ DIMMC1(CPU1) - Asserted

Repeated ad nauseum in the event log. Believe I've isolated the bad stick and removed it, though it was in H1 not C1 so I will monitor for additional issues.

Marking this solved as I know to look at syslog and impi events now. Thanks for the assistance.

[Solved] Unraid becomes unreachable

Recommended Posts

RIDGID

Link to comment

trurl

Link to comment

trurl

Link to comment

RIDGID

Link to comment

trurl

Link to comment

RIDGID

Link to comment

trurl

Link to comment

RIDGID

Link to comment

JorgeB

Link to comment

RIDGID

Link to comment

Join the conversation