[Solved] Unraid becomes unreachable


Recommended Posts

Recently I have had unraid go full unresponsive on me a couple times. Webui gone, no SSH, can't ping, not visible to my router, no video output, but still powered on and active. Trying to restart from IPMI give the error below.

 

1664122049_Screenshot2021-04-05181406.png.1e900ad09b591d85ca185e001cf3b0f7.png

 

First unclean shutdown was 31Mar, 12TB parity check finshed 02Apr @ 5am. Crashed again 04Apr, parity check started at 2pm. Server unresponsive again this afternoon 05Apr 6pm. When I rebooted I get a notification that the parity check finished with 0 errors (Average speed: nan B/s), I assume it failed. I've now rebooted and the parity check is running currently. 

 

I've attached my diagnostics, maybe someone smarter than me could lend some insight as to what may be causing these crashes? Nothing in the logs jumped out at me, but I am not quite sure what to look for.

 

Only two notable changes I've made recently to the otherwise stable server are:

 

1. Changing the server name via Settings>Identification 

 

2. Upgrading to 6.9.1

supermicro-diagnostics-20210405-1821.zip

Link to comment

Unrelated, but your system share has files on the array (disk11). Those files are always open when Docker and VM Manager are enabled, so they will keep disks spunup, and docker / VM performance will be impacted by slower array. Mover (or anything else) can't move open files so you would have to go to Settings and disable Docker and VM Manager to get them moved.

 

Also, docker.img is 50G. Have you had problems filling it? 20G is usually more than enough and making it larger won't fix filling it, it will only make it take longer to fill.

Link to comment

Thanks, that is great info I never would have picked up on myself!

 

Looks like my libvert.img file got moved to disk 11 somehow, I've moved it back onto the cache/system folder. No clue how it could have happened but I'm glad to fix it and will keep an eye on it in the future. 

 

As for the docker.img, a few years back I had an issue with it getting filled (I believe by radarr or rutorrent logs or something) and I probably increased the size hoping to fix the issue as I had a 2TB cache at the time. its been that way so long I forgot 50g wasnt the default. Any advantage to making it smaller other than saving 30g on  the cache?

Link to comment
2 hours ago, RIDGID said:

never would have picked up on myself

You can see how much of each disk is used by each user share by clicking Compute... for the share on the User Shares page, or Compute All button.

 

2 hours ago, RIDGID said:

No clue how it could have happened

If you enable Docker or VM Manager without a cache disk it gets created on the array.

 

2 hours ago, RIDGID said:

Any advantage to making it smaller other than saving 30g on  the cache?

That plus next time I see your diagnostics for another issue I won't think I need to comment on it.

  • Haha 1
Link to comment

Here is my syslog from the most recent crash. Looking at this bit

Apr  9 06:52:48 Supermicro kernel: mce: [Hardware Error]: Machine check events logged
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010093
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: TSC 6d1be24ad4e8c 
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: ADDR c4fce24c0 
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: MISC 40381286 
Apr  9 06:52:48 Supermicro kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1617965568 SOCKET 0 APIC 0
Apr  9 06:52:48 Supermicro kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0xc4fce2 offset:0x4c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:1)

I'm guessing bad memory stick?

syslog

Link to comment

Good call. 

Correctable Memory ECC @ DIMMC1(CPU1) - Asserted

Repeated ad nauseum in the event log. Believe I've isolated the bad stick and removed it, though it was in H1 not C1 so I will monitor for additional issues.

 

Marking this solved as I know to look at syslog and impi events now. Thanks for the assistance.

Link to comment
  • ChatNoir changed the title to [Solved] Unraid becomes unreachable

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.