Server seems to be crashing during parity rebuild


Recommended Posts

I had a disk go into error state a few days ago, but went through the spaceinvaderone video on XFS repair and did SMART tests and all that jazz and I can't actually find anything wrong with the disk. No errors or anything, so I re-added it to the array, and started a parity rebuild of the drive data. Standard procedure, per my understanding - and I got through all that without any issues or questions.

 

However, I keep coming back to my server to check in on the status of that parity rebuild only to find the array is stopped and the parity rebuild never completed. It has happened 3 times now. No idea what's going on.

 

I do see a lot of weird errors about a USB device not responding in sys log, and I do see some warnings that UPS communication is dropping in and out in the GUI notifications. My server is plugged into its UPS via a USB cable so I wonder if that could be related - however, other things plugged into that UPS appear fine, so I don't think the UPS is failing to supply power. And if that's not the issue, I'm not sure what else to do with the UPS to troubleshoot it.

 

I am also on a new USB boot drive. The previous one was the original drive I built the first iteration of my Unraid server with in like 2017 and it finally gave up the ghost. But I got a USB drive recommended for Unraid by spaceinvaderone - I don't think that's what's causing the USB errors.

 

I added a Coral m.2 TPU and 2x80mm exhaust fans recently. The Coral seems to be working fine, and I mention the exhaust fans because I saw in another thread Squid mentioned CPU overheating can cause this kind of random shutdown. My CPU hasn't overheated before, and it has more fans now, so I'm pretty sure that's not it. The new fans went in because the drives can run a little hot on hot days, but recent ambient temperature has been low.

 

All told, I have a fair number of changes recently that make it a little harder to troubleshoot, and I have yet to find the actual crash or shutdown or whatever is happening in the logs.

 

Still looking - I have a feeling the answer is in here somewhere - I'm just not super familiar with the log format and haven't tracked it down yet.

 

Diagnostics attached. Let me know if ya'll have any ideas - thank you!

greenplanet-diagnostics-20240401-0755.zip

Link to comment

Gotcha! I'm hesitant to interrupt the parity rebuild for a manual reboot - it takes 24h+ to complete and I'm 1/3 of the way through here (and hoping it doesn't crash again before it finishes this rebuild).


I also want to clarify that the server started itself back up after the crash automatically - it just wouldn't start the array, on account of the error state drive needing a parity rebuild (that counts as a "configuration change" and disables autostart). So, if I'm understanding correctly, I think the server has technically been restarted since the crash, and I'm not sure that rebooting now will provide the correct time window in syslog-previous.txt that we're looking for, so I want to clarify (especially before considering interrupting a parity rebuild).

 

I do see references in the syslog that's contained within my diagnostics to an unclean shutdown, which makes me think maybe somehow there is a power issue forcing the server to shutdown?

Link to comment

There isn't anything in /boot/logs other than the diagnostics I've generated trying to troubleshoot this:

 

# ls -lah /boot/logs
total 656K
drwx------  2 root root  16K Apr  1 17:50 ./
drwx------ 10 root root  16K Dec 31  1969 ../
-rw-------  1 root root 224K Apr  1 07:56 greenplanet-diagnostics-20240401-0755.zip
-rw-------  1 root root 176K Apr  1 16:55 greenplanet-diagnostics-20240401-1655.zip
-rw-------  1 root root 217K Apr  1 17:01 greenplanet-diagnostics-20240401-1701.zip

 

I've now triple-checked that syslog server is indeed enabled so I have now tried disabling syslog rotation in the hopes that helps?

 

Link to comment

Sure! Here's what I have for syslog server settings.

 

Also, I've also successfully completed the parity rebuild without further crashes! It's hard to say for sure I 100% fixed it given the crashes were random and I might just be on a lucky streak, but the change I made that seems to have helped was removing the UPS communication cable.

 

The cable has an RJ45 style connector on the UPS end (USB on the Unraid end) and I noticed the stay clip on the RJ45 was busted. This probably meant the cable had taken a hit at some point and I thought perhaps the damaged cable was disrupting Unraid's communication with its UPS. Because my Unraid server is configured to shut itself down when the UPS battery only has a few minutes of runtime left, it's possible disruptions in UPS communications might be causing the server to shut down.

syslog.PNG

Link to comment

Weird! I only see references in the syslog server documentation to it being for a remote machine, nothing about setting it to the local server IP. 

Quote

Remote Syslog Server: This is used when you have another machine on your network that is acting as a syslog server. This can be another Unraid server. You can also use virtually any other computer. You find the necessary software by googling for the syslog server of that computer's operating system. After you have set up the computer/server, you fill in the computer/server name or the IP address. (I prefer to use the IP address as there is never any confusion about what it is.) Then click on the 'Apply' button and your syslog will be mirrored to the other computer.

The other computer has to be left on continuously until the problem occurs.

The events captured will only start with the point at which the syslog daemon is started during the boot process thus missing the very start of the boot process.

But sure, I can add the Unraid server's LAN IP there instead of leaving it blank. Thanks!

Link to comment

Thanks for the help! I haven't had another crash since I just removed the UPS communication cable entirely so as far as I can tell, that was the issue. My server is configured to shut itself down proactively if the UPS is running low on battery so I can see how the coms cable going bad could've caused shutdown issues. Probably the last update to the thread here just for posterity's sake in case somebody finds it searching for a similar issue and wants to know how I fixed it. Thanks again all who helped!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.