[Solved] Unraid Crash then Unresponsive


Recommended Posts

Firstly a little about my setup:

- HP N40L Microserver running unRAID 6.7.2 with 6 Data Drives (8Tb Parity, 1 x 8Tb Data + 4 4Tb Data) and 1 Cache Drive (500Gb)

- System is used simply to download and store media files on my local network, and is attached to APC UPS and hardwired directly to Wi-Fi Modem/Router, otherwise all access is via Wi-Fi via various devices (iOS)

- Dockers: just the basics - Plex Media Server (LimeTech), ruTorrent (Linuxserver.io) and Sickbeard (all up to date with latest versions)

- Uptime prior to Crash was 52 Days, and Parity was last checked only 2 Days ago, with 0 errors. System has been rock-solid and I have never encountered any major problems.

 

The Incident:

Last night I arrived home from work, opened up ruTorrent and added a couple of magnet links. After adding the links ruTorrent became unresponsive. Seeing that there have been issues with various versions having 100% CPU usage and I often see the docket reboot itself I was not too concerned. I waited a couple of minutes and tried again but could not reach ruTorrent so I tried unRAID Home with the same result - no GUI access. I fired up Termius and logged in via Telnet successfully, so issued the 'Poweroff' command so shut down the server (stupidly forgetting to save a Syslog or Diagnostics as I did not initially think much of the freeze).

 

On reboot I was able to access the GUI, the Array was valid and all Array Drives present but I noticed that my Cache drive was missing from list of available Drives. I initially thought that maybe my old spinning rust disk had died, being the cause of the crash. I tried opening the syslog from the GUI but it had become unresponsive again. I figured that the 'dead' disk was causing slow reads/errors so logged in again via Telnet successfully and shut down again.

 

I opened up my server and replaced the Cache Drive with a (Brand) New SSD I had in the cupboard and booted back up. When I first booted I opened the GUI successfully but noticed multiple Array Drives missing. I quickly shut the server down (via GUI), opened it backup and checked/re-seated all connections. On rebooting I was sucessfully able to access the GUI and confirm that all Disks were present, Array was valid, and new Cache Drive detected. I assigned the new Cache Drive and clicked Start to start the Array but the GUI did not respond. At this point I was becoming seriously concerned. I thought maybe my motherboard had grenaded and was dropping disks, so tried to connect via Telnet but this did not respond initially either. After trying again I was able to access the server via Telnet and confirm that all Disks were present and detected using the ls -l /dev/disk/by-id command. I then shut down again, and did some research online about what steps to take.

 

I rebooted the server (again) and this time was unable to access the GUI at all (previous times I had been able to load the unRAID Home page, but it had not responded to commands), but could still Telnet in. I saved a copy of the Syslog and Diagnostics, which I have attached to this post (Note: on issuing the Diagnostics command I received a few lines of errors referencing missing info and giving line references to dynamix files etc, but the command completed successfully).

 

I am seeking the help of the unRAID community in diagnosing/fixing this issue. I have always found the community super-helpful and have been able to resolve any minor issues I have had in the past. At this point I am not sure what hardware is to blame (Mobo, RAM, USB/Boot Drive), or if it could be a Software issue with a corrupt file(s).

 

Could someone more knowledgeable than me take a look at the attached Diagnostics, help diagnose the cause(s) and/or let me know if there are any other steps I should take.

syslog.txt watchtower-diagnostics-20190820-2108.zip

Edited by blu3_v2
Solved
Link to comment

After blowing apart the server again, rechecking and re-seating all drives and connections I sat down and read through any logs from the Diagnostics.zip that I could decipher. I could not find any errors in SMART reports etc. and the syslog captured showed no errors, and the server had started up successfully and was ready to start with a Valid Array prior to logging in via Telnet and issuing the Poweroff command.

 

I then set about troubleshooting any other potential issues and thought I would try reboot my Modem/Router, and all wireless iOS devices that I use to access the server on my LAN. After booting up the server (again) I was able to access the GUI, Start the Array, and access all Dockers etc. (all using my original Cache Drive so all Dockers and Data were preserved including in-progress downloads). So it appears that it was simply a LAN issue (IP conflict, hardware issue with Router???).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.