Server randomly freezing


Recommended Posts

Hi all,

 

I've recently started experiencing strange issues where my server would randomly hang and I have to shut it down (basically by pulling out the power) and then start it back up and I've been trying to figure out why.

 

The last time this happened I had to format my cache pool and restore appdata :( thankfully @johnnie.black helped me with that.

 

I've just experienced another server hang and have had to do forcefully shut the server down and restart it. I managed to capture what was on the screen at the time which I've attached along with my diagnostics. Currently doing another parity check but cache pools seems fine this time.


What I've done so far:

  • Disabled Intel Turbo boost now (I installed the system temp plugin and I can see my CPU does go to about 70 degrees celcius - fans are working overtime!)
  • Ensured that the USB stick is in the USB2.0 port

 

What could this be? This last happened 2 days ago

 

Running Unraid 6.8.3 with the Nvidia plugin (which I installed recently - hope this is not causing the issue)

 

20200416_171620.jpg

server-unraid-diagnostics-20200416-1728.zip

Link to comment

Logs are after reboot so they don't tell what happened.  You should turn on syslog server.

 

Probably not related, but you should check the data/power connection of your cache2 drive.  Its throwing what looks like connection errors shortly after booting.

 

Best course of action now is to run the official unraid (not the nvidia build) to rule things out.  If it still hangs then try "safe mode" which disables all the plug-ins.  That will further narrow it down.

Edited by civic95man
Link to comment

Thanks @civic95man - I've reverted to the official unraid build and replaced the cable on cache drive 2 (UDMA CRC Error Count is remaining at 1 and not increasing which is a good thing right? I can't seem to scan this drive for errors as it's in a cache pool and I can only scan the 1st drive)

 

I have also done the following:

  • Changed the direction of my CPU cooler - temps have dropped
  • Added in an additional fan on the other side of my drives for push/pull config which seems to have made a difference to temp

Attached new diagnostics - are you able to let me know if it's looking better? Had a loose cable on disk 3 which raised the CRC errors but it has stopped now after I secured it.

server-unraid-diagnostics-20200417-0842.zip

Link to comment
8 hours ago, ultimz said:

UDMA CRC Error Count is remaining at 1 and not increasing which is a good thing right?

Yes, they won't revert back to zero, but as long as they aren't increasing then its a good thing.  I do still see issues with the link resetting on cache2 so you might want to check the power/data connector one more time.  I also saw the same thing with disk3.  You should be able to run a scrub on you cache pool as a whole which will tell you if there is corruption.

 

You may also want to check your syslog server settings.  It looks like it's trying to send your logs to a server that it can't connect to.

 

Finally, at the end of your log I see a call trace originating from macvlan.  This could be the cause of your system hanging/crashing.  There are issues with docker where if you set a static IP for a container on br0, you get these call traces. Most of the time the kernel can recover and things continue.  Other times the system will hang.  Try turning off the static IP for your container(s) and if you absolutely need them, then try setting up the IP allocation on your router.

 

If this fixes your system hanging and you feel confident everything is stable then you can go back to using the nvidia build.

Edited by civic95man
Link to comment

Thanks I had 1 docker that was set to use a custom IP on br0 and I've changed it to Bridge mode now. I've also resolved the syslog issue and I can see files being saved in my Syslog share that I created.

 

Are you able to let me know which device is causing these errors (ata4?)... is this cache2?:

image.png.8fbc849fed34a6158abc3a2e440850e6.png

 

Diagnostics attached again. Thank you for assisting me with this

server-unraid-diagnostics-20200417-1924.zip

Link to comment

you can generally find it in the syslog early on in the boot process:

Apr 17 17:45:27 Server-Unraid kernel: ata4.00: ATA-11: WDC  WDS500G2B0A-00SM50, 193163803750, 401000WD, max UDMA/133

here is shows WDS500G2B0A-00SM50 as being connected to ata4.00

 

further down in the log you'll see

Apr 17 17:45:27 Server-Unraid kernel: scsi 4:0:0:0: Direct-Access     ATA      WDC  WDS500G2B0A 00WD PQ: 0 ANSI: 5
Apr 17 17:45:27 Server-Unraid kernel: sd 4:0:0:0: Attached scsi generic sg4 type 0
Apr 17 17:45:27 Server-Unraid kernel: sd 4:0:0:0: [sde] 976773168 512-byte logical blocks: (500 GB/466 GiB)
Apr 17 17:45:27 Server-Unraid kernel: sd 4:0:0:0: [sde] Write Protect is off
Apr 17 17:45:27 Server-Unraid kernel: sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
Apr 17 17:45:27 Server-Unraid kernel: sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

where the sd 4:0:0:0 refers to this ata4.00.

 

Finally, towards the end of the boot process you'll see mdcmd list all of your drives in the slot assignments and you'll see the cache there

Apr 17 17:46:07 Server-Unraid emhttpd: import 31 cache device: (sde) WDC_WDS500G2B0A-00SM50_193163803750

where slot 30 is cache1, slot 31 is cache 2, and so on. 

That just an example of how I found out whats going on. There are obviously different ways as well.  

 

I hope that the docker issues fixes your hangs, just need to figure out what is going on with that cache2 drive.  If you have any available sata ports, you can try them, or swapping out the cable again.

Link to comment

Ah I see - learning tons here

 

I did a scrub on the cache pool and it gave me this which I forgot to mention in my previous message

 

Scrub_Status.PNG.1eb28236f9ad0d212b635a898d70ab56.PNG

 

It's doing a parity check now so I will swap out the cable after and hopefully that along with the other changes I've done (with your help) will solve the issue of hanging :) thanks

Edited by ultimz
Link to comment
2 minutes ago, ultimz said:

did a scrub on the cache pool and it gave me this which I forgot to mention in my previous message

That looks promising thus far. Hopefully the cache pool remains intact while you figure this out.

 

4 minutes ago, ultimz said:

learning tons here

Don't worry, I still am as well. We were all in your shoes at one point and the important part is that you learn for these experiences. 

  -- I suffered the same issue with one of my two cache drives as you did.  Turned out to be my drive enclosure was poor quality and left the drive with a bad sata connection.

Link to comment

Yeah I'm starting to realise how important good quality hardware is right now!

 

Unfortunately I managed to break the cache pool - after swapping power and data cables I thought the cache2 drive might be faulty so pulled it out and tried to boot with just cache1 and that didn't go too well (think my mistake was reducing the drive count in the cache pool to 1 and then starting the array). What I ended up doing last night is the following:

  • Formatted the cache1 disk and restoring appdata from my backup
  • Plugged cache2 disk into my Windows 10 PC and ran chkdsk and SMART scans - drive seems to be healthy (I have ordered another drive just in case and it will arrive on Tuesday)
  • Running a parity check now. It seems like my Parity disk is now complaining about connection issues from the logs? I suspect it might be my controller or that SATA port but I will troubleshoot after the Parity check

If the cache2 disk is ok can I just plug it back in and test it... then add it back into a cache pool by increasing the number of disks and assigning it into the pool? Will it automatically go back into a RAID1 config using btrfs?

 

New diagnostic logs attached

 

Thanks

 

Chkdsk info for cache2:

CheckCache2-Windows.PNG.e5999f14bb048707e73f676b34fe471d.PNG

server-unraid-diagnostics-20200418-1201.zip

Link to comment
29 minutes ago, ultimz said:

If the cache2 disk is ok can I just plug it back in and test it... then add it back into a cache pool by increasing the number of disks and assigning it into the pool? Will it automatically go back into a RAID1 config using btrfs?

Yes, if current cache is already btrfs, more info here:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=480417

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.