Jump to content

Unraid Crashes due to Kernel Panic Again


Recommended Posts

Background: 

I have been running Unraid for several years and it normally runs well.  I have had issues with Kernel panics that stemmed from XFS corruption to issues with custom IP addresses.  Tired of constantly rebooting due to these issues, I invested $2.5k in server grade equipment hoping that it will resolve many of these issues.  Unfortunately I am still running into this issue. 

 

EDIT: I wanted to also note that I have performed a clean fresh install on the new hardware.

 

Equipment:  

  1. SuperMicro 6047R-E1R24N 24x LFF SuperStorage Server W/ X9DRI-LN4F+
  2. 2 x Xeon CPU E5-2697 @ 2.70 GHz (48 cores total)
  3. 128 GiB DDR3 Multi-bit ECC memory
  4. Interface 1: 10GB 
  5. Intel onboard 1 Gib interface
  6. 1 TB M.3 SSD (Cache)
  7. 1 TB M.3 SSD (VM/Downloads)
  8. 2 x 10TB parity Drives
  9. Assorted size drives (59 TB)

 

Attached Files:

  1. Uploaded Diagnostic file titanium-diagnostics-20220111-1342.zip

 

Screen Shots of several Kernel panic error messages:

  1. 704690238_kernalpanic12-6-21.PNG.c60b727ad9dc0fa49c4d131312fa6e65.PNG
  2. 1782549032_Kernalpanic12-21-21.PNG.761cc72e75f75864534ebafecb684851.PNG
  3. Here was a partial crash where the webUI was nonresponsive and nothing worked but the CLI  Had to power cycle to resolve1951518154_anothercrash12-17-21.PNG.07268dbe5cf94e4deea2c141eb913dfa.PNG
  4. Another non kernel panic but system became unresponsive non of the dockers worked similar to above.  With this said most of the crashes are Kernel panics.
  5. 1120970840_strangeerrorunraid12-30-21.PNG.d92c03608bea783bb665647de978086a.PNG
  6. 1-9-22.PNG.2c0a74e820cf4836ad9d4ce663f86861.PNG

 

Memory check

I performed a memory check using the the Mem tool option during unraid boot.  I only did 1 pass as it took over 12 hours to perform

861499490_unraidmemcheck12-30-21.PNG.57b175255294561524767fc71aa1f1ca.PNG

 

My Thoughts what I have done:

  1. in the past I had kernel panic issues related to Nextcloud. (deactivated docker)
  2. I stopped all dockers execpt for (Plex, Sonarr, Radarr, SABNZB, Qbitorrent, Privoxy, Overseerr, prowlarr, speedtest-tracker- whoogle search)
  3. I do have some drives with errors and are failing some older WD Green drives, not sure that would cause a kernel panic.
  4. I do get SSD temp warnings from time to time but they recover fairly quickly.
  5. I am running an Nvida video card for plex transcoding it gave me issues with the bios when during the build requiring me to pull the Mobo battery and reset the bios to resolve whenever I install or pull the card.  

 

Hope to finally resolve this as I spent a ton of money building this.  The old server is now running on truenas and is rock solid I love Unraid and would hate to switch.

 

I think I got all of the information to help with troubleshooting this.

 

thanks

Edited by linenoise
  • Like 1
Link to comment

I am having similar issues and the last thing I am going to try is to remove my SSD which is a Seagate Ironwolf NVMe in a PCIe adaptor.   I have replaced all the RAM, the 10GbE NIC back to 1GbE and as well as the usb drive for the OS and other tests. 

 

Has it stabilised for you? I am also running 6.10.rc2 and I am still having the kernal panic.

Link to comment

I'm back, I upgraded to the RC version of unraid, it broke my plex that was using NVIDA plugin and an NVIDA card to transcode.  By break I bean the old container would crash when I tried to start it.  I had to d/l a different docker, none of the ones on the app store, at least for the RC have the NVIDA transcode.  

 

This time around I was able to get 2 screen shots of errors. 

 

Also during the last 2 days 1 of my Hard drives died,  running emulated currently waiting for a replacement drive.  Should get her on Wednesday hopefully. 

I do have some old hard drives 10+ years of run time that are giving some reallocation errors, going to consolidate the data and replace them with a new drive.   This is going to take a few days I have to rebuild the drive that completely died then consolidate the data off the other 3 drives before removing and replacing with the new drive.

 

In the mean time I am going to remove the NVidia video card and see if the fixes anything.

 

 

server crash 1-18-22.PNG

kernal panic 1-18-22 b.PNG

Link to comment

Another note, I am running unraid 6.10.0-RC2 and found that heimdall was constantly generating log entries to the point where the log was at 12 Gigs in just a few days.  From what I was reading on the forum it might have been caused by an enhanced app but when I removed heimdall ran cleanup appdata to remove the old config and ran a base install of heimdall it was still generating tons of log entries.  Not sure this is a result of the RC, but I can see this quickly filling up the docker file and crashing docker.  Had to uninstall heimdall for the time being.

 

I attached a sample of the log file, already at 52mb after about 1 min of running.  Not sure if this is related to our crashes but another data point. 

 

@Squid  Can you expand on what the RC2 version of Unraid was supposed to fix?  I am finding some of the dockers are having  compatibility issues or versions I used on the stable versions are not showing in the appstore.  Is there a filter to only show compatiable apps with the installed version in the appstore?

 

thanks

laravel.log.txt

Link to comment
3 hours ago, linenoise said:

Is there a filter to only show compatiable apps with the installed version in the appstore?

 

The Apps tab will

  • Not display anything incompatible unless changed in it's settings for new installations
  • Not display anything deprecated unless changed in it's settings for new installations
  • For a reinstallation of an app which you've previously had installed which is now either deprecated or listed as being incompatible it will allow a reinstall, but is clearly identified as being deprecated or blacklisted.
  • Will NEVER allow you to install anything that is outright blacklisted for one reason or another
  • Opt-in: Periodically scan your system and alert you to the presence of:
    1. Known malware and security vulnerabilities within applications
    2. Critical security vulnerabilities within Unraid OS
    3. Any installed applications which may seriously hinder the operation of your server

In theory, there really should be no cases of "incompatible" docker apps with anyone's system due to the nature of the system.  The whole system though is moderated and when it appears that a particular app is causing issues for many users, it gets investigated and any applicable action deemed necessary is taken.   In terms of your problems with nVidia and Plex, any of the available apps should work just fine, assuming the plugin is up to date and the hardware doesn't itself have issues and is supported by the driver version (I personally use @binhex and bounce back and forth depending upon my priorities on reallocating hardware resources between various computers use either nVidia or Intel transcoding).  Plex Inc though is one of those software companies that tend on their constant updates to break something else, and "good" releases are rather rare compared to the number of updates they issue.  Part of my reasoning for using binhex, as while its updated regularly, it's not utilizing the bleeding edge of Plex (which they always label as "stable".  Many others here will disagree and say the opposite and that Hotio or LSIO is the way to go.  To me, its the same as AMD vs Intel or GM vs Ford)

 

Additionally, Fix Common Problems will (amongst many other tests) alert you if you have installed anything deprecated, blacklisted, or incompatible (and in the case of plugins - unknown) installed

 

RC2+ fixes an issue where some users experience hard crashes of the OS when containers / VMs run on their own IP addresses.  It offers the option of ipvlan as the driver instead of simply macvlan.  The actual differences between them are above my paygrade...

 

  • Like 1
Link to comment
On 1/18/2022 at 9:34 PM, Squid said:

 

The Apps tab will

  • Not display anything incompatible unless changed in it's settings for new installations
  • Not display anything deprecated unless changed in it's settings for new installations
  • For a reinstallation of an app which you've previously had installed which is now either deprecated or listed as being incompatible it will allow a reinstall, but is clearly identified as being deprecated or blacklisted.
  • Will NEVER allow you to install anything that is outright blacklisted for one reason or another
  • Opt-in: Periodically scan your system and alert you to the presence of:
    1. Known malware and security vulnerabilities within applications
    2. Critical security vulnerabilities within Unraid OS
    3. Any installed applications which may seriously hinder the operation of your server

In theory, there really should be no cases of "incompatible" docker apps with anyone's system due to the nature of the system.  The whole system though is moderated and when it appears that a particular app is causing issues for many users, it gets investigated and any applicable action deemed necessary is taken.   In terms of your problems with nVidia and Plex, any of the available apps should work just fine, assuming the plugin is up to date and the hardware doesn't itself have issues and is supported by the driver version (I personally use @binhex and bounce back and forth depending upon my priorities on reallocating hardware resources between various computers use either nVidia or Intel transcoding).  Plex Inc though is one of those software companies that tend on their constant updates to break something else, and "good" releases are rather rare compared to the number of updates they issue.  Part of my reasoning for using binhex, as while its updated regularly, it's not utilizing the bleeding edge of Plex (which they always label as "stable".  Many others here will disagree and say the opposite and that Hotio or LSIO is the way to go.  To me, its the same as AMD vs Intel or GM vs Ford)

 

Additionally, Fix Common Problems will (amongst many other tests) alert you if you have installed anything deprecated, blacklisted, or incompatible (and in the case of plugins - unknown) installed

 

RC2+ fixes an issue where some users experience hard crashes of the OS when containers / VMs run on their own IP addresses.  It offers the option of ipvlan as the driver instead of simply macvlan.  The actual differences between them are above my paygrade...

 

Thanks for the quick and detailed response.  I normally try to go with official containers when I can, but your point about Plex update strategy makes perfect sense.  I was using LSIO because there docker had entries for nvidia card information. 

 

My Unraid server crashed again today.  I do get temp warnings on my M.3  1tb SSD (ADATA_SX8200PNP_2L1829ASJJ2X SSD) running as my cache, up to around 42C,  this happens normally when i am transcoding h264 to h265 when its doing a lot of read/writes to the cache drive.  Could cache drive high temp cause unraid to crash?

Link to comment

nvme drives all run hot  42C is nothing to them.  Samsung drives for instance are rated to run up to 70C

 

You're running ECC memory.  Does your system event log in the BIOS say anything?  You should actually be running a standalone boot stick with memtest from https://www.memtest86.com/ as the one included won't catch any ECC errors that happen unless it's completely uncorrectable (due to licencing issues)

Link to comment
  • 2 weeks later...
On 1/20/2022 at 5:49 PM, Squid said:

nvme drives all run hot  42C is nothing to them.  Samsung drives for instance are rated to run up to 70C

 

You're running ECC memory.  Does your system event log in the BIOS say anything?  You should actually be running a standalone boot stick with memtest from https://www.memtest86.com/ as the one included won't catch any ECC errors that happen unless it's completely uncorrectable (due to licencing issues)

I ran a memtest and didn't see any issues.

 

 I think I might have a breakthrough, I enabled syslogs server and saved the syslogs to my cache drive according to the post here. 

 

This allowed me to catch some errors before they were lost due to reboot.  From the logs looks like the preclear plugin was blowing up nginx web server if i'm reading the logs correctly.  I didn't get a kernel panic but I was getting no response from any of the dockers or unraid interface.  Not sure this would eventually cause a kernel panic type error but I removed the plugin and will see if unraid stabilizes.  I attached the syslogs.

syslog-10.0.0.11 - Copy.log.zip

Link to comment
Jan 30 01:55:40 Titanium kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Jan 30 01:55:40 Titanium kernel: macvlan_process_broadcast+0xc7/0x10b [macvlan]

Macvlan call traces are usually the result of having dockers with a custom IP address, upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)), or see below for more info.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

Link to comment
On 1/31/2022 at 4:34 AM, JorgeB said:
Jan 30 01:55:40 Titanium kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Jan 30 01:55:40 Titanium kernel: macvlan_process_broadcast+0xc7/0x10b [macvlan]

Macvlan call traces are usually the result of having dockers with a custom IP address, upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)), or see below for more info.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

 

OK, I have to give Credit to @Squid for this same solution, I ignored him because I didn't think i had any custom IP addresses.  When you posted the log with the name of my server all up in my face, I checked my dockers and sure enough my speed test_tracker docker was using a custom ip.  So my sincere apologies to Mr. Squid, who nailed this early on and thanks @JorgeB for pointing it out again. 

 

I made the changes to the docker settings hopefully this will work.  Not sure if should start a new thread but since this likely due to all of the kernel crashes I thought i'd post it here.  I have some corrupted XFS files.  I saw spaceinvader's excellent video on how to fix  xfs corruption and this thread

But I have no idea what drive dm-14 refers to, I was expecting something like sda or sdb.  Using an educated guess I ran the xfs-repair from the unraid GUI on drive 14 but that didn't seem to fix the issue.  Below is the error message in the logs and a list of my unraid drives and their Linux name.

 

 

Feb  1 17:58:02 Titanium kernel: XFS (dm-14): Metadata corruption detected at xfs_dinode_verify+0xa7/0x56c [xfs], inode 0x20314439 dinode
Feb  1 17:58:02 Titanium kernel: XFS (dm-14): Unmount and run xfs_repair
Feb  1 17:58:02 Titanium kernel: XFS (dm-14): First 128 bytes of corrupted metadata buffer:

 

image.thumb.png.f9bbb00b35bd6bb9c30c666e5ded5692.png

 

Link to comment
41 minutes ago, Squid said:

What was the output from running it?  Default is with the -n which is a no modify flag.  It needs to be removed for any fixes to actually take place

I ran with  -lv flag  This was the output.

Phase 1 - find and verify superblock...
        - block cache size set to 6161160 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1945344 tail block 1945344
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:1945340) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Tue Feb  1 19:45:43 2022

Phase		Start		End		Duration
Phase 1:	02/01 19:42:47	02/01 19:42:47
Phase 2:	02/01 19:42:47	02/01 19:43:16	29 seconds
Phase 3:	02/01 19:43:16	02/01 19:43:27	11 seconds
Phase 4:	02/01 19:43:27	02/01 19:43:28	1 second
Phase 5:	02/01 19:43:28	02/01 19:43:29	1 second
Phase 6:	02/01 19:43:29	02/01 19:43:39	10 seconds
Phase 7:	02/01 19:43:39	02/01 19:43:39

Total run time: 52 seconds
done

 

 

43 minutes ago, Squid said:

You must be friends with my wife :) 

:) Yep, and you can sympathize with my poor wife that has to live with me.....  :)   

Edited by linenoise
Link to comment
15 hours ago, JorgeB said:

Looks like xfs_repair succeeded, post new diags after array start in normal mode.

I am still getting this error in my logs avery 15 sec or so.

 

Feb  2 18:20:09 Titanium kernel: XFS (dm-14): Metadata corruption detected at xfs_dinode_verify+0xa7/0x56c [xfs], inode 0x20314439 dinode
Feb  2 18:20:09 Titanium kernel: XFS (dm-14): Unmount and run xfs_repair
Feb  2 18:20:09 Titanium kernel: XFS (dm-14): First 128 bytes of corrupted metadata buffer:
Feb  2 18:20:09 Titanium kernel: 00000000: 49 4e 41 ed 03 01 00 00 00 00 00 63 00 00 00 64  INA........c...d
Feb  2 18:20:09 Titanium kernel: 00000010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  2 18:20:09 Titanium kernel: 00000020: 98 8f 34 fb 84 88 ff ff 61 96 0a ae 34 dd db 6c  ..4.....a...4..l
Feb  2 18:20:09 Titanium kernel: 00000030: 61 96 0a ae 34 dd db 6c 00 00 00 00 00 00 00 1a  a...4..l........
Feb  2 18:20:09 Titanium kernel: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  2 18:20:09 Titanium kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 7b 94 ec 72  ............{..r
Feb  2 18:20:09 Titanium kernel: 00000060: ff ff ff ff 9c 27 7a e1 00 00 00 00 00 00 00 06  .....'z.........
Feb  2 18:20:09 Titanium kernel: 00000070: 00 00 00 b1 00 09 b6 1c 00 00 00 00 00 00 00 00  ................

 

I ran this twice.  Not sure if this was answered but is (dm-14) drive 14?  I was expecting a sd something like sda, sdb, etc.  

Link to comment

Ok I'm at a complete loss here on this XFS repair.

 

I moved all data off of drive 14, took drive 14 offline then the xfs corrupted error went from dm-14 to dm-13.

I performed a webgui repair on drive 13.  Did not stop the error.

Feb  3 06:42:17 Titanium kernel: XFS (dm-13): Unmount and run xfs_repair
Feb  3 06:42:17 Titanium kernel: XFS (dm-13): First 128 bytes of corrupted metadata buffer:
Feb  3 06:42:17 Titanium kernel: 00000000: 49 4e 41 ed 03 01 00 00 00 00 00 63 00 00 00 64  INA........c...d
Feb  3 06:42:17 Titanium kernel: 00000010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 06:42:17 Titanium kernel: 00000020: 98 8f 34 fb 84 88 ff ff 61 96 0a ae 34 dd db 6c  ..4.....a...4..l
Feb  3 06:42:17 Titanium kernel: 00000030: 61 96 0a ae 34 dd db 6c 00 00 00 00 00 00 00 1a  a...4..l........
Feb  3 06:42:17 Titanium kernel: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 06:42:17 Titanium kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 7b 94 ec 72  ............{..r
Feb  3 06:42:17 Titanium kernel: 00000060: ff ff ff ff 9c 27 7a e1 00 00 00 00 00 00 00 06  .....'z.........
Feb  3 06:42:17 Titanium kernel: 00000070: 00 00 00 b1 00 09 b6 1c 00 00 00 00 00 00 00 00  ................
Feb  3 06:42:17 Titanium kernel: XFS (dm-13): Metadata corruption detected at xfs_dinode_verify+0xa7/0x56c [xfs], inode 0x20314439 

I formatted drive 14 by converting from xfs encrypted to xfs back to xfs encrypted.  xfs encrypted -> format -> xfs -> format -> xfs encrypted.  Then the corruption error came back to dm-14.

Feb  3 07:22:35 Titanium kernel: XFS (dm-14): Unmount and run xfs_repair
Feb  3 07:22:35 Titanium kernel: XFS (dm-14): First 128 bytes of corrupted metadata buffer:
Feb  3 07:22:35 Titanium kernel: 00000000: 49 4e 41 ed 03 01 00 00 00 00 00 63 00 00 00 64  INA........c...d
Feb  3 07:22:35 Titanium kernel: 00000010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 07:22:35 Titanium kernel: 00000020: 98 8f 34 fb 84 88 ff ff 61 96 0a ae 34 dd db 6c  ..4.....a...4..l
Feb  3 07:22:35 Titanium kernel: 00000030: 61 96 0a ae 34 dd db 6c 00 00 00 00 00 00 00 1a  a...4..l........
Feb  3 07:22:35 Titanium kernel: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 07:22:35 Titanium kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 7b 94 ec 72  ............{..r
Feb  3 07:22:35 Titanium kernel: 00000060: ff ff ff ff 9c 27 7a e1 00 00 00 00 00 00 00 06  .....'z.........
Feb  3 07:22:35 Titanium kernel: 00000070: 00 00 00 b1 00 09 b6 1c 00 00 00 00 00 00 00 00  ................

 

stoped errors on dm-13 and moved back to dm-14

 

I then performed a Guix XFS-repair using -L to zero the log and then ran gui-xfs repair -V but I am still getting the error.

 

I am beginning to think that dm-14 does not correlate to drive 14 that the corruption is somewhere else.  If I run parity will it fix the corruption or will it just write the corruption to the parity drive?

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...