Unraid Crashes due to Kernel Panic Again

linenoise · January 11, 2022

Background:

I have been running Unraid for several years and it normally runs well. I have had issues with Kernel panics that stemmed from XFS corruption to issues with custom IP addresses. Tired of constantly rebooting due to these issues, I invested $2.5k in server grade equipment hoping that it will resolve many of these issues. Unfortunately I am still running into this issue.

EDIT: I wanted to also note that I have performed a clean fresh install on the new hardware.

Equipment:

SuperMicro 6047R-E1R24N 24x LFF SuperStorage Server W/ X9DRI-LN4F+
2 x Xeon CPU E5-2697 @ 2.70 GHz (48 cores total)
128 GiB DDR3 Multi-bit ECC memory
Interface 1: 10GB
Intel onboard 1 Gib interface
1 TB M.3 SSD (Cache)
1 TB M.3 SSD (VM/Downloads)
2 x 10TB parity Drives
Assorted size drives (59 TB)

Attached Files:

Uploaded Diagnostic file titanium-diagnostics-20220111-1342.zip

Screen Shots of several Kernel panic error messages:

Here was a partial crash where the webUI was nonresponsive and nothing worked but the CLI Had to power cycle to resolve
Another non kernel panic but system became unresponsive non of the dockers worked similar to above. With this said most of the crashes are Kernel panics.

Memory check

I performed a memory check using the the Mem tool option during unraid boot. I only did 1 pass as it took over 12 hours to perform

861499490_unraidmemcheck12-30-21.PNG.57b175255294561524767fc71aa1f1ca.PNG

My Thoughts what I have done:

in the past I had kernel panic issues related to Nextcloud. (deactivated docker)
I stopped all dockers execpt for (Plex, Sonarr, Radarr, SABNZB, Qbitorrent, Privoxy, Overseerr, prowlarr, speedtest-tracker- whoogle search)
I do have some drives with errors and are failing some older WD Green drives, not sure that would cause a kernel panic.
I do get SSD temp warnings from time to time but they recover fairly quickly.
I am running an Nvida video card for plex transcoding it gave me issues with the bios when during the build requiring me to pull the Mobo battery and reset the bios to resolve whenever I install or pull the card.

Hope to finally resolve this as I spent a ton of money building this. The old server is now running on truenas and is rock solid I love Unraid and would hate to switch.

I think I got all of the information to help with troubleshooting this.

thanks

Edited January 11, 2022 by linenoise

Squid · January 12, 2022

5 hours ago, linenoise said:

issues with custom IP addresses.

Are you still using custom IPs? Upgrade to 6.10-rc2+ and switch the type of network to instead be ipvlan (from macvlan)

linenoise · January 12, 2022

No custom IP addresses on any of my dockers although I would love to implement them if they were ever stable, I am using all Network Type: Host. I am installing 6.10.rc2. I'll see if it will stabilize.

PS thanks for the quick response and love your Sent via tag at the end.

lewisifer · January 14, 2022

I am having similar issues and the last thing I am going to try is to remove my SSD which is a Seagate Ironwolf NVMe in a PCIe adaptor. I have replaced all the RAM, the 10GbE NIC back to 1GbE and as well as the usb drive for the OS and other tests.

Has it stabilised for you? I am also running 6.10.rc2 and I am still having the kernal panic.

lewisifer · January 15, 2022

I have removed and SSD and it has still crashed. I have no idea what I can try next.

linenoise · January 18, 2022

I'm back, I upgraded to the RC version of unraid, it broke my plex that was using NVIDA plugin and an NVIDA card to transcode. By break I bean the old container would crash when I tried to start it. I had to d/l a different docker, none of the ones on the app store, at least for the RC have the NVIDA transcode.

This time around I was able to get 2 screen shots of errors.

Also during the last 2 days 1 of my Hard drives died, running emulated currently waiting for a replacement drive. Should get her on Wednesday hopefully.

I do have some old hard drives 10+ years of run time that are giving some reallocation errors, going to consolidate the data and replace them with a new drive. This is going to take a few days I have to rebuild the drive that completely died then consolidate the data off the other 3 drives before removing and replacing with the new drive.

In the mean time I am going to remove the NVidia video card and see if the fixes anything.

linenoise · January 18, 2022

On 1/15/2022 at 5:32 PM, lewisifer said:

I have removed and SSD and it has still crashed. I have no idea what I can try next.

Interesting I am also running 2 x SSD m.2 in pcie adapters.

linenoise · January 18, 2022

Another note, I am running unraid 6.10.0-RC2 and found that heimdall was constantly generating log entries to the point where the log was at 12 Gigs in just a few days. From what I was reading on the forum it might have been caused by an enhanced app but when I removed heimdall ran cleanup appdata to remove the old config and ran a base install of heimdall it was still generating tons of log entries. Not sure this is a result of the RC, but I can see this quickly filling up the docker file and crashing docker. Had to uninstall heimdall for the time being.

I attached a sample of the log file, already at 52mb after about 1 min of running. Not sure if this is related to our crashes but another data point.

@Squid Can you expand on what the RC2 version of Unraid was supposed to fix? I am finding some of the dockers are having compatibility issues or versions I used on the stable versions are not showing in the appstore. Is there a filter to only show compatiable apps with the installed version in the appstore?

thanks

laravel.log.txt

Squid · January 19, 2022

3 hours ago, linenoise said:

Is there a filter to only show compatiable apps with the installed version in the appstore?

The Apps tab will

Not display anything incompatible unless changed in it's settings for new installations
Not display anything deprecated unless changed in it's settings for new installations
For a reinstallation of an app which you've previously had installed which is now either deprecated or listed as being incompatible it will allow a reinstall, but is clearly identified as being deprecated or blacklisted.
Will NEVER allow you to install anything that is outright blacklisted for one reason or another
Opt-in: Periodically scan your system and alert you to the presence of:
1. Known malware and security vulnerabilities within applications
2. Critical security vulnerabilities within Unraid OS
3. Any installed applications which may seriously hinder the operation of your server

In theory, there really should be no cases of "incompatible" docker apps with anyone's system due to the nature of the system. The whole system though is moderated and when it appears that a particular app is causing issues for many users, it gets investigated and any applicable action deemed necessary is taken. In terms of your problems with nVidia and Plex, any of the available apps should work just fine, assuming the plugin is up to date and the hardware doesn't itself have issues and is supported by the driver version (I personally use @binhex and bounce back and forth depending upon my priorities on reallocating hardware resources between various computers use either nVidia or Intel transcoding). Plex Inc though is one of those software companies that tend on their constant updates to break something else, and "good" releases are rather rare compared to the number of updates they issue. Part of my reasoning for using binhex, as while its updated regularly, it's not utilizing the bleeding edge of Plex (which they always label as "stable". Many others here will disagree and say the opposite and that Hotio or LSIO is the way to go. To me, its the same as AMD vs Intel or GM vs Ford)

Additionally, Fix Common Problems will (amongst many other tests) alert you if you have installed anything deprecated, blacklisted, or incompatible (and in the case of plugins - unknown) installed

RC2+ fixes an issue where some users experience hard crashes of the OS when containers / VMs run on their own IP addresses. It offers the option of ipvlan as the driver instead of simply macvlan. The actual differences between them are above my paygrade...

linenoise · January 20, 2022

On 1/18/2022 at 9:34 PM, Squid said:

The Apps tab will

Not display anything incompatible unless changed in it's settings for new installations

Not display anything deprecated unless changed in it's settings for new installations

For a reinstallation of an app which you've previously had installed which is now either deprecated or listed as being incompatible it will allow a reinstall, but is clearly identified as being deprecated or blacklisted.

Will NEVER allow you to install anything that is outright blacklisted for one reason or another

Opt-in: Periodically scan your system and alert you to the presence of:
1. Known malware and security vulnerabilities within applications
2. Critical security vulnerabilities within Unraid OS
3. Any installed applications which may seriously hinder the operation of your server

In theory, there really should be no cases of "incompatible" docker apps with anyone's system due to the nature of the system. The whole system though is moderated and when it appears that a particular app is causing issues for many users, it gets investigated and any applicable action deemed necessary is taken. In terms of your problems with nVidia and Plex, any of the available apps should work just fine, assuming the plugin is up to date and the hardware doesn't itself have issues and is supported by the driver version (I personally use @binhex and bounce back and forth depending upon my priorities on reallocating hardware resources between various computers use either nVidia or Intel transcoding). Plex Inc though is one of those software companies that tend on their constant updates to break something else, and "good" releases are rather rare compared to the number of updates they issue. Part of my reasoning for using binhex, as while its updated regularly, it's not utilizing the bleeding edge of Plex (which they always label as "stable". Many others here will disagree and say the opposite and that Hotio or LSIO is the way to go. To me, its the same as AMD vs Intel or GM vs Ford)

Additionally, Fix Common Problems will (amongst many other tests) alert you if you have installed anything deprecated, blacklisted, or incompatible (and in the case of plugins - unknown) installed

RC2+ fixes an issue where some users experience hard crashes of the OS when containers / VMs run on their own IP addresses. It offers the option of ipvlan as the driver instead of simply macvlan. The actual differences between them are above my paygrade...

Thanks for the quick and detailed response. I normally try to go with official containers when I can, but your point about Plex update strategy makes perfect sense. I was using LSIO because there docker had entries for nvidia card information.

My Unraid server crashed again today. I do get temp warnings on my M.3 1tb SSD (ADATA_SX8200PNP_2L1829ASJJ2X SSD) running as my cache, up to around 42C, this happens normally when i am transcoding h264 to h265 when its doing a lot of read/writes to the cache drive. Could cache drive high temp cause unraid to crash?

Squid · January 20, 2022

nvme drives all run hot 42C is nothing to them. Samsung drives for instance are rated to run up to 70C

You're running ECC memory. Does your system event log in the BIOS say anything? You should actually be running a standalone boot stick with memtest from https://www.memtest86.com/ as the one included won't catch any ECC errors that happen unless it's completely uncorrectable (due to licencing issues)

linenoise · January 30, 2022

On 1/20/2022 at 5:49 PM, Squid said:

nvme drives all run hot 42C is nothing to them. Samsung drives for instance are rated to run up to 70C

You're running ECC memory. Does your system event log in the BIOS say anything? You should actually be running a standalone boot stick with memtest from https://www.memtest86.com/ as the one included won't catch any ECC errors that happen unless it's completely uncorrectable (due to licencing issues)

I ran a memtest and didn't see any issues.

I think I might have a breakthrough, I enabled syslogs server and saved the syslogs to my cache drive according to the post here.

This allowed me to catch some errors before they were lost due to reboot. From the logs looks like the preclear plugin was blowing up nginx web server if i'm reading the logs correctly. I didn't get a kernel panic but I was getting no response from any of the dockers or unraid interface. Not sure this would eventually cause a kernel panic type error but I removed the plugin and will see if unraid stabilizes. I attached the syslogs.

syslog-10.0.0.11 - Copy.log.zip

JorgeB · January 31, 2022

Jan 30 01:55:40 Titanium kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Jan 30 01:55:40 Titanium kernel: macvlan_process_broadcast+0xc7/0x10b [macvlan]

Macvlan call traces are usually the result of having dockers with a custom IP address, upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)), or see below for more info.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

linenoise · February 2, 2022

On 1/31/2022 at 4:34 AM, JorgeB said:
Jan 30 01:55:40 Titanium kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Jan 30 01:55:40 Titanium kernel: macvlan_process_broadcast+0xc7/0x10b [macvlan]
Macvlan call traces are usually the result of having dockers with a custom IP address, upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)), or see below for more info.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

OK, I have to give Credit to @Squid for this same solution, I ignored him because I didn't think i had any custom IP addresses. When you posted the log with the name of my server all up in my face, I checked my dockers and sure enough my speed test_tracker docker was using a custom ip. So my sincere apologies to Mr. Squid, who nailed this early on and thanks @JorgeB for pointing it out again.

I made the changes to the docker settings hopefully this will work. Not sure if should start a new thread but since this likely due to all of the kernel crashes I thought i'd post it here. I have some corrupted XFS files. I saw spaceinvader's excellent video on how to fix xfs corruption and this thread

But I have no idea what drive dm-14 refers to, I was expecting something like sda or sdb. Using an educated guess I ran the xfs-repair from the unraid GUI on drive 14 but that didn't seem to fix the issue. Below is the error message in the logs and a list of my unraid drives and their Linux name.

Feb  1 17:58:02 Titanium kernel: XFS (dm-14): Metadata corruption detected at xfs_dinode_verify+0xa7/0x56c [xfs], inode 0x20314439 dinode
Feb  1 17:58:02 Titanium kernel: XFS (dm-14): Unmount and run xfs_repair
Feb  1 17:58:02 Titanium kernel: XFS (dm-14): First 128 bytes of corrupted metadata buffer:

Squid · February 2, 2022

8 minutes ago, linenoise said:

I ignored him

You must be friends with my wife

Squid · February 2, 2022

9 minutes ago, linenoise said:

xfs-repair from the unraid GUI on drive 14

What was the output from running it? Default is with the -n which is a no modify flag. It needs to be removed for any fixes to actually take place

linenoise · February 2, 2022

41 minutes ago, Squid said:

What was the output from running it? Default is with the -n which is a no modify flag. It needs to be removed for any fixes to actually take place

I ran with -lv flag This was the output.

Phase 1 - find and verify superblock...
        - block cache size set to 6161160 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1945344 tail block 1945344
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:1945340) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Tue Feb  1 19:45:43 2022

Phase		Start		End		Duration
Phase 1:	02/01 19:42:47	02/01 19:42:47
Phase 2:	02/01 19:42:47	02/01 19:43:16	29 seconds
Phase 3:	02/01 19:43:16	02/01 19:43:27	11 seconds
Phase 4:	02/01 19:43:27	02/01 19:43:28	1 second
Phase 5:	02/01 19:43:28	02/01 19:43:29	1 second
Phase 6:	02/01 19:43:29	02/01 19:43:39	10 seconds
Phase 7:	02/01 19:43:39	02/01 19:43:39

Total run time: 52 seconds
done

43 minutes ago, Squid said:

You must be friends with my wife

Yep, and you can sympathize with my poor wife that has to live with me.....

Edited February 2, 2022 by linenoise

Squid · February 2, 2022

Not sure if you'd want the -l (device where external log is) @JorgeB would know best

JorgeB · February 2, 2022

Looks like xfs_repair succeeded, post new diags after array start in normal mode.

linenoise · February 2, 2022

15 hours ago, JorgeB said:

Looks like xfs_repair succeeded, post new diags after array start in normal mode.

I am still getting this error in my logs avery 15 sec or so.

Feb  2 18:20:09 Titanium kernel: XFS (dm-14): Metadata corruption detected at xfs_dinode_verify+0xa7/0x56c [xfs], inode 0x20314439 dinode
Feb  2 18:20:09 Titanium kernel: XFS (dm-14): Unmount and run xfs_repair
Feb  2 18:20:09 Titanium kernel: XFS (dm-14): First 128 bytes of corrupted metadata buffer:
Feb  2 18:20:09 Titanium kernel: 00000000: 49 4e 41 ed 03 01 00 00 00 00 00 63 00 00 00 64  INA........c...d
Feb  2 18:20:09 Titanium kernel: 00000010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  2 18:20:09 Titanium kernel: 00000020: 98 8f 34 fb 84 88 ff ff 61 96 0a ae 34 dd db 6c  ..4.....a...4..l
Feb  2 18:20:09 Titanium kernel: 00000030: 61 96 0a ae 34 dd db 6c 00 00 00 00 00 00 00 1a  a...4..l........
Feb  2 18:20:09 Titanium kernel: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  2 18:20:09 Titanium kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 7b 94 ec 72  ............{..r
Feb  2 18:20:09 Titanium kernel: 00000060: ff ff ff ff 9c 27 7a e1 00 00 00 00 00 00 00 06  .....'z.........
Feb  2 18:20:09 Titanium kernel: 00000070: 00 00 00 b1 00 09 b6 1c 00 00 00 00 00 00 00 00  ................

I ran this twice. Not sure if this was answered but is (dm-14) drive 14? I was expecting a sd something like sda, sdb, etc.

linenoise · February 3, 2022

Tried running the xfx repair from the gui several times using the -v flag no luck. I'm considering the Aliens 2 approach and nuke from orbit. Currently moving data off the drive and will reformat. Hopefully that will work.

JorgeB · February 3, 2022

8 hours ago, linenoise said:

but is (dm-14) drive 14?

No,

On 2/2/2022 at 7:50 AM, JorgeB said:

post new diags after array start in normal mode.

to see which drive it is.

linenoise · February 3, 2022

Ok I'm at a complete loss here on this XFS repair.

I moved all data off of drive 14, took drive 14 offline then the xfs corrupted error went from dm-14 to dm-13.

I performed a webgui repair on drive 13. Did not stop the error.

Feb  3 06:42:17 Titanium kernel: XFS (dm-13): Unmount and run xfs_repair
Feb  3 06:42:17 Titanium kernel: XFS (dm-13): First 128 bytes of corrupted metadata buffer:
Feb  3 06:42:17 Titanium kernel: 00000000: 49 4e 41 ed 03 01 00 00 00 00 00 63 00 00 00 64  INA........c...d
Feb  3 06:42:17 Titanium kernel: 00000010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 06:42:17 Titanium kernel: 00000020: 98 8f 34 fb 84 88 ff ff 61 96 0a ae 34 dd db 6c  ..4.....a...4..l
Feb  3 06:42:17 Titanium kernel: 00000030: 61 96 0a ae 34 dd db 6c 00 00 00 00 00 00 00 1a  a...4..l........
Feb  3 06:42:17 Titanium kernel: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 06:42:17 Titanium kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 7b 94 ec 72  ............{..r
Feb  3 06:42:17 Titanium kernel: 00000060: ff ff ff ff 9c 27 7a e1 00 00 00 00 00 00 00 06  .....'z.........
Feb  3 06:42:17 Titanium kernel: 00000070: 00 00 00 b1 00 09 b6 1c 00 00 00 00 00 00 00 00  ................
Feb  3 06:42:17 Titanium kernel: XFS (dm-13): Metadata corruption detected at xfs_dinode_verify+0xa7/0x56c [xfs], inode 0x20314439

I formatted drive 14 by converting from xfs encrypted to xfs back to xfs encrypted. xfs encrypted -> format -> xfs -> format -> xfs encrypted. Then the corruption error came back to dm-14.

Feb  3 07:22:35 Titanium kernel: XFS (dm-14): Unmount and run xfs_repair
Feb  3 07:22:35 Titanium kernel: XFS (dm-14): First 128 bytes of corrupted metadata buffer:
Feb  3 07:22:35 Titanium kernel: 00000000: 49 4e 41 ed 03 01 00 00 00 00 00 63 00 00 00 64  INA........c...d
Feb  3 07:22:35 Titanium kernel: 00000010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 07:22:35 Titanium kernel: 00000020: 98 8f 34 fb 84 88 ff ff 61 96 0a ae 34 dd db 6c  ..4.....a...4..l
Feb  3 07:22:35 Titanium kernel: 00000030: 61 96 0a ae 34 dd db 6c 00 00 00 00 00 00 00 1a  a...4..l........
Feb  3 07:22:35 Titanium kernel: 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb  3 07:22:35 Titanium kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 7b 94 ec 72  ............{..r
Feb  3 07:22:35 Titanium kernel: 00000060: ff ff ff ff 9c 27 7a e1 00 00 00 00 00 00 00 06  .....'z.........
Feb  3 07:22:35 Titanium kernel: 00000070: 00 00 00 b1 00 09 b6 1c 00 00 00 00 00 00 00 00  ................

stoped errors on dm-13 and moved back to dm-14

I then performed a Guix XFS-repair using -L to zero the log and then ran gui-xfs repair -V but I am still getting the error.

I am beginning to think that dm-14 does not correlate to drive 14 that the corruption is somewhere else. If I run parity will it fix the corruption or will it just write the corruption to the parity drive?

itimpi · February 3, 2022

If you provide your system's diagnostics zip file we would be able to tell you what drive it relates to.

linenoise · February 3, 2022

Here is the diagnostic file.

Thanks

titanium-diagnostics-20220203-0905.zip

Unraid Crashes due to Kernel Panic Again

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation