Jump to content

How-To Solve from Problem to Solution - [6.9.2] KERNEL PANIC ON REBUILD


Recommended Posts

Hi All,

 

I wanted to share this journey as it unfolds and gain the collective experience from the community as it goes forward. I do hope to solve this and return to the Unraid community (first build was back in Jan 2013)

 

I will explain my equipment, versions and usecase then begin to explain the Problem and steps to solve it. 

 

Equipment:

 

  1. Supermicro X9DA7/E with latest bios v3.3 
  2. Intel® Xeon® CPU E5-2620 0 @ 2.00GHz x 2 (aka Dual)
  3. 64 GiB DDR3 Multi-bit ECC (all Memtested) 
  4. LSI SAS9207-8i (HBA flashed in IT mode) 
  5. USB - Unraid - Patriot AUTOBAHN 8gb "Low Profile"

 

"Equipment to be added for docker and VM workloads (usecases); 

 

  1. Nvidia Quadro 4000
  2. Nvidia K80 

 

Special note: The onboard Supermicro raid is disabled and we are using the LSI SAS9207-8i card as the primary HBA

 

Unraid: Version 6.9.2 2021-04-07

 

(No dockers, no addons nothing but vanilla install to date)

 

Situation: 

  • When all cards are installed (as seen in the Image). 
  • We login to the Web UI and try and build the initial array with one parity drive (8TB) & one drive (6TB) the server stops/locks up with this error. 
  • It does not reboot

 

Error: 

 

KERNEL PANIC - NOT SYNCING: TIMEOUT: NOT ALL CPUS ENTERED BROADCAST EXCEPTION HANDLER

 

Troubleshooting plan:

 

Worked: (One of three)
Removed all cards; then add one by one -  Re-added one card; LSI SAS9207-8i; Re-connected two drives to above card

Started server, created a NEW Parity drive and one disk drive and after 12 hours it is OK/Healthy and NO Errors.

Now we will add one card at a time and re-start the troubleshooting) 

 

Kernel panic error 01 - Smaller.jpg

MoBo and Cards - Label.jpg

Edited by neural
did not have Unraid Version
Link to comment

Update #1 - Kernel Panic 

 

Whats inside the server now - Only these cards at the point of failure 

  1. LSI, HBA (with new Supermicro 8087 to SATA cables - CBL-0097L-03) not the red ones as shown in the picture
  2. Radeon Graphics Card (Small temp descrete solution for testing) 
  3. USB (Plugged into the motherboard USB slot)

Unraid: 

  • Setup., Added 4 drives, cleared all, added one as parity and three for storage.
  • Added shares and one docker (Krusader) and two utiliies (for mounting drives)
  • File movement - Moved 3 tb to array, and then via network was moving 1tb over network at same time
  • Via Unraid Web UI - One drive "dropped" or unmounted, unresponsive UI - and then Via Unraid Linux server saw the Kernel Panic error again
  • Server was online for about 24 hours without any activity then approx 2 hrs into a drive clearing, and file copy it crashed. 

After Kernel Panic:

  1. Powered off server as it just hangs when Kernel Panic error occurs 
  2. Removed USB from the motherboard and moved it to the back panel - Thinking maybe the heat 
  3. Removed the GFX card and replaced with the Quadro 4000
  4. Left the HBA in place
  5. Started server and started Memtest86 v5.01 (still running test #8, pass 34% / test 58%) still no errors

 

Troubleshooting: 

 

#1 -  ? LSI HBA activity causing the Kenel Panic - So will double check ALL Bios for both Onboard and the Card (and recreate the same conditons)

- Copy over network to share and Copy from mounted disk to share same time. 

Link to comment

UPDATE #2 - Kernel Panic - Continues 

 

 

What's inside the server now at the 3rd Kernel Panic - Only these cards at the point of failure 

 

LSI, HBA (with new Supermicro 8087 to SATA cables - CBL-0097L-03) not the red ones as shown in the picture

Radeon Graphics Card (Small temp discrete solution for testing) 

USB (Plugged into the motherboard USB slot)

** Memtest86 v5.01 ran for 16 hrs without errors - I exited and moved ahead with that considering it cleared. 

 

After Kernel Panic #3:

 

Powered off server as it just hangs when Kernel Panic error occurs 

Removed USB from the motherboard and moved it to the back panel - Thinking maybe the heat 

Removed the GFX card and replaced with the Quadro 4000

Removed & Replaced the LSI HBA 

New HBA - Supermicro AOC-SASLP-MV8 Rev1.01

 

Also: Bad Drive is not bad; Why is Unraid not reflecting this ?

1407772400_UnraidDrivewithoutIssuesSnip-Screenshot2022-04-14170951.jpg.c4cba8a45f6bed63256020642cee442e.jpg

- I tested with 3 sets of Sata Cables - Drive is 100% no Smart errors and works in secondary tests

- I tested with 3 controllers (onboard, LSI #1 & Supermicro) also tested drive with HBAs in different slots

 

Troubleshooting: 

 

#1 -  ? LSI HBA activity causing the Kernel Panic - Removed it to test with alternative HBA

 

Test Case - Recreate the Kernel Panic - 

Copy over network to share and Copy from mounted disk to share same time.

 

HBA Supermicro.jpg

Edited by neural
image
Link to comment
18 minutes ago, neural said:

Also: Bad Drive is not bad; Why is Unraid not reflecting this ?

Unraid disables a disk if there's a write error, doesn't mean the problem was caused by the disk, but after it gets disable you need to rebuild to re-enable, assuming the emulated disk is mounting and contents look correct you can rebuild on top:

 

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

 

 

 

 

Link to comment

@JorgeB Thank you for the information. We are just re-running the test case to attempt and create a Kernel Panic with the NEW HBA. If it passes then we will rebuild that drive as you have prescribed. Thanks!.

 

The drive issue happened after the 1st Kernel Panic so not our primary mission perse' but all good experience. 

Link to comment

UPDATE #3 - Kernel Panic - Stable for >48 hrs

 

What's inside the server now

  1. Supermicro AOC-SASLP-MV8 Rev1.01 new Supermicro 8087 to SATA cables - CBL-0097L-03
  2. Nvidia Quadro 4000
  3. USB (Plugged into the motherboard USB slot)

Array: All clear

  • Resolved Drive Issue & Added another 3TB drive
  • Parity is clear, Array no errors clear

Next Usecase: 

  • Now will begin to setup Unraid with VMs and add the K80 (when I get the riser cable and K80 cooler) 
Link to comment
  • 4 months later...
  • 1 year later...

Any updates did you find the issue? I recently upgrade my server changed CPU and added the LSI 9201-8i SAS Card. The server will be up for up to 24 hours and will just randomly crash with kernel panic, logs show nothing out of ordinary. Added back old CPU and it still crashed. Right now I removed the SAS card and waiting to see if it will crash. Want to know if there is something you did to fix the issue.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...