neural Posted April 9, 2022 Share Posted April 9, 2022 (edited) Hi All, I wanted to share this journey as it unfolds and gain the collective experience from the community as it goes forward. I do hope to solve this and return to the Unraid community (first build was back in Jan 2013) I will explain my equipment, versions and usecase then begin to explain the Problem and steps to solve it. Equipment: Supermicro X9DA7/E with latest bios v3.3 Intel® Xeon® CPU E5-2620 0 @ 2.00GHz x 2 (aka Dual) 64 GiB DDR3 Multi-bit ECC (all Memtested) LSI SAS9207-8i (HBA flashed in IT mode) USB - Unraid - Patriot AUTOBAHN 8gb "Low Profile" "Equipment to be added for docker and VM workloads (usecases); Nvidia Quadro 4000 Nvidia K80 Special note: The onboard Supermicro raid is disabled and we are using the LSI SAS9207-8i card as the primary HBA Unraid: Version 6.9.2 2021-04-07 (No dockers, no addons nothing but vanilla install to date) Situation: When all cards are installed (as seen in the Image). We login to the Web UI and try and build the initial array with one parity drive (8TB) & one drive (6TB) the server stops/locks up with this error. It does not reboot Error: KERNEL PANIC - NOT SYNCING: TIMEOUT: NOT ALL CPUS ENTERED BROADCAST EXCEPTION HANDLER Troubleshooting plan: Worked: (One of three) Removed all cards; then add one by one - Re-added one card; LSI SAS9207-8i; Re-connected two drives to above card Started server, created a NEW Parity drive and one disk drive and after 12 hours it is OK/Healthy and NO Errors. Now we will add one card at a time and re-start the troubleshooting) Edited April 9, 2022 by neural did not have Unraid Version Quote Link to comment
neural Posted April 13, 2022 Author Share Posted April 13, 2022 Update #1 - Kernel Panic Whats inside the server now - Only these cards at the point of failure LSI, HBA (with new Supermicro 8087 to SATA cables - CBL-0097L-03) not the red ones as shown in the picture Radeon Graphics Card (Small temp descrete solution for testing) USB (Plugged into the motherboard USB slot) Unraid: Setup., Added 4 drives, cleared all, added one as parity and three for storage. Added shares and one docker (Krusader) and two utiliies (for mounting drives) File movement - Moved 3 tb to array, and then via network was moving 1tb over network at same time Via Unraid Web UI - One drive "dropped" or unmounted, unresponsive UI - and then Via Unraid Linux server saw the Kernel Panic error again Server was online for about 24 hours without any activity then approx 2 hrs into a drive clearing, and file copy it crashed. After Kernel Panic: Powered off server as it just hangs when Kernel Panic error occurs Removed USB from the motherboard and moved it to the back panel - Thinking maybe the heat Removed the GFX card and replaced with the Quadro 4000 Left the HBA in place Started server and started Memtest86 v5.01 (still running test #8, pass 34% / test 58%) still no errors Troubleshooting: #1 - ? LSI HBA activity causing the Kenel Panic - So will double check ALL Bios for both Onboard and the Card (and recreate the same conditons) - Copy over network to share and Copy from mounted disk to share same time. Quote Link to comment
neural Posted April 15, 2022 Author Share Posted April 15, 2022 (edited) UPDATE #2 - Kernel Panic - Continues What's inside the server now at the 3rd Kernel Panic - Only these cards at the point of failure LSI, HBA (with new Supermicro 8087 to SATA cables - CBL-0097L-03) not the red ones as shown in the picture Radeon Graphics Card (Small temp discrete solution for testing) USB (Plugged into the motherboard USB slot) ** Memtest86 v5.01 ran for 16 hrs without errors - I exited and moved ahead with that considering it cleared. After Kernel Panic #3: Powered off server as it just hangs when Kernel Panic error occurs Removed USB from the motherboard and moved it to the back panel - Thinking maybe the heat Removed the GFX card and replaced with the Quadro 4000 Removed & Replaced the LSI HBA New HBA - Supermicro AOC-SASLP-MV8 Rev1.01 Also: Bad Drive is not bad; Why is Unraid not reflecting this ? - I tested with 3 sets of Sata Cables - Drive is 100% no Smart errors and works in secondary tests - I tested with 3 controllers (onboard, LSI #1 & Supermicro) also tested drive with HBAs in different slots Troubleshooting: #1 - ? LSI HBA activity causing the Kernel Panic - Removed it to test with alternative HBA Test Case - Recreate the Kernel Panic - Copy over network to share and Copy from mounted disk to share same time. Edited April 15, 2022 by neural image Quote Link to comment
JorgeB Posted April 15, 2022 Share Posted April 15, 2022 18 minutes ago, neural said: Also: Bad Drive is not bad; Why is Unraid not reflecting this ? Unraid disables a disk if there's a write error, doesn't mean the problem was caused by the disk, but after it gets disable you need to rebuild to re-enable, assuming the emulated disk is mounting and contents look correct you can rebuild on top: https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself Quote Link to comment
neural Posted April 15, 2022 Author Share Posted April 15, 2022 @JorgeB Thank you for the information. We are just re-running the test case to attempt and create a Kernel Panic with the NEW HBA. If it passes then we will rebuild that drive as you have prescribed. Thanks!. The drive issue happened after the 1st Kernel Panic so not our primary mission perse' but all good experience. Quote Link to comment
neural Posted April 19, 2022 Author Share Posted April 19, 2022 UPDATE #3 - Kernel Panic - Stable for >48 hrs What's inside the server now Supermicro AOC-SASLP-MV8 Rev1.01 new Supermicro 8087 to SATA cables - CBL-0097L-03 Nvidia Quadro 4000 USB (Plugged into the motherboard USB slot) Array: All clear Resolved Drive Issue & Added another 3TB drive Parity is clear, Array no errors clear Next Usecase: Now will begin to setup Unraid with VMs and add the K80 (when I get the riser cable and K80 cooler) Quote Link to comment
softdrinker Posted August 20, 2022 Share Posted August 20, 2022 Hello, Really interesting topic, is it still stable until today? it was due to faulty LSI HBA? Thank you Quote Link to comment
adrianniebla24 Posted December 6, 2023 Share Posted December 6, 2023 Any updates did you find the issue? I recently upgrade my server changed CPU and added the LSI 9201-8i SAS Card. The server will be up for up to 24 hours and will just randomly crash with kernel panic, logs show nothing out of ordinary. Added back old CPU and it still crashed. Right now I removed the SAS card and waiting to see if it will crash. Want to know if there is something you did to fix the issue. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.