[SOLVED] [6.6.2] Unable to complete parity check due to kernal panic, not syning, out of memory


mrbens

Recommended Posts

Hi, really in need of assistance with this please. I've attached diagnostics at different stages of the problem.

 

Been having multiple issues recently and haven't been able to complete a parity check for a couple of months as after a few hours of running the check the server freezes with this output:

 

990116085_2018-11-2721-03-01.thumb.jpg.1e8de981675b12f65e2a3dcdfb0ca374.jpg

 

It also sometimes has a kernel panic if not doing a parity check if left powered on for a couple of days. I have 7 other example kernel panic photos I can attach if anyone wants to see as only the bottom 3 kernel lines are the same each time.

 

I've run 6 passes on memtest (about 6 hours) and no issues found but I know it's recommended to run that about 24 hours to be sure. Happy to do that if advised.

 

Then to make it worse a couple of weeks after that started a disk became error disabled. I swapped it out with a new disk which I'd just pre-cleared to expand the array with but obviously can't complete the parity check or rebuild. I can't remember why now but at the time it looked like the disk was fine and the SATA port caused the issue, so I'll run a pre-clear on it to test when the server is working again.

 

The new disk is using a different SATA port and cable to the errored one but again tonight when I've powered the server up it has disabled disk 14 again (screenshot below) with a load of errors across the disks on my 8 port SATA card (SuperMicro AOC-SASLP-MV8 PCI-E x4 8-Port SAS). Disks 10 to 16 and the original errored one are on it which leads me to think that might be failing.

 

Could that cause an out of memory issue?

 

Seems to be picking them off one by one! I don't have any other spare SATA ports to move disks around to test unfortunately. BTW the GUI looks gorgeous using Dark Reader :)

 

503801978_2019-01-22Disk14errorstate.PNG.d936a83a37e801265a9f9535164501e3.PNG

 

After a reboot there have been no further errors whereas before the logs were full of red ata lines in the "tower-diagnostics-20190122-2010 emulated disk 14 errors.zip".

 

I've seen in the link below that the kernel panic out of memory issue can be caused by processes on the server consuming all the memory. How would I check for that please?

 

 

There was one time when it froze it displayed the output below when I'd left top running to try capture anything when it froze. I'm guessing that was just stating which process were being stopped to free up mem. When I logged in it immediately did the kernel panic.

 

132455496_2018-12-0419-18-52.thumb.jpg.bed8290f99e1842bb759527fd7dddb09.jpg

 

Not sure if related but not long before the kernel panic issue I had replaced a generic 2 port SATA card as it was generating red ata syslog lines on the 2 disks and didn't get that issue on the new replacement 2 port Highpoint Rocket 620 SATA card. To be able to use that card I used the enable_ahci.sh in the link below. Could that have affected the 8 port card?

 

Also is there a way to save the server syslog to the usb or external server so I can see what is going on before the kernel panic?

 

Thanks for any help on this! I can't wait to get my server working again :)

 

Ben

tower-diagnostics-20181203-2003.zip

tower-diagnostics-20181211-0154 before parity check.zip

tower-diagnostics-20181211-1148 after failed parity check, disk 14 emulated.zip

tower-diagnostics-20190120-1335.zip

tower-diagnostics-20190122-2010 emulated disk 14 errors.zip

Edited by mrbens
Re-structured the text to make it clearer and added more info.
Link to comment

There appears to exist a problem with the SASLP controller, some syslog errors, it's not passing SMART info and not spinning down disks, because of this it's spamming the log and making it very difficult to analyze, it could also be what's causing the problems, and since those controllers are not recommended anyway for other reasons I would suggest you try with a different one, preferably an LSI HBA.

Link to comment

Thanks for the info. I wasn't aware it wasn't recommended.

 

Had a look into it and looks like this is a good deal. LSI SAS 9207-8i HBA for £108:

https://www.amazon.co.uk/Broadcom-Port-6Gbps-9207-8i-Adaptor-x/dp/B0085FT2JC

 

Some of the Amazon reviews mention it is good with unRAID and there's a thread about it on here too:

Can only find Chinese imports of the cheaper model LSI Sas9201-8iso so would rather not risk it.

 

https://wiki.unraid.net/index.php/Hardware_Compatibility#PCI_SATA_Controllers

Link to comment

Many thanks again, the replacement LSI card has fixed it!

 

Full rebuild of disk14 complete. No errors and no issues that I can see in the syslog.

 

Preclearing the original disk14 now to see if I can safely re-add it to the array. Hopefully all back to normal :)

Edited by mrbens
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.