pengrus Posted March 17, 2022 Share Posted March 17, 2022 (edited) Hi! My backup server threw a 2TB disk (Enterprise of course, still have 10-year-old WD20EARS in this thing no problem), so I replaced with a 4TB Seagate. But every time I try to start the array and rebuild the data, it gets to some random point and dies, crashing the server and requiring a hard reboot. The new drive is fine as far as I can tell, no SMART errors. Memtest ran for a day with no errors. I've attached diagnostics from before array start and after the crash, and tails of syslog and the kernel, though it appears nothing is valuable there. I've also attached a screenshot of the error that finally crashes the server (loudly today with the beeping). I have searched and found other threads mentioning something similar, but there haven't been any real fixable causes discovered, hope someone out there can help! Thank you. -P archive-diagnostics-20220316-1144.zip archive-diagnostics-20220317-1748.zip juststarted.txt kerneljuststarted.txt Edited March 22, 2022 by pengrus Solved! Quote Link to comment
JorgeB Posted March 18, 2022 Share Posted March 18, 2022 Post the output of: cat /proc/interrupts SASLP tends to like IRQ16, if it's getting disable during high load it's a problem. Quote Link to comment
pengrus Posted March 18, 2022 Author Share Posted March 18, 2022 6 hours ago, JorgeB said: Post the output of: cat /proc/interrupts SASLP tends to like IRQ16, if it's getting disable during high load it's a problem. Well, it's definitely mvsas...how do I fix that?? root@Archive:~# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 11987036 0 0 0 0 0 0 0 IO-APIC 2-edge timer 1: 0 4 0 0 0 0 0 0 IO-APIC 1-edge i8042 8: 0 0 4 0 0 0 0 0 IO-APIC 8-edge rtc0 9: 0 0 0 0 0 0 0 0 IO-APIC 9-fasteoi acpi 12: 6 0 0 0 0 0 0 0 IO-APIC 12-edge i8042 16: 0 0 0 0 142514 0 0 0 IO-APIC 16-fasteoi mvsas 18: 0 0 0 4 0 0 0 0 IO-APIC 18-fasteoi i801_smbus 19: 0 0 0 0 0 77385 0 0 IO-APIC 19-fasteoi ata_piix, ata_piix 21: 0 0 0 0 0 0 72 0 IO-APIC 21-fasteoi ehci_hcd:usb1 23: 0 0 0 0 0 0 0 2654154 IO-APIC 23-fasteoi ehci_hcd:usb2 24: 7115066 0 0 0 0 0 0 0 HPET-MSI 3-edge hpet3 25: 0 6391924 0 0 0 0 0 0 HPET-MSI 4-edge hpet4 26: 0 0 6033243 0 0 0 0 0 HPET-MSI 5-edge hpet5 27: 0 0 0 5389967 0 0 0 0 HPET-MSI 6-edge hpet6 28: 0 0 0 0 4701097 0 0 0 HPET-MSI 7-edge hpet7 29: 0 0 0 0 0 0 0 0 DMAR-MSI 0-edge dmar0 30: 0 0 0 0 0 0 0 0 PCI-MSI 49152-edge PCIe PME, aerdrv 31: 0 0 0 0 0 0 0 0 PCI-MSI 81920-edge PCIe PME, aerdrv 32: 0 0 0 0 0 0 0 0 PCI-MSI 458752-edge PCIe PME 33: 0 0 0 0 0 0 0 0 PCI-MSI 466944-edge PCIe PME 34: 0 0 0 0 0 0 0 0 PCI-MSI 468992-edge PCIe PME 35: 0 0 0 0 0 555414 0 0 PCI-MSI 2097152-edge eth0-rx-0 36: 0 0 0 0 0 0 209403 0 PCI-MSI 2097153-edge eth0-tx-0 37: 0 0 0 0 0 0 0 2 PCI-MSI 2097154-edge eth0 38: 0 0 0 0 0 0 0 86345 PCI-MSI 524288-edge mpt2sas0-msix0 NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts LOC: 181 190 187 184 181 5772395 5427425 8262557 Local timer interrupts SPU: 0 0 0 0 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 0 0 0 0 Performance monitoring interrupts IWI: 1070909 857156 782246 731932 645449 944730 664880 1390869 IRQ work interrupts RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries RES: 74066 29253 25872 22715 22290 30056 26554 29858 Rescheduling interrupts CAL: 266472 84006 63573 55369 70468 22849 16713 11208 Function call interrupts TLB: 2859 3516 3542 3300 3399 3428 3186 2658 TLB shootdowns TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts MCE: 0 0 0 0 0 0 0 0 Machine check exceptions MCP: 217 218 218 218 218 218 218 218 Machine check polls ERR: 0 MIS: 0 PIN: 0 0 0 0 0 0 0 0 Posted-interrupt notification event NPI: 0 0 0 0 0 0 0 0 Nested posted-interrupt event PIW: 0 0 0 0 0 0 0 0 Posted-interrupt wakeup event Thanks! -P Quote Link to comment
JorgeB Posted March 21, 2022 Share Posted March 21, 2022 You can try a different slot, but IIRC they tend to always go for IRQ16, best bet would be to replace the controller with an LSI, they are not for a long time recommended for various issues. Quote Link to comment
Solution pengrus Posted March 22, 2022 Author Solution Share Posted March 22, 2022 Thanks to @JorgeB for pointing me to the controller. For those that might have one still chugging away, what looks like is happening is that under load the Marvell-based (AOC-SASLP-MV-8 in this case) controller will freak out over IRQ16 (or 13, sometimes) and crash the server. The disk being rebuilt wasn't even on the controller, but you need all the disks to participate so... Anyway, so I went and found a different post (also featuring @JorgeB and @saarg in starring roles) that recommended disabling IOMMU by appending "iommu=pt" to syslinux.cfg. And now my drive is rebuilt. I have some more LSIs on order to replace so I can have IOMMU back, but this works for now! Thanks again. -P 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.