Jump to content

Random Crashes/freezes. No idea what's causing it.


Recommended Posts

Like the title says. I just went through a server hardware upgrade. New MB/CPU/RAM/PSU. I'll post the details of the system below along with the latest diagnostic zip and syslog, but for now a brief description of what has been happening. I performed the upgrade a week ago Friday and the server was running rock solid and has been till 2 days ago when I decided to add an HBA, 2 SAS drives, a different CPU, and a new Pioneer UHD drive. The reason for the new CPU was that the previous one didn't have integrated graphics, which meant cannibalizing the gpu from my wife's pc everytime I needed to make a bios change. Easier to swap the 14400F chip for the 14500.

 

The System is currently as follows:

UnRAID OS Pro ver. 6.12.10

CPU - Intel i5 14500

MB - MSI ProZ690-A

RAM - 4x32GB Corsair Vengeance DDR5-5200

HBA - LSI SAS9300-16i (Currently running the 16.00.10.00 IT Firmware and associated bios/efi from Broadcom's website. I also had the the TrueNAS/Broadcom Collaboration 16.00.12.00 IT FW installed, but downgraded as a troubleshooting step... Still got a crash)

Connections - 2x8643 mini-SAS to 4x 8482 connectors for a total of 8 drives connected.

PSU - Corsair HX1000i

Cache - Samsung 970 Evo Plus SSD 1TB NVME M.2

Drive configuration post update (ignore the missing disc numbers. I removed several unused drives and was going to rebuild the parity/drive config after preclearing/installing the new SAS drives. *additional note, I did rebuild the parity after removing those drives and prior to the skyhawk failing):

Parity - Seagate EXOS X16 ST16000NM001G 16TB

Disc 3 - Seagate EXOS X16 ST16000NM001G 16TB

Disc 6 - Seagate Skyhawk ST10000VX0004 10TB (after the second or third crash this drive began returning numerous Reallocated sector errors. over 200 at one point. Currently trying to Rebuild with a new EXOS X16 16TB SAS I purchased. Had been trying to run a preclear on it prior to this, but the server kept freezing during preread)

Disc 7 - Seagate Desktop ST4000DM000 4TB

Disc 8 - Seagate Desktop ST4000DM000 4TB

Disc 9 - Seagate Desktop ST4000DM000 4TB

Disc 10 - Seagate EXOS X10 ST10000NM0086 10TB

 

I was seeing I/O errors associated with the Pioneer drive in a log at one point so it has been temporarily removed. Crashes still remain.

 

I also swapped to a new (old) flash drive for the OS. Was using a 2GB Sony drive for the last 15 years. I was having issues with my Laptop reading the drive, so I replaced it with a 32GB USB 2.0 Sandisk Cruzr Glide, just in case that was the issue. no problems with the backup/key transfer and the drive appears to be workin perfectly. tried in multiple USB ports on server. Currently in a 2.0 rear panel slot.

 

I've setup the syslog server to save locally to my cache drive and mirror to the flash. So far I haven't been able to capture much. as of my typing this the server has been running a rebuild on the new drive for 20 some minutes, but nothing has been added to the syslog since I made the change to the syslog server on where to save the file. I downloaded the latest diagnostic report zip and it is attached below. No idea what I should be looking for there.

 

I had the skyhawk removed from the array, but still attached as an unassigned device. I've now removed that as well, so my current theory is a hardware issue with the LSI HBA card or the 14500 CPU. So, if there's any indication for either of these in the diagnostics it'd be great to know which. At this point I'm mostly stumped and not sure what to try next, short of pulling the HBA card/sas drives and going back to the onboard SATA, I'm not sure what else to try. Any help would be great! Thanks.

dvd-diagnostics-20240421-1049.zip syslog-previous

Link to comment

As a follow up... The rebuild has been running for 4-5 hours now with no issue. I realized that after loading the downgraded firmware for the HBA I had not restarted. I just started the array and proceeded with the rebuild, which then crashed again. Where I'm at currently is the first reboot post downgrade, so maybe it was the 16.00.12.00 firmware causing the issue and it just needed a reboot to clear the issue after going back to 16.00.10.00. Is that plausible? I don't fully buy it, but so far so good.

 

Also wanted to mention that I was having the freeze up occur during both normal and safe mode and docker both enabled and disabled. I've also only noted the issue while the array is started/disks mounted.

 

I haven't mounted a fan on the LSI card yet, but it doesn't seem to be getting that hot. I've got great airflow through the case and drive, nvme, and mb temps rarely get over 30 C. HBA is also connected to the psu via the 6 pin pcie power cable. Final item of note, I've tried the HBA on both the pcie 5 x16 (cpu lane) and a pcie 3 x4 (mb lane) slot... Both also experienced the crash.

 

Hopefully that covers everything. Hope someone can give me some solid advice based off from that.

Link to comment

Ok, 7 hours in now and still going, though this did pop up in the syslog a while ago. Looks like some kind of fault related to the HBA. Does this tell us anything more specific?

 

Apr 21 12:18:14 DVD kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
Apr 21 12:18:14 DVD kernel: mpt3sas_cm0: fault_state(0x5862)!
Apr 21 12:18:14 DVD kernel: mpt3sas_cm0: sending diag reset !!
Apr 21 12:18:15 DVD kernel: mpt3sas_cm0: diag reset: SUCCESS
Apr 21 12:18:15 DVD kernel: mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
Apr 21 12:18:15 DVD kernel: mpt3sas_cm0: _base_display_fwpkg_version: complete
Apr 21 12:18:15 DVD kernel: mpt3sas_cm0: LSISAS3008: FWVersion(16.00.10.00), ChipRevision(0x02), BiosVersion(18.00.00.00)
Apr 21 12:18:15 DVD kernel: mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
Apr 21 12:18:15 DVD kernel: mpt3sas_cm0: sending port enable !!
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: port enable: SUCCESS
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: search for end-devices: start
Apr 21 12:18:23 DVD kernel: scsi target9:0:0: handle(0x0009), sas_addr(0x4433221100000000)
Apr 21 12:18:23 DVD kernel: scsi target9:0:0: enclosure logical id(0x500062b202a05640), slot(3)
Apr 21 12:18:23 DVD kernel: scsi target9:0:1: handle(0x000a), sas_addr(0x4433221101000000)
Apr 21 12:18:23 DVD kernel: scsi target9:0:1: enclosure logical id(0x500062b202a05640), slot(2)
Apr 21 12:18:23 DVD kernel: scsi target9:0:5: handle(0x000b), sas_addr(0x5000c500c9d8f0d5)
Apr 21 12:18:23 DVD kernel: scsi target9:0:5: enclosure logical id(0x500062b202a05640), slot(6)
Apr 21 12:18:23 DVD kernel: #011handle changed from(0x000c)!!!
Apr 21 12:18:23 DVD kernel: scsi target9:0:2: handle(0x000c), sas_addr(0x4433221102000000)
Apr 21 12:18:23 DVD kernel: scsi target9:0:2: enclosure logical id(0x500062b202a05640), slot(0)
Apr 21 12:18:23 DVD kernel: #011handle changed from(0x000b)!!!
Apr 21 12:18:23 DVD kernel: scsi target9:0:3: handle(0x000d), sas_addr(0x4433221103000000)
Apr 21 12:18:23 DVD kernel: scsi target9:0:3: enclosure logical id(0x500062b202a05640), slot(1)
Apr 21 12:18:23 DVD kernel: scsi target9:0:6: handle(0x000e), sas_addr(0x4433221107000000)
Apr 21 12:18:23 DVD kernel: scsi target9:0:6: enclosure logical id(0x500062b202a05640), slot(5)
Apr 21 12:18:23 DVD kernel: #011handle changed from(0x000f)!!!
Apr 21 12:18:23 DVD kernel: scsi target9:0:4: handle(0x000f), sas_addr(0x4433221104000000)
Apr 21 12:18:23 DVD kernel: scsi target9:0:4: enclosure logical id(0x500062b202a05640), slot(7)
Apr 21 12:18:23 DVD kernel: #011handle changed from(0x000e)!!!
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: search for end-devices: complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: search for end-devices: start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: search for PCIe end-devices: complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: search for expanders: start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: search for expanders: complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: _base_fault_reset_work: hard reset: success
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: removing unresponding devices: start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: removing unresponding devices: end-devices
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: Removing unresponding devices: pcie end-devices
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: removing unresponding devices: expanders
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: removing unresponding devices: complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: scan devices: start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011scan devices: expanders start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011scan devices: expanders complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011scan devices: end devices start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011scan devices: end devices complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011scan devices: pcie end devices start
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: #011pcie devices: pcie end devices complete
Apr 21 12:18:23 DVD kernel: mpt3sas_cm0: scan devices: complete
Apr 21 12:18:23 DVD kernel: sd 9:0:5:0: Mode parameters changed
Apr 21 12:18:23 DVD kernel: sd 9:0:0:0: Power-on or device reset occurred
Apr 21 12:18:23 DVD kernel: sd 9:0:4:0: Power-on or device reset occurred
Apr 21 12:18:23 DVD kernel: sd 9:0:2:0: Power-on or device reset occurred
Apr 21 12:18:23 DVD kernel: sd 9:0:6:0: Power-on or device reset occurred
Apr 21 12:18:23 DVD kernel: sd 9:0:1:0: Power-on or device reset occurred
Apr 21 12:18:23 DVD kernel: sd 9:0:3:0: Power-on or device reset occurred

 

Thoughts? Anybody?

Link to comment

OK, I addressed that in my second post above. Tried two different slots and, while I don't have fan mounted directly to the heat sink of the HBA, I do have 4x140mm fans in a pushpull configuration pulling air throught the front, over the hdd stack, and directly onto the HBA card. It's warmer than the other components in my system, but i can easily put my hand on the heatsink without discomfort/burning. I'll grab a laser temp guage later today to get an actual reading.

 

That said... I had another crash after going to bed last night! The big issue is that when it crashes there is no event added to the log as the system just randomly hangs out of the blue. After my last post yesterday I got a bit more in the log and then nothing all the way till the system froze 5-6 hours later.

 

Apr 21 17:39:51 DVD kernel: BUG: Bad page map in process disk_load  pte:c2bb4f2a9f02a87f pmd:1546ec067
Apr 21 17:39:51 DVD kernel: addr:0000000000400000 vm_flags:00000071 anon_vma:0000000000000000 mapping:ffff888108c1a898 index:0
Apr 21 17:39:51 DVD kernel: file:bash fault:shmem_fault mmap:shmem_mmap read_folio:0x0
Apr 21 17:39:51 DVD kernel: CPU: 10 PID: 7966 Comm: disk_load Tainted: P           O       6.1.79-Unraid #1
Apr 21 17:39:51 DVD kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A WIFI (MS-7D25), BIOS A.H0 03/29/2024
Apr 21 17:39:51 DVD kernel: Call Trace:
Apr 21 17:39:51 DVD kernel: <TASK>
Apr 21 17:39:51 DVD kernel: dump_stack_lvl+0x44/0x5c
Apr 21 17:39:51 DVD kernel: print_bad_pte+0x1bc/0x1d6
Apr 21 17:39:51 DVD kernel: vm_normal_page+0x81/0x9b
Apr 21 17:39:51 DVD kernel: unmap_page_range+0x384/0x67b
Apr 21 17:39:51 DVD kernel: ? prep_new_page+0x1c/0x4c
Apr 21 17:39:51 DVD kernel: unmap_vmas+0xb6/0x100
Apr 21 17:39:51 DVD kernel: exit_mmap+0xdb/0x22e
Apr 21 17:39:51 DVD kernel: ? finish_task_switch.isra.0+0x140/0x218
Apr 21 17:39:51 DVD kernel: __mmput+0x43/0xe3
Apr 21 17:39:51 DVD kernel: do_exit+0x31b/0x923
Apr 21 17:39:51 DVD kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
Apr 21 17:39:51 DVD kernel: do_group_exit+0x7a/0x7a
Apr 21 17:39:51 DVD kernel: get_signal+0x622/0x65a
Apr 21 17:39:51 DVD kernel: arch_do_signal_or_restart+0x36/0x607
Apr 21 17:39:51 DVD kernel: ? __do_sys_wait4+0x37/0x8a
Apr 21 17:39:51 DVD kernel: ? do_sigaction+0x1c4/0x1ee
Apr 21 17:39:51 DVD kernel: exit_to_user_mode_prepare+0x58/0x112
Apr 21 17:39:51 DVD kernel: syscall_exit_to_user_mode+0x18/0x2c
Apr 21 17:39:51 DVD kernel: do_syscall_64+0x77/0x81
Apr 21 17:39:51 DVD kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Apr 21 17:39:51 DVD kernel: RIP: 0033:0x1477feee5c63
Apr 21 17:39:51 DVD kernel: Code: Unable to access opcode bytes at 0x1477feee5c39.
Apr 21 17:39:51 DVD kernel: RSP: 002b:00007ffc7a031608 EFLAGS: 00000202 ORIG_RAX: 000000000000003d
Apr 21 17:39:51 DVD kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00001477feee5c63
Apr 21 17:39:51 DVD kernel: RDX: 0000000000000000 RSI: 00007ffc7a031638 RDI: 00000000ffffffff
Apr 21 17:39:51 DVD kernel: RBP: 0000000000538c48 R08: 0000000000000001 R09: 0000000000000008
Apr 21 17:39:51 DVD kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00000000005393a0
Apr 21 17:39:51 DVD kernel: R13: 0000000000528e0c R14: 0000000000538c48 R15: 00000000005393a0
Apr 21 17:39:51 DVD kernel: </TASK>
Apr 21 17:39:51 DVD kernel: BUG: Bad page map in process disk_load  pte:e66f50f55cccfd27 pmd:1546ec067
Apr 21 17:39:51 DVD kernel: addr:0000000000401000 vm_flags:00000071 anon_vma:0000000000000000 mapping:ffff888108c1a898 index:1
Apr 21 17:39:51 DVD kernel: file:bash fault:shmem_fault mmap:shmem_mmap read_folio:0x0
Apr 21 17:39:51 DVD kernel: CPU: 10 PID: 7966 Comm: disk_load Tainted: P    B      O       6.1.79-Unraid #1
Apr 21 17:39:51 DVD kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A WIFI (MS-7D25), BIOS A.H0 03/29/2024
Apr 21 17:39:51 DVD kernel: Call Trace:
Apr 21 17:39:51 DVD kernel: <TASK>
Apr 21 17:39:51 DVD kernel: dump_stack_lvl+0x44/0x5c
Apr 21 17:39:51 DVD kernel: print_bad_pte+0x1bc/0x1d6
Apr 21 17:39:51 DVD kernel: vm_normal_page+0x81/0x9b
Apr 21 17:39:51 DVD kernel: unmap_page_range+0x384/0x67b
Apr 21 17:39:51 DVD kernel: ? prep_new_page+0x1c/0x4c
Apr 21 17:39:51 DVD kernel: unmap_vmas+0xb6/0x100
Apr 21 17:39:51 DVD kernel: exit_mmap+0xdb/0x22e
Apr 21 17:39:51 DVD kernel: ? finish_task_switch.isra.0+0x140/0x218
Apr 21 17:39:51 DVD kernel: __mmput+0x43/0xe3
Apr 21 17:39:51 DVD kernel: do_exit+0x31b/0x923
Apr 21 17:39:51 DVD kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
Apr 21 17:39:51 DVD kernel: do_group_exit+0x7a/0x7a
Apr 21 17:39:51 DVD kernel: get_signal+0x622/0x65a
Apr 21 17:39:51 DVD kernel: arch_do_signal_or_restart+0x36/0x607
Apr 21 17:39:51 DVD kernel: ? __do_sys_wait4+0x37/0x8a
Apr 21 17:39:51 DVD kernel: ? do_sigaction+0x1c4/0x1ee
Apr 21 17:39:51 DVD kernel: exit_to_user_mode_prepare+0x58/0x112
Apr 21 17:39:51 DVD kernel: syscall_exit_to_user_mode+0x18/0x2c
Apr 21 17:39:51 DVD kernel: do_syscall_64+0x77/0x81
Apr 21 17:39:51 DVD kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Apr 21 17:39:51 DVD kernel: RIP: 0033:0x1477feee5c63
Apr 21 17:39:51 DVD kernel: Code: Unable to access opcode bytes at 0x1477feee5c39.
Apr 21 17:39:51 DVD kernel: RSP: 002b:00007ffc7a031608 EFLAGS: 00000202 ORIG_RAX: 000000000000003d
Apr 21 17:39:51 DVD kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00001477feee5c63
Apr 21 17:39:51 DVD kernel: RDX: 0000000000000000 RSI: 00007ffc7a031638 RDI: 00000000ffffffff
Apr 21 17:39:51 DVD kernel: RBP: 0000000000538c48 R08: 0000000000000001 R09: 0000000000000008
Apr 21 17:39:51 DVD kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00000000005393a0
Apr 21 17:39:51 DVD kernel: R13: 0000000000528e0c R14: 0000000000538c48 R15: 00000000005393a0
Apr 21 17:39:51 DVD kernel: </TASK>
Apr 21 17:39:51 DVD kernel: BUG: Bad page map in process disk_load  pte:d8d336e62a531f07 pmd:1546ec067
Apr 21 17:39:51 DVD kernel: addr:0000000000402000 vm_flags:00000071 anon_vma:0000000000000000 mapping:ffff888108c1a898 index:2
Apr 21 17:39:51 DVD kernel: file:bash fault:shmem_fault mmap:shmem_mmap read_folio:0x0
Apr 21 17:39:51 DVD kernel: CPU: 10 PID: 7966 Comm: disk_load Tainted: P    B      O       6.1.79-Unraid #1
Apr 21 17:39:51 DVD kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A WIFI (MS-7D25), BIOS A.H0 03/29/2024
Apr 21 17:39:51 DVD kernel: Call Trace:
Apr 21 17:39:51 DVD kernel: <TASK>
Apr 21 17:39:51 DVD kernel: dump_stack_lvl+0x44/0x5c
Apr 21 17:39:51 DVD kernel: print_bad_pte+0x1bc/0x1d6
Apr 21 17:39:51 DVD kernel: vm_normal_page+0x81/0x9b
Apr 21 17:39:51 DVD kernel: unmap_page_range+0x384/0x67b
Apr 21 17:39:51 DVD kernel: ? prep_new_page+0x1c/0x4c
Apr 21 17:39:51 DVD kernel: unmap_vmas+0xb6/0x100
Apr 21 17:39:51 DVD kernel: exit_mmap+0xdb/0x22e
Apr 21 17:39:51 DVD kernel: ? finish_task_switch.isra.0+0x140/0x218
Apr 21 17:39:51 DVD kernel: __mmput+0x43/0xe3
Apr 21 17:39:51 DVD kernel: do_exit+0x31b/0x923
Apr 21 17:39:51 DVD kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
Apr 21 17:39:51 DVD kernel: do_group_exit+0x7a/0x7a
Apr 21 17:39:51 DVD kernel: get_signal+0x622/0x65a
Apr 21 17:39:51 DVD kernel: arch_do_signal_or_restart+0x36/0x607
Apr 21 17:39:51 DVD kernel: ? __do_sys_wait4+0x37/0x8a
Apr 21 17:39:51 DVD kernel: ? do_sigaction+0x1c4/0x1ee
Apr 21 17:39:51 DVD kernel: exit_to_user_mode_prepare+0x58/0x112
Apr 21 17:39:51 DVD kernel: syscall_exit_to_user_mode+0x18/0x2c
Apr 21 17:39:51 DVD kernel: do_syscall_64+0x77/0x81
Apr 21 17:39:51 DVD kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Apr 21 17:39:51 DVD kernel: RIP: 0033:0x1477feee5c63
Apr 21 17:39:51 DVD kernel: Code: Unable to access opcode bytes at 0x1477feee5c39.
Apr 21 17:39:51 DVD kernel: RSP: 002b:00007ffc7a031608 EFLAGS: 00000202 ORIG_RAX: 000000000000003d
Apr 21 17:39:51 DVD kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00001477feee5c63
Apr 21 17:39:51 DVD kernel: RDX: 0000000000000000 RSI: 00007ffc7a031638 RDI: 00000000ffffffff
Apr 21 17:39:51 DVD kernel: RBP: 0000000000538c48 R08: 0000000000000001 R09: 0000000000000008
Apr 21 17:39:51 DVD kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00000000005393a0
Apr 21 17:39:51 DVD kernel: R13: 0000000000528e0c R14: 0000000000538c48 R15: 00000000005393a0
Apr 21 17:39:51 DVD kernel: </TASK>
Apr 21 17:39:51 DVD kernel: BUG: Bad page map in process disk_load  pte:88810cf8a39f9f00 pmd:1546ec067
Apr 21 17:39:51 DVD kernel: addr:0000000000403000 vm_flags:00000071 anon_vma:0000000000000000 mapping:ffff888108c1a898 index:3
Apr 21 17:39:51 DVD kernel: file:bash fault:shmem_fault mmap:shmem_mmap read_folio:0x0
Apr 21 17:39:51 DVD kernel: CPU: 10 PID: 7966 Comm: disk_load Tainted: P    B      O       6.1.79-Unraid #1
Apr 21 17:39:51 DVD kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A WIFI (MS-7D25), BIOS A.H0 03/29/2024
Apr 21 17:39:51 DVD kernel: Call Trace:
Apr 21 17:39:51 DVD kernel: <TASK>
Apr 21 17:39:51 DVD kernel: dump_stack_lvl+0x44/0x5c
Apr 21 17:39:51 DVD kernel: print_bad_pte+0x1bc/0x1d6
Apr 21 17:39:51 DVD kernel: vm_normal_page+0x81/0x9b
Apr 21 17:39:51 DVD kernel: unmap_page_range+0x384/0x67b
Apr 21 17:39:51 DVD kernel: ? prep_new_page+0x1c/0x4c
Apr 21 17:39:51 DVD kernel: unmap_vmas+0xb6/0x100
Apr 21 17:39:51 DVD kernel: exit_mmap+0xdb/0x22e
Apr 21 17:39:51 DVD kernel: ? finish_task_switch.isra.0+0x140/0x218
Apr 21 17:39:51 DVD kernel: __mmput+0x43/0xe3
Apr 21 17:39:51 DVD kernel: do_exit+0x31b/0x923
Apr 21 17:39:51 DVD kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
Apr 21 17:39:51 DVD kernel: do_group_exit+0x7a/0x7a
Apr 21 17:39:51 DVD kernel: get_signal+0x622/0x65a
Apr 21 17:39:51 DVD kernel: arch_do_signal_or_restart+0x36/0x607
Apr 21 17:39:51 DVD kernel: ? __do_sys_wait4+0x37/0x8a
Apr 21 17:39:51 DVD kernel: ? do_sigaction+0x1c4/0x1ee
Apr 21 17:39:51 DVD kernel: exit_to_user_mode_prepare+0x58/0x112
Apr 21 17:39:51 DVD kernel: syscall_exit_to_user_mode+0x18/0x2c
Apr 21 17:39:51 DVD kernel: do_syscall_64+0x77/0x81
Apr 21 17:39:51 DVD kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Apr 21 17:39:51 DVD kernel: RIP: 0033:0x1477feee5c63
Apr 21 17:39:51 DVD kernel: Code: Unable to access opcode bytes at 0x1477feee5c39.
Apr 21 17:39:51 DVD kernel: RSP: 002b:00007ffc7a031608 EFLAGS: 00000202 ORIG_RAX: 000000000000003d
Apr 21 17:39:51 DVD kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00001477feee5c63
Apr 21 17:39:51 DVD kernel: RDX: 0000000000000000 RSI: 00007ffc7a031638 RDI: 00000000ffffffff
Apr 21 17:39:51 DVD kernel: RBP: 0000000000538c48 R08: 0000000000000001 R09: 0000000000000008
Apr 21 17:39:51 DVD kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00000000005393a0
Apr 21 17:39:51 DVD kernel: R13: 0000000000528e0c R14: 0000000000538c48 R15: 00000000005393a0
Apr 21 17:39:51 DVD kernel: </TASK>
Apr 21 17:39:51 DVD kernel: BUG: Bad page map in process disk_load  pte:e7e6fde0edee0451 pmd:1546ec067
Apr 21 17:39:51 DVD kernel: addr:0000000000404000 vm_flags:00000071 anon_vma:0000000000000000 mapping:ffff888108c1a898 index:4
Apr 21 17:39:51 DVD kernel: file:bash fault:shmem_fault mmap:shmem_mmap read_folio:0x0
Apr 21 17:39:51 DVD kernel: CPU: 10 PID: 7966 Comm: disk_load Tainted: P    B      O       6.1.79-Unraid #1
Apr 21 17:39:51 DVD kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A WIFI (MS-7D25), BIOS A.H0 03/29/2024
Apr 21 17:39:51 DVD kernel: Call Trace:
Apr 21 17:39:51 DVD kernel: <TASK>
Apr 21 17:39:51 DVD kernel: dump_stack_lvl+0x44/0x5c
Apr 21 17:39:51 DVD kernel: print_bad_pte+0x1bc/0x1d6
Apr 21 17:39:51 DVD kernel: vm_normal_page+0x81/0x9b
Apr 21 17:39:51 DVD kernel: unmap_page_range+0x384/0x67b
Apr 21 17:39:51 DVD kernel: ? prep_new_page+0x1c/0x4c
Apr 21 17:39:51 DVD kernel: unmap_vmas+0xb6/0x100
Apr 21 17:39:51 DVD kernel: exit_mmap+0xdb/0x22e
Apr 21 17:39:51 DVD kernel: ? finish_task_switch.isra.0+0x140/0x218
Apr 21 17:39:51 DVD kernel: __mmput+0x43/0xe3
Apr 21 17:39:51 DVD kernel: do_exit+0x31b/0x923
Apr 21 17:39:51 DVD kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
Apr 21 17:39:51 DVD kernel: do_group_exit+0x7a/0x7a
Apr 21 17:39:51 DVD kernel: get_signal+0x622/0x65a
Apr 21 17:39:51 DVD kernel: arch_do_signal_or_restart+0x36/0x607
Apr 21 17:39:51 DVD kernel: ? __do_sys_wait4+0x37/0x8a
Apr 21 17:39:51 DVD kernel: ? do_sigaction+0x1c4/0x1ee
Apr 21 17:39:51 DVD kernel: exit_to_user_mode_prepare+0x58/0x112
Apr 21 17:39:51 DVD kernel: syscall_exit_to_user_mode+0x18/0x2c
Apr 21 17:39:51 DVD kernel: do_syscall_64+0x77/0x81
Apr 21 17:39:51 DVD kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Apr 21 17:39:51 DVD kernel: RIP: 0033:0x1477feee5c63
Apr 21 17:39:51 DVD kernel: Code: Unable to access opcode bytes at 0x1477feee5c39.
Apr 21 17:39:51 DVD kernel: RSP: 002b:00007ffc7a031608 EFLAGS: 00000202 ORIG_RAX: 000000000000003d
Apr 21 17:39:51 DVD kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00001477feee5c63
Apr 21 17:39:51 DVD kernel: RDX: 0000000000000000 RSI: 00007ffc7a031638 RDI: 00000000ffffffff
Apr 21 17:39:51 DVD kernel: RBP: 0000000000538c48 R08: 0000000000000001 R09: 0000000000000008
Apr 21 17:39:51 DVD kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00000000005393a0
Apr 21 17:39:51 DVD kernel: R13: 0000000000528e0c R14: 0000000000538c48 R15: 00000000005393a0
Apr 21 17:39:51 DVD kernel: </TASK>
Apr 21 17:39:51 DVD kernel: BUG: Bad page map in process disk_load  pte:b883ef5228c7e09d pmd:1546ec067
Apr 21 17:39:51 DVD kernel: addr:0000000000407000 vm_flags:00000071 anon_vma:0000000000000000 mapping:ffff888108c1a898 index:7
Apr 21 17:39:51 DVD kernel: file:bash fault:shmem_fault mmap:shmem_mmap read_folio:0x0
Apr 21 17:39:51 DVD kernel: CPU: 10 PID: 7966 Comm: disk_load Tainted: P    B      O       6.1.79-Unraid #1
Apr 21 17:39:51 DVD kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A WIFI (MS-7D25), BIOS A.H0 03/29/2024
Apr 21 17:39:51 DVD kernel: Call Trace:
Apr 21 17:39:51 DVD kernel: <TASK>
Apr 21 17:39:51 DVD kernel: dump_stack_lvl+0x44/0x5c
Apr 21 17:39:51 DVD kernel: print_bad_pte+0x1bc/0x1d6
Apr 21 17:39:51 DVD kernel: vm_normal_page+0x81/0x9b
Apr 21 17:39:51 DVD kernel: unmap_page_range+0x384/0x67b
Apr 21 17:39:51 DVD kernel: ? prep_new_page+0x1c/0x4c
Apr 21 17:39:51 DVD kernel: unmap_vmas+0xb6/0x100
Apr 21 17:39:51 DVD kernel: exit_mmap+0xdb/0x22e
Apr 21 17:39:51 DVD kernel: ? finish_task_switch.isra.0+0x140/0x218
Apr 21 17:39:51 DVD kernel: __mmput+0x43/0xe3
Apr 21 17:39:51 DVD kernel: do_exit+0x31b/0x923
Apr 21 17:39:51 DVD kernel: ? _raw_spin_lock_irqsave+0x2c/0x37
Apr 21 17:39:51 DVD kernel: do_group_exit+0x7a/0x7a
Apr 21 17:39:51 DVD kernel: get_signal+0x622/0x65a
Apr 21 17:39:51 DVD kernel: arch_do_signal_or_restart+0x36/0x607
Apr 21 17:39:51 DVD kernel: ? __do_sys_wait4+0x37/0x8a
Apr 21 17:39:51 DVD kernel: ? do_sigaction+0x1c4/0x1ee
Apr 21 17:39:51 DVD kernel: exit_to_user_mode_prepare+0x58/0x112
Apr 21 17:39:51 DVD kernel: syscall_exit_to_user_mode+0x18/0x2c
Apr 21 17:39:51 DVD kernel: do_syscall_64+0x77/0x81
Apr 21 17:39:51 DVD kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Apr 21 17:39:51 DVD kernel: RIP: 0033:0x1477feee5c63
Apr 21 17:39:51 DVD kernel: Code: Unable to access opcode bytes at 0x1477feee5c39.
Apr 21 17:39:51 DVD kernel: RSP: 002b:00007ffc7a031608 EFLAGS: 00000202 ORIG_RAX: 000000000000003d
Apr 21 17:39:51 DVD kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00001477feee5c63
Apr 21 17:39:51 DVD kernel: RDX: 0000000000000000 RSI: 00007ffc7a031638 RDI: 00000000ffffffff
Apr 21 17:39:51 DVD kernel: RBP: 0000000000538c48 R08: 0000000000000001 R09: 0000000000000008
Apr 21 17:39:51 DVD kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00000000005393a0
Apr 21 17:39:51 DVD kernel: R13: 0000000000528e0c R14: 0000000000538c48 R15: 00000000005393a0
Apr 21 17:39:51 DVD kernel: </TASK>
Apr 21 17:39:51 DVD kernel: BUG: Bad rss-counter state mm:000000009a70e951 type:MM_SHMEMPAGES val:8
Apr 21 18:00:01 DVD crond[1461]: exit status 127 from user root /usr/sbin/speedtest-xml &> /dev/null
Apr 21 21:00:01 DVD crond[1461]: exit status 127 from user root /usr/sbin/speedtest-xml &> /dev/null
Apr 22 00:00:01 DVD crond[1461]: exit status 127 from user root /usr/sbin/speedtest-xml &> /dev/null

 

 

Got this around 5:40pm and then nothing till it froze sometime after midnight..

 

On a whim, I started a mem test this morning after seeing elsewhere that "Bad page map" errors might be indicative of bad memory. Got 5% of the way through the first pass and started popping errors and fails!

 

*Head -> Desk... lots of creative mumbling of inappropriate phrases*

 

I'm now doing a stick by stick test to see which one is borked! Honestly... it'd been running perfect for a week and only started giving issues after I brought on the HBA. Maybe that just put the right kind of strain on the memory to present the problems that were already there. Who knows. Either way, I'm fairly certain now that this is the majority of my problem atm. Once I isolate the bad stick I'll run things with half the memory while I get the one kit replaced and see if anything else weird pops up.

 

  • Like 1
Link to comment
Posted (edited)

Quick Update...

 

Memtest86 errors/failed with all 4 sticks installed on test 3-4 of a single pass. So far I've tested each individual stick in the A2 slot and all have successfully completed a single pass with no errors. So... the memory itself might not be bad. I am now testing each individual slot on the motherboard. So far A2 is obviously good and A1 is currently testing, but looking good as well. That means either B1 or B2 is faulty or one of the sticks wasn't 100% set or something. All the sticks were fully locked into place so all I could think is that maybe the fan from the cpu cooler caused an issue when I swapped cpu's (fan sits directly over A1) or something got bumped or shifted slightly out of alignment. I've moved the fan to the opposite side of the cooling tower for now and we'll see where I'm at post testing. I'm now 99% certain my testing will reveal NO errors or definitive answers. It'll probably run 24/7 for the next 10 years and never give me another single issue... which will piss me off to no end. 😂

 

edit: OR the A1 slot is bad... errors on test #6 and #8 so far.. I'll see how B1 and B2 do, but I'm guessing it's a Motherboard problem. Good to know, but only mildly less annoying. lol

Edited by Harblar
Link to comment
48 minutes ago, Harblar said:

Memtest86 errors/failed with all 4 sticks installed on test 3-4 of a single pass. So far I've tested each individual stick in the A2 slot and all have successfully completed a single pass with no errors. So... the memory itself might not be bad.

Some motherboards just won't run with all slots filled.

Link to comment

Getting errors on A1 where it is the ONLY slot filled. Tested the same stick in all other slots with no issue. Going to run a couple more tests to confirm things, but I've already started a return request with MSI.

 

Any recommendations on boards that WILL run with all slots filled? I'm looking at getting an MSI Pro Z790-P WIFI, since it will allow expansion to 256GB RAM and does provide PCIE 4.0 x4 through the chipset (for the HBA) in addition to the PCIE 5x16 through the CPU Lane, where as the current board I have only allows pcie3 x4 through the chipset.

 

I'd prefer to get one without "wifi", but thats actually not easy to find these days. lol

Link to comment
2 minutes ago, Harblar said:

Any recommendations on boards that WILL run with all slots filled?

Sorry, I didn't mean to imply that there are properly working boards that don't run with all slots full. If the manufacturer says their board will run with model XXXX RAM, it should run it fine, but that doesn't mean boards don't fail.

 

I just wanted to let you know that could be a failure symptom, you can have a board where all the slots are fine, all the DIMM's are fine, but all 4 at once isn't.

 

I personally had a board that ran fine with all 4 DIMMS for years, until it didn't. The only failure mode was random errors when all 4 slots were full, it ran perfectly on any 2 of the DIMMS, but put all 4 in and memtest would fail every time.

  • Upvote 1
Link to comment

Gotcha. Didn't mean to come across as sarcastic. Genuinely would take any recommendations you might have that have proven good combos.

 

I did just check my memory against the MSI QVL for this board and it is only technically qualified for 1 or 2 (not 4) of this Corsair module. That said, I'm still nearly certain that A1 is faulty since its the only slot that gave any error when I did the single stick slot by slot tests. Was going to run a dual channel (A1/B1) test to further verify, but the A2/B2 came back clean so I just left it and decided to give the rebuild another shot and see what happens.

Link to comment

6 hours in and 21% through the rebuild and EVERYTHING is running smoother and cooler. CPU is sitting at 25C and 1-2% utilization (was 35-40C and 7-10% yesterday.) Ram usage is sitting tight at 3GB used/6GB cached. With twice the memory yesterday, it was sitting at near double that. 

 

There also hasn't been a single hiccup in the syslog yet. 

 

One other thing I disabled in Bios was the Cstate on the CPU. It was hidden in an obscure overclocking menu or I would have had it disabled from the first night. Not saying it was causing issues, but It's pretty much not needed for a server that runs 24-7 and I've seen where some people with AMD based systems have had freezing issues with it enabled. Granted, I have an intel setup, but better to get rid of all the useless crap that could potentially be getting in the way. 🤘

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...