January 16, 20197 yr Oddly this started after I started transferring a few files directly from a rclone mount. Not sure if it was just a coincidence. Quote Jan 15 22:23:21 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:23:21 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:23:21 Backup kernel: CMCI storm detected: switching to poll mode Jan 15 22:24:25 Backup kernel: mce_notify_irq: 15 callbacks suppressed Jan 15 22:24:25 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:24:35 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:26:48 Backup kernel: mce_notify_irq: 2 callbacks suppressed Jan 15 22:26:48 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:27:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:29:00 Backup kernel: mce_notify_irq: 2 callbacks suppressed Jan 15 22:29:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:34:00 Backup kernel: CMCI storm subsided: switching to interrupt mode Jan 15 22:37:07 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:37:54 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:38:20 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:38:29 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:39:20 Backup kernel: CMCI storm detected: switching to poll mode Jan 15 22:39:21 Backup kernel: mce_notify_irq: 18 callbacks suppressed Jan 15 22:39:21 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:39:22 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:40:26 Backup kernel: mce_notify_irq: 1 callbacks suppressed Jan 15 22:40:26 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:40:40 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:43:14 Backup kernel: mce_notify_irq: 1 callbacks suppressed Jan 15 22:43:14 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:46:30 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:50:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:53:10 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:54:32 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:54:37 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:55:42 Backup kernel: mce_notify_irq: 5 callbacks suppressed Jan 15 22:55:42 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:55:52 Backup login[32403]: ROOT LOGIN on '/dev/pts/0' Jan 15 22:56:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:56:43 Backup kernel: mce_notify_irq: 1 callbacks suppressed Jan 15 22:56:43 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:56:54 Backup kernel: mce: [Hardware Error]: Machine check events logged backup-diagnostics-20190115-2259.zip
January 16, 20197 yr Community Expert See if /var/log/mcelog has anything interesting, also if the board has a system event log, there might be some more info there.
January 16, 20197 yr Author Thanks johnnie.black, always willing to help. Much appreciated. So I have mcelog installed in the nerd pack but it's never worked. I also know a ton of people who say the same thing. System event log only shows sys_fan4, 3, 2, 1, cpu2,1 fan all lower critrical going low asserted or deasserted. Probably due to the fans I'm using. I did notice this Quote 70 01/10/2019 01:47:5734AC LostPower SupplyPower Supply Input Lost or Out of Range - Asserted But realized that was when I shut down the server gracefully and had to pull it. I didn't open it, just did some rearranging in the rack. I'm doing a memtest right now.
January 16, 20197 yr Author 1 Pass ran with no errors, not sure how many passes are sufficient in this case?
January 16, 20197 yr Community Expert If you're using ECC RAM no errors will show on memtest, since they are corrected.
January 25, 20197 yr Author Upgraded to the latest RC version available, hardware errors are still continuing. Here is the latest syslog messages that I haven't seen before. Quote Jan 25 06:00:14 Backup kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0. Jan 25 06:00:14 Backup kernel: Do you have a strange power saving mode enabled? Jan 25 06:00:14 Backup kernel: Dazed and confused, but trying to continue Jan 25 06:00:14 Backup kernel: DMAR: DRHD: handling fault status reg 2 Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr ff0cf000 [fault reason 06] PTE Read access is not set Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fed22000 [fault reason 06] PTE Read access is not set Jan 25 06:00:14 Backup kernel: DMAR: DRHD: handling fault status reg 202 Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fed13000 [fault reason 06] PTE Read access is not set Jan 25 06:00:14 Backup kernel: DMAR: DRHD: handling fault status reg 302 Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fed14000 [fault reason 06] PTE Read access is not set Jan 25 06:00:15 Backup kernel: mce_notify_irq: 58 callbacks suppressed Jan 25 06:00:15 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 25 06:00:16 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 25 06:00:22 Backup kernel: dmar_fault: 8424 callbacks suppressed Jan 25 06:00:22 Backup kernel: DMAR: DRHD: handling fault status reg 402 Jan 25 06:00:22 Backup kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr fea3a000 [fault reason 06] PTE Read access is not set Jan 25 06:00:23 Backup kernel: DMAR: DRHD: handling fault status reg 502 Jan 25 06:00:23 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fec10000 [fault reason 06] PTE Read access is not set Jan 25 06:00:23 Backup kernel: DMAR: DRHD: handling fault status reg 602 Jan 25 06:00:23 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fe7df000 [fault reason 06] PTE Read access is not set Jan 25 06:00:23 Backup kernel: DMAR: DRHD: handling fault status reg 702 backup-diagnostics-20190125-0754.zip Edited January 25, 20197 yr by slimshizn
January 30, 20197 yr Author Guess I can just live with the MCE logs for now. Any limetech/unraid admins have any ideas here?
Archived
This topic is now archived and is closed to further replies.