slimshizn Posted January 16, 2019 Share Posted January 16, 2019 Oddly this started after I started transferring a few files directly from a rclone mount. Not sure if it was just a coincidence. Quote Jan 15 22:23:21 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:23:21 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:23:21 Backup kernel: CMCI storm detected: switching to poll mode Jan 15 22:24:25 Backup kernel: mce_notify_irq: 15 callbacks suppressed Jan 15 22:24:25 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:24:35 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:26:48 Backup kernel: mce_notify_irq: 2 callbacks suppressed Jan 15 22:26:48 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:27:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:29:00 Backup kernel: mce_notify_irq: 2 callbacks suppressed Jan 15 22:29:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:34:00 Backup kernel: CMCI storm subsided: switching to interrupt mode Jan 15 22:37:07 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:37:54 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:38:20 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:38:29 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:39:20 Backup kernel: CMCI storm detected: switching to poll mode Jan 15 22:39:21 Backup kernel: mce_notify_irq: 18 callbacks suppressed Jan 15 22:39:21 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:39:22 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:40:26 Backup kernel: mce_notify_irq: 1 callbacks suppressed Jan 15 22:40:26 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:40:40 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:43:14 Backup kernel: mce_notify_irq: 1 callbacks suppressed Jan 15 22:43:14 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:46:30 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:50:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:53:10 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:54:32 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:54:37 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:55:42 Backup kernel: mce_notify_irq: 5 callbacks suppressed Jan 15 22:55:42 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:55:52 Backup login[32403]: ROOT LOGIN on '/dev/pts/0' Jan 15 22:56:00 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:56:43 Backup kernel: mce_notify_irq: 1 callbacks suppressed Jan 15 22:56:43 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 15 22:56:54 Backup kernel: mce: [Hardware Error]: Machine check events logged backup-diagnostics-20190115-2259.zip Quote Link to comment
JorgeB Posted January 16, 2019 Share Posted January 16, 2019 See if /var/log/mcelog has anything interesting, also if the board has a system event log, there might be some more info there. Quote Link to comment
slimshizn Posted January 16, 2019 Author Share Posted January 16, 2019 Thanks johnnie.black, always willing to help. Much appreciated. So I have mcelog installed in the nerd pack but it's never worked. I also know a ton of people who say the same thing. System event log only shows sys_fan4, 3, 2, 1, cpu2,1 fan all lower critrical going low asserted or deasserted. Probably due to the fans I'm using. I did notice this Quote 70 01/10/2019 01:47:5734AC LostPower SupplyPower Supply Input Lost or Out of Range - Asserted But realized that was when I shut down the server gracefully and had to pull it. I didn't open it, just did some rearranging in the rack. I'm doing a memtest right now. Quote Link to comment
slimshizn Posted January 16, 2019 Author Share Posted January 16, 2019 1 Pass ran with no errors, not sure how many passes are sufficient in this case? Quote Link to comment
JorgeB Posted January 16, 2019 Share Posted January 16, 2019 If you're using ECC RAM no errors will show on memtest, since they are corrected. Quote Link to comment
slimshizn Posted January 16, 2019 Author Share Posted January 16, 2019 Just started happening, so I'm not sure where to look here. Quote Link to comment
slimshizn Posted January 25, 2019 Author Share Posted January 25, 2019 (edited) Upgraded to the latest RC version available, hardware errors are still continuing. Here is the latest syslog messages that I haven't seen before. Quote Jan 25 06:00:14 Backup kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0. Jan 25 06:00:14 Backup kernel: Do you have a strange power saving mode enabled? Jan 25 06:00:14 Backup kernel: Dazed and confused, but trying to continue Jan 25 06:00:14 Backup kernel: DMAR: DRHD: handling fault status reg 2 Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr ff0cf000 [fault reason 06] PTE Read access is not set Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fed22000 [fault reason 06] PTE Read access is not set Jan 25 06:00:14 Backup kernel: DMAR: DRHD: handling fault status reg 202 Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fed13000 [fault reason 06] PTE Read access is not set Jan 25 06:00:14 Backup kernel: DMAR: DRHD: handling fault status reg 302 Jan 25 06:00:14 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fed14000 [fault reason 06] PTE Read access is not set Jan 25 06:00:15 Backup kernel: mce_notify_irq: 58 callbacks suppressed Jan 25 06:00:15 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 25 06:00:16 Backup kernel: mce: [Hardware Error]: Machine check events logged Jan 25 06:00:22 Backup kernel: dmar_fault: 8424 callbacks suppressed Jan 25 06:00:22 Backup kernel: DMAR: DRHD: handling fault status reg 402 Jan 25 06:00:22 Backup kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr fea3a000 [fault reason 06] PTE Read access is not set Jan 25 06:00:23 Backup kernel: DMAR: DRHD: handling fault status reg 502 Jan 25 06:00:23 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fec10000 [fault reason 06] PTE Read access is not set Jan 25 06:00:23 Backup kernel: DMAR: DRHD: handling fault status reg 602 Jan 25 06:00:23 Backup kernel: DMAR: [DMA Read] Request device [03:00.0] fault addr fe7df000 [fault reason 06] PTE Read access is not set Jan 25 06:00:23 Backup kernel: DMAR: DRHD: handling fault status reg 702 backup-diagnostics-20190125-0754.zip Edited January 25, 2019 by slimshizn Quote Link to comment
slimshizn Posted January 30, 2019 Author Share Posted January 30, 2019 Guess I can just live with the MCE logs for now. Any limetech/unraid admins have any ideas here? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.