mgranger Posted November 19, 2019 Share Posted November 19, 2019 I got an error in Fix Common Problems that I had a "Machine Check Events detected on your server" error. Not sure what that means but I will post my diagnostics in hopes that someone can help. finalizer-diagnostics-20191119-1241.zip Quote Link to comment
John_M Posted November 19, 2019 Share Posted November 19, 2019 It's a hardware fault that takes place as the CPU is starting up: Nov 19 06:38:07 Finalizer kernel: smpboot: CPU0: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (family: 0x6, model: 0x3c, stepping: 0x3) Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: Machine check events logged Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: fe00000000800400 Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffa00ac3d3 MISC ffffffffa00ac3d3 Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1574163469 SOCKET 0 APIC 0 microcode 27 It might indicate a faulty CPU but I'd do a MemTest first (select it from the boot menu if legacy booting) to try to eliminate faulty RAM. If the RAM passes the test you could use the Nerd Tools plugin to install mcelog, which might reveal more information. It looks like you also have cache pool corruption Nov 19 06:39:49 Finalizer kernel: BTRFS warning (device sdn1): csum failed root 5 ino 144339143 off 1912832 csum 0x3d38702b expected csum 0xf33e576a mirror 1 with one of the devices showing read errors: Nov 19 06:39:49 Finalizer kernel: BTRFS info (device sdn1): read error corrected: ino 144339143 off 1916928 (dev /dev/sdp1 sector 3106512) Quote Link to comment
mgranger Posted November 19, 2019 Author Share Posted November 19, 2019 14 minutes ago, John_M said: It's a hardware fault that takes place as the CPU is starting up: Nov 19 06:38:07 Finalizer kernel: smpboot: CPU0: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (family: 0x6, model: 0x3c, stepping: 0x3) Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: Machine check events logged Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: fe00000000800400 Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffa00ac3d3 MISC ffffffffa00ac3d3 Nov 19 06:38:07 Finalizer kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1574163469 SOCKET 0 APIC 0 microcode 27 It might indicate a faulty CPU but I'd do a MemTest first (select it from the boot menu if legacy booting) to try to eliminate faulty RAM. If the RAM passes the test you could use the Nerd Tools plugin to install mcelog, which might reveal more information. It looks like you also have cache pool corruption Nov 19 06:39:49 Finalizer kernel: BTRFS warning (device sdn1): csum failed root 5 ino 144339143 off 1912832 csum 0x3d38702b expected csum 0xf33e576a mirror 1 with one of the devices showing read errors: Nov 19 06:39:49 Finalizer kernel: BTRFS info (device sdn1): read error corrected: ino 144339143 off 1916928 (dev /dev/sdp1 sector 3106512) Ok I will try that when I get home. That doesn't sound very good though. I can do an mcelog now but not sure where to go to get that. I have the mcelog installed already but not sure how to use it. What do I need to do for the cache pool corruption. I woke up this morning with my computer locked up and had to reboot it. I assume that is what caused all of this but not sure if I need to do a parity check to fix all this or something else? Quote Link to comment
John_M Posted November 19, 2019 Share Posted November 19, 2019 First thing is to test the RAM for 24 hours or so. If the RAM is bad you can't do anything else. Quote Link to comment
JorgeB Posted November 19, 2019 Share Posted November 19, 2019 46 minutes ago, John_M said: It looks like you also have cache pool corruption Nov 19 06:39:49 Finalizer kernel: BTRFS warning (device sdn1): csum failed root 5 ino 144339143 off 1912832 csum 0x3d38702b expected csum 0xf33e576a mirror 1 with one of the devices showing read errors: Nov 19 06:39:49 Finalizer kernel: BTRFS info (device sdn1): read error corrected: ino 144339143 of These are checksum errors and usually the result of bad RAM. 1 Quote Link to comment
mgranger Posted November 19, 2019 Author Share Posted November 19, 2019 13 minutes ago, johnnie.black said: These are checksum errors and usually the result of bad RAM. So am i at risk of bad things happening if I still keep my server up and running at this time. (Corruption?) I would like to keep it up and running if possible but dont want to cause more damage if I can avoid it. When I do the Memtest will it show which stick is bad or do I have to only put one in and do process of elimination to determine which is bad (if this is the case) Is this a common thing to go bad. I have had this computer for a little over a year and a half and doesn't seem like they would wear out this quick. Quote Link to comment
JorgeB Posted November 19, 2019 Share Posted November 19, 2019 5 minutes ago, mgranger said: (Corruption?) Yes, all data on the array is on xfs disks, xfs can't detect data corruption, data read on cache will be corrected by btrfs if corruption is detected, but new data written can also be corrupted, since the checksum saved by btrfs can be from corrupted data in RAM, you'd need ECC RAM to protect from corruption on write. Quote Link to comment
mgranger Posted November 19, 2019 Author Share Posted November 19, 2019 17 minutes ago, johnnie.black said: Yes, all data on the array is on xfs disks, xfs can't detect data corruption, data read on cache will be corrected by btrfs if corruption is detected, but new data written can also be corrupted, since the checksum saved by btrfs can be from corrupted data in RAM, you'd need ECC RAM to protect from corruption on write. So I should shut it down. Is there a way to check if data is corrupted that has been written to the array already. Quote Link to comment
JorgeB Posted November 19, 2019 Share Posted November 19, 2019 Not with xfs and without checksums, and like mentioned btrfs can only confirm that is returning the same data that it was written, so if there was a problem during writes data can also be corrupt. Quote Link to comment
itimpi Posted November 19, 2019 Share Posted November 19, 2019 4 minutes ago, mgranger said: So I should shut it down. Is there a way to check if data is corrupted that has been written to the array already. Not unless you have checksums for the files or backups that you can compare the array files with. Quote Link to comment
mgranger Posted November 19, 2019 Author Share Posted November 19, 2019 For what it's worth I have been running Memtest86 for over an hour now and here is what I have got so far. Quote Link to comment
mgranger Posted November 20, 2019 Author Share Posted November 20, 2019 4 hours ago, mgranger said: For what it's worth I have been running Memtest86 for over an hour now and here is what I have got so far. Ok so I did 1 run using memtest86 version 5.01 and got no errors. Then I stopped it and tried version 8.2 of memtest86. I am currently done with 2 runs of this version and have no errors. I will let it finish overnight and update in the morning. It has been taking about 1 hour 45 mins per run so I am a little over 5 hours of memtesting. Quote Link to comment
Squid Posted November 20, 2019 Share Posted November 20, 2019 17 hours ago, mgranger said: I got an error in Fix Common Problems that I had a "Machine Check Events detected on your server" error. Not sure what that means but I will post my diagnostics in hopes that someone can help. finalizer-diagnostics-20191119-1241.zip 195.24 kB · 4 downloads FWIW, the mce happens when the system is initializing the CPUs. For some reason, some hardware combinations do issue an MCE at that point, and it's nothing to worry about. Quote Link to comment
JorgeB Posted November 20, 2019 Share Posted November 20, 2019 You should run memtest for at least 24H, and only an error result is positive, no errors doesn't mean there not a problem. Quote Link to comment
JorgeB Posted November 20, 2019 Share Posted November 20, 2019 But looking more carefully at your diags, checksum errors are almost certainly related to this: Nov 19 06:38:56 Finalizer kernel: BTRFS info (device sdn1): bdev /dev/sdp1 errs: wr 220877319, rd 749360, flush 0, corrupt 0, gen 0 One of your cache devices is dropping offline, errors are being corrected after it came back online, you should run a scrub, see here for more info. Quote Link to comment
mgranger Posted November 21, 2019 Author Share Posted November 21, 2019 i had a lot pf errors show up on the cache drive and then my disk 6 became disable yesterday due to read errors. so i went in and replaced the sata cables on both those drive hoping that will help. i am currently rebuilding drive 6 because i had to stop and remove the disk from the array and then add it back in. i am hoping this fixes some of my issues but i am a little nervous about it. i also balanced and scrubbed the cache pool. here are the diagnostics from this morning although maybe the parity needs to finish finalizer-diagnostics-20191121-1123.zip Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Docker image is corrupt and spamming the log, you need to delete and re-create. Quote Link to comment
mgranger Posted November 21, 2019 Author Share Posted November 21, 2019 25 minutes ago, johnnie.black said: Docker image is corrupt and spamming the log, you need to delete and re-create. I noticed that and deleted and recreated it late last night. Hopefully it is not still corrupt. It seems to be working right now. Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Then it should be OK, syslog doesn't show the last hours because of the previously mentioned spam. Quote Link to comment
mgranger Posted November 21, 2019 Author Share Posted November 21, 2019 i will restart after the parity is run so that it all gets cleared out. Quote Link to comment
mgranger Posted November 22, 2019 Author Share Posted November 22, 2019 (edited) I am having some read errors with disk 6. when i try to move files over to it using unbalance the disk gets read errors and becomes disabled and so i am emulatimg off the parity drive. not sure what is going on here. thos is a brand new 8tb drive. i will post the diagnostics. also what do i have to do to get the drive back in the array? or is that too risky and should i just be avoiding this drive? finalizer-diagnostics-20191122-0449.zip Edited November 22, 2019 by mgranger Quote Link to comment
JorgeB Posted November 22, 2019 Share Posted November 22, 2019 Diags are after rebooting so we can't see the errors, but the disk looks fine, replace/swap cables and rebuild, if it happens again grab diags before rebooting. Quote Link to comment
mgranger Posted November 26, 2019 Author Share Posted November 26, 2019 On 11/22/2019 at 2:48 AM, johnnie.black said: Diags are after rebooting so we can't see the errors, but the disk looks fine, replace/swap cables and rebuild, if it happens again grab diags before rebooting. Thanks @johnnie.black Everything seems to be back to normal now. I switched the bay the hard drive was in (different cable) and that seemed to fix it for now. Quote Link to comment
mgranger Posted November 28, 2019 Author Share Posted November 28, 2019 Well maybe i spoke too soon. I was getting errors where my sever could not be accessed when I woke up in them morning so I have been doing some searching online and noticed that this could be due to having 2 cache drives (both of mine are the samsung evo which could also be an issue) so I tried removing one from the settings but when I did this I know got another machine check event error. I am attaching my diagnostics finalizer-diagnostics-20191128-1438.zip Quote Link to comment
JorgeB Posted November 28, 2019 Share Posted November 28, 2019 Those MCE errors early in the boot precess seem to happen with some hardware, and usually are nothing to worry about, bios update might help. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.