smdl Posted October 14, 2019 Share Posted October 14, 2019 Hi. folks. I've been evaluating unRAID for a few months (I like it!) on a very old machine, and am just in the process of setting up an all-new server. I completed the setup and pre-clear last night, and just brought the array online today. One of my first steps was to add the CA Fix Common Problems app, and that reported a single hardware error when I ran it. However, I saw some notes in the forum indicating that doing a simple re-scan to see if the problem persisted would be a normal approach. When I did that, the problem was no longer indicated, so I carried on configuring the system. At things sit currenetly, everything unRAID seems happy, with everything in the green. However, while checking status, I happened to notice hardware errors showing up in the log. All say that the errors where corrected, and no action is required, but they persist. I have attached the Diagnostics file to this post, and would really appreciate any help with figuring out what isn't right. I'm still a novice with unRAID, so apologies if I am missing something obvious. Just in case the Diagnostic file doesn't include the hardware errors, I have included a sample below: Thanks, Shaun Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:14 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151 Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x000000000107a9c0 Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002 Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2 Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity. Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Oct 14 14:45:01 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required. Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x000000000106ce40 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity. Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Oct 14 14:45:01 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required. Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:14 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x000000000109e1c0 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2 Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity. Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Oct 14 14:50:28 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged Oct 14 14:50:28 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required. Oct 14 14:50:28 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: gisl-cr1-ur1-diagnostics-20191014-2201.zip Quote Link to comment
smdl Posted October 15, 2019 Author Share Posted October 15, 2019 Hmm, just checked, and this error still seems to be occurring about every 5-6 minutes. Hoping someone has some idea of how to look further into these. I'll keep searching, as well. Thanks, Shaun 0xdc20000000020151 Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x00000000011541c0 Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002 Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2 Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity. Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Oct 15 14:19:30 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required. Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: Quote Link to comment
Squid Posted October 16, 2019 Share Posted October 16, 2019 Normally I'd say bad CPU, but quick googling suggest that it may be a kernel problem. Update the BIOS if possible, and try 6.8.0-rc1 and see what happens. 1 Quote Link to comment
Vr2Io Posted October 16, 2019 Share Posted October 16, 2019 (edited) You may try update mainboard BIOS https://community.amd.com/thread/230743 Edited October 16, 2019 by Benson 1 Quote Link to comment
smdl Posted October 16, 2019 Author Share Posted October 16, 2019 Thanks, folks -- I really appreciate the help. I actually did update the BIOS, right after I posted my last message, as there was a newer version available. The update was successful, which is great, but now the system won't boot. It consistently gives the following message: SYSLINUX 6.03 EDD Load error - Boot error I was just starting to look at that when I had to take my son to an event, so I am just getting back to it now. Seems it might have something to do with the USB flash drive, but it seems suspicious that the problem started immediately after I updated the BIOS. I'll see what I can dig up. Any suggestions or advice would be gratefully accepted Cheers, Shaun Quote Link to comment
smdl Posted October 16, 2019 Author Share Posted October 16, 2019 Okay, I was finally able to get the system back up and working again, but I had to recreate the USB boot drive to do it. Also good news, the errors don't seem to be recurring... yet. Will need to keep watching this for a while to see if it comes back in time. Thanks again for the help with this! Cheers, Shaun Quote Link to comment
smdl Posted October 16, 2019 Author Share Posted October 16, 2019 Uh oh. Spoke too soon. Again, uR seems to be working great, but these errors continue to appear in the background. I saw an note in CA Fix Common Problems: You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged I'm not sure I should be playing in NerdPack too much yet, and am not sure I actually have it installed. I did find NerdPack GUI, and have that installed, but can't really figure out how to do much in there, yet. Will keep searching as it would be nice to have the errors decoded. Here is another sample of the errors. Oct 15 20:20:01 Tower kernel: mce: [Hardware Error]: Machine check events logged Oct 15 20:20:01 Tower kernel: [Hardware Error]: Corrected error, no action required. Oct 15 20:20:01 Tower kernel: [Hardware Error]: CPU:14 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151 Oct 15 20:20:01 Tower kernel: [Hardware Error]: Error Addr: 0x0000000001073ac0 Oct 15 20:20:01 Tower kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002 Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2 Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity. Oct 15 20:20:01 Tower kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Oct 15 20:20:01 Tower kernel: mce: [Hardware Error]: Machine check events logged Oct 15 20:20:01 Tower kernel: [Hardware Error]: Corrected error, no action required. Oct 15 20:20:01 Tower kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151 Oct 15 20:20:01 Tower kernel: [Hardware Error]: Error Addr: 0x0000000001071bc0 Oct 15 20:20:01 Tower kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002 Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2 Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity. Oct 15 20:20:01 Tower kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Cheers, Shaun Quote Link to comment
smdl Posted October 26, 2019 Author Share Posted October 26, 2019 Greetings, all. Well, the system continues to operate without apparent issue, although it's not really in heavy use, yet. Until I have a sense of what might be causing this, I'm hesitant to depend upon it. I've looked up the errors online, but all I really see are posts relating to Linux kernel and AMD error reporting conventions, which really go over my head. I'll keep trying to understand it, but these seem to be more about how errors are reported, rather than what their potential meaning might be. Hoping someone out there might have additional ideas as to where to look. Any help would be sincerely appreciated. Cheers, Shaun Quote Link to comment
smdl Posted October 26, 2019 Author Share Posted October 26, 2019 (edited) As mentioned, this system seems to operate fine from an unRAID perspective, but I will add that the array seems to take much longer to start than my old machine, which seems odd. Old machine is AMD Phenom 9500 Quad Core 2200mhz with 4GB DDR2, and 5 SATA drives (largest 2TB). New one is Ryzen 7 2700X 3700Mhz with 32GB DDR4, and 4 SATA drives (largest 4TB). Seems odd to me that the new array would take about twice as long to start, and I wonder if this is an indication of a performance problem related to the errors. Are there other factors that might contribute to slower startup? Thanks, Shaun Edited October 26, 2019 by smdl Quote Link to comment
smdl Posted November 22, 2019 Author Share Posted November 22, 2019 Hi, folks. Just to close the loop on this, after extensive testing and parts-swapping, I was able to confirm that the problem was with the processor. I have since replaced the processor with a new one (same model), and the errors have stopped. Sincere thanks to those who offered input. Cheers, Shaun Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.