December 14, 20232 yr Hey everyone! I have a drive that keeps getting read errors but I have run the smart tests (short and long) and it passed. Attached the smart file. Also, I have a ram stick (error from log below) that I'm trying to pinpoint as well to replace, could that be causing my read errors if the drive is fine? I normally dive into things and figure them out on my own, but I'm a new dad and have a lot less time to troubleshoot and look over things, so any help would be much appreciated! kernel: mce: [Hardware Error]: Machine check events logged kernel: [Hardware Error]: Corrected error, no action required. kernel: [Hardware Error]: CPU:8 (15:2:0) MC4_STATUS[-|CE|MiscV|AddrV|-|CECC|-]: 0x9c31c00001080a13 kernel: [Hardware Error]: Error Addr: 0x000000070ab8eeb0 kernel: [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB. kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) Edit: I just noticed this might be posted in the wrong section, sorry if it is. ST8000VN004-2M2101_WSD9W4HN-2023-08-31 disk6 (sdo).txt Edited December 14, 20232 yr by XDUDE3D
December 15, 20232 yr Community Expert Unlikely that both issues are related, please post the diagnostics.
December 15, 20232 yr Community Expert ECC can only go so far in correcting memory errors. Memory needs to work perfectly. Everything goes through memory, the OS and other executable code, your data, everything. The CPU can't do anything with anything until it is loaded into RAM.
December 15, 20232 yr Author Totally agree on the memory. I have a good stick that I am swapping out one stick at a time till i find the bad one. I thought it saying "MC4 Error (node 1)" it would be stick 4 on CPU 1, so swapped it but error remains. So will just swap them till i find it. Will running the mem test in the boot menu tell me more exact location of the bad stick? Sounded logical to me that a memory issue could cause issues elsewhere, just was not sure it could cause hard drive errors or not. I already moved the drive to another bay, so it is not the cables or hotswap issue. Edited December 15, 20232 yr by XDUDE3D
December 15, 20232 yr Community Expert Solution Disk read errors were on spin up, and likely this is the issue:
December 15, 20232 yr Author While I would be happy to have an answer, kinda sucks that seagate/lsi are acting up... My server has been rock solid for years, now 7 of my drives & 3 of my HBAs are on the list of potential issues......... I set the one drive that has been throwing errors to no spin down to test this out. I hope a update comes to undo/fix this, as editing each drives power settings (at least what I understood from the linked topic) does not sound like something i wanna do just for fun. Just not sure why only 1 drive is acting up out of 7 of those same drives... Must be the wrong FW on that one maybe. Maybe I'll just grab a 9300 16i card and make sure all the seagates are on it, idk. Ok...I'll do some tinkering and report back/close this topic. Edited December 15, 20232 yr by XDUDE3D
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.