Hard drive dying or Ram caused?

Followers

December 14, 20232 yr

Hey everyone!

I have a drive that keeps getting read errors but I have run the smart tests (short and long) and it passed. Attached the smart file.

Also, I have a ram stick (error from log below) that I'm trying to pinpoint as well to replace, could that be causing my read errors if the drive is fine? I normally dive into things and figure them out on my own, but I'm a new dad and have a lot less time to troubleshoot and look over things, so any help would be much appreciated!

kernel: mce: [Hardware Error]: Machine check events logged

kernel: [Hardware Error]: Corrected error, no action required.

kernel: [Hardware Error]: Error Addr: 0x000000070ab8eeb0

kernel: [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.

kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Edit: I just noticed this might be posted in the wrong section, sorry if it is.

ST8000VN004-2M2101_WSD9W4HN-2023-08-31 disk6 (sdo).txt

Edited December 14, 20232 yr by XDUDE3D

Quote

Solved by JorgeB

December 15, 20232 yr

Go to solution

December 15, 20232 yr

Community Expert

Unlikely that both issues are related, please post the diagnostics.

Quote

December 15, 20232 yr

Author

sovereign-diagnostics-20231214-1606.zip

Quote

December 15, 20232 yr

Community Expert

ECC can only go so far in correcting memory errors. Memory needs to work perfectly. Everything goes through memory, the OS and other executable code, your data, everything. The CPU can't do anything with anything until it is loaded into RAM.

Quote

December 15, 20232 yr

Author

Totally agree on the memory. I have a good stick that I am swapping out one stick at a time till i find the bad one. I thought it saying "MC4 Error (node 1)" it would be stick 4 on CPU 1, so swapped it but error remains. So will just swap them till i find it. Will running the mem test in the boot menu tell me more exact location of the bad stick?

Sounded logical to me that a memory issue could cause issues elsewhere, just was not sure it could cause hard drive errors or not. I already moved the drive to another bay, so it is not the cables or hotswap issue.

Edited December 15, 20232 yr by XDUDE3D

Quote

December 15, 20232 yr

Community Expert

Memory cannot cause SMART errors

Quote

December 15, 20232 yr

Community Expert
Solution

Disk read errors were on spin up, and likely this is the issue:

Quote

December 15, 20232 yr

Author

While I would be happy to have an answer, kinda sucks that seagate/lsi are acting up... My server has been rock solid for years, now 7 of my drives & 3 of my HBAs are on the list of potential issues......... I set the one drive that has been throwing errors to no spin down to test this out. I hope a update comes to undo/fix this, as editing each drives power settings (at least what I understood from the linked topic) does not sound like something i wanna do just for fun. Just not sure why only 1 drive is acting up out of 7 of those same drives... Must be the wrong FW on that one maybe. Maybe I'll just grab a 9300 16i card and make sure all the seagates are on it, idk. Ok...I'll do some tinkering and report back/close this topic.

Edited December 15, 20232 yr by XDUDE3D

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Hard drive dying or Ram caused?

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)