loyalsnoopdoge Posted July 25 Share Posted July 25 (edited) Hello unRAID community. I could use your support My setup: Intel Core i7 12700k 1x Seagate Ironwolf Pro 16TB (Parity 1) 4x WD Red Pro 16TB (Parity 2. Disks 1, 2, & 3) LSI 9400-16i in IT mode (all hard drives connected to this HBA) 2x Samsung 980 Pro 2TB in RAID1 (Cache) My issue: My server has been working great for the last year, then all of a sudden about 2 weeks ago I started getting UDMA CRC errors on all of my data drives during my quarterly parity check. This error is easily reproducible by doing a parity check, or during extended sessions with the mover. The odd thing is, that every time I reproduce the issue, the error will come from 1 or 2 of the drives at random, but not all 3. Eventually if I let the parity check run long enough, unRAID will detect errors on one of the 3 data drives. The concerning thing is that I find it hard to believe that 3 of my drives are failing. 2 of the 3 are ~1 year old, while the last one (which is having the most UDMA CRC issues, is 3 months old) Steps I have attempted to resolve: Updated BIOS on motherboard Updated unRAID to 6.12.11 Verified LSI 9400-16i is running latest firmware Dusted SAS ports on LSi, dusted PCIe slot, dusted SATA ports on HDD's Swapped all mini-SAS to SATA breakout cables for new ones Swapped power cables on HDD's for new ones (from PSU box) Switched LSI card to different PCIe slot Purchased new LSI 9400-16i HBA and installed with another set of brand new mini-SAS to SATA cables Extended SMART tests on all HDD's. All drives passed 2 Extended SMART tests I am running out of ideas to try. Is there anything else I can try, or should I accept that my data drives have gone bad? diagnostics-20240725-1217.zip Edited July 25 by loyalsnoopdoge grammar corrections, attaching diagnostics Quote Link to comment
JorgeB Posted July 25 Share Posted July 25 That's usually not a disk problem, though you have already replaced everything else, but post the diags after a parity check in case there's something more there. Quote Link to comment
Veah Posted July 25 Share Posted July 25 Is it possible to get all the hdd on mb sata to test it? Eliminate the LSI card sort of thing. Quote Link to comment
loyalsnoopdoge Posted July 25 Author Share Posted July 25 2 minutes ago, Veah said: Is it possible to get all the hdd on mb sata to test it? Eliminate the LSI card sort of thing. I plan on trying this when I get home today. Will update here Quote Link to comment
loyalsnoopdoge Posted July 25 Author Share Posted July 25 1 hour ago, JorgeB said: That's usually not a disk problem, though you have already replaced everything else, but post the diags after a parity check in case there's something more there. Updated diagnostics now with Read errors Please use this one instead diagnostics-20240725-1217.zip Quote Link to comment
loyalsnoopdoge Posted July 25 Author Share Posted July 25 @JorgeBwhen you get a chance would you mind looking at my diagnostics? thank you in advance Quote Link to comment
loyalsnoopdoge Posted July 26 Author Share Posted July 26 UPDATE: Just plugged the drives throwing errors into the SATA ports on my motherboard, and they are still throwing UDMA CRC errors. I assume the drive is toast? Quote Link to comment
Veah Posted July 26 Share Posted July 26 It's really sus that multiple young drives would fail so soon... Hard for me to accept that. Let's see, you eliminated SAS controller, you spared off both data and power cables. There's only one long shot I can think of with HW and that would be the PSU actual. Any chance you have an old one laying around? If not that, maybe Jorge (or another pro) can see something in the logs. I'm pretty bad at reading through those unfortunately. Quote Link to comment
JorgeB Posted July 26 Share Posted July 26 5 hours ago, Veah said: It's really sus that multiple young drives would fail so soon... Agreed, the logs show what look like power/connection issues, and if it's the same with the onboard SATA still think power or miniSAS cables are the problem. Quote Link to comment
loyalsnoopdoge Posted July 26 Author Share Posted July 26 8 hours ago, JorgeB said: Agreed, the logs show what look like power/connection issues, and if it's the same with the onboard SATA still think power or miniSAS cables are the problem. I have tried replacing the power cables as well. Do you think it could be the SATA connectors on the drives themselves? Quote Link to comment
JorgeB Posted July 26 Share Posted July 26 UDMA CRC error are usually from the SATA connection, but power related issues, be it the PSU or cables/connector, may also possibly cause them. Quote Link to comment
loyalsnoopdoge Posted July 30 Author Share Posted July 30 Havent had a chance to fully disect the server. will update when I do. Appreciate the support Quote Link to comment
loyalsnoopdoge Posted August 8 Author Share Posted August 8 (edited) Since my last post I've been hard at work with this thing I just put new drives in with the new HBA and new SAS to SATA breakout cables. I made sure the temps aren't going above 40C on the drives. One of my brand new drives just threw a UDMA CRC error during a parity check I've also re seated the motherboard, and re seated all the connections on it. I also ran memtest for 16 hours to see if my RAM may be a potential issue, even though it's a connection issue I'm at a loss of what to do. do I replace the PSU and/or the motherboard? Edited August 8 by loyalsnoopdoge Quote Link to comment
JorgeB Posted August 8 Share Posted August 8 6 hours ago, loyalsnoopdoge said: just threw a UDMA CRC error during a parity check A single error once in a while is not a reason for concern, it is if they keep happening. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.