Read errors on disk 2 in the middle of replacing a bad disk 1


Recommended Posts

My server's really gone to shit (partially my fault) if you've seen my thread 1 and thread 2. I keep thinking my issues are over but new things keep coming up.

 

The summary is disk 1 had a file system that kept corrupting when new data was added and I heard a mechanical chirping coming from what I was pretty sure was that disk. I started a replacement with a new disk I bought but the new disk was unmountable and I messed up thinking I should format it. I put the old disk 1 back and attempted to save my data but ended up with more xfs file system corruption that ended up being fixed but my entire disk 1 in the lost+found. Part of that attempt was a New Config so I had to let parity rebuild while the xfs_repair was happening (they ran at the same time, xfs_repair finishing first, not sure if related).

 

Last night parity finished rebuilding. All my disks mountable and parity is valid. It was late so I figured I'd just replace disk 1 in the morning. This morning I woke up to another parity sync having been triggered at midnight (since it's the first of the month) and disk 2 having 3305 read errors. Plus more mechanical chirping sounds. Parity sync started slowing down with a 5 day ETA so I ended up just stopping it.

 

As it stands now I have:

  1. disk 1: WD120EDAZ. Purchased in the last month. Entire disk in lost+found, chirps, seems to run just fine but the file system keeps corrupting itself (although always fixable).
  2. disk 2: WD120EMFZ. Purchased in the last 5 months. Suddenly 3305 read errors.
  3. new disk outside my array: WD120EDAZ. Fingers crossed it's good.

 

What should I do? Which disk should I replace? I started an RMA for disk 1 and I don't think I can afford to drop money on another new disk. I think if anything I need to wait for my replacement disk from Western Digital before I can RMA a second disk. The only other disks in my array are two 4TB drives. Is there any chance this is just some weird bad data error from a file system repair on the >4TB section of parity? Comparing a bad disk to a good disk? Or the xfs_repair running during a parity rebuild?

 

What are the odds these two issues aren't related? That two disks break at nearly the same time? For reference these are Western Digital EasyStore 12TB's I purchased from Best Buy and shucked. Parity and the new disk are also that. The other two disks are WD 4TB Red.

 

Any chance this is a PSU or cabling issue? EVGA N1 400W PSU and I have a Molex to SATA adapter going to these drives specifically.

 

Attached diagnostics. Any help is massively appreciated.

apollo-diagnostics-20210301-0723.zip

Edited by s449
Link to comment

I'll wait for someone's response, but as an aside I'm thinking I should replace disk 2. Disk 1 is already dead and the files are gone as far as I'm concerned. I know I can dig through the lost+found but we're talking like 10k+ files. My important stuff on my server is backed up. What I lost is media project assets and media files that I can easily redownload or had no emotional attachment to. It'd be a big inconvenience but I wouldn't lose sleep. Same if I lost disk 2, although disk 2 had double the data on it so it would be an even bigger inconvenience.

Link to comment
1 hour ago, JorgeB said:

Problem with disk2 looks more like a connection/power problem, but the fact that you mention mechanical noises makes me suspect a power issue, replace cables or PSU and try the parity sync again with the same disk.

 

I was thinking of replacing the power supply one day anyway for a more efficient and better one than my EVGA N1 400W. I can do a whole power supply and cable overhaul if it really seems to be the real root of all these issues. SATA cables are pretty cheap although would you think it's more of a power cable issue than a data cable issue?

 

My only concern is what would the best way be to connect these drives and avoid the 3.3v pin issue? Right now I have my two SSD's connected direct to SATA power, then my parity disk and disks 1-3 on a Molex to 4x SATA power cable. Then my disk 4 (and eventually would be 5-7) on the same SATA as the SSD but with a SATA to 4x SATA power cable with the 5th cable cut (for the 3.3v issue).

 

I'm looking at the Fractal Design Ion+ 560P as a replacement since it has more wattage in case, more efficient, and modular. It supports 6 SATA but since it's modular I assume I can buy another 6-pin to 4x SATA cable? It comes with a 2x SATA and a 4x SATA cable (then of course Molex). But if I can connect a second 6-pin to 4x SATA directly to the PSU that would be ideal? Less points of failure? I'm just not seeing an OEM one sold, just cheap sketchy looking ones that usually reference Corsair or Seasonic. And for the 3.3v pin issue I would just go the tape over the pin route?

 

Edit: I got rid of my Molex to SATA cable from my case. I just found out about the lovely phrase "molex to sata, lose your data". Maybe the molex to sata cable on my drive bay was messing with the drives. All my drives are now on one SATA cable which for 2 SSD and 5 HDD I can't imagine is great long term. I do have a SATA to 4x SATA extension to do so but at least no Molex. I'm checking xfs_repair and short SMART test on all my drives real quick, then I will do a parity check with "write corrections to parity" checked. I'll still more than likely replace the PSU anyway with that Fractal one I linked just in case. I emailed Fractal to ask if they sell OEM 6-pin to 4x SATA that way I don't need any extensions for my drives. I'll also get Kapton tape for my 3.3v issue rather than doing cable mods.

 

I figured to soften the blow of a new PSU I should just return my replacement HDD, since I don't need the space by any means, and just wait for the RMA on disk 1 to go through while my server stays off. Although it's sounding like I might possibly need to RMA disk 2 because it had 3305 read errors? Or is it possible disk 2 is okay? Assuming the parity check I'm about to do goes through fine.

Edited by s449
Link to comment

I had some strange errors last time i replaced my power supply too - with the new PS no worries.  

 

Seasonic are excellent - I just bought a 650 W for a second server for $65 or $70 delivered.  Check them out.  

 

Whatever you do you want a single 12V rail for your PS.  

 

Depending on your drives, that 400W PS is likely too small.

Link to comment
2 hours ago, kimifelipe said:

I had some strange errors last time i replaced my power supply too - with the new PS no worries.  

 

Seasonic are excellent - I just bought a 650 W for a second server for $65 or $70 delivered.  Check them out.  

 

Whatever you do you want a single 12V rail for your PS.  

 

Depending on your drives, that 400W PS is likely too small.

 

Thanks for the reassurance that it could be PSU! I double checked with a PSU calculator and it recommends 379W. If I max out my drives it would be 401W. But for all I know this 400W PSU isn't delivering 400W. I'd still want to aim for 500W+ to be safe.

 

Right now with my re-cabling and removing Molex to Sata, things seem pretty stable but I'll wait until the parity check finishes in 18 hours to confirm no read errors on disk 2.

 

I ended up un-shucking and returning the replacement drive to Best Buy since all said and done I really wouldn't need the extra space post-RMA. If disk 2 is okay then I'll continue with the RMA of disk 1. I'll try to get the lost+found data off it but if I can't, I'll just remove it and do a New Config. Nothing of importance wasn't backed up, so it's not worth digging through.

 

I'll also replace the PSU just in case, of course.

 

Edit: Replacement PSU ordered: Seasonic FOCUS GX-550. It has 10 SATA connections out the box (2x, 4x, and 4x) so no need for adapters. Also bought Kapton tape for the 3.3v issue. It's 80+ Gold, has a fanless mode, modular so I can keep unnecessary PCI cables out the case. Hopefully that helps long term stability even if it's not directly the root of this issue. Just waiting on the parity check, 13.6% done with no errors.

Edited by s449
Link to comment
On 3/1/2021 at 8:53 AM, JorgeB said:

Problem with disk2 looks more like a connection/power problem, but the fact that you mention mechanical noises makes me suspect a power issue, replace cables or PSU and try the parity sync again with the same disk.

 

So my hard drives (including disk2) were plugged into a Molex to Sata power cable. I took that cable out and have them plugged into a Sata extension power cable. After doing so I ran another parity check and it just finished with no sync errors and no read errors on any of my drives.

 

Capture.PNG.a095576cc8f5840ca53eaefcf03b7b89.PNG

 

Sounds like you could be right and disk2 was a power issue. It could even be that disk1 from my past threads was also a power issue. Regardless, my higher quality replacement PSU should arrive tomorrow with it's 10x Sata power slots and I'll be omitting all extension cables to my rig; all my drives will be plugged directly into the PSU. Even if it's fixed now I'm sure that PSU will be a worthwhile upgrade.

 

I'll still go through with an RMA with disk1 as well. For all I know even if this fixes the file corruption issues it could have been permanently damaged.

 

Thanks for all the help!

  • Like 1
Link to comment

So in the past few days I successfully removed disk 1 from the array. Rebuilt parity using New Config with no errors. Ran another parity check and still no errors. All my disks reported no short SMART errors, no filesystem errors, I checked a number of times. I also installed my new power supply and I'm not using any adapters, all drives plugged straight into the PSU. I was covering the 3.3v pins using Kapton tape and the drives were being recognized. Everything looked fine and felt like it was running as good as ever while I was spending all day downloading a lot of data that I lost.

 

Suddenly tonight I get a message saying that disk 2 has read errors and is being emulated. It had 1132 read errors. Because I already replaced the power supply and cabling I tried replacing the SATA cable. In doing so my 3.3v kapton tape mod got loose so my drive wasn't detected at first. I fixed it but Unraid detected it as a new drive. Could be from the read errors though but it thought my drive was a new device so it started a Data Rebuild. An hour into the data rebuild it hit read errors again (1024 errors this time). Attached another diagnostic zip from after that. I shutdown my server, ended up deciding to just snip the 3.3v cable in case my tape mod was causing issues (it's a modular power supply anyway), turned my server back on and now the only option was to run a Read Check. But my drive is still disabled with it's contents emulated. I checked xfs_repair, no issues, short SMART test, no issues. So I’m running a Read Check.

 

Any ideas what's going on? I'm really worried about losing my data.

apollo-diagnostics-20210307-0057.zip

Edited by s449
Link to comment
7 hours ago, JorgeB said:

Disk dropped offline, if the emulated disk is mounting and contents look correct you can rebuild on top.

 

Ah okay. So basically the connection just got loose? I think it was the tape mod then. Snipping the 3.3v wire should be better. Thanks! I’ll let it rebuild.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.