[Solved] Read errors all over after board swap


Recommended Posts

Help please :(

 

I recently transplanted my hardware over to a new case and at the same time changed the board, cpu and memory. Initially I had a problem where one of the controller cards was showing all disks on boot but not in unraid (displaying as missing). I updated the BIOS on the board from F3 to F8 and re-seated the card and cables, which seemed to work. Everything booted up ok and all disks went green. I allocated a new/spare disk as a cache drive, which formatted OK and i had green lights across the board, no errors. 

 

A few hours later after some light plex use, Parity 2 drops out first (red X). I figure maybe the cable is bad and I'm dealing with dinner/son/etc so leave it for the moment. Shortly there after when I get a chance to check it, the shares have dropped off and I have read errors across most of the disks. 

 

I'm wondering if the BIOS is setup slightly differently on the new board (legacy, ide mode, etc) so tomorrow afternoon i'll compare against the old board but I'm mostly at a loss as to what is going on. 

 

Motherboard changed from a Gigabyte GA-H57M-USB3 (rev 2.0) to a Gigabyte GA-H67MA-USB3-B3 (rev 1.0) and I ran a few passes of memtest on the new board without any issues yesterday.  

 

Edit: I unassigned Parity 2 since it was 'dropped' and after a reboot and replacing a couple of SATA cables (both parity drives) it was looking OK but then after today, errors all over the place again. Syslog is from the first time it failed, diagnostic is from after the reboot. Now disk 8 has dropped completely, I have no idea what is going on.  Added an extra screenshot.

 

ohno.PNG

unraid-syslog-20190603-1428.zip

after-reboot.jpg

unraid-diagnostics-20190607-0945.zip

Edited by chickensoup
Solved.
Link to comment

Sorry for the late reply, been really busy with work. I've updated the OP with diagnostic after booting the server back up last night, it looked OK initially and I ran a non-correct parity check overnight, in the morning all disks were OK at about 30% with 8 errors detected so I stopped the check and changed to a correcting parity check, which I now regret- I'm hoping the data isn't corrupt. This afternoon looks just as bad as the other day only with different disks.

 

Please note that between the two screenshots/reboots I also tidied up the cabling so the specific disks aren't necessarily on the same ports as they were the first time. Apologies if this makes things a little messier to diagnose but the logs should clear up any confusion.

Link to comment

Will change the onboard SATA controller to AHCI tonight, anything else I should look at before I reboot it? It is still currently powered on.

 

Not to over-complicate things but in full disclosure, the system is actually running off two power supplies, for no reason other than that the case supports them and I was testing power usage balancing the load between the two. Based on what you have said I suspect my TT 750 might be playing up, which is strange since it is actually powering less than it has been for the last few months.

 

I tested both the other night after it first failed and they looked OK but I might try swapping them around to see if this fixes anything.

 

More info here >

 

Edited by chickensoup
More info
Link to comment

I actually tested both the power supplies before rebuilding the server and they looked OK, even under load but it's curious that other than Parity 2 (which dropped due to a SMART 199, which could be the SATA cable) all the other power supplies with errors are on the same PSU. Disks 10, 11, Cache and Parity 1 are all on a different PSU and show no errors. 

 

I have another power supply I can use but now I'm not really sure about how to best proceed with my disks having dropped out all over the place. 

i.e. 

- Parity 2 dropped out so I unassigned it for now

- After the reboot when all looked well, I ran a correcting parity check which fixed ~10 errors but when I checked the server after it had finished there were read errors showing on all the data disks. I'm not sure if I can trust my parity is valid at this stage and I'm not sure when the errors started happening, can anyone tell from the diagnostic? 

- Disk 8 dropped out (also not sure if this was after the parity check) but SMART looks OK, I'm running a full check on it now 

 

Not sure if I should dump the data off disk 8 and rebuild it from parity as I feel like I actually trust the data disk more than the current parity state. Is there an option to reintroduce the disk to the array and rebuild parity off the data, assuming the disk is OK? 

 

Sorry if any of the above is confusing, just never had so many errors all at once, it's been rock solid up until now (going on 10 years..) 

 

Edit: Photo of the setup attached, if anyone is curious - disk 8 is missing as i'm running a WDDiag on it at the moment. 

20190609_173745.jpg

Edited by chickensoup
Photo added
Link to comment

I suspect the problem come from you use dual PSU and they are at poor DC ground.

 

Suggest try make some connect of ground between 2 PSU, i.e. connect both PSU black wire to black wire by molx plug.

 

Although the PSU sync plug and metal case already do that.

 

There also another possible reason, one PSU just have several disk loading, if that PSU not implement DC to DC design, then due to low loading, voltage regulation may be out of range. What model of PSU for connect disk only ?

 

And the mainboard change does pass memory test ?

 

Edited by Benson
Link to comment

Sorry for the long reply but I think I've finally worked out what has happened. 

 

It always felt like it was power related but the one thing I could never understand was why I was getting errors on some disks but not others- even when they were connected to the same chain off the same power supply cable. I thought at one point that maybe bending the cables in to shape to fit the case might have had some impact, since the power supplies are reasonably old (though good quality). 

 

It took a few days of thinking about the symptoms and scratching my head; the comment about poor ground also got me thinking and then while sifting through my power supplies and cabling I had a realization. 

 

I had 3 x Modular 6-Pin to SATA cables connected to my ToughPower 750W power supply. Turns out, the power supply likely only shipped with two of these and the one additional cable must be from a different PSU with a slightly different pin-out (I'm pretty sure there is no damage to the drives). My guess is the drives had 12V and Ground but that the 3.3V and 5V lines were swapped around. I feel like such an idiot. 

 

The picture below shows the two TT cables connected to the power supply (top row G, 12V, G on each) and two more modular SATA cables I had in my stash. If I had to put money on it, I would guess that the one on the right hand side in the picture was also being used which is why the symptoms were so strange. Close enough voltage to be OK for a little while, but ultimately not what the drive was looking for. 

 

Edit: All of the cable below are SATA to 6-Pin modular PSU cables, the one on the right in my hand at a distance, looks pretty much identical to the ones which shipped with the PSU. 

 

TT_Modular_SATA.jpg

Edited by chickensoup
Link to comment
5 minutes ago, johnnie.black said:

Glad you found the problem, very lucky not to damage the disks.

 

Thanks!

 

On 6/9/2019 at 7:05 PM, johnnie.black said:

Yes, you can do a new config and re-sync parity.

 

I have a quick question about re-building the array when I boot it back up. Assuming I trust the data on Disk 8, am I able to run a new config and re-build both the parity drives (at once) based on the data that is on the disks currently? Disk 8 is showing a red X (as per the second screenshot on my OP) but I don't think there is an issue with the drive at all, I've tested it outside of the array- I certainly don't trust my Parity right now but I want to re-introduce all the disks. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.