Jump to content
chickensoup

[Solved] Read errors all over after board swap

16 posts in this topic Last Reply

Recommended Posts

Posted (edited)

Help please :(

 

I recently transplanted my hardware over to a new case and at the same time changed the board, cpu and memory. Initially I had a problem where one of the controller cards was showing all disks on boot but not in unraid (displaying as missing). I updated the BIOS on the board from F3 to F8 and re-seated the card and cables, which seemed to work. Everything booted up ok and all disks went green. I allocated a new/spare disk as a cache drive, which formatted OK and i had green lights across the board, no errors. 

 

A few hours later after some light plex use, Parity 2 drops out first (red X). I figure maybe the cable is bad and I'm dealing with dinner/son/etc so leave it for the moment. Shortly there after when I get a chance to check it, the shares have dropped off and I have read errors across most of the disks. 

 

I'm wondering if the BIOS is setup slightly differently on the new board (legacy, ide mode, etc) so tomorrow afternoon i'll compare against the old board but I'm mostly at a loss as to what is going on. 

 

Motherboard changed from a Gigabyte GA-H57M-USB3 (rev 2.0) to a Gigabyte GA-H67MA-USB3-B3 (rev 1.0) and I ran a few passes of memtest on the new board without any issues yesterday.  

 

Edit: I unassigned Parity 2 since it was 'dropped' and after a reboot and replacing a couple of SATA cables (both parity drives) it was looking OK but then after today, errors all over the place again. Syslog is from the first time it failed, diagnostic is from after the reboot. Now disk 8 has dropped completely, I have no idea what is going on.  Added an extra screenshot.

 

ohno.PNG

unraid-syslog-20190603-1428.zip

after-reboot.jpg

unraid-diagnostics-20190607-0945.zip

Edited by chickensoup
Solved.

Share this post


Link to post

And what card are you using for the additional SATA ports beyond the MB ones? 

Share this post


Link to post

I'm using two Adaptec 1430SA's, the thing is that Parity 1 and Parity 2 are on the same card, one has errors and one doesn't. Drives on the second card have errors and so do ones on the motherboard, but some don't...

Share this post


Link to post

Sorry for the late reply, been really busy with work. I've updated the OP with diagnostic after booting the server back up last night, it looked OK initially and I ran a non-correct parity check overnight, in the morning all disks were OK at about 30% with 8 errors detected so I stopped the check and changed to a correcting parity check, which I now regret- I'm hoping the data isn't corrupt. This afternoon looks just as bad as the other day only with different disks.

 

Please note that between the two screenshots/reboots I also tidied up the cabling so the specific disks aren't necessarily on the same ports as they were the first time. Apologies if this makes things a little messier to diagnose but the logs should clear up any confusion.

Share this post


Link to post

Multiple SATA links are going down, in multiple controllers, my first guess would be a power problem.

 

Also change the onboard SATA controller to AHCI mode.

Share this post


Link to post
Posted (edited)

Will change the onboard SATA controller to AHCI tonight, anything else I should look at before I reboot it? It is still currently powered on.

 

Not to over-complicate things but in full disclosure, the system is actually running off two power supplies, for no reason other than that the case supports them and I was testing power usage balancing the load between the two. Based on what you have said I suspect my TT 750 might be playing up, which is strange since it is actually powering less than it has been for the last few months.

 

I tested both the other night after it first failed and they looked OK but I might try swapping them around to see if this fixes anything.

 

More info here >

 

Edited by chickensoup
More info

Share this post


Link to post
42 minutes ago, chickensoup said:

anything else I should look at before I reboot it?

Nothing else that comes to mind.

Share this post


Link to post
Posted (edited)

I actually tested both the power supplies before rebuilding the server and they looked OK, even under load but it's curious that other than Parity 2 (which dropped due to a SMART 199, which could be the SATA cable) all the other power supplies with errors are on the same PSU. Disks 10, 11, Cache and Parity 1 are all on a different PSU and show no errors. 

 

I have another power supply I can use but now I'm not really sure about how to best proceed with my disks having dropped out all over the place. 

i.e. 

- Parity 2 dropped out so I unassigned it for now

- After the reboot when all looked well, I ran a correcting parity check which fixed ~10 errors but when I checked the server after it had finished there were read errors showing on all the data disks. I'm not sure if I can trust my parity is valid at this stage and I'm not sure when the errors started happening, can anyone tell from the diagnostic? 

- Disk 8 dropped out (also not sure if this was after the parity check) but SMART looks OK, I'm running a full check on it now 

 

Not sure if I should dump the data off disk 8 and rebuild it from parity as I feel like I actually trust the data disk more than the current parity state. Is there an option to reintroduce the disk to the array and rebuild parity off the data, assuming the disk is OK? 

 

Sorry if any of the above is confusing, just never had so many errors all at once, it's been rock solid up until now (going on 10 years..) 

 

Edit: Photo of the setup attached, if anyone is curious - disk 8 is missing as i'm running a WDDiag on it at the moment. 

20190609_173745.jpg

Edited by chickensoup
Photo added

Share this post


Link to post
1 hour ago, chickensoup said:

Is there an option to reintroduce the disk to the array and rebuild parity off the data, assuming the disk is OK? 

Yes, you can do a new config and re-sync parity.

Share this post


Link to post
Posted (edited)

I suspect the problem come from you use dual PSU and they are at poor DC ground.

 

Suggest try make some connect of ground between 2 PSU, i.e. connect both PSU black wire to black wire by molx plug.

 

Although the PSU sync plug and metal case already do that.

 

There also another possible reason, one PSU just have several disk loading, if that PSU not implement DC to DC design, then due to low loading, voltage regulation may be out of range. What model of PSU for connect disk only ?

 

And the mainboard change does pass memory test ?

 

Edited by Benson

Share this post


Link to post
Posted (edited)

Sorry for the long reply but I think I've finally worked out what has happened. 

 

It always felt like it was power related but the one thing I could never understand was why I was getting errors on some disks but not others- even when they were connected to the same chain off the same power supply cable. I thought at one point that maybe bending the cables in to shape to fit the case might have had some impact, since the power supplies are reasonably old (though good quality). 

 

It took a few days of thinking about the symptoms and scratching my head; the comment about poor ground also got me thinking and then while sifting through my power supplies and cabling I had a realization. 

 

I had 3 x Modular 6-Pin to SATA cables connected to my ToughPower 750W power supply. Turns out, the power supply likely only shipped with two of these and the one additional cable must be from a different PSU with a slightly different pin-out (I'm pretty sure there is no damage to the drives). My guess is the drives had 12V and Ground but that the 3.3V and 5V lines were swapped around. I feel like such an idiot. 

 

The picture below shows the two TT cables connected to the power supply (top row G, 12V, G on each) and two more modular SATA cables I had in my stash. If I had to put money on it, I would guess that the one on the right hand side in the picture was also being used which is why the symptoms were so strange. Close enough voltage to be OK for a little while, but ultimately not what the drive was looking for. 

 

Edit: All of the cable below are SATA to 6-Pin modular PSU cables, the one on the right in my hand at a distance, looks pretty much identical to the ones which shipped with the PSU. 

 

TT_Modular_SATA.jpg

Edited by chickensoup

Share this post


Link to post
5 minutes ago, johnnie.black said:

Glad you found the problem, very lucky not to damage the disks.

 

Thanks!

 

On 6/9/2019 at 7:05 PM, johnnie.black said:

Yes, you can do a new config and re-sync parity.

 

I have a quick question about re-building the array when I boot it back up. Assuming I trust the data on Disk 8, am I able to run a new config and re-build both the parity drives (at once) based on the data that is on the disks currently? Disk 8 is showing a red X (as per the second screenshot on my OP) but I don't think there is an issue with the drive at all, I've tested it outside of the array- I certainly don't trust my Parity right now but I want to re-introduce all the disks. 

Share this post


Link to post
11 minutes ago, chickensoup said:

Assuming I trust the data on Disk 8, am I able to run a new config and re-build both the parity drives (at once) based on the data that is on the disks currently?

Yes, new config will reset the disabled disk.

Share this post


Link to post

Was able to generate a new config and rebuild Parity to both P disks without any issue, all disks are up and there doesn't appear to be any issue with the data. 

 

Thanks for all your help guys :)

 

Marked as Solved. 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.