Jump to content

PSU died, took out 1 x 4TB parity and 2 x 4TB data HDD's. What do I do next?


T800

Recommended Posts

PSU died, and fried 3 x 4TB HDD's! What now?

I installed a new PSU.

The server won't autoboot from the USB anymore I have to select it to boot from each time. I reset CMOS and still does it. Motherboard damaged?

I tried the HDD's on other power and sata ports and get nothing. I tried on them in my other server to see if it will show up as an unassigned device. It's spins up but don't appear as a device.

If the HDD's are completely dead how do I get the server back on with the 2nd parity drive and without the other 2 data drives.

 

5ae450cd352d9_ScreenShot2018-04-28at11_06_42.thumb.png.7a81263daef35b7c241943b41b2a4a6e.png.

 

 

Link to comment
1 minute ago, T800 said:

I reset CMOS and still does it.

Did you restore to factory defaults, or set the boot order to boot off of the USB stick first?

 

2 minutes ago, T800 said:

2nd parity drive and without the other 2 data drives

You can't.  You have lost a total of 3 drives, but dual parity allows you to lose a maximum of 2 drives.

  • Upvote 1
Link to comment

If the data is important on the dead drives, you might be able to swap circuit boards on the drives to get them working again. Google/Bing about that. From what i recall, if its onky the electronics on the drive and not the internal ohysical media or drive heads, then circuit board replacment might work ... i think you need identical model drives for the doner boards.

 

As for the drives that are left, I would at least perform SMART tests on them to make sure they didnt suffer as well before going forward.

Link to comment
51 minutes ago, T800 said:

So what do I do, I'm guessing I can keep my data on the drives that are left? Do I preclear the 2nd parity drive to use as single parity and then what?

 

Don't preclear. Just New Config, assign your remaining data disks to whatever data slots you want them to use, assign old parity2 to parity slot. Don't check the box saying parity is already valid. Start array to begin parity sync.

 

TL;DR

 

No need to preclear unless you just want to test the disk. unRAID only requires a clear disk when you are adding a disk to a new slot in an array with valid parity. A clear disk is all zeros, which has no effect on the parity calculation, so when you add a clear disk to a new slot parity remains valid. And if you do add a disk to a new slot and this disk hasn't been cleared, unRAID will clear it before it adds it so parity remains valid.

 

When replacing a disk, whether it is parity or data, there is no need to clear the replacement, since it is going to be completely overwritten from the parity calculation anyway.

 

Link to comment

When I put decent PCB's on the 3 HDD's they all appear as devices on the other server. They won't SMART test though and say:

=== START OF READ SMART DATA SECTION ===
Read SMART Log Directory failed: scsi error badly formed scsi parameters

Read SMART Self-test Log failed: scsi error badly formed scsi parameters

What should I do?

The drives will mount but have issues.

 

So far the existing drives in the server that failed are passing the SMART tests.

Link to comment

I'm gonna get a PCB and try and save the drives. While I wait for that I have reset 2nd parity as single parity and started a new config with the raining drives. 

I got quite a few udma urc error counts on drive 6.

 

It's currently on udma crc error count is 5778

Should I stop parity, check connections or just let it finish.

 

Parity write is slow at 9MB/sec

Link to comment

This is a tough one to answer at this point.  I am assuming that the count is continuing to increase.  What a UDMA CRC error is  an error in the transmission of the data from the HD to SATA controller on either the MB or an SATA card.  Usually these errors are caused by a bad SATA cable or a bad connection.  (Occasionally, Crosstalk can be an issue...)  But in your case, with the (most likely, over-voltage) problem, there could be more damage to the electronics than was first realized.  

 

I would suggest uploading the Diagnostics file and see if anyone can spot anything there.  

 

At this point, you might want to take a deep breath and begin to analyze the entire situation.  Is this a 'new server' or have you been using it for a few years?  Have you been considering doing any upgrades to it?  What is your risk tolerance as this system has been severely stressed and, possibly, had its life (and reliability) reduced?  

Link to comment

Personally, I'd stop and investigate. At 9 MB/s you can't be very far into the rebuild and at that rate it will take days to complete if you just leave it.

 

What was the model of the power supply that caused so much damage? I would hope that a quality brand would be capable of failing gracefully.

Link to comment
6 hours ago, T800 said:

I got quite a few udma urc error counts on drive 6.

 

Yes. That's your problem. The SATA link keeps being reset every few seconds and can't achieve a speed faster than 1.5 Gb/s. First thing is to try a new cable.

 

Apart from that your motherboard appears to be working well enough.

 

Edited by John_M
Link to comment

Okay thanks, will do. I actually stopped the monthly check this morning because it completed yesterday, didn't realize it was worth running again.

 

I ran the "Fix Common Problems" plugin and it's reporting:

 

Your CPU is running constantly at 100% and will not throttle down when it's idle (to save heat / power). This is because there is currently no CPU Scaling Driver Installed. Seek assistance on the unRaid forums with this issue

 

 

I never had this before the PSU took a swan dive (or could be from the 6.5.0 update a few days before the PSU went). So far still haven't been able to save USB as the main priority boot device either. Gonna try a firmware update (I noticed it's a couple of revisions old) on the motherboard and another USB port. If it doesn't sort it looks like I might have to get a motherboard after all.

Link to comment

The parity check went fine.

The server has been working great since.

I installed Tips and Tweaks and set to On Demand. It still comes up as an error in fix common problems but the CPU shows it not at 100% and the server sleeps now which it hadn't for a few months.

Link to comment
  • 1 month later...

Just a quick update on this. I ordered 3 donor PCB boards from China. Two weeks later they arrived, I got a half decent heat gun with a fine tip and swapped the BIOS chips over. One of the parity drives and one of the data drives came back to life. I figured, the 3rd PCB maybe be faulty so I told them and they sent another. It arrived on Monday, swapped BIOS chip and viola it's back up and running. I'm back to where I was data-wise and with 2 parity drives and just finished my 2nd parity check with 0 errors.

 

The only thing left is to get the server to boot from USB by itself. It's fine if I restart but if I shutdown and boot I have to go to the boot menu and select USB. Even if I can't sort that i can live with it for now.

 

Short version, I've managed to not lose any data.

Edited by T800
  • Like 4
  • Upvote 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...