Jump to content

Parity Resync/Data Rebuild and Crashing Server


Recommended Posts

Posted (edited)

I've got a really old unRAID server that's been running for years on a X58A-UD3R motherboard. 5 x WD Reds 3TB and a single WD Red 3TB parity drive, plus a 256GB Samsung 850 EVO cache. One of the data drives has been reporting read errors so thought I'd swap out the parity drive for a 10TB  WD Gold Enterprise. I pulled the parity drive (which I've retained) and erased and cleared the Gold. Now the parity resync/data rebuild is getting to various stages (4%, 7% etc) then the server crashes and restarts. This has happened at least 5 times so far and I can't even get close to having this complete. 

 

I've seen a few disk-related errors in the syslog and captured the diagnostics with them happening but can't predict when the server will crash again but I'm sending syslog to another server in the meantime.  Any pointers on how to recover from this mess? I've got several more 10TB  WD Gold Enterprise disks that are pre-clearing on another machine at the moment but don't have any spare WD Reds 3TB. I could salvage a 2TB drive from another server if it'll help to get past this. Thanks. 

tower-diagnostics-20210805-1007.zip

Edited by outofmysystem
Posted (edited)

Thanks for the reply. Will keep the memtest in mind as the server doesn't currently have a monitor attached. The data rebuild is at 20% currently but the estimated completion keeps growing (now at 21 hours to go). I'm wondering if the parity swap procedure should have been the way to

go as I still have the original parity disk available that could be put back in. 

 

- Edit - Estimated finish went as far as 28 hours then the server hard crashed again. Nothing in the remote syslog. Sigh.

Edited by outofmysystem
Posted

Looks like connection problems with disk2.

Aug  5 08:53:45 Tower kernel: ata19.00: ATA-9: WDC WD30EFRX-68EUZN0,      WD-WCC4N5HFUNV9, 82.00A82, max UDMA/133
...
Aug  5 08:55:29 Tower kernel: md: import disk2: (sdh) WDC_WD30EFRX-68EUZN0_WD-WCC4N5HFUNV9 size: 2930266532 
...
Aug  5 09:37:51 Tower kernel: ata19.00: failed command: READ FPDMA QUEUED
Aug  5 09:37:51 Tower kernel: ata19.00: cmd 60/20:00:28:21:5b/01:00:20:00:00/40 tag 0 ncq dma 147456 in
Aug  5 09:37:51 Tower kernel:         res 40/00:74:20:35:5b/00:00:20:00:00/40 Emask 0x32 (host bus error)
Aug  5 09:37:51 Tower kernel: ata19.00: status: { DRDY }

 

Is that the disk that was giving errors? SMART for that disk looks mostly OK except

  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    3

You should monitor attributes 1 and 200 on WD disks. You can set this up by clicking on the disk and add those custom attributes.

 

Also, which controller is that disk on?

01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9128 PCIe SATA 6 Gb/s RAID controller [1b4b:9128] (rev 11)

Marvell is not recommended.

 

Check connections, run an extended SMART test on that disk (will take several hours), and post new diagnostics.

 

If it turns out that disk needs replacing, then parity swap is what you should be doing.

 

Posted

Thanks for taking the time to look at the diags. Disk2 is the one that's been showing read errors and the one that will be swapped out eventually. Earlier, I added the original parity drive back in and the resync is at 25% with 4 hours to go. Fingers crossed this actually completes. I can't remember which drives are on which controllers but I'll avoid the Marvell ones once I see how these steps finish, then I'll try a parity swap. Thanks!

Posted
7 minutes ago, outofmysystem said:

I added the original parity drive back in and the resync is at 25% with 4 hours to go.

It wasn't necessary to attempt rebuild on original parity disk. New Config - Trust Parity would have let you put the disk back in as is. Of course that parity wouldn't actually be valid if you had written anything to the array while it was out.

Posted (edited)

So close. Made it to 92.4% before the server crashed again. Array is currently stopped as I disabled autostart. I don't believe anything has been written to the array since I started this process. I don't have VMs and I stopped and disabled docker right at the start before attempting to swap out the parity drive for the larger one. I've attached the latest diagnostics. Is the next thing to try New Config - Trust Parity and then the parity swap procedure? Thanks. 

Capture.PNG

tower-diagnostics-20210805-2214.zip

Edited by outofmysystem
Posted

Parity build shouldn't cause crash even if you have a bad disk or disk connection problems. Unfortunately, syslog resets on reboot, so those diagnostics don't show if there was anything causing problems during parity build.

 

Have you eliminated connection to Marvell?

 

Do you have backups of anything important and irreplaceable?

 

9 hours ago, trurl said:

Check connections, run an extended SMART test on that disk (will take several hours), and post new diagnostics.

 

Posted (edited)

Hi again. I had moved all the HDDs off the Marvell controller and left just the SSD cache drive attached to it. I've set up remote syslog and will try one more time. I have got backups of most important stuff to a 2nd unRAID server but the failing one is my media server so I'd really like to save the array. The 2nd server doesn't have enough SATA ports to move the failing server's disks to it to test so I'm now torn between buying a new SATA controller card to put in the failing server in case it's the motherboard's SATA ports that are the issue, or biting down and building a whole new rig and trying the disks in that. 

 

edit - Reset connections and running extended SMART test on disk 2

Edited by outofmysystem
Posted
6 hours ago, outofmysystem said:

Last time I checked, the extended SMART test on Disk 2 was at 20% then it just seemed to stop.

Do you mean the test stopped running, or just that it hasn't progressed?

 

  9 Power_On_Hours          -O--CK   044   044   000    -    41199
  
  SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     40969         -
# 2  Short offline       Completed without error       00%     39783         -
# 3  Extended offline    Interrupted (host reset)      10%     39783         -
# 4  Short offline       Completed without error       00%     39724         -
# 5  Short offline       Completed without error       00%     23791         -

As you can see, there is nothing about the extended test you started in SMART for that disk yet. If it hasn't actually stopped then it probably just hasn't completed yet. Expect it to take several hours and you don't really get much in the way of progress indicator.

Posted (edited)

Hi. It literally just stopped after running for a while - I saw up to 20%. I was watching it tick over then when I looked again, the screen looked like it hadn't been running any test. Is it worth just cutting my losses and proceeding with the parity swap? I can't tell if it's the board that's defective or just this drive that's causing all the problems. 

 

Where can I track the extended SMART test if not on the GUI? Is there a log file i can tail so I can keep an eye on it? Thanks

 

- Edit - found on Reddit. 

Quote

 

Do you have the disk set to spin down, because that will abort the test every time.

I just went through the same thing, I turned off the spin down delay temporarily and it let the extended test complete.

 

 

Edited by outofmysystem
Added more info
Posted

Were you trying to run extended SMART test on that disk, and rebuild parity at the same time?

 

Since you have tried and failed to rebuild parity on the original parity disk, we don't know if it currently has valid parity, which is required for parity swap to succeed. We can assume new parity disk is invalid since it never had parity built, at least not as far as the largest data disk.

 

Even if you are absolutely sure nothing has been written to your server since you started all this, I would still not be confident that original parity is now valid since you tried to rebuild it and had problems completing rebuild.

10 hours ago, outofmysystem said:

I've set up remote syslog

Have you had a crash since then? Beginning to wonder if you don't have some sort of hardware problem beyond just a problem disk.

 

Might be simpler to New Config without the problem disk and see if you can get parity built that way on the new parity disk, then see what you can copy from the problem disk by mounting it Unassigned.

 

If you can't successfully rebuild parity then you very likely wouldn't be able to do parity swap either even if you had valid parity, since parity swap ultimately does a rebuild (of a data disk) after copying parity. Unless simply removing that problem disk somehow made all your problems go away.

Posted

Another possibility would be to unassign parity, start the array, and then try to copy whatever you can from the problem disk. Then work on getting parity build without that disk.

Posted

Thanks very much for all your help with this. I definitely wasn't running an extended SMART test and rebuilding parity at the same time. I figured out I was setting the disks to spin down which was interrupting the SMART test so I rebooted and I'm running that again and hopefully that will complete overnight and I'll upload the results. It's possible there is an underlying hardware problem as this server is quite old, but maybe a completed test will shed some light. I haven't seen anything in remote syslog during the multiple crashes I've had when trying to rebuild parity. 

 

Posted (edited)
3 minutes ago, trurl said:

Another possibility would be to unassign parity, start the array, and then try to copy whatever you can from the problem disk. Then work on getting parity build without that disk.

So for this, would I just copy all of the files to another server or should I copy them over an existing drive in the array? I'm a bit confused by that, sorry. 

Edited by outofmysystem
Posted

So, I've changed out all of the SATA cables and reseated everything. All 6 HDDs are online with the parity drive showing as emulated. A Parity-Sync/Data-Rebuild operation is running at the moment but it'll likely fail like the rest. I've dug out an old server that only has 6 SATA ports so I've got no way to add all 6, plus the SSD cache drive. I can order a PCI Express (PCIe) SATA III (6G) SSD Adapter but that will take a while to arrive. Annoyingly, I just found 2 x 3TB Reds in the old server I forgot I had so could have done a straight swap for the dodgy disk. 

 

Quote

Another possibility would be to unassign parity, start the array, and then try to copy whatever you can from the problem disk. Then work on getting parity build without that disk.

 

I think it's possible to get the data off the faulty disk 2. Would using unBalance to scatter the data from it be an option or should I just bite the bullet and rsync all it's contents to another server on my network? 

 

Posted

unBALANCE would be one way to do that, but not sure how well it would do unless the issues you are having with building parity are resolved. I was thinking more of copying to an Unassigned Device, or possibly removing parity if it can't be rebuilt and trying to copy to another disk in the array.

 

Might be simpler and safer to put the disk in another computer that can read Linux filesystems to do the copy and so sidestep the issues you are currently having with your server.

Posted (edited)

Well, that last parity rebuild finished successfully. I guess I'll go and check a few things that live on Drive 2 and see if they look intact. This morning I took out all of the SATA cables and replaced them with ones I had and actually slotted 4 of them into the Marvell SATA ports (I know!). So either one of the SATA cables is dodgy or one of the SATA ports on the board. This morning I ordered an LSI SAS 9207-8i and cables which will arrive next week which should remove either the ports or the SATA cables from the equation. Seeing as Disk 2 passed an extended SMART test, I'm wondering if the reads errors were down to the ports/cables, but don't know enough to make that call. 

 

I'm hoping I'm out of the woods now but thank you SO much for your help and patience with this. Coming back to the thread over multiple days is very much appreciated. 

 

 

Capture.PNG

Edited by outofmysystem

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...