outofmysystem Posted August 5, 2021 Posted August 5, 2021 (edited) I've got a really old unRAID server that's been running for years on a X58A-UD3R motherboard. 5 x WD Reds 3TB and a single WD Red 3TB parity drive, plus a 256GB Samsung 850 EVO cache. One of the data drives has been reporting read errors so thought I'd swap out the parity drive for a 10TB WD Gold Enterprise. I pulled the parity drive (which I've retained) and erased and cleared the Gold. Now the parity resync/data rebuild is getting to various stages (4%, 7% etc) then the server crashes and restarts. This has happened at least 5 times so far and I can't even get close to having this complete. I've seen a few disk-related errors in the syslog and captured the diagnostics with them happening but can't predict when the server will crash again but I'm sending syslog to another server in the meantime. Any pointers on how to recover from this mess? I've got several more 10TB WD Gold Enterprise disks that are pre-clearing on another machine at the moment but don't have any spare WD Reds 3TB. I could salvage a 2TB drive from another server if it'll help to get past this. Thanks. tower-diagnostics-20210805-1007.zip Edited August 5, 2021 by outofmysystem Quote
knightrideruk Posted August 5, 2021 Posted August 5, 2021 Have you checked your memory. Use memcheck from boot screen. I think you may have to replace faulty 3TB drive before replacing parity. Quote
outofmysystem Posted August 5, 2021 Author Posted August 5, 2021 (edited) Thanks for the reply. Will keep the memtest in mind as the server doesn't currently have a monitor attached. The data rebuild is at 20% currently but the estimated completion keeps growing (now at 21 hours to go). I'm wondering if the parity swap procedure should have been the way to go as I still have the original parity disk available that could be put back in. - Edit - Estimated finish went as far as 28 hours then the server hard crashed again. Nothing in the remote syslog. Sigh. Edited August 5, 2021 by outofmysystem Quote
trurl Posted August 5, 2021 Posted August 5, 2021 Looks like connection problems with disk2. Aug 5 08:53:45 Tower kernel: ata19.00: ATA-9: WDC WD30EFRX-68EUZN0, WD-WCC4N5HFUNV9, 82.00A82, max UDMA/133 ... Aug 5 08:55:29 Tower kernel: md: import disk2: (sdh) WDC_WD30EFRX-68EUZN0_WD-WCC4N5HFUNV9 size: 2930266532 ... Aug 5 09:37:51 Tower kernel: ata19.00: failed command: READ FPDMA QUEUED Aug 5 09:37:51 Tower kernel: ata19.00: cmd 60/20:00:28:21:5b/01:00:20:00:00/40 tag 0 ncq dma 147456 in Aug 5 09:37:51 Tower kernel: res 40/00:74:20:35:5b/00:00:20:00:00/40 Emask 0x32 (host bus error) Aug 5 09:37:51 Tower kernel: ata19.00: status: { DRDY } Is that the disk that was giving errors? SMART for that disk looks mostly OK except 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 3 You should monitor attributes 1 and 200 on WD disks. You can set this up by clicking on the disk and add those custom attributes. Also, which controller is that disk on? 01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9128 PCIe SATA 6 Gb/s RAID controller [1b4b:9128] (rev 11) Marvell is not recommended. Check connections, run an extended SMART test on that disk (will take several hours), and post new diagnostics. If it turns out that disk needs replacing, then parity swap is what you should be doing. Quote
outofmysystem Posted August 5, 2021 Author Posted August 5, 2021 Thanks for taking the time to look at the diags. Disk2 is the one that's been showing read errors and the one that will be swapped out eventually. Earlier, I added the original parity drive back in and the resync is at 25% with 4 hours to go. Fingers crossed this actually completes. I can't remember which drives are on which controllers but I'll avoid the Marvell ones once I see how these steps finish, then I'll try a parity swap. Thanks! Quote
trurl Posted August 5, 2021 Posted August 5, 2021 7 minutes ago, outofmysystem said: I added the original parity drive back in and the resync is at 25% with 4 hours to go. It wasn't necessary to attempt rebuild on original parity disk. New Config - Trust Parity would have let you put the disk back in as is. Of course that parity wouldn't actually be valid if you had written anything to the array while it was out. Quote
trurl Posted August 5, 2021 Posted August 5, 2021 No difference at this point between rebuilding original parity or rebuilding new parity. Go ahead and see if it will complete then post new diagnostics either way. Quote
outofmysystem Posted August 5, 2021 Author Posted August 5, 2021 (edited) So close. Made it to 92.4% before the server crashed again. Array is currently stopped as I disabled autostart. I don't believe anything has been written to the array since I started this process. I don't have VMs and I stopped and disabled docker right at the start before attempting to swap out the parity drive for the larger one. I've attached the latest diagnostics. Is the next thing to try New Config - Trust Parity and then the parity swap procedure? Thanks. tower-diagnostics-20210805-2214.zip Edited August 5, 2021 by outofmysystem Quote
trurl Posted August 6, 2021 Posted August 6, 2021 Parity build shouldn't cause crash even if you have a bad disk or disk connection problems. Unfortunately, syslog resets on reboot, so those diagnostics don't show if there was anything causing problems during parity build. Have you eliminated connection to Marvell? Do you have backups of anything important and irreplaceable? 9 hours ago, trurl said: Check connections, run an extended SMART test on that disk (will take several hours), and post new diagnostics. Quote
outofmysystem Posted August 6, 2021 Author Posted August 6, 2021 (edited) Hi again. I had moved all the HDDs off the Marvell controller and left just the SSD cache drive attached to it. I've set up remote syslog and will try one more time. I have got backups of most important stuff to a 2nd unRAID server but the failing one is my media server so I'd really like to save the array. The 2nd server doesn't have enough SATA ports to move the failing server's disks to it to test so I'm now torn between buying a new SATA controller card to put in the failing server in case it's the motherboard's SATA ports that are the issue, or biting down and building a whole new rig and trying the disks in that. edit - Reset connections and running extended SMART test on disk 2 Edited August 6, 2021 by outofmysystem Quote
outofmysystem Posted August 6, 2021 Author Posted August 6, 2021 Last time I checked, the extended SMART test on Disk 2 was at 20% then it just seemed to stop. unRAID GUI now shows "No self-tests logged on this disk". No errors or anything appeared on the remote syslog and the server remained up. Latest diags attached. tower-diagnostics-20210806-1254.zip Quote
trurl Posted August 6, 2021 Posted August 6, 2021 6 hours ago, outofmysystem said: Last time I checked, the extended SMART test on Disk 2 was at 20% then it just seemed to stop. Do you mean the test stopped running, or just that it hasn't progressed? 9 Power_On_Hours -O--CK 044 044 000 - 41199 SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 40969 - # 2 Short offline Completed without error 00% 39783 - # 3 Extended offline Interrupted (host reset) 10% 39783 - # 4 Short offline Completed without error 00% 39724 - # 5 Short offline Completed without error 00% 23791 - As you can see, there is nothing about the extended test you started in SMART for that disk yet. If it hasn't actually stopped then it probably just hasn't completed yet. Expect it to take several hours and you don't really get much in the way of progress indicator. Quote
outofmysystem Posted August 6, 2021 Author Posted August 6, 2021 (edited) Hi. It literally just stopped after running for a while - I saw up to 20%. I was watching it tick over then when I looked again, the screen looked like it hadn't been running any test. Is it worth just cutting my losses and proceeding with the parity swap? I can't tell if it's the board that's defective or just this drive that's causing all the problems. Where can I track the extended SMART test if not on the GUI? Is there a log file i can tail so I can keep an eye on it? Thanks - Edit - found on Reddit. Quote Do you have the disk set to spin down, because that will abort the test every time. I just went through the same thing, I turned off the spin down delay temporarily and it let the extended test complete. Edited August 6, 2021 by outofmysystem Added more info Quote
trurl Posted August 6, 2021 Posted August 6, 2021 Were you trying to run extended SMART test on that disk, and rebuild parity at the same time? Since you have tried and failed to rebuild parity on the original parity disk, we don't know if it currently has valid parity, which is required for parity swap to succeed. We can assume new parity disk is invalid since it never had parity built, at least not as far as the largest data disk. Even if you are absolutely sure nothing has been written to your server since you started all this, I would still not be confident that original parity is now valid since you tried to rebuild it and had problems completing rebuild. 10 hours ago, outofmysystem said: I've set up remote syslog Have you had a crash since then? Beginning to wonder if you don't have some sort of hardware problem beyond just a problem disk. Might be simpler to New Config without the problem disk and see if you can get parity built that way on the new parity disk, then see what you can copy from the problem disk by mounting it Unassigned. If you can't successfully rebuild parity then you very likely wouldn't be able to do parity swap either even if you had valid parity, since parity swap ultimately does a rebuild (of a data disk) after copying parity. Unless simply removing that problem disk somehow made all your problems go away. Quote
trurl Posted August 6, 2021 Posted August 6, 2021 Another possibility would be to unassign parity, start the array, and then try to copy whatever you can from the problem disk. Then work on getting parity build without that disk. Quote
outofmysystem Posted August 6, 2021 Author Posted August 6, 2021 Thanks very much for all your help with this. I definitely wasn't running an extended SMART test and rebuilding parity at the same time. I figured out I was setting the disks to spin down which was interrupting the SMART test so I rebooted and I'm running that again and hopefully that will complete overnight and I'll upload the results. It's possible there is an underlying hardware problem as this server is quite old, but maybe a completed test will shed some light. I haven't seen anything in remote syslog during the multiple crashes I've had when trying to rebuild parity. Quote
outofmysystem Posted August 6, 2021 Author Posted August 6, 2021 (edited) 3 minutes ago, trurl said: Another possibility would be to unassign parity, start the array, and then try to copy whatever you can from the problem disk. Then work on getting parity build without that disk. So for this, would I just copy all of the files to another server or should I copy them over an existing drive in the array? I'm a bit confused by that, sorry. Edited August 6, 2021 by outofmysystem Quote
trurl Posted August 7, 2021 Posted August 7, 2021 Do you know how to work with files directly on the server, including Unassigned Devices, instead of working with them over the network? Quote
outofmysystem Posted August 7, 2021 Author Posted August 7, 2021 1 hour ago, trurl said: Do you know how to work with files directly on the server, including Unassigned Devices, instead of working with them over the network? Yeah, that's fine. The extended SMART test ended without errors. tower-diagnostics-20210807-0321.zip tower-smart-20210806-1955.zip Quote
outofmysystem Posted August 7, 2021 Author Posted August 7, 2021 So, I've changed out all of the SATA cables and reseated everything. All 6 HDDs are online with the parity drive showing as emulated. A Parity-Sync/Data-Rebuild operation is running at the moment but it'll likely fail like the rest. I've dug out an old server that only has 6 SATA ports so I've got no way to add all 6, plus the SSD cache drive. I can order a PCI Express (PCIe) SATA III (6G) SSD Adapter but that will take a while to arrive. Annoyingly, I just found 2 x 3TB Reds in the old server I forgot I had so could have done a straight swap for the dodgy disk. Quote Another possibility would be to unassign parity, start the array, and then try to copy whatever you can from the problem disk. Then work on getting parity build without that disk. I think it's possible to get the data off the faulty disk 2. Would using unBalance to scatter the data from it be an option or should I just bite the bullet and rsync all it's contents to another server on my network? Quote
trurl Posted August 7, 2021 Posted August 7, 2021 unBALANCE would be one way to do that, but not sure how well it would do unless the issues you are having with building parity are resolved. I was thinking more of copying to an Unassigned Device, or possibly removing parity if it can't be rebuilt and trying to copy to another disk in the array. Might be simpler and safer to put the disk in another computer that can read Linux filesystems to do the copy and so sidestep the issues you are currently having with your server. Quote
outofmysystem Posted August 7, 2021 Author Posted August 7, 2021 (edited) Well, that last parity rebuild finished successfully. I guess I'll go and check a few things that live on Drive 2 and see if they look intact. This morning I took out all of the SATA cables and replaced them with ones I had and actually slotted 4 of them into the Marvell SATA ports (I know!). So either one of the SATA cables is dodgy or one of the SATA ports on the board. This morning I ordered an LSI SAS 9207-8i and cables which will arrive next week which should remove either the ports or the SATA cables from the equation. Seeing as Disk 2 passed an extended SMART test, I'm wondering if the reads errors were down to the ports/cables, but don't know enough to make that call. I'm hoping I'm out of the woods now but thank you SO much for your help and patience with this. Coming back to the thread over multiple days is very much appreciated. Edited August 9, 2021 by outofmysystem Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.