HBA Card died, replaced - All drives show but drive 6 disabled, SMART checks out, started reconstruction/data-rebuild process. Drive 1 started showing showing read errors - Paused rebuild, options?

Quilimox · November 17, 2020

Hello all I wonder if someone could advise me on the best way forward.

My Dell H200 (IT Mode) HBA card died a few weeks ago. Got a replacement and flashed it successfully. Plugged it in and booted into Unraid and all drives showed except drive 6 was disabled. Short and extended SMART came back clear. Checked the Unraid Wiki and opted to reconstruct the drive as it was the safest option.

Followed the instructions, unassigned drive, started array, stopped the array. The array wouldn't stop, log showed 'umount: /mnt/cache: target is busy' and multiple retries. Ran command lsof /mnt/cache and nothing showed. Ran command fuser -mv /mnt/cache and it showed 'USER: root - PID: kernel - ACCESS: mount - COMMAND: /mnt/cache'. Ran command losetup and found 'NAME: /dev/loop2' and 'BACK-FILE: /mnt/cache/system/docker/docker.img'.

Looked at the Syslog and I guessed Pulseway that got in the way (trying to access /mnt/cache prior to array running and trying to find /mnt/user whilst the array was trying to stop) or the docker service did because I didn't disable it before I started the array, which I started and stopped pretty quickly.
'pulseway: Could not find hdd with path: /mnt/cache'

'pulseway: Could not find hdd with path: /mnt/user

emhttpd: Unmounting disks...

emhttpd: shcmd (54122): umount /mnt/cache

umount: /mnt/cache: target is busy.

emhttpd: shcmd (54122): exit status: 32

emhttpd: Retry unmounting disk share(s)...'

So I ran command umount -1 /dev/loop2 and the array stopped.

With the array now stopped I continued following the instructions to reconstruct drive 6 re-assigned drive 6 and started the array and it started the reconstruct/data-rebuilt process.

Then all of a sudden drive 1 started spitting out read errors... It is an older drive but its never had issues before but who knows maybe the drive is bust. I paused the data-rebuild process and ran both the sort and extended SMART checks on the drive which both came back clean from what I can tell.

I suspect that the SATA cable for this drive (which goes into my new Dell H200 HBA) is causing issues, or the HBA card is the problem although if it was the HBA card I'd have thought I'd see more than one drive having problems.

I do have a spare SATA cable and port on the motherboard (its a 3gbps rated one hence its not in use) which I could plug this drive into to see if it is the SATA cable. From what I understand I should be able to power off the system (to plug in the spare SATA on the spare port) with the only caveat being I'd have to start the data-rebuild process again (which is fine I'm not gonna lose any time I paused it almost as soon as I saw the read errors).

Am I on the right track? Is swapping the SATA connection my best bet? Any advice would be much appreciated. Diagnostics attached.

dutchy-media-diagnostics-20201117-1718.zip

Edited November 17, 2020 by Quilimox

JorgeB · November 17, 2020

20 minutes ago, Quilimox said:

Is swapping the SATA connection my best bet?

It's the first thing to try.

Quilimox · November 17, 2020

Thanks for the reply. To confirm is it best practice to actually cancel the data-rebuild and then shutdown from the GUI? I suppose it doesn't really matter because its going to start the process from scratch anyhow. As I'm writing this it feels more and more like a pointless question but I want to ask anyway.

JorgeB · November 17, 2020

Just now, Quilimox said:

I suppose it doesn't really matter because its going to start the process from scratch anyhow.

This.

trurl · November 17, 2020

Starting from scratch seems like a better idea anyway since I wouldn't trust what it had already rebuilt with those errors.

Quilimox · November 17, 2020

19 minutes ago, trurl said:

Starting from scratch seems like a better idea anyway since I wouldn't trust what it had already rebuilt with those errors.

25 minutes ago, JorgeB said:

It's the first thing to try.

Thanks all. Stopped the rebuild, shut it down and swapped the SATA cable, plugged into 3gbps port. Powered it on, started array and rebuild process and so far so good no errors yet. Does appear speed is impacted which I suppose is to be expected due to that 3gbps port and/or is it compounded by the fact its a rebuild rather than just a check? My parity is 4TB and it usually takes 7-8h for a check to complete but its hovering between 12-16h now to complete.

JorgeB · November 17, 2020

15 minutes ago, Quilimox said:

oes appear speed is impacted which I suppose is to be expected due to that 3gbps port and/or is it compounded by the fact its a rebuild rather than just a check?

Unlikely to be that, SATA2 is enough for spinners, only SSDs (and very few of the current fastest disks) need more bandwidth than that.

Quilimox · November 17, 2020

1 hour ago, JorgeB said:

Unlikely to be that, SATA2 is enough for spinners, only SSDs (and very few of the current fastest disks) need more bandwidth than that.

I see. Looking at the syslog it does appear the Seagate 1TB (disk 1) was still having trouble:

Nov 17 18:08:16 Dutchy-Media kernel: ata4.00: status: { DRDY }
Nov 17 18:08:16 Dutchy-Media kernel: ata4: hard resetting link
Nov 17 18:08:21 Dutchy-Media kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd 00/00:00:00:00:00:a0 (NOP) rejected by device (Stat=0x51 Err=0x04)
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: ACPI cmd 00/00:00:00:00:00:a0 (NOP) rejected by device (Stat=0x51 Err=0x04)
Nov 17 18:08:21 Dutchy-Media kernel: ata4.00: configured for UDMA/133
Nov 17 18:08:21 Dutchy-Media kernel: ata4: EH complete
Nov 17 18:08:23 Dutchy-Media kernel: ata4.00: exception Emask 0x10 SAct 0x4000000 SErr 0x4890000 action 0xe frozen
Nov 17 18:08:23 Dutchy-Media kernel: ata4.00: irq_stat 0x08400040, interface fatal error, connection status changed
Nov 17 18:08:23 Dutchy-Media kernel: ata4: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
Nov 17 18:08:23 Dutchy-Media kernel: ata4.00: failed command: READ FPDMA QUEUED
Nov 17 18:08:23 Dutchy-Media kernel: ata4.00: cmd 60/c0:d0:b0:01:74/03:00:00:00:00/40 tag 26 ncq dma 491520 in
Nov 17 18:08:23 Dutchy-Media kernel:         res 40/00:d0:b0:01:74/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Nov 17 18:08:23 Dutchy-Media kernel: ata4.00: status: { DRDY }
Nov 17 18:08:23 Dutchy-Media kernel: ata4: hard resetting link
Nov 17 18:08:28 Dutchy-Media kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd 00/00:00:00:00:00:a0 (NOP) rejected by device (Stat=0x51 Err=0x04)
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: ACPI cmd 00/00:00:00:00:00:a0 (NOP) rejected by device (Stat=0x51 Err=0x04)
Nov 17 18:08:28 Dutchy-Media kernel: ata4.00: configured for UDMA/133
Nov 17 18:08:28 Dutchy-Media kernel: ata4: EH complete
Nov 17 18:08:31 Dutchy-Media kernel: ata4.00: limiting speed to UDMA/100:PIO4
Nov 17 18:08:31 Dutchy-Media kernel: ata4.00: exception Emask 0x10 SAct 0x60 SErr 0x4890000 action 0xe frozen
Nov 17 18:08:31 Dutchy-Media kernel: ata4.00: irq_stat 0x08400040, interface fatal error, connection status changed
Nov 17 18:08:31 Dutchy-Media kernel: ata4: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
Nov 17 18:08:31 Dutchy-Media kernel: ata4.00: failed command: READ FPDMA QUEUED
Nov 17 18:08:31 Dutchy-Media kernel: ata4.00: cmd 60/08:28:38:70:7f/04:00:00:00:00/40 tag 5 ncq dma 528384 in
Nov 17 18:08:31 Dutchy-Media kernel:         res 40/00:30:40:74:7f/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Nov 17 18:08:31 Dutchy-Media kernel: ata4.00: status: { DRDY }

Whatever the issue is has stopped now though. Still Googling but in case you don't mind having a look I've attached diagnostics.

dutchy-media-diagnostics-20201117-1949.zip

Quilimox · November 17, 2020

Right well from what I can tell online it appears to be an issue with the SATA cable again. Hell of a coincidence if its another dodgy cable..

Notably since the issue stopped occurring the estimated rebuild speed has increased and remains steady at around 120-130mbps which is nominal and the time to finish gone down to about 6h so I'm leaning towards a temp badly seated connection rather than a bad cable.

Either way I'll swap it out for another cable once the rebuild is finished then run another parity check. If that goes to plan and no issues pop up then fair enough but if the same issues occur is it worth just chucking the drive and putting in a new one do you think?

Edited November 17, 2020 by Quilimox

Frank1940 · November 17, 2020

Any time you are working around SATA cables, double check that every cable (both data and power) is firmly seated before you power things back up. (The SATA connector design is a poster child for how not to design a connector system!!! I personally avoid dressing SATA cables into 'neatest' bundles using cable ties because of potential connector issues when these bundles are subjected to even the slightest movement.)

itimpi · November 17, 2020

24 minutes ago, Quilimox said:

Right well from what I can tell online it appears to be an issue with the SATA cable again. Hell of a coincidence if its another dodgy cable..

Very often it is not the cable that is faulty, but just the plug at one of the ends not making good contact. The SATA cable design means it is very easy for cables to not be perfectly seated, and even if they are to work slightly lose under normal drive vibration.

Quilimox · November 27, 2020

On 11/17/2020 at 9:17 PM, Frank1940 said:

Any time you are working around SATA cables, double check that every cable (both data and power) is firmly seated before you power things back up. (The SATA connector design is a poster child for how not to design a connector system!!! I personally avoid dressing SATA cables into 'neatest' bundles using cable ties because of potential connector issues when these bundles are subjected to even the slightest movement.)

On 11/17/2020 at 9:22 PM, itimpi said:

Very often it is not the cable that is faulty, but just the plug at one of the ends not making good contact. The SATA cable design means it is very easy for cables to not be perfectly seated, and even if they are to work slightly lose under normal drive vibration.

Didn't realise SATA was so finicky, must have just been really lucky up to this point having never had issues with it.

Sorry for the delay in updating this. In the end the rebuild completed. I did a parity check which corrected 1 error. Since then I replaced the SATA cable and touch wood its been fine. No further errors and its all working as expected.

Thank you all for your input. Much appreciated.

Quilimox · December 23, 2020

Back again.. Unsure if I should be making another thread, but I think context in this case is important.

Server had been running fine up until a few weeks ago. It was doing a parity check and all of a sudden all drives plugged into my Dell H200 HBA disappeared. I couldn't believe I'd have a second HBA die on me in such short succession. I pulled it out and upon careful examination I found a sticky residue on the contact points connecting the 085B chip to the PCB.
It's vape liquid. I do vape and have done for years but its never caused issues with the machines I run in the room. However upon second inspection of my previous Dell H200 HBA card, this does appear to have caused that one to fail too. I never noticed it as the residue is clear and its only really apparent upon touch that its there.
It appears to have collected on the CPU heatsink and from there dripped onto the HBA cards. I could have kicked myself that I hadn't noticed any of this before.

Having now stopped vaping in the office, and am contemplating my life's choices... I cleaned everything with isopropanol and although it booted fine it didn't sit well with me and I didn't really trust any of the existing components anymore so I thought it best to replace the lot bar the HDD/SSDs. Got myself a new motherboard and CPU combo, new HBA this time a Dell H310. Cleaned out the entire case and everything left in it fans/filters/drives. Got it booted and disk 6 showed disabled but everything else was present and correct.

Started a data rebuild on disk 6 like before. Disk 1 reared its ugly head and showed errors. I plugged in a different SATA connector going to the HBA and I didn't use a new cable to a port on the motherboard like I had done previously . Disk showed up fine upon booting, attempted rebuild again and it again showed errors and this time Unraid disabled the disk.

I've now done what I'd done before which was replace the SATA connection from the HBA with a new SATA cable to a port on the motherboard. I am now booted into Unraid with disk 1 disabled and disk 6 needing a rebuild. I only have 1 parity, which will not be the case for long once I've got the array sorted out I assure you. I am fairly confident disk 1 is actually fine but just gets a bad SATA connection when connected using the cables from my HBA.

Would the best thing to do be to use the the 'Trust My Array' procedure, to re-enable disk 1. Then perform the rebuild of disk 6?

If so would the process be:
-Tools -> New Config -> Retain current configuration: All -> Apply
-if needed assign any missing disk(s), all disks should be assigned, including the disable disk 1 and new disk 6
-check both "parity is already valid" and "maintenance mode" before starting the array
-start the array
-stop array, unassign disk 6
-start array, check emulated disk 6 mounts and contents look correct
-stop array, reassign disk 6
-start array to begin rebuild

Based on what I found this thread which closely seems to resemble my situation:

Because within that thread JorgeB you mentioned:

On 10/31/2017 at 11:52 AM, JorgeB said:

Lots of Marvell controllers, there are ATA errors on disks 7 and 16, just a couple, but array is not even started, so not very encouraging...

As to re-enable the disable disk, and assuming array data is static since the beginning of the rebuild, you can try this:

-Tools -> New Config -> Retain current configuration: All -> Apply
-if needed assign any missing disk(s), all disks should be assigned, including the disable disk3 and new disk9
-check both "parity is already valid" and "maintenance mode" before starting the array
-start the array
-stop array, unassign disk9
-start array, check emulated disk9 mounts and contents look correct
-stop array, reassign disk9
-start array to begin rebuild

Hope there are no more errors during the rebuild and get rid of those Marvell controllers ASAP.

Thank you in advance for any assistance

Edited December 23, 2020 by Quilimox

JorgeB · December 23, 2020

Just now, Quilimox said:

If so would the process be:
-Tools -> New Config -> Retain current configuration: All -> Apply
-if needed assign any missing disk(s), all disks should be assigned, including the disable disk 1 and new disk 6
-check both "parity is already valid" and "maintenance mode" before starting the array
-start the array
-stop array, unassign disk 6
-start array, check emulated disk 6 mounts and contents look correct
-stop array, reassign disk 6
-start array to begin rebuild

That should do it.

Quilimox · December 23, 2020

2 hours ago, JorgeB said:

That should do it.

Thank you for your response.

So I got to this part:
-Tools -> New Config -> Retain current configuration: All -> Apply
-if needed assign any missing disk(s), all disks should be assigned, including the disable disk 1 and new disk 6
-check both "parity is already valid" and "maintenance mode" before starting the array
-start the array
-stop array, unassign disk 6
-start array, check emulated disk 6 mounts and contents look correct

Emulated disk 6 did not mount. Which appears to be the same issue Wavey had in his thread. I've now stopped the array again, I can assign disk 6 and it shows as a new drive and it states next to the array start button it'll start the data rebuild, which is what it was doing on disk 6, prior to disk 1 throwing a load of errors and disabling itself.

Unlike Wavey however I have not started the array for it to start the data rebuild as I noticed your recommendation would be to run the xfs_repair command first as there is no point starting a rebuild on disk 6 if the file system on it is corrupt. Which I guess means you suspect that disk 6 showed as unmountable because it had a corrupt file system.

However I am slightly confused why having unassigned disk 6, it would then try to mount it, for it to then know that it is unmountable?
As I am currently able to assign disk 6, and it stating its a new drive and a its asking for a rebuild upon array start. Does that mean Unraid no longer deems disk 6 unmountable?

Maybe I'm getting the wrong end of the stick here I'm just trying to figure out the logic behind Unraid deciding that disk 6 in unmountable when I had nothing assigned as disk 6 to mount upon array start.

Should I run the xfs_repair command any way for good measure? Wavey mentions it recommended he run the command with parameter -L which looks to enable 'Force Log Zeroing' which according to https://linux.die.net/man/8/xfs_repair "Forces xfs_repair to zero the log even if it is dirty (contains metadata changes). When using this option the filesystem will likely appear to be corrupt, and can cause the loss of user files and/or data." . I guess this is fine as the disk needs to be rebuilt any way right?

JorgeB · December 23, 2020

2 minutes ago, Quilimox said:

Emulated disk 6 did not mount.

That means parity wasn't 100% valid, please post diags.

Quilimox · December 23, 2020

4 minutes ago, JorgeB said:

That means parity wasn't 100% valid, please post diags.

It just clicked. It's trying to load the contents of disk 6 from parity to emulate it, and it couldn't.

Diags attached. Save you scrolling through the log see 16:11:59 onwards.

Noted that it does state the following in regards to running xfs_repair
Dec 23 16:16:22 Dutchy-Media kernel: XFS (md6): Mounting V5 Filesystem
Dec 23 16:16:22 Dutchy-Media kernel: XFS (md6): Corruption warning: Metadata has LSN (1:9207) ahead of current LSN (1:9180). Please unmount and run xfs_repair (>= v4.3) to resolve.
Dec 23 16:16:22 Dutchy-Media kernel: XFS (md6): log mount/recovery failed: error -22
Dec 23 16:16:22 Dutchy-Media kernel: XFS (md6): log mount failed
Dec 23 16:16:22 Dutchy-Media root: mount: /mnt/disk6: wrong fs type, bad option, bad superblock on /dev/md6, missing codepage or helper program, or other error.
Dec 23 16:16:22 Dutchy-Media emhttpd: shcmd (47222): exit status: 32

dutchy-media-diagnostics-20201223-1650.zip

JorgeB · December 23, 2020

Check filesystem on disk6:

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

Quilimox · December 23, 2020

8 minutes ago, JorgeB said:

Check filesystem on disk6:

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

Started array in maintenance mode and there is no 'Check Filesystem Status' section with the option on disk 6 when I go into it.

It has "File system type:" set to "auto" though. Should I stop the array, set that to xfs (to confirm all my drives are meant to be xfs) then start the array again in maintenance mode and see if the check file system option shows under disk 6 options?

Or I can SSH in and run the commands through that?

I'm guessing because Unraid can't determine the file system is why the option doesn't show.

JorgeB · December 23, 2020

Yes, you manually need to select the fs to use the GUI ( or run xfs_repair in the console).

Quilimox · December 23, 2020

Thanks.

Results:

Phase 1 - find and verify superblock...
        - block cache size set to 344000 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 9180 tail block 9180
        - scan filesystem freespace and inode maps...
sb_fdblocks 890108489, counted 885448772
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1:23603) is ahead of log (1:9180).
Would format log to cycle 4.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Dec 23 18:02:07 2020

Phase		Start		End		Duration
Phase 1:	12/23 18:02:02	12/23 18:02:02
Phase 2:	12/23 18:02:02	12/23 18:02:02
Phase 3:	12/23 18:02:02	12/23 18:02:07	5 seconds
Phase 4:	12/23 18:02:07	12/23 18:02:07
Phase 5:	Skipped
Phase 6:	12/23 18:02:07	12/23 18:02:07
Phase 7:	12/23 18:02:07	12/23 18:02:07

Total run time: 5 seconds

Doesn't mean much to me but I can see it would try a repair it it wasn't ran with -n parameter so shall I now run it with just -v so it does the repairs?

Edited December 23, 2020 by Quilimox

JorgeB · December 23, 2020

7 minutes ago, Quilimox said:

so shall I now run it with just -v so it does the repairs?

Yes, and if it asks for it use -L.

Quilimox · December 23, 2020

Done. Didn't ask for -L . Looks like it was a success?

Phase 1 - find and verify superblock...
        - block cache size set to 344000 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 9180 tail block 9180
        - scan filesystem freespace and inode maps...
sb_fdblocks 890108489, counted 885448772
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:23603) is ahead of log (1:9180).
Format log to cycle 4.

        XFS_REPAIR Summary    Wed Dec 23 18:12:46 2020

Phase		Start		End		Duration
Phase 1:	12/23 18:11:53	12/23 18:11:53
Phase 2:	12/23 18:11:53	12/23 18:11:53
Phase 3:	12/23 18:11:53	12/23 18:11:53
Phase 4:	12/23 18:11:53	12/23 18:11:53
Phase 5:	12/23 18:11:53	12/23 18:11:53
Phase 6:	12/23 18:11:53	12/23 18:11:53
Phase 7:	12/23 18:11:53	12/23 18:11:53

Total run time:
done

What now. Do I unassign disk 6 again, start the array and see if it gets emulated?

If it gets emulated I guess I then stop the array, assign disk 6 and start it and rebuild?

JorgeB · December 23, 2020

47 minutes ago, Quilimox said:

Looks like it was a success?

Yes, start the array and it should mount.

Quilimox · December 24, 2020

Ok so it mounted successfully and showing as emulated. Started the array and it began the rebuild process, but at awful speeds. I can hear disk 1 spinning up and the head moving across the platters, then spinning back down before repeating.

Checked the log and found this repeating over and over:

ata3.00: exception Emask 0x10 SAct 0x1c000 SErr 0x4090000 action 0xe frozen
ata3.00: irq_stat 0x00400040, connection status changed
ata3: SError: { PHYRdyChg 10B8B DevExch }
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: failed command: READ FPDMA QUEUED
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: cmd 60/00:70:00:a5:00/01:00:01:00:00/40 tag 14 ncq dma 131072 in
Dec 23 23:59:43 Dutchy-Media kernel:         res 40/00:80:00:a7:00/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: status: { DRDY }
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: failed command: READ FPDMA QUEUED
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: cmd 60/00:78:00:a6:00/01:00:01:00:00/40 tag 15 ncq dma 131072 in
Dec 23 23:59:43 Dutchy-Media kernel:         res 40/00:80:00:a7:00/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: status: { DRDY }
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: failed command: READ FPDMA QUEUED
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: cmd 60/00:80:00:a7:00/01:00:01:00:00/40 tag 16 ncq dma 131072 in
Dec 23 23:59:43 Dutchy-Media kernel:         res 40/00:80:00:a7:00/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
Dec 23 23:59:43 Dutchy-Media kernel: ata3.00: status: { DRDY }
Dec 23 23:59:43 Dutchy-Media kernel: ata3: hard resetting link
Dec 23 23:59:48 Dutchy-Media kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 23 23:59:48 Dutchy-Media kernel: ata3.00: configured for UDMA/33
Dec 23 23:59:48 Dutchy-Media kernel: ata3: EH complete

So I stopped and shutdown. Swapped the SATA cable in it to one that was plugged into another drive which appeared to be working fine. Booted back up and I'm getting the same messages but now for 'ata4.00' as that's where disk 1 is now plugged in:

Dec 24 00:44:04 Dutchy-Media kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: configured for UDMA/133
Dec 24 00:44:04 Dutchy-Media kernel: ata4: EH complete
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: exception Emask 0x50 SAct 0x7fe000 SErr 0x4090800 action 0xe frozen
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: irq_stat 0x00400040, connection status changed
Dec 24 00:44:04 Dutchy-Media kernel: ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: failed command: READ FPDMA QUEUED
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: cmd 60/40:68:c8:76:00/01:00:00:00:00/40 tag 13 ncq dma 163840 in
Dec 24 00:44:04 Dutchy-Media kernel:         res 40/00:b0:90:91:00/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: status: { DRDY }
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: failed command: READ FPDMA QUEUED
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: cmd 60/00:70:08:78:00/04:00:00:00:00/40 tag 14 ncq dma 524288 in
Dec 24 00:44:04 Dutchy-Media kernel:         res 40/00:b0:90:91:00/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: status: { DRDY }
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: failed command: READ FPDMA QUEUED
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: cmd 60/40:78:08:7c:00/01:00:00:00:00/40 tag 15 ncq dma 163840 in
Dec 24 00:44:04 Dutchy-Media kernel:         res 40/00:b0:90:91:00/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Dec 24 00:44:04 Dutchy-Media kernel: ata4.00: status: { DRDY }

So it seems to me its not a bad SATA connection, cable or controller. Disk 1 is toast. Problem of-course is I only have parity of 1 and disk 6 is currently being emulated.

I don't have a backup of the array data as it was only non-personal media being stored on there and I wasn't too bothered. However after the first HBA died and after I got things up and running again I did back up some personal files from an external USB HDD that was dying and needed to be RMA'd and I backed that up to the server as a quick and easy method, with the idea being I'd then restore that data to the returned RMA'd drive. However the 2nd HBA died before the RMA'd drive arrived so I would like to try and recover the data on disk 1 and 6 if that is possible.

Physically recovering the data from disk 1 ain't going to work I don't think. It would take a long, long time because when it spins up it goes between 8, sometimes 3mbps and constantly dropping to 0 when it spins down and even if I did wait however long it would take I'm not sure it constantly going through that motion would not make matters worse.

Is there a way to grab all the emulated disk 6 data. Then have parity emulate the data that was stored on disk 1 rather than dis so I can then back that up? Essentially having my 1 parity drive, emulate 1 disk's data a time so I can back it up? I feel like there should be a way but I'm not entirely sure how to go about it.

Edit: I just remembered parity doesn't actually hold any data and it uses the data drives to emulate 1 disk per parity if it's missing so I guess if I was to copy emulated data I could still be hampered by the low speeds I'm seeing on disk 1..

Starting to think this is a lost cause now

Edit2: Something I hadn't yet considered is power. My PSU has a single 12v rail rated at 38 amps. I have got a new CPU (used to be an i3-3240 and now its an i5-6500) which pulls a few more watts than the previous one did. I also swapped out two fans when I put the new CPU, Mobo and RAM in. Maybe I'm overloading my 12v rail, but tbh I doubt I'm even close to drawing 38 amps even with the new CPU/mobo/RAM my 8 HDDs, 3 SSDs, 2 140mm fans and 4 120mm fans and LCD fan controller.

I'm going to remove the two fans I replaced and remove an unassigned SSD drive that was being used by Lancache. See if that helps. If not, considering the vape liquid causing all of this to begin with. Maybe its worth getting a new power supply, because I didn't even think of replacing that to be fair.

Edited December 24, 2020 by Quilimox

HBA Card died, replaced - All drives show but drive 6 disabled, SMART checks out, started reconstruction/data-rebuild process. Drive 1 started showing showing read errors - Paused rebuild, options?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation