Jump to content

Disk & Parity Issues - Replace Disk?


Recommended Posts

Posted

Hey, if anyone is willing to chip in here, I'm looking for some opinions as to what to do next.

 

I'm pretty sure after a couple unclean powerdowns (we have terrible power and I've had a UPS and an inverter die), I have been having issues with a single disk constantly throwing errors and I haven't been able to complete a parity sync for ages now, one because it's incredibly slow, and two because I was previously concerned about ruining the parity (this has gone out the window now).

 

So basically I was getting this in my logs when parity would run:

 

Apr 11 08:25:04 Tower kernel: mdcmd (37): nocheck cancel
Apr 11 08:25:05 Tower sSMTP[15964]: Creating SSL connection to host
Apr 11 08:25:06 Tower sSMTP[15964]: SSL connection using TLS_AES_256_GCM_SHA384
Apr 11 08:25:10 Tower sSMTP[15964]: Sent mail for [email protected] (221 2.0.0 closing connection d16-20020adff2d0000000b003418364032asm968558wrp.112 - gsmtp) uid=0 username=root outbytes=775
Apr 11 08:25:26 Tower kernel: ata9.00: failed to read SCR 1 (Emask=0x40)
Apr 11 08:25:26 Tower kernel: ata9.01: failed to read SCR 1 (Emask=0x40)
Apr 11 08:25:26 Tower kernel: ata9.02: failed to read SCR 1 (Emask=0x40)
Apr 11 08:25:26 Tower kernel: ata9.02: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 11 08:25:26 Tower kernel: ata9.02: failed command: READ DMA
Apr 11 08:25:26 Tower kernel: ata9.02: cmd c8/00:00:18:4a:00/00:00:00:00:00/e0 tag 17 dma 131072 in
Apr 11 08:25:26 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 11 08:25:26 Tower kernel: ata9.02: status: { DRDY }
Apr 11 08:25:27 Tower kernel: ata9.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 11 08:25:27 Tower kernel: ata9.01: limiting SATA link speed to 1.5 Gbps
Apr 11 08:25:28 Tower kernel: ata9.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 11 08:25:28 Tower kernel: ata9.01: SATA link down (SStatus 0 SControl 310)
Apr 11 08:25:28 Tower kernel: ata9.02: hard resetting link
Apr 11 08:25:28 Tower kernel: ata9.02: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 11 08:25:28 Tower kernel: ata9.00: configured for UDMA/133
Apr 11 08:25:29 Tower kernel: ata9.02: configured for UDMA/133
Apr 11 08:25:29 Tower kernel: ata9.02: device reported invalid CHS sector 0
Apr 11 08:25:29 Tower kernel: ata9: EH complete
Apr 11 08:25:29 Tower kernel: md: recovery thread: exit status: -4
Apr 11 08:25:37 Tower kernel: ata9.00: failed to read SCR 1 (Emask=0x40)
Apr 11 08:25:37 Tower kernel: ata9.01: failed to read SCR 1 (Emask=0x40)
Apr 11 08:25:37 Tower kernel: ata9.02: failed to read SCR 1 (Emask=0x40)
Apr 11 08:25:37 Tower kernel: ata9.02: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 11 08:25:37 Tower kernel: ata9.02: failed command: READ DMA
Apr 11 08:25:37 Tower kernel: ata9.02: cmd c8/00:20:40:5f:6c/00:00:00:00:00/e9 tag 20 dma 16384 in
Apr 11 08:25:37 Tower kernel:         res 50/00:00:37:5e:6c/00:00:00:00:00/ed Emask 0x4 (timeout)
Apr 11 08:25:37 Tower kernel: ata9.02: status: { DRDY }
Apr 11 08:25:38 Tower kernel: ata9.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 11 08:25:38 Tower kernel: ata9.01: limiting SATA link speed to 1.5 Gbps
Apr 11 08:25:38 Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
Apr 11 08:25:38 Tower kernel: ata9.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 11 08:25:39 Tower kernel: ata9.01: SATA link down (SStatus 0 SControl 310)
Apr 11 08:25:39 Tower kernel: ata9.02: hard resetting link
Apr 11 08:25:39 Tower kernel: ata9.02: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 11 08:25:39 Tower kernel: ata9.00: configured for UDMA/133
Apr 11 08:25:39 Tower kernel: ata9.02: configured for UDMA/133
Apr 11 08:25:39 Tower kernel: ata9: EH complete

 

This lead me down the rabbit hole. First a few quick SMART tests - all fine. Then extended SMART tests - also fine. Once, I had to run XFS_repair to fix an issue where the drive wasn't mounting. 

 

I have swapped the SATA cable with a previously used one and a brand new one, I also swapped to a SATA port on a PCIe-SATA adapter so as to test if it was maybe the port on the motherboard (which I thought would be fine since the other 3 don't have issues.

 

So now I'm pretty convinced there is actually an issue with the drive. All the data (as far as I can tell) seems to be in-tact and I have cloud backups of the entire NAS stored per disk (ie. /mnt/disk1 rsyncs to cloud/disk1 and so on for each of the 3 data disks in my array). When writing to or reading from the drive I don't see any performance issues.

EDIT: As I'm posting this it seems the drive is finally dead. It is unmountable and I can't get anything to work. I'm going to have to replace it. So any advice as to how to do that safely is appreciated.

 

So I have a few actual questions. Is there something I haven't considered or may have overlooked? I believe I need to replace the disk. If I do replace the disk, should I let it sync with Parity first and then try to restore the data from the cloud backups or should I prevent parity from running, restore from backups and then sync? Would it be silly to try copy this data straight to the new disk (considering there may well be data corruption issues? Any other ideas/input would be greatly appreciated!

 

I have attached diagnostics just in case but I have just rebooted a few times trying to change cables etc so I'm not sure what you might get from it.

Thanks a million, appreciate you for reading through this!

 

tower-diagnostics-20240411-0931.zip

Posted

The errors posted are from a controller with a SATA port multiplier, and those are not recommended, but the array disks are not using it, disk2 was still mounting in the diags posted, if it no longer is post new ones.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...