Parity Errors, but no unclean shutown


emmcee

Recommended Posts

I had a scheduled monthly parity check at the weekend and it has reported 4536 errors. I'm runing a UPS and haven't had any outages (uptime is 69 days). The main page is showing zero errors on all of my drives.

 

The syslog is showing these errors from when the check was running:

 

Nov  3 00:31:03 Tower kernel: ata1: SError: { PHYRdyChg 10B8B DevExch }
Nov  3 00:31:03 Tower kernel: ata1.00: failed command: WRITE DMA EXT
Nov  3 00:31:03 Tower kernel: ata1.00: cmd 35/00:40:18:7b:1e/00:05:b0:02:00/e0 tag 12 dma 688128 out
Nov  3 00:31:03 Tower kernel:         res 50/00:00:c7:88:1c/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Nov  3 00:31:03 Tower kernel: ata1.00: status: { DRDY }
Nov  3 00:31:03 Tower kernel: ata1: hard resetting link
Nov  3 00:31:04 Tower sSMTP[20039]: Sent mail for [email protected] (221 2.0.0 closing connection p18sm17796971wmi.42 - gsmtp) uid=0 username=xxx outbytes=684
Nov  3 00:31:09 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Nov  3 00:31:09 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Nov  3 00:31:09 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Nov  3 00:31:13 Tower kernel: ata3: COMRESET failed (errno=-16)
Nov  3 00:31:13 Tower kernel: ata3: hard resetting link
Nov  3 00:31:13 Tower kernel: ata1: COMRESET failed (errno=-16)
Nov  3 00:31:13 Tower kernel: ata1: hard resetting link
Nov  3 00:31:13 Tower kernel: ata4: COMRESET failed (errno=-16)
Nov  3 00:31:13 Tower kernel: ata4: hard resetting link
Nov  3 00:31:13 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov  3 00:31:13 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov  3 00:31:13 Tower kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov  3 00:31:13 Tower kernel: ata3.00: configured for UDMA/133
Nov  3 00:31:13 Tower kernel: ata3: EH complete
Nov  3 00:31:14 Tower kernel: ata1.00: configured for UDMA/133
Nov  3 00:31:14 Tower kernel: ata1: EH complete
Nov  3 00:31:14 Tower kernel: ata4.00: configured for UDMA/133
Nov  3 00:31:14 Tower kernel: ata4: EH complete
Nov  3 00:33:13 Tower kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
Nov  3 00:33:13 Tower kernel: ata1.00: irq_stat 0x00400040, connection status changed
Nov  3 00:33:13 Tower kernel: ata1: SError: { PHYRdyChg 10B8B DevExch }
Nov  3 00:33:13 Tower kernel: ata1.00: failed command: READ DMA
Nov  3 00:33:13 Tower kernel: ata1.00: cmd c8/00:f0:58:07:53/00:00:00:00:00/e0 tag 27 dma 122880 in
Nov  3 00:33:13 Tower kernel:         res 50/00:00:57:07:53/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Nov  3 00:33:13 Tower kernel: ata1.00: status: { DRDY }
Nov  3 00:33:13 Tower kernel: ata1: hard resetting link
Nov  3 00:33:13 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Nov  3 00:33:13 Tower kernel: ata4.00: irq_stat 0x00400040, connection status changed
Nov  3 00:33:13 Tower kernel: ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }
Nov  3 00:33:13 Tower kernel: ata4.00: failed command: READ DMA EXT
Nov  3 00:33:13 Tower kernel: ata4.00: cmd 25/00:40:18:c8:94/00:05:b0:02:00/e0 tag 13 dma 688128 in
Nov  3 00:33:13 Tower kernel:         res 50/00:00:5f:04:53/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Nov  3 00:33:13 Tower kernel: ata4.00: status: { DRDY }
Nov  3 00:33:13 Tower kernel: ata4: hard resetting link
Nov  3 00:33:19 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Nov  3 00:33:19 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Nov  3 00:33:23 Tower kernel: ata4: COMRESET failed (errno=-16)
Nov  3 00:33:23 Tower kernel: ata4: hard resetting link
Nov  3 00:33:23 Tower kernel: ata1: COMRESET failed (errno=-16)
Nov  3 00:33:23 Tower kernel: ata1: hard resetting link
Nov  3 00:33:24 Tower kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov  3 00:33:24 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov  3 00:33:24 Tower kernel: ata1.00: configured for UDMA/133
Nov  3 00:33:24 Tower kernel: ata1: EH complete
Nov  3 00:33:24 Tower kernel: ata4.00: configured for UDMA/133
Nov  3 00:33:24 Tower kernel: ata4: EH complete

 

The machine is still running. Any suggestions as to how I should proceed?

 

Link to comment
47 minutes ago, johnnie.black said:

You can, the errors logged were only on parity2, which is one of the disks with ATA errors, why there's a chance it's related to that, but even if it's not those errors need to be dealt with.

Agreed. I’ll shutdown later and check my cables and card. I’m already looking at new hardware - I’ve been looking for an excuse for a while. 

  • Like 1
Link to comment

Cables and card look OK to me. I'm not sure how to proceed though.

 

Parity2 is connected to the motherboard, but there are 6 drives in total on the motherboard and 2 on a PCI card. Is it worth swaping parity2 over to a spare port on the card?

 

I have a spare 8TB drive I can swap in for parity 2, but there seems to be an error on ata4 also.

 

Looking at the disk attribites, there are UDMA CRC errors on disk1, disk2, disk3.

Link to comment

OK, I had the brianwave of pulling all the drives/flash and putting them in my gaming PC. After some fiddling I got all drives to show.

 

The good news is that there are no ATA errors in the log. The bad news is that drive 3 is showing as unmountable. That drive is 4.5 years old and was due for replacement. I have a spare. Is it worth rebuilding that onto a new drive. If so, there is no issues replacing a 3TB unmountable with an 8TB?

 

New diagnostics attached

 

 

 

 

tower-diagnostics-20191107-1858.zip

Link to comment

OK - I did that and it says:

 

Quote

 

Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

 

 

Is it OK to use -L?

Link to comment

Update: system, log has this after mounting the FS again:

 

Nov  7 19:19:13 Tower kernel: XFS (md3): Corruption warning: Metadata has LSN (5:1567790) ahead of current LSN (5:1566498). Please unmount and run xfs_repair (>= v4.3) to resolve.
Nov  7 19:19:13 Tower kernel: XFS (md3): log mount/recovery failed: error -22
Nov  7 19:19:13 Tower kernel: XFS (md3): log mount failed
Nov  7 19:19:13 Tower root: mount: /mnt/disk3: wrong fs type, bad option, bad superblock on /dev/md3, missing codepage or helper program, or other error.
Nov  7 19:19:13 Tower emhttpd: shcmd (268): exit status: 32
Nov  7 19:19:13 Tower emhttpd: /mnt/disk3 mount error: No file system

 

Link to comment
8 minutes ago, emmcee said:

Is my best option here just to replace Parity2?

Probably, I saw some older read errors on parity2 on the first diags, but since they were old (a year or so ago) didn't mentioned them, but since there are more now, and they are a disk problem, you should replace it, it's even possibly the disk was causing the sync errors, though disks shouldn't return wrong data, it's been known to happen.

  • Thanks 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.