[Solved] slow parity check after --rebuild-tree


Recommended Posts

Couldn't find an answer in posts.

I'm running a second server with 5.0.rc8a containing 2 and 3 TB drives.

One disk was bad so I replaced it, did the rebuild and it all seemed fine.

However later realized the drive was kept in read only mode.

I ran reisferfsck --check and later with --rebuild-tree, ran --check again.

Looked like it was fixed but still kept the drive in read only mode, even after an array stop and re-start,

including a reboot.  Did a second --rebuild-tree after --check indicated a problem.

After this the server wanted to do an array parity check... which is now painfully slow (take 30 days to finish).

Looking at the current syslog indicates another problem but I don't believe it's the same drive.

 

Jan 15 23:34:38 Moat kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 15 23:34:38 Moat kernel: ata4.00: configured for UDMA/33
Jan 15 23:34:38 Moat kernel: ata4: EH complete
Jan 15 23:34:39 Moat kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Jan 15 23:34:39 Moat kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed
Jan 15 23:34:39 Moat kernel: ata4: SError: { Persist PHYRdyChg }
Jan 15 23:34:39 Moat kernel: ata4.00: failed command: READ DMA EXT
Jan 15 23:34:39 Moat kernel: ata4.00: cmd 25/00:00:f0:74:a9/00:04:02:00:00/e0 tag 0 dma 524288 in
Jan 15 23:34:39 Moat kernel:          res 50/00:00:ef:74:a9/00:00:02:00:00/e0 Emask 0x10 (ATA bus error)
Jan 15 23:34:39 Moat kernel: ata4.00: status: { DRDY }
Jan 15 23:34:39 Moat kernel: ata4: hard resetting link
Jan 15 23:34:46 Moat kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 15 23:34:46 Moat kernel: ata4.00: configured for UDMA/33
Jan 15 23:34:46 Moat kernel: ata4: EH complete
Jan 15 23:34:46 Moat kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Jan 15 23:34:46 Moat kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed
Jan 15 23:34:46 Moat kernel: ata4: SError: { Persist PHYRdyChg }
Jan 15 23:34:46 Moat kernel: ata4.00: failed command: READ DMA EXT
Jan 15 23:34:46 Moat kernel: ata4.00: cmd 25/00:00:50:b6:a9/00:04:02:00:00/e0 tag 0 dma 524288 in
Jan 15 23:34:46 Moat kernel:          res 50/00:00:4f:b6:a9/00:00:02:00:00/e0 Emask 0x10 (ATA bus error)
Jan 15 23:34:46 Moat kernel: ata4.00: status: { DRDY }
Jan 15 23:34:46 Moat kernel: ata4: hard resetting link
Jan 15 23:34:53 Moat kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 15 23:34:54 Moat kernel: ata4.00: configured for UDMA/33
Jan 15 23:34:54 Moat kernel: ata4: EH complete
Jan 15 23:34:54 Moat kernel: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Jan 15 23:34:54 Moat kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed
Jan 15 23:34:54 Moat kernel: ata4: SError: { Persist PHYRdyChg }
Jan 15 23:34:54 Moat kernel: ata4.00: failed command: READ DMA EXT
Jan 15 23:34:54 Moat kernel: ata4.00: cmd 25/00:00:60:fa:a9/00:04:02:00:00/e0 tag 0 dma 524288 in
Jan 15 23:34:54 Moat kernel:          res 50/00:00:5f:fa:a9/00:00:02:00:00/e0 Emask 0x10 (ATA bus error)
Jan 15 23:34:54 Moat kernel: ata4.00: status: { DRDY }
Jan 15 23:34:54 Moat kernel: ata4: hard resetting link
Jan 15 23:35:01 Moat kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 15 23:35:01 Moat kernel: ata4.00: configured for UDMA/33
Jan 15 23:35:01 Moat kernel: ata4: EH complete
Jan 15 23:35:01 Moat kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Jan 15 23:35:01 Moat kernel: ata4: irq_stat 0x00400000, PHY RDY changed
Jan 15 23:35:01 Moat kernel: ata4: SError: { Persist PHYRdyChg }
Jan 15 23:35:01 Moat kernel: ata4: hard resetting link
Jan 15 23:35:09 Moat kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 15 23:35:09 Moat kernel: ata4.00: configured for UDMA/33
Jan 15 23:35:09 Moat kernel: ata4: EH complete
Jan 15 23:35:09 Moat kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Jan 15 23:35:09 Moat kernel: ata4: irq_stat 0x00400000, PHY RDY changed
Jan 15 23:35:09 Moat kernel: ata4: SError: { Persist PHYRdyChg }
Jan 15 23:35:09 Moat kernel: ata4: hard resetting link

 

Is this indicating a disk problem or a SATA cable connection to the sata port 4 (or 5 if 0-4)?

 

I had to change the 200mm Big Boy fan on the Antec 900 case, so I'm thinking something came loose after flipping it on its' side.

 

EDIT: Solved - loose sata cables or power to the sata enclosures. Possibly loose sata controller card.

Link to comment

I've attached the full syslog from bootup tonight.

Came back up ok and did not require a parity check.

Checked last disk that was rebuilt and short test passed.

 

Looking at the end of the syslog before shutdown last night I see these, and I notice /dev/sde reports something,

which may coincide with ata4 errors found at end of attached syslog (repeating errors).

 

Jan 16 23:25:14 Moat status[22654]: SMART overall health assessment
Jan 16 23:25:15 Moat status[22654]: /dev/sda: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:15 Moat status[22654]: /dev/sdb: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:15 Moat status[22654]: /dev/sdc: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:15 Moat status[22654]: /dev/sdd: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:15 Moat status[22654]: /dev/sde: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: /dev/sde: Unknown USB bridge [0x0781:0x5530 (0x103)]
Jan 16 23:25:15 Moat status[22654]: Smartctl: please specify device type with the -d option.
Jan 16 23:25:15 Moat status[22654]: Use smartctl -h to get a usage summary
Jan 16 23:25:15 Moat status[22654]: /dev/sdf: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:15 Moat status[22654]: /dev/sdg: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:15 Moat status[22654]: /dev/sdh: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:15 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:15 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:16 Moat status[22654]: /dev/sdi: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:16 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:16 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:16 Moat status[22654]: /dev/sdj: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:16 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:16 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:16 Moat status[22654]: /dev/sdk: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:16 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:16 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:16 Moat status[22654]: /dev/sdl: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:16 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:16 Moat status[22654]: SMART overall-health self-assessment test result: PASSED
Jan 16 23:25:16 Moat status[22654]: /dev/sdm: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Jan 16 23:25:16 Moat status[22654]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Jan 16 23:25:16 Moat status[22654]: SMART overall-health self-assessment test result: PASSED

 

The parity check is now currently down to 11 hrs on a 3TB drive, running at 79 MB/s

This is good now except for the ata4 errors.

 

I suspect the sata multiplier card was loose, or sata connections.

I use 4 out 5 sata on motherboard plus 8 on the multiplier.

syslog-20170117.txt

Link to comment

I wasn't sure that the ata4 in this syslog would be the same ata4 in the first excerpt, as it's a channel, dynamically assigned at boot, and can be different for each boot.  But you have 3 drive controllers in use, and only one of them, the motherboard SATA ports using ahci, are using ATA support.  So since AHCI assigns one device per ATA channel, the 6 SATA ports are always assigned the first 6 ATA channels.  So your Disk 6 (sdc, WDC_WD20EARS-00MVWB0_WD-WCAZA1137613) is the device ata4.00, on the ata4 channel.

 

Your SAS card is not an ATA controller, and neither is the USB controller that's handling your flash drive, on sde.  Because it's a USB device, you have to give it a USB device type in order to access the SMART info for it.  Try "-d cypress", works for some USB drives.  If not that, then you will have to look it up in the smartctl docs, and try other USB options.  In the latest v6, I believe you can specify the device type, in order to get SMART info.

 

On boot, it started a correcting parity check, and immediately found parity errors. Within a second or two, it also began the sequence of exceptions on the ata4.00 device, which would have drastically slowed everything down.  The errors are consistent with a loose connection, and are continuous (power problems can look like a loose connection).  The kernel quickly begins slowing down the communication channels to the drive, in hope that it will help restore clean and stable comms, but unsuccessfully.  It has already slowed it down to the lowest it usually can go.  The timeouts and hard resets also involve lost time, significant delays, making everything else wait too.

 

In my opinion, this is consistent with either a loose vibrating connection, or power issues, either a general low power problem, or more likely power issues to the drive itself.  I think the most likely is insufficient power to the drive.  It all starts when the power demand is highest, very shortly after the parity check begins.  You may want to redistribute the power connections, to make sure this drive is on a rail not shared with too many other devices.  Single rail power supplies are always recommended, to avoid situations where too many drives are drawing from the same power rail.

 

I do not believe there's a problem with the SATA cable, just the power cable, or PSU.

 

It's always best to provide the complete syslog or diagnostics, not just an excerpt.  If you still want to show something from the syslog, make sure it's the very first errors that occur, they are always the most important.  I really do recommend upgrading to v6, with much better diagnostics and notifications, but I recognize that with all your plugins, it will be quite a job.

Link to comment

Thanks for the help and detailed response. Appreciate that you were able to pinpoint ata4 as sdc.

 

DUe to replacing the Big Boy fan on my antec case, and hard to reach screw, I most likely loosened a power cable on the sata enclusures or bumped sata cable connections. I also made sure to better secure the controller card. Once I checked and did all this, I was able to reboot and re-start the parity check. After initial startup, I started seeing normal speeds.

 

After array parity was done, I ran a SMARTshort test on sdc and it came back fine (and no more syslog errors). I'll simply monitor it for now.

 

I was not able to get v6 booting up on my other server (recent thread), but at some point I'll try it on this server (different motherboard).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.