Jump to content

CPU Stalls, UDMA/133, heck where to start.. o.O


grandprix

Recommended Posts

Ok.

 

So, I'm beginning to think that -perhaps- I used a nuclear bomb to remove an ant hill.

 

My problems at this thread, is what prompted me to do the "upgrade":  http://lime-technology.com/forum/index.php?topic=33669.msg310737#msg310737

 

I figured I had a failing controller on the mobo.  Now, perhaps it was just the parity drive, because after the hardware "upgrade" (in quotes because with the exception of the case and drives, everything else has been replaced, except for the PSU, I'll get to that).

 

After the upgrade to the X10SLM-F-O, ECC RAM and Xeon, this happened when running a no-correct immediately after "upgrade":

 

Aug 30 18:51:52 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=1355 c=1354 q=108)
Aug 30 18:51:52 Tower kernel: Pid: 2921, comm: unraidd Not tainted 3.9.11p-unRAID #5
Aug 30 18:51:52 Tower kernel: Call Trace:
Aug 30 18:51:52 Tower kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107
Aug 30 18:51:52 Tower kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a
Aug 30 18:51:52 Tower kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b
Aug 30 18:51:52 Tower kernel:  [<c1032ed9>] update_process_times+0x2d/0x53
Aug 30 18:51:52 Tower kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1
Aug 30 18:51:52 Tower kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a
Aug 30 18:51:52 Tower kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf
Aug 30 18:51:52 Tower kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7
Aug 30 18:51:52 Tower kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f
Aug 30 18:51:52 Tower kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34
Aug 30 18:51:52 Tower kernel:  [<c1400e0b>] ? _raw_spin_lock+0xd/0x1f
Aug 30 18:51:52 Tower kernel:  [<f944f21d>] handle_stripe+0x4b/0xceb [md_mod]
Aug 30 18:51:52 Tower kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42
Aug 30 18:51:52 Tower kernel:  [<f944e8d9>] ? _release_stripe+0xd0/0xfa [md_mod]
Aug 30 18:51:52 Tower kernel:  [<f944ff2e>] unraidd+0x71/0xb5 [md_mod]
Aug 30 18:51:52 Tower kernel:  [<f944ccb2>] md_thread+0xd3/0xea [md_mod]
Aug 30 18:51:52 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b
Aug 30 18:51:52 Tower kernel:  [<c103ebf1>] kthread+0x90/0x95
Aug 30 18:51:52 Tower kernel:  [<f944cbdf>] ? import_device+0x166/0x166 [md_mod]
Aug 30 18:51:52 Tower kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28
Aug 30 18:51:52 Tower kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a
Aug 30 19:02:53 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6001 jiffies g=1434 c=1433 q=108)
Aug 30 19:02:53 Tower kernel: Pid: 2849, comm: mdrecoveryd Not tainted 3.9.11p-unRAID #5
Aug 30 19:02:53 Tower kernel: Call Trace:
Aug 30 19:02:53 Tower kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107
Aug 30 19:02:53 Tower kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a
Aug 30 19:02:53 Tower kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b
Aug 30 19:02:53 Tower kernel:  [<c1032ed9>] update_process_times+0x2d/0x53
Aug 30 19:02:53 Tower kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1
Aug 30 19:02:53 Tower kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a
Aug 30 19:02:53 Tower kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf
Aug 30 19:02:53 Tower kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7
Aug 30 19:02:53 Tower kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f
Aug 30 19:02:53 Tower kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34
Aug 30 19:02:53 Tower kernel:  [<c124007b>] ? des3_ede_decrypt+0x232/0x4d4
Aug 30 19:02:53 Tower kernel:  [<f944fd65>] ? handle_stripe+0xb93/0xceb [md_mod]
Aug 30 19:02:53 Tower kernel:  [<f944ffb5>] unraid_sync+0x43/0x52 [md_mod]
Aug 30 19:02:53 Tower kernel:  [<f944bdc9>] md_do_sync+0x13c/0x3b3 [md_mod]
Aug 30 19:02:53 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b
Aug 30 19:02:53 Tower kernel:  [<f944c57b>] md_do_recovery+0x117/0x19c [md_mod]
Aug 30 19:02:53 Tower kernel:  [<f944ccb2>] md_thread+0xd3/0xea [md_mod]
Aug 30 19:02:53 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b
Aug 30 19:02:53 Tower kernel:  [<c103ebf1>] kthread+0x90/0x95
Aug 30 19:02:53 Tower kernel:  [<f944cbdf>] ? import_device+0x166/0x166 [md_mod]
Aug 30 19:02:53 Tower kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28
Aug 30 19:02:53 Tower kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a
Aug 30 22:03:53 Tower kernel: md: parity incorrect, sector=2549700264
Aug 30 22:03:53 Tower kernel: md: parity incorrect, sector=2549700272
Aug 30 22:35:38 Tower login[2439]: ROOT LOGIN  on '/dev/tty1'
Aug 30 22:50:27 Tower login[4006]: ROOT LOGIN  on '/dev/tty1'

 

I searched and found three answers (I suppose): 

 

1. "Nothing of concern, would be fixed in later 5.0rc" -- but am running 5.0.5 stable

2. "LOWMEM" is low, but that is when I logged into terminal to run a  free -l  and it was showing 635k free for Low Memory.

3.  Forget what three was, I -believe- possible bad drive, which, since I still got the random sector placed 2 sync errors that I was getting with the old hardware (again besides the case and drives, and PSU, promise to get to that, Im now on all new hardware).

 

So figuring maybe it was #3, I stopped the no-correct, brought down unraid, shutdown, change out the parity drive with one that was pre-cleared twice, and it is now rebuilding as I write this.

 

So far, so good, I suppose.

 

Aug 31 00:44:00 Tower kernel: mdcmd (52): start UPGRADE_DISK
Aug 31 00:44:00 Tower kernel: unraid: allocating 77688K for 1536 stripes (12 disks)
Aug 31 00:44:00 Tower kernel: md1: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md2: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md3: running, size: 2930266532 blocks
Aug 31 00:44:00 Tower kernel: md4: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md5: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md6: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md7: running, size: 2930266532 blocks
Aug 31 00:44:00 Tower kernel: md8: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md9: running, size: 1953514552 blocks
Aug 31 00:44:00 Tower kernel: md10: running, size: 2930266532 blocks
Aug 31 00:44:00 Tower kernel: md11: running, size: 2930266532 blocks
Aug 31 00:44:00 Tower emhttp: shcmd (38): udevadm settle
Aug 31 00:44:00 Tower emhttp: shcmd (39): /usr/local/sbin/emhttp_event array_started
Aug 31 00:44:00 Tower emhttp_event: array_started
Aug 31 00:44:00 Tower emhttp: Mounting disks...
Aug 31 00:44:00 Tower emhttp: shcmd (40): mkdir /mnt/disk1
Aug 31 00:44:00 Tower emhttp: shcmd (41): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md1 /mnt/disk1 |& logger
Aug 31 00:44:00 Tower kernel: REISERFS (device md1): found reiserfs format "3.6" with standard journal
Aug 31 00:44:00 Tower kernel: REISERFS (device md1): using ordered data mode
Aug 31 00:44:00 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:00 Tower kernel: REISERFS (device md1): journal params: device md1, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:00 Tower kernel: REISERFS (device md1): checking transaction log (md1)
Aug 31 00:44:00 Tower kernel: REISERFS (device md1): Using r5 hash to sort names
Aug 31 00:44:00 Tower emhttp: shcmd (42): mkdir /mnt/disk2
Aug 31 00:44:00 Tower emhttp: shcmd (43): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md2 /mnt/disk2 |& logger
Aug 31 00:44:00 Tower kernel: REISERFS (device md2): found reiserfs format "3.6" with standard journal
Aug 31 00:44:00 Tower kernel: REISERFS (device md2): using ordered data mode
Aug 31 00:44:00 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:00 Tower kernel: REISERFS (device md2): journal params: device md2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:00 Tower kernel: REISERFS (device md2): checking transaction log (md2)
Aug 31 00:44:00 Tower kernel: REISERFS (device md2): Using r5 hash to sort names
Aug 31 00:44:00 Tower emhttp: shcmd (44): mkdir /mnt/disk3
Aug 31 00:44:00 Tower emhttp: shcmd (45): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md3 /mnt/disk3 |& logger
Aug 31 00:44:00 Tower kernel: REISERFS (device md3): found reiserfs format "3.6" with standard journal
Aug 31 00:44:00 Tower kernel: REISERFS (device md3): using ordered data mode
Aug 31 00:44:00 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:00 Tower kernel: REISERFS (device md3): journal params: device md3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:00 Tower kernel: REISERFS (device md3): checking transaction log (md3)
Aug 31 00:44:00 Tower kernel: REISERFS (device md3): Using r5 hash to sort names
Aug 31 00:44:01 Tower emhttp: shcmd (46): mkdir /mnt/disk4
Aug 31 00:44:01 Tower emhttp: shcmd (47): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md4 /mnt/disk4 |& logger
Aug 31 00:44:01 Tower kernel: REISERFS (device md4): found reiserfs format "3.6" with standard journal
Aug 31 00:44:01 Tower kernel: REISERFS (device md4): using ordered data mode
Aug 31 00:44:01 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:01 Tower kernel: REISERFS (device md4): journal params: device md4, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:01 Tower kernel: REISERFS (device md4): checking transaction log (md4)
Aug 31 00:44:01 Tower kernel: REISERFS (device md4): Using r5 hash to sort names
Aug 31 00:44:01 Tower emhttp: shcmd (48): mkdir /mnt/disk5
Aug 31 00:44:01 Tower emhttp: shcmd (49): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md5 /mnt/disk5 |& logger
Aug 31 00:44:01 Tower kernel: REISERFS (device md5): found reiserfs format "3.6" with standard journal
Aug 31 00:44:01 Tower kernel: REISERFS (device md5): using ordered data mode
Aug 31 00:44:01 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:01 Tower kernel: REISERFS (device md5): journal params: device md5, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:01 Tower kernel: REISERFS (device md5): checking transaction log (md5)
Aug 31 00:44:01 Tower kernel: REISERFS (device md5): Using r5 hash to sort names
Aug 31 00:44:01 Tower emhttp: shcmd (50): mkdir /mnt/disk6
Aug 31 00:44:01 Tower emhttp: shcmd (51): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md6 /mnt/disk6 |& logger
Aug 31 00:44:01 Tower kernel: REISERFS (device md6): found reiserfs format "3.6" with standard journal
Aug 31 00:44:01 Tower kernel: REISERFS (device md6): using ordered data mode
Aug 31 00:44:01 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:01 Tower kernel: REISERFS (device md6): journal params: device md6, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:01 Tower kernel: REISERFS (device md6): checking transaction log (md6)
Aug 31 00:44:01 Tower kernel: REISERFS (device md6): Using r5 hash to sort names
Aug 31 00:44:01 Tower emhttp: shcmd (52): mkdir /mnt/disk7
Aug 31 00:44:01 Tower emhttp: shcmd (53): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md7 /mnt/disk7 |& logger
Aug 31 00:44:01 Tower kernel: REISERFS (device md7): found reiserfs format "3.6" with standard journal
Aug 31 00:44:01 Tower kernel: REISERFS (device md7): using ordered data mode
Aug 31 00:44:01 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:01 Tower kernel: REISERFS (device md7): journal params: device md7, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:01 Tower kernel: REISERFS (device md7): checking transaction log (md7)
Aug 31 00:44:01 Tower kernel: REISERFS (device md7): Using r5 hash to sort names
Aug 31 00:44:01 Tower emhttp: shcmd (54): mkdir /mnt/disk8
Aug 31 00:44:01 Tower emhttp: shcmd (55): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md8 /mnt/disk8 |& logger
Aug 31 00:44:01 Tower kernel: REISERFS (device md8): found reiserfs format "3.6" with standard journal
Aug 31 00:44:01 Tower kernel: REISERFS (device md8): using ordered data mode
Aug 31 00:44:01 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:01 Tower kernel: REISERFS (device md8): journal params: device md8, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:01 Tower kernel: REISERFS (device md8): checking transaction log (md8)
Aug 31 00:44:01 Tower kernel: REISERFS (device md8): Using r5 hash to sort names
Aug 31 00:44:01 Tower emhttp: shcmd (56): mkdir /mnt/disk9
Aug 31 00:44:01 Tower emhttp: shcmd (57): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md9 /mnt/disk9 |& logger
Aug 31 00:44:01 Tower kernel: REISERFS (device md9): found reiserfs format "3.6" with standard journal
Aug 31 00:44:01 Tower kernel: REISERFS (device md9): using ordered data mode
Aug 31 00:44:01 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:01 Tower kernel: REISERFS (device md9): journal params: device md9, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:01 Tower kernel: REISERFS (device md9): checking transaction log (md9)
Aug 31 00:44:02 Tower kernel: REISERFS (device md9): Using r5 hash to sort names
Aug 31 00:44:02 Tower emhttp: shcmd (58): mkdir /mnt/disk10
Aug 31 00:44:02 Tower emhttp: shcmd (59): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md10 /mnt/disk10 |& logger
Aug 31 00:44:02 Tower kernel: REISERFS (device md10): found reiserfs format "3.6" with standard journal
Aug 31 00:44:02 Tower kernel: REISERFS (device md10): using ordered data mode
Aug 31 00:44:02 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:02 Tower kernel: REISERFS (device md10): journal params: device md10, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:02 Tower kernel: REISERFS (device md10): checking transaction log (md10)
Aug 31 00:44:02 Tower kernel: REISERFS (device md10): Using r5 hash to sort names
Aug 31 00:44:02 Tower emhttp: shcmd (60): mkdir /mnt/disk11
Aug 31 00:44:02 Tower emhttp: shcmd (61): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md11 /mnt/disk11 |& logger
Aug 31 00:44:02 Tower kernel: REISERFS (device md11): found reiserfs format "3.6" with standard journal
Aug 31 00:44:02 Tower kernel: REISERFS (device md11): using ordered data mode
Aug 31 00:44:02 Tower kernel: reiserfs: using flush barriers
Aug 31 00:44:02 Tower kernel: REISERFS (device md11): journal params: device md11, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Aug 31 00:44:02 Tower kernel: REISERFS (device md11): checking transaction log (md11)
Aug 31 00:44:02 Tower kernel: REISERFS (device md11): Using r5 hash to sort names
Aug 31 00:44:02 Tower emhttp: shcmd (62): mkdir /mnt/user
Aug 31 00:44:02 Tower emhttp: shcmd (63): /usr/local/sbin/shfs /mnt/user -disks 16777214 -o noatime,big_writes,allow_other -o remember=0  |& logger
Aug 31 00:44:02 Tower emhttp: shcmd (64): crontab -c /etc/cron.d -d &> /dev/null
Aug 31 00:44:02 Tower emhttp: shcmd (65): /usr/local/sbin/emhttp_event disks_mounted
Aug 31 00:44:02 Tower emhttp_event: disks_mounted
Aug 31 00:44:02 Tower kernel: mdcmd (53): check CORRECT
Aug 31 00:44:02 Tower kernel: md: recovery thread woken up ...
Aug 31 00:44:02 Tower kernel: md: recovery thread syncing parity disk ...
Aug 31 00:44:02 Tower kernel: md: using 2048k window, over a total of 2930266532 blocks.
Aug 31 00:44:03 Tower emhttp: shcmd (66): :>/etc/samba/smb-shares.conf
Aug 31 00:44:03 Tower avahi-daemon[2494]: Files changed, reloading.
Aug 31 00:44:04 Tower emhttp: Restart SMB...
Aug 31 00:44:04 Tower emhttp: shcmd (67): killall -HUP smbd
Aug 31 00:44:04 Tower emhttp: shcmd (68): cp /etc/avahi/services/smb.service- /etc/avahi/services/smb.service
Aug 31 00:44:04 Tower avahi-daemon[2494]: Files changed, reloading.
Aug 31 00:44:04 Tower avahi-daemon[2494]: Service group file /services/smb.service changed, reloading.
Aug 31 00:44:04 Tower emhttp: shcmd (69): ps axc | grep -q rpc.mountd
Aug 31 00:44:04 Tower emhttp: _shcmd: shcmd (69): exit status: 1
Aug 31 00:44:04 Tower emhttp: shcmd (70): /usr/local/sbin/emhttp_event svcs_restarted
Aug 31 00:44:04 Tower emhttp_event: svcs_restarted
Aug 31 00:44:04 Tower emhttp: shcmd (71): /usr/local/sbin/emhttp_event started
Aug 31 00:44:04 Tower emhttp_event: started
Aug 31 00:44:05 Tower avahi-daemon[2494]: Service "Tower" (/services/smb.service) successfully established.

 

But, I still have those "configured for UDMA/133" and other link oddities (to me anyway).  The syslog I'm attaching will show those as well as the CPU stalls, etc.

 

Any and all help is greatly appreciated.  Please?

syslog-2014-08-30.txt

Link to comment

Ok, parity seemed to have rebuilt without issue?  But, now I'm getting read errors on the parity drive while doing a no-correct (to see if the parity is good after rebuild).  I can't seem to win with "ata3.00" no matter the controller, cabling or hard drive.

 

THIS syslog attached to this reply is from unraid start up, to parity rebuild, to the errors being received during no-correct (which is not done yet, I just kicked it off, as I went to bed while it was still in rebuild stage).

 

Aug 31 11:40:19 Tower kernel: md: nocheck_array: check not active
Aug 31 11:40:32 Tower kernel: md: unRAID driver removed
Aug 31 11:40:32 Tower kernel: md: unRAID driver 2.2.0 installed
Aug 31 11:40:40 Tower emhttp: Start array...
Aug 31 11:40:40 Tower kernel: mdcmd (52): start STOPPED
Aug 31 11:40:41 Tower avahi-daemon[3315]: Service "Tower" (/services/smb.service) successfully established.
Aug 31 11:40:58 Tower kernel: mdcmd (53): check NOCORRECT
Aug 31 11:40:58 Tower kernel: md: recovery thread woken up ...
Aug 31 11:40:58 Tower kernel: md: recovery thread checking parity...
Aug 31 11:40:58 Tower kernel: md: using 2048k window, over a total of 2930266532 blocks.
Aug 31 11:42:32 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:42:32 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:42:32 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:42:32 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:42:32 Tower kernel: ata3.00: cmd c8/00:10:f0:f2:05/00:00:00:00:00/e1 tag 0 dma 8192 in
Aug 31 11:42:32 Tower kernel:          res 50/00:00:ef:f2:05/00:00:01:00:00/e1 Emask 0x10 (ATA bus error)
Aug 31 11:42:32 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:42:32 Tower kernel: ata3: hard resetting link
Aug 31 11:42:32 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 31 11:42:32 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:42:32 Tower kernel: ata3: EH complete
Aug 31 11:48:38 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:48:38 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:48:38 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:48:38 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:48:38 Tower kernel: ata3.00: cmd c8/00:10:70:39:f4/00:00:00:00:00/e4 tag 0 dma 8192 in
Aug 31 11:48:38 Tower kernel:          res 50/00:00:6f:39:f4/00:00:04:00:00/e4 Emask 0x10 (ATA bus error)
Aug 31 11:48:38 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:48:38 Tower kernel: ata3: hard resetting link
Aug 31 11:48:38 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 31 11:48:38 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:48:38 Tower kernel: ata3: EH complete
Aug 31 11:51:23 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:51:23 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:51:23 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:51:23 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:51:23 Tower kernel: ata3.00: cmd c8/00:18:b0:d4:aa/00:00:00:00:00/e6 tag 0 dma 12288 in
Aug 31 11:51:23 Tower kernel:          res 50/00:00:af:d4:aa/00:00:06:00:00/e6 Emask 0x10 (ATA bus error)
Aug 31 11:51:23 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:51:23 Tower kernel: ata3: hard resetting link
Aug 31 11:51:23 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 31 11:51:23 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:51:23 Tower kernel: ata3: EH complete
Aug 31 11:52:14 Tower kernel: ata3: limiting SATA link speed to 3.0 Gbps
Aug 31 11:52:14 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:52:14 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:52:14 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:52:14 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:52:14 Tower kernel: ata3.00: cmd c8/00:10:f0:f5:32/00:00:00:00:00/e7 tag 0 dma 8192 in
Aug 31 11:52:14 Tower kernel:          res 50/00:00:ef:f5:32/00:00:07:00:00/e7 Emask 0x10 (ATA bus error)
Aug 31 11:52:14 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:52:14 Tower kernel: ata3: hard resetting link
Aug 31 11:52:15 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Aug 31 11:52:15 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:52:15 Tower kernel: ata3: EH complete

 

http://pastebin.com/SpdwtR6D

 

 

Link to comment

Here is the SMART report for that drive (the new parity drive):

 

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HDN724030ALE640
Serial Number:    PK1234P8JE9ABX
LU WWN Device Id: 5 000cca 22ce23adf
Firmware Version: MJ8OA5E0
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Aug 31 12:11:01 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 455) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       82
  3 Spin_Up_Time            0x0007   124   124   024    Pre-fail  Always       -       502 (Average 502)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3370
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       17
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       17
194 Temperature_Celsius     0x0002   187   187   000    Old_age   Always       -       32 (Min/Max 24/37)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 3370 hours (140 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 ff f5 32 07  Error: ICRC, ABRT 1 sectors at LBA = 0x0732f5ff = 120780287

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 ff f5 32 e7 ff      11:35:32.891  READ DMA
  c8 00 10 f0 f5 32 e7 00      11:35:32.874  READ DMA
  c8 00 10 e0 f5 32 e7 00      11:35:32.874  READ DMA
  c8 00 08 d8 f5 32 e7 00      11:35:32.874  READ DMA
  c8 00 10 c8 f5 32 e7 00      11:35:32.874  READ DMA

Error 1 occurred at disk power-on lifetime: 3370 hours (140 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 ff f2 05 01  Error: ICRC, ABRT 1 sectors at LBA = 0x0105f2ff = 17167103

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 f0 f2 05 e1 00      11:25:52.941  READ DMA
  c8 00 10 e0 f2 05 e1 00      11:25:52.941  READ DMA
  c8 00 18 c8 f2 05 e1 00      11:25:52.941  READ DMA
  c8 00 10 b8 f2 05 e1 00      11:25:52.941  READ DMA
  c8 00 08 b0 f2 05 e1 00      11:25:52.941  READ DMA

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

And here we go with the CPU stalls again (again, during a no-correct):

 

Aug 31 12:30:04 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=8426 c=8425 q=163)
Aug 31 12:30:04 Tower kernel: Pid: 3399, comm: unraidd Not tainted 3.9.11p-unRAID #5
Aug 31 12:30:04 Tower kernel: Call Trace:
Aug 31 12:30:04 Tower kernel:  [] print_cpu_stall+0xbc/0x107
Aug 31 12:30:04 Tower kernel:  [] __rcu_pending+0x4f/0x12a
Aug 31 12:30:04 Tower kernel:  [] rcu_check_callbacks+0x73/0x9b
Aug 31 12:30:04 Tower kernel:  [] update_process_times+0x2d/0x53
Aug 31 12:30:04 Tower kernel:  [] tick_sched_timer+0x77/0xa1
Aug 31 12:30:04 Tower kernel:  [] ? __remove_hrtimer+0x25/0x7a
Aug 31 12:30:04 Tower kernel:  [] __run_hrtimer+0x45/0xaf
Aug 31 12:30:04 Tower kernel:  [] hrtimer_interrupt+0xf1/0x1e7
Aug 31 12:30:04 Tower kernel:  [] ? _scsih_build_scatter_gather+0x238/0x25b [mpt2sas]
Aug 31 12:30:04 Tower kernel:  [] smp_apic_timer_interrupt+0x6d/0x7f
Aug 31 12:30:04 Tower kernel:  [] apic_timer_interrupt+0x2d/0x34
Aug 31 12:30:04 Tower kernel:  [] ? __slab_free+0x101/0x2a6
Aug 31 12:30:04 Tower kernel:  [] kmem_cache_free+0xaf/0xb7
Aug 31 12:30:04 Tower kernel:  [] ? scsi_pool_free_command+0x25/0x32
Aug 31 12:30:04 Tower kernel:  [] ? scsi_pool_free_command+0x25/0x32
Aug 31 12:30:04 Tower kernel:  [] scsi_pool_free_command+0x25/0x32
Aug 31 12:30:04 Tower kernel:  [] __scsi_put_command+0x4c/0x5a
Aug 31 12:30:04 Tower kernel:  [] scsi_put_command+0x4b/0x50
Aug 31 12:30:04 Tower kernel:  [] scsi_next_command+0x21/0x34
Aug 31 12:30:04 Tower kernel:  [] scsi_end_request+0x66/0x70
Aug 31 12:30:04 Tower kernel:  [] scsi_io_completion+0x1b0/0x421
Aug 31 12:30:04 Tower kernel:  [] ? scsi_device_unbusy+0x7c/0x82
Aug 31 12:30:04 Tower kernel:  [] scsi_finish_command+0x91/0x97
Aug 31 12:30:04 Tower kernel:  [] scsi_softirq_done+0xc5/0xcd
Aug 31 12:30:04 Tower kernel:  [] blk_done_softirq+0x4a/0x57
Aug 31 12:30:04 Tower kernel:  [] __do_softirq+0x94/0x151
Aug 31 12:30:04 Tower kernel:  [] ? ttwu_do_wakeup+0xf/0xaa
Aug 31 12:30:04 Tower kernel:  [] irq_exit+0x33/0x6c
Aug 31 12:30:04 Tower kernel:  [] do_IRQ+0x87/0x9b
Aug 31 12:30:04 Tower kernel:  [] ? xor_blocks+0x5b/0x7c
Aug 31 12:30:04 Tower kernel:  [] common_interrupt+0x2c/0x31
Aug 31 12:30:04 Tower kernel:  [] ? handle_stripe+0xb37/0xceb [md_mod]
Aug 31 12:30:04 Tower kernel:  [] ? __wake_up+0x3b/0x42
Aug 31 12:30:04 Tower kernel:  [] unraidd+0x71/0xb5 [md_mod]
Aug 31 12:30:04 Tower kernel:  [] md_thread+0xd3/0xea [md_mod]
Aug 31 12:30:04 Tower kernel:  [] ? wake_up_bit+0x5b/0x5b
Aug 31 12:30:04 Tower kernel:  [] kthread+0x90/0x95
Aug 31 12:30:04 Tower kernel:  [] ? import_device+0x166/0x166 [md_mod]
Aug 31 12:30:04 Tower kernel:  [] ret_from_kernel_thread+0x1b/0x28
Aug 31 12:30:04 Tower kernel:  [] ? kthread_freezable_should_stop+0x4a/0x4a
Aug 31 12:36:23 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=8460 c=8459 q=119)
Aug 31 12:36:23 Tower kernel: Pid: 3399, comm: unraidd Not tainted 3.9.11p-unRAID #5
Aug 31 12:36:23 Tower kernel: Call Trace:
Aug 31 12:36:23 Tower kernel:  [] print_cpu_stall+0xbc/0x107
Aug 31 12:36:23 Tower kernel:  [] __rcu_pending+0x4f/0x12a
Aug 31 12:36:23 Tower kernel:  [] rcu_check_callbacks+0x73/0x9b
Aug 31 12:36:23 Tower kernel:  [] update_process_times+0x2d/0x53
Aug 31 12:36:23 Tower kernel:  [] tick_sched_timer+0x77/0xa1
Aug 31 12:36:23 Tower kernel:  [] ? __remove_hrtimer+0x25/0x7a
Aug 31 12:36:23 Tower kernel:  [] __run_hrtimer+0x45/0xaf
Aug 31 12:36:23 Tower kernel:  [] hrtimer_interrupt+0xf1/0x1e7
Aug 31 12:36:23 Tower kernel:  [] smp_apic_timer_interrupt+0x6d/0x7f
Aug 31 12:36:23 Tower kernel:  [] apic_timer_interrupt+0x2d/0x34
Aug 31 12:36:23 Tower kernel:  [] ? xor_avx_5+0x148/0x34c
Aug 31 12:36:23 Tower kernel:  [] xor_blocks+0x74/0x7c
Aug 31 12:36:23 Tower kernel:  [] check_parity+0x96/0xcc [md_mod]
Aug 31 12:36:23 Tower kernel:  [] handle_stripe+0xa29/0xceb [md_mod]
Aug 31 12:36:23 Tower kernel:  [] ? __wake_up+0x3b/0x42
Aug 31 12:36:23 Tower kernel:  [] unraidd+0x71/0xb5 [md_mod]
Aug 31 12:36:23 Tower kernel:  [] md_thread+0xd3/0xea [md_mod]
Aug 31 12:36:23 Tower kernel:  [] ? wake_up_bit+0x5b/0x5b
Aug 31 12:36:23 Tower kernel:  [] kthread+0x90/0x95
Aug 31 12:36:23 Tower kernel:  [] ? import_device+0x166/0x166 [md_mod]
Aug 31 12:36:23 Tower kernel:  [] ret_from_kernel_thread+0x1b/0x28
Aug 31 12:36:23 Tower kernel:  [] ? kthread_freezable_should_stop+0x4a/0x4a

 

 

Link to comment

Because two stalls just wasn't enough:

 

Aug 31 12:47:45 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=8566 c=8565 q=160)
Aug 31 12:47:45 Tower kernel: Pid: 3399, comm: unraidd Not tainted 3.9.11p-unRAID #5
Aug 31 12:47:45 Tower kernel: Call Trace:
Aug 31 12:47:45 Tower kernel:  [] print_cpu_stall+0xbc/0x107
Aug 31 12:47:45 Tower kernel:  [] __rcu_pending+0x4f/0x12a
Aug 31 12:47:45 Tower kernel:  [] rcu_check_callbacks+0x73/0x9b
Aug 31 12:47:45 Tower kernel:  [] update_process_times+0x2d/0x53
Aug 31 12:47:45 Tower kernel:  [] tick_sched_timer+0x77/0xa1
Aug 31 12:47:45 Tower kernel:  [] ? __remove_hrtimer+0x25/0x7a
Aug 31 12:47:45 Tower kernel:  [] __run_hrtimer+0x45/0xaf
Aug 31 12:47:45 Tower kernel:  [] hrtimer_interrupt+0xf1/0x1e7
Aug 31 12:47:45 Tower kernel:  [] smp_apic_timer_interrupt+0x6d/0x7f
Aug 31 12:47:45 Tower kernel:  [] ? xor_blocks+0x5b/0x7c
Aug 31 12:47:45 Tower kernel:  [] apic_timer_interrupt+0x2d/0x34
Aug 31 12:47:45 Tower kernel:  [] ? memcmp+0x17/0x25
Aug 31 12:47:45 Tower kernel:  [] handle_stripe+0xa4d/0xceb [md_mod]
Aug 31 12:47:45 Tower kernel:  [] ? __wake_up+0x3b/0x42
Aug 31 12:47:45 Tower kernel:  [] unraidd+0x71/0xb5 [md_mod]
Aug 31 12:47:45 Tower kernel:  [] md_thread+0xd3/0xea [md_mod]
Aug 31 12:47:45 Tower kernel:  [] ? wake_up_bit+0x5b/0x5b
Aug 31 12:47:45 Tower kernel:  [] kthread+0x90/0x95
Aug 31 12:47:45 Tower kernel:  [] ? import_device+0x166/0x166 [md_mod]
Aug 31 12:47:45 Tower kernel:  [] ret_from_kernel_thread+0x1b/0x28
Aug 31 12:47:45 Tower kernel:  [] ? kthread_freezable_should_stop+0x4a/0x4a

 

Link to comment

With all the posts in this thread, I expected to see various others helping you, but just looks like you've been talking to your self!  ;D

 

I'm not an authority on that aspect, but I do believe the stalls are harmless.

 

Nothing wrong with "configured for UDMA/133" and other link oddities, quite normal.

 

The one disk problem I see are the exception handler stoppages, with the "UnrecovData 10B8B BadCRC" SATA error flags.  The key one there is the BadCRC flag.  That usually always means a bad SATA cable (a quick and cheap fix!).  Replacing it with a good SATA cable almost always fixes the problem.  The SMART report has the corresponding UDMA_CRC_Error_Count to match.  No problem with the drive itself.  And none of the drive-related issues have anything to do with the stalls.

Link to comment

With all the posts in this thread, I expected to see various others helping you, but just looks like you've been talking to your self!  ;D

 

I'm not an authority on that aspect, but I do believe the stalls are harmless.

 

Nothing wrong with "configured for UDMA/133" and other link oddities, quite normal.

 

The one disk problem I see are the exception handler stoppages, with the "UnrecovData 10B8B BadCRC" SATA error flags.  The key one there is the BadCRC flag.  That usually always means a bad SATA cable (a quick and cheap fix!).  Replacing it with a good SATA cable almost always fixes the problem.  The SMART report has the corresponding UDMA_CRC_Error_Count to match.  No problem with the drive itself.  And none of the drive-related issues have anything to do with the stalls.

 

Rob, thank you for the reply.  As for the only one to have posted in this thread, it's ok, I just pretended I was talking to my wife, she rarely listens either. :P

 

On a serious note and back to topic.  The no-correct finished, just a few minutes ago, no sync errors.

 

I've got plenty of breakout cables to choose from, the ones in the machine now are brand new (not to say they can't be bad).  I'll keep an eye out on it.  Though I have a theory (perhaps not theory, that denotes some form of knowledge in the subject -- a "guess" then?), why I may have gotten that error.

 

Aug 31 11:42:32 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:42:32 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:42:32 Tower kernel: ata3.00: cmd c8/00:10:f0:f2:05/00:00:00:00:00/e1 tag 0 dma 8192 in
Aug 31 11:42:32 Tower kernel:          res 50/00:00:ef:f2:05/00:00:01:00:00/e1 Emask 0x10 (ATA bus error)
Aug 31 11:42:32 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:42:32 Tower kernel: ata3: hard resetting link
Aug 31 11:42:32 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 31 11:42:32 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:42:32 Tower kernel: ata3: EH complete
Aug 31 11:48:38 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:48:38 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:48:38 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:48:38 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:48:38 Tower kernel: ata3.00: cmd c8/00:10:70:39:f4/00:00:00:00:00/e4 tag 0 dma 8192 in
Aug 31 11:48:38 Tower kernel:          res 50/00:00:6f:39:f4/00:00:04:00:00/e4 Emask 0x10 (ATA bus error)
Aug 31 11:48:38 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:48:38 Tower kernel: ata3: hard resetting link
Aug 31 11:48:38 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 31 11:48:38 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:48:38 Tower kernel: ata3: EH complete
Aug 31 11:51:23 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:51:23 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:51:23 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:51:23 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:51:23 Tower kernel: ata3.00: cmd c8/00:18:b0:d4:aa/00:00:00:00:00/e6 tag 0 dma 12288 in
Aug 31 11:51:23 Tower kernel:          res 50/00:00:af:d4:aa/00:00:06:00:00/e6 Emask 0x10 (ATA bus error)
Aug 31 11:51:23 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:51:23 Tower kernel: ata3: hard resetting link
Aug 31 11:51:23 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 31 11:51:23 Tower kernel: ata3.00: configured for UDMA/133
Aug 31 11:51:23 Tower kernel: ata3: EH complete
Aug 31 11:52:14 Tower kernel: ata3: limiting SATA link speed to 3.0 Gbps
Aug 31 11:52:14 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Aug 31 11:52:14 Tower kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Aug 31 11:52:14 Tower kernel: ata3: SError: { UnrecovData 10B8B BadCRC }
Aug 31 11:52:14 Tower kernel: ata3.00: failed command: READ DMA
Aug 31 11:52:14 Tower kernel: ata3.00: cmd c8/00:10:f0:f5:32/00:00:00:00:00/e7 tag 0 dma 8192 in
Aug 31 11:52:14 Tower kernel:          res 50/00:00:ef:f5:32/00:00:07:00:00/e7 Emask 0x10 (ATA bus error)
Aug 31 11:52:14 Tower kernel: ata3.00: status: { DRDY }
Aug 31 11:52:14 Tower kernel: ata3: hard resetting link
Aug 31 11:52:15 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Aug 31 11:52:15 Tower kernel: ata3.00: configured for UDMA/133

 

After the no-correct finished, I looked to see if I had done what I suspected.  In short, I placed the corresponding port for the parity drive on the breakout cable (P3) to a 3Gbps port on the mobo SATA, forgetting that Norco reverses the numbering of their ports (or what I would consider reversed, port 1 is on the right of the plane when looking at it from the front, port 4 is on the left).

 

The SATA 3.0 drive is now on a SATA 3.0 port.  Not sure it will make a bit of difference, but..  do you think (since it was shortly after the drive error) that the CPU Stall was caused by the drive error as well?  Or what do you believe caused the CPU Stall?  I Googled and was conquered.  I haven't a clue just what I'm reading, most of the posts I found simply refer to the "module"(?) generating the stall report, which in-of-itself is Greek to me.

 

Link to comment
I've got plenty of breakout cables to choose from, the ones in the machine now are brand new (not to say they can't be bad).

I'd try replacing one cable at a time, until the BadCRC error no longer occurs.  In this case, an older but tested one is better.  A CRC error is relatively minor, as it retries until packet transfer is good, with only a minor delay in the transfer.  I don't believe speed or port choice matters at all, unless one is faulty, and I don't think that is the issue here.

 

do you think (since it was shortly after the drive error) that the CPU Stall was caused by the drive error as well?  Or what do you believe caused the CPU Stall?

I really don't see a connection between disk error and CPU stall.  I don't know anything about CPU stalls.  I vaguely remember trying to research them, and not finding anything useful, from a troubleshooting standpoint.  Perhaps someone else can help, but until then, I would ignore them.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...