Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Hard crash during rsync (with syslog)

Featured Replies

Several of us have reported 'random' or in my instance rsync copy associated hard crashes.

 

Today I learned how to capture an active syslog with tail -f syslog, so I did that this evening.

 

6b8 crashed during the day while idle.  I turned it on tonight, with parity check running, and set up an rsync from a reiserfs array volume to a xfs array volume.  Then I took the honey out to dinner.  Less than three hours later, unRaid 6b8 reliably crashed.

 

Here is the syslog captured by ssh:

root@Tower:~# tail -f /var/log/syslog 
Sep  4 18:15:01 Tower cache_dirs: max_seconds=10, min_seconds=1
Sep  4 18:15:01 Tower cache_dirs: max_depth=6
Sep  4 18:15:01 Tower cache_dirs: command=find -noleaf
Sep  4 18:15:01 Tower cache_dirs: version=1.6.9
Sep  4 18:15:01 Tower cache_dirs: ---------- caching directories ---------------
Sep  4 18:15:01 Tower cache_dirs: 
Sep  4 18:15:01 Tower cache_dirs: ----------------------------------------------
Sep  4 18:15:01 Tower cache_dirs: ERROR: included directory "/mnt" does not exist.
Sep  4 18:15:01 Tower cache_dirs: cache_dirs process ID 4297 started, To terminate it, type: cache_dirs -q
Sep  4 18:15:34 Tower sshd[4332]: Accepted publickey for root from 192.168.1.1 port 64641 ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  4 18:15:53 Tower kernel: REISERFS (device md2): replayed 25 transactions in 98 seconds
Sep  4 18:15:53 Tower kernel: REISERFS (device md2): Using r5 hash to sort names
Sep  4 18:15:53 Tower emhttp: shcmd (32): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md2 /mnt/disk2 |& logger
Sep  4 18:15:53 Tower emhttp: shcmd (33): mkdir -p /mnt/disk3
Sep  4 18:15:53 Tower emhttp: shcmd (34): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md3 /mnt/disk3 |& logger
Sep  4 18:15:53 Tower kernel: REISERFS (device md3): found reiserfs format "3.6" with standard journal
Sep  4 18:15:53 Tower kernel: REISERFS (device md3): using ordered data mode
Sep  4 18:15:53 Tower kernel: reiserfs: using flush barriers
Sep  4 18:15:53 Tower kernel: REISERFS (device md3): journal params: device md3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Sep  4 18:15:53 Tower kernel: REISERFS (device md3): checking transaction log (md3)
Sep  4 18:15:53 Tower kernel: REISERFS (device md3): replayed 2 transactions in 0 seconds
Sep  4 18:15:53 Tower kernel: REISERFS (device md3): Using r5 hash to sort names
Sep  4 18:15:53 Tower emhttp: shcmd (35): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md3 /mnt/disk3 |& logger
Sep  4 18:15:53 Tower emhttp: shcmd (36): mkdir -p /mnt/disk4
Sep  4 18:15:53 Tower emhttp: shcmd (37): set -o pipefail ; mount -t xfs -o noatime,nodiratime /dev/md4 /mnt/disk4 |& logger
Sep  4 18:15:53 Tower kernel: XFS (md4): Mounting V4 Filesystem
Sep  4 18:15:54 Tower kernel: XFS (md4): Starting recovery (logdev: internal)
Sep  4 18:15:54 Tower kernel: XFS (md4): Ending recovery (logdev: internal)
Sep  4 18:15:54 Tower emhttp: shcmd (38): mkdir -p /mnt/disk5
Sep  4 18:15:54 Tower emhttp: shcmd (39): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md5 /mnt/disk5 |& logger
Sep  4 18:15:54 Tower kernel: REISERFS (device md5): found reiserfs format "3.6" with standard journal
Sep  4 18:15:54 Tower kernel: REISERFS (device md5): using ordered data mode
Sep  4 18:15:54 Tower kernel: reiserfs: using flush barriers
Sep  4 18:15:54 Tower kernel: REISERFS (device md5): journal params: device md5, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Sep  4 18:15:54 Tower kernel: REISERFS (device md5): checking transaction log (md5)
Sep  4 18:15:54 Tower kernel: REISERFS (device md5): replayed 2 transactions in 0 seconds
Sep  4 18:15:54 Tower kernel: REISERFS (device md5): Using r5 hash to sort names
Sep  4 18:15:54 Tower emhttp: shcmd (40): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md5 /mnt/disk5 |& logger
Sep  4 18:15:54 Tower emhttp: shcmd (41): mkdir -p /mnt/disk6
Sep  4 18:15:54 Tower emhttp: shcmd (42): set -o pipefail ; mount -t xfs -o noatime,nodiratime /dev/md6 /mnt/disk6 |& logger
Sep  4 18:15:54 Tower kernel: XFS (md6): Mounting V4 Filesystem
Sep  4 18:15:55 Tower kernel: XFS (md6): Starting recovery (logdev: internal)
Sep  4 18:15:55 Tower emhttp: shcmd (43): mkdir -p /mnt/cache
Sep  4 18:15:55 Tower kernel: XFS (md6): Ending recovery (logdev: internal)
Sep  4 18:15:57 Tower emhttp: shcmd (44): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/sdi1 /mnt/cache |& logger
Sep  4 18:15:57 Tower kernel: BTRFS info (device sdi1): disk space caching is enabled
Sep  4 18:15:57 Tower kernel: BTRFS: detected SSD devices, enabling SSD mode
Sep  4 18:15:58 Tower emhttp: shcmd (45): mkdir /mnt/user0
Sep  4 18:15:58 Tower emhttp: shcmd (46): /usr/local/sbin/shfs /mnt/user0 -disks 16777214 -o noatime,big_writes,allow_other  |& logger
Sep  4 18:15:58 Tower emhttp: shcmd (47): mkdir /mnt/user
Sep  4 18:15:58 Tower emhttp: shcmd (48): /usr/local/sbin/shfs /mnt/user -disks 16777215 1024 -o noatime,big_writes,allow_other -o remember=330  |& logger
Sep  4 18:15:58 Tower emhttp: shcmd (49): crontab -c /etc/cron.d - <<< "# Generated mover schedule: 30 2 * * 5 /usr/local/sbin/mover |& logger"
Sep  4 18:15:58 Tower emhttp: shcmd (50): /usr/local/sbin/emhttp_event disks_mounted
Sep  4 18:15:58 Tower emhttp_event: disks_mounted
Sep  4 18:15:58 Tower logger: Starting docker.io
Sep  4 18:15:58 Tower logger: Maybe expand image file
Sep  4 18:15:58 Tower kernel: BTRFS: device fsid 0ab81931-36e8-49c4-924b-6ca691a81a8c devid 1 transid 1661 /dev/loop8
Sep  4 18:15:58 Tower kernel: BTRFS info (device loop8): disk space caching is enabled
Sep  4 18:15:58 Tower logger: Resize '/var/lib/docker' of 'max'
Sep  4 18:15:58 Tower logger: starting docker ...
Sep  4 18:15:58 Tower kernel: BTRFS: new size for /dev/loop8 is 8589934592
Sep  4 18:15:58 Tower kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Sep  4 18:15:58 Tower avahi-daemon[3955]: Joining mDNS multicast group on interface docker0.IPv4 with address 172.17.42.1.
Sep  4 18:15:58 Tower avahi-daemon[3955]: New relevant interface docker0.IPv4 for mDNS.
Sep  4 18:15:58 Tower avahi-daemon[3955]: Registering new address record for 172.17.42.1 on docker0.IPv4.
Sep  4 18:16:00 Tower ntpd[1357]: Listen normally on 3 docker0 172.17.42.1 UDP 123
Sep  4 18:16:00 Tower ntpd[1357]: peers refreshed
Sep  4 18:16:00 Tower ntpd[1357]: new interface(s) found: waking up resolver
Sep  4 18:16:00 Tower kernel: device veth5b16 entered promiscuous mode
Sep  4 18:16:00 Tower avahi-daemon[3955]: Withdrawing workstation service for veth75fb.
Sep  4 18:16:00 Tower kernel: docker0: port 1(veth5b16) entered forwarding state
Sep  4 18:16:00 Tower kernel: docker0: port 1(veth5b16) entered forwarding state
Sep  4 18:16:00 Tower logger: CouchPotato: started succesfully!
Sep  4 18:16:00 Tower logger: Dropbox: started succesfully!
Sep  4 18:16:00 Tower kernel: device veth22d7 entered promiscuous mode
Sep  4 18:16:00 Tower kernel: docker0: port 2(veth22d7) entered forwarding state
Sep  4 18:16:00 Tower kernel: docker0: port 2(veth22d7) entered forwarding state
Sep  4 18:16:00 Tower avahi-daemon[3955]: Withdrawing workstation service for veth7a1e.
Sep  4 18:16:01 Tower logger: SABnzbd: started succesfully!
Sep  4 18:16:01 Tower kernel: device vethdefa entered promiscuous mode
Sep  4 18:16:01 Tower kernel: docker0: port 3(vethdefa) entered forwarding state
Sep  4 18:16:01 Tower kernel: docker0: port 3(vethdefa) entered forwarding state
Sep  4 18:16:01 Tower avahi-daemon[3955]: Withdrawing workstation service for veth1072.
Sep  4 18:16:01 Tower logger: SickBeard: started succesfully!
Sep  4 18:16:01 Tower emhttp: unclean shutdown detected
Sep  4 18:16:01 Tower kernel: mdcmd (43): check CORRECT
Sep  4 18:16:01 Tower kernel: md: recovery thread woken up ...
Sep  4 18:16:01 Tower kernel: md: recovery thread checking parity...
Sep  4 18:16:01 Tower kernel: md: using 1536k window, over a total of 3907018532 blocks.
Sep  4 18:16:03 Tower emhttp: shcmd (51): :>/etc/samba/smb-shares.conf
Sep  4 18:16:03 Tower emhttp: shcmd (52): cp /etc/exports- /etc/exports
Sep  4 18:16:03 Tower avahi-daemon[3955]: Files changed, reloading.
Sep  4 18:16:03 Tower emhttp: shcmd (53): cp /etc/netatalk/AppleVolumes.default- /etc/netatalk/AppleVolumes.default
Sep  4 18:16:03 Tower avahi-daemon[3955]: Files changed, reloading.
Sep  4 18:16:03 Tower last message repeated 2 times
Sep  4 18:16:03 Tower emhttp: shcmd (54): echo '"/mnt/user/Media" -async,no_subtree_check,fsid=101 *(sec=sys,ro,insecure,anongid=100,anonuid=99,all_squash)' >>/etc/exports
Sep  4 18:16:03 Tower avahi-daemon[3955]: Files changed, reloading.
Sep  4 18:16:03 Tower emhttp: Restart SMB...
Sep  4 18:16:03 Tower emhttp: shcmd (55): killall -HUP smbd
Sep  4 18:16:03 Tower emhttp: shcmd (56): cp /etc/avahi/services/smb.service- /etc/avahi/services/smb.service
Sep  4 18:16:03 Tower avahi-daemon[3955]: Files changed, reloading.
Sep  4 18:16:03 Tower avahi-daemon[3955]: Service group file /etc/avahi/services/smb.service changed, reloading.
Sep  4 18:16:03 Tower emhttp: shcmd (57): ps axc | grep -q rpc.mountd
Sep  4 18:16:03 Tower emhttp: _shcmd: shcmd (57): exit status: 1
Sep  4 18:16:03 Tower emhttp: Start NFS...
Sep  4 18:16:03 Tower emhttp: shcmd (58): /etc/rc.d/rc.nfsd start |& logger
Sep  4 18:16:04 Tower logger: Starting NFS server daemons:
Sep  4 18:16:04 Tower logger:   /usr/sbin/exportfs -r
Sep  4 18:16:04 Tower logger:   /usr/sbin/rpc.nfsd 8
Sep  4 18:16:04 Tower logger:   /usr/sbin/rpc.mountd
Sep  4 18:16:04 Tower rpc.mountd[4960]: Version 1.2.8 starting
Sep  4 18:16:04 Tower emhttp: Restart AFP...
Sep  4 18:16:04 Tower emhttp: shcmd (59): killall -HUP afpd
Sep  4 18:16:04 Tower emhttp: shcmd (60): cp /etc/avahi/services/afp.service- /etc/avahi/services/afp.service
Sep  4 18:16:04 Tower avahi-daemon[3955]: Files changed, reloading.
Sep  4 18:16:04 Tower avahi-daemon[3955]: Service group file /etc/avahi/services/afp.service changed, reloading.
Sep  4 18:16:04 Tower emhttp: shcmd (61): /usr/local/sbin/emhttp_event svcs_restarted
Sep  4 18:16:04 Tower emhttp_event: svcs_restarted
Sep  4 18:16:04 Tower emhttp: shcmd (62): /usr/local/sbin/emhttp_event started
Sep  4 18:16:04 Tower emhttp_event: started
Sep  4 18:16:04 Tower rc.unRAID[4970][4971]: Processing /etc/rc.d/rc.unRAID.d/ start scripts.
Sep  4 18:16:04 Tower avahi-daemon[3955]: Service "Tower" (/etc/avahi/services/smb.service) successfully established.
Sep  4 18:16:05 Tower avahi-daemon[3955]: Service "Tower-AFP" (/etc/avahi/services/afp.service) successfully established.
Sep  4 18:16:15 Tower kernel: docker0: port 1(veth5b16) entered forwarding state
Sep  4 18:16:15 Tower kernel: docker0: port 2(veth22d7) entered forwarding state
Sep  4 18:16:16 Tower kernel: docker0: port 3(vethdefa) entered forwarding state
Sep  4 18:17:35 Tower login[4205]: invalid password for 'root'  on '/dev/tty1'
Sep  4 18:17:45 Tower login[4205]: ROOT LOGIN  on '/dev/tty1'
Sep  4 18:18:12 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 18:18:12 Tower kernel: ata7.00: failed command: IDENTIFY DEVICE
Sep  4 18:18:12 Tower kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 10 pio 512 in
Sep  4 18:18:12 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 18:18:12 Tower kernel: ata7.00: status: { DRDY }
Sep  4 18:18:12 Tower kernel: ata7: hard resetting link
Sep  4 18:18:13 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 18:18:13 Tower kernel: ata7.00: configured for UDMA/133
Sep  4 18:18:13 Tower kernel: ata7: EH complete
Sep  4 18:20:28 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 18:20:28 Tower kernel: ata7.00: failed command: SMART
Sep  4 18:20:28 Tower kernel: ata7.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 3 pio 512 in
Sep  4 18:20:28 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 18:20:28 Tower kernel: ata7.00: status: { DRDY }
Sep  4 18:20:28 Tower kernel: ata7: hard resetting link
Sep  4 18:20:29 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 18:20:29 Tower kernel: ata7.00: configured for UDMA/133
Sep  4 18:20:29 Tower kernel: ata7: EH complete
Sep  4 18:20:51 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 18:20:51 Tower kernel: ata7.00: failed command: IDENTIFY DEVICE
Sep  4 18:20:51 Tower kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 10 pio 512 in
Sep  4 18:20:51 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 18:20:51 Tower kernel: ata7.00: status: { DRDY }
Sep  4 18:20:51 Tower kernel: ata7: hard resetting link
Sep  4 18:20:52 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 18:20:52 Tower kernel: ata7.00: configured for UDMA/133
Sep  4 18:20:52 Tower kernel: ata7: EH complete
Read from remote host <ip address>: Operation timed out

 

Sure looking forward to a server that stays up more than three hours.

 

D

 

Your log looks almost identical to mine.  I have been doing some testing on mine with shutting down samba and it hasn't crashed yet.  I noticed while capturing top that smbd was the top process when mine hung.  I shutdown smbd via "/etc/rc.d/rc.samba stop" and ran a parity check over night, and have been running an rsync for about 1 hours, and it hasn't crashed yet.  I wonder if you would get the same result.

Sep  4 18:18:12 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 18:18:12 Tower kernel: ata7.00: failed command: IDENTIFY DEVICE
Sep  4 18:18:12 Tower kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 10 pio 512 in
Sep  4 18:18:12 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 18:18:12 Tower kernel: ata7.00: status: { DRDY }
Sep  4 18:18:12 Tower kernel: ata7: hard resetting link
Sep  4 18:18:13 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 18:18:13 Tower kernel: ata7.00: configured for UDMA/133

 

These are problems with your hard drives, the cables to the hard drives, the controller and/or possibly a driver.

 

rsync will work your drive and eat up allot of memory if you have allot of files.

This is a known, you may need to drop the cache before doing a very large rsync.

But that's if you get OOM killer messages.

 

In addition, turn off the cache_dirs if you plan to do a large rsync.

 

Sep  4 18:15:01 Tower cache_dirs: max_seconds=10, min_seconds=1
Sep  4 18:15:01 Tower cache_dirs: max_depth=6
Sep  4 18:15:01 Tower cache_dirs: command=find -noleaf
Sep  4 18:15:01 Tower cache_dirs: version=1.6.9
Sep  4 18:15:01 Tower cache_dirs: ---------- caching directories ---------------
Sep  4 18:15:01 Tower cache_dirs: 
Sep  4 18:15:01 Tower cache_dirs: ----------------------------------------------

 

What's happening is you are making your drive scramble if the buffer cache is getting flushed from the large rsync.

 

On top of that, there is some other hardware (or driver) issue.

In reality, the software and hardware should deal with the scrambling and work load.

Albeit slow.  What's going here is something is just on the edge of acceptable behavior.

 

So you may need to do a SMART long test on the drive in question just to have peace of mind.

Check the logs look for pending and unrecoverable/reallocated sectors.

 

Double check your cables.

Make sure you are using quality cables and they are not near something which could cause interference.

 

It's not Samba that is making the hard drive disconnect and reset the SATA.

It's the workload/scrambling and something being marginal in performance.  (Could be a Sata driver timing issue too).

 

But without a SMART test and report it's hard to tell.

  • Author

Attached is a  syslog of a similar crash during rsync.   

 

Note  at around 00:57

Sep  5 00:57:04 Tower kernel: kernel BUG at drivers/md/unraid.c:461!
Sep  5 00:57:04 Tower kernel: invalid opcode: 0000 [#1] SMP 

 

There are similar ata issues to the previous crash that weebo pointed out, though different drives (ata10 is what i see, while it was ata7 before).

 

How do the drives identified in the syslog as  'ata7' correlate with the devices sda, sdb, etc?

 

The cables are from the server hardware from LimeTech, but i have a PCI 4 channel sata card that has one bad channel.  I checked them all last week, but..

 

I'll get smart reports soon, and try without cach_dirs on

 

Thanks

 

D

 

edit:  Since this crash I can't keep the server up for more than a couple of minutes with rsync + parity check running on either reiser >XFS or reiser>reiserfs copy, with or without cache_dirs.

 

gotta go to work...

Syslog_9_5_14.txt

Well I stand corrected here in that it could be an unRAID driver issue too.

however there are some hardware errors that keep cropping up.

ata10.00, ata7.00, ata9.00.

 

That's three drives, which could possibly be on the same controller.

Without seeing the whole syslog I can't determine what controller and/or hardware.

 

However the kernel bug and the hardware issue may or may not be unrelated to the drives.

The time is much later then the drive errors being reported.

 

Perhaps an smp issue with the unraid driver.

 

Which version of unRAID are you using? (I'm assuming unRAID 6 Beta-8 from prior post).

Also can you describe your hardware?

 

Sep  4 21:07:03 Tower sshd[6115]: Received disconnect from 24.125.219.59: 11: 
Sep  4 21:18:51 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:18:51 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE
Sep  4 21:18:51 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 29 pio 512 in
Sep  4 21:18:51 Tower kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 21:18:51 Tower kernel: ata10.00: status: { DRDY }
Sep  4 21:18:51 Tower kernel: ata10: hard resetting link
Sep  4 21:18:52 Tower kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 21:18:52 Tower kernel: ata10.00: configured for UDMA/133
Sep  4 21:18:52 Tower kernel: ata10: EH complete
Sep  4 21:21:01 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:21:01 Tower kernel: ata7.00: failed command: IDENTIFY DEVICE
Sep  4 21:21:01 Tower kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 27 pio 512 in
Sep  4 21:21:01 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 21:21:01 Tower kernel: ata7.00: status: { DRDY }
Sep  4 21:21:01 Tower kernel: ata7: hard resetting link
Sep  4 21:21:01 Tower kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 21:21:01 Tower kernel: ata7.00: configured for UDMA/133
Sep  4 21:21:01 Tower kernel: ata7: EH complete
Sep  4 21:26:02 Tower sshd[21258]: Failed password for root from 122.225.109.101 port 39804 ssh2
...
ep  4 21:31:01 Tower php: /usr/bin/docker start mbserver
Sep  4 21:31:02 Tower php: mbserver 
Sep  4 21:31:02 Tower php: 
Sep  4 21:31:22 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:31:22 Tower kernel: ata9.00: failed command: IDENTIFY DEVICE
Sep  4 21:31:22 Tower kernel: ata9.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 18 pio 512 in
Sep  4 21:31:22 Tower kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 21:31:22 Tower kernel: ata9.00: status: { DRDY }
Sep  4 21:31:22 Tower kernel: ata9: hard resetting link
Sep  4 21:31:23 Tower kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 21:31:23 Tower kernel: ata9.00: configured for UDMA/133
Sep  4 21:31:23 Tower kernel: ata9: EH complete
Sep  4 21:32:07 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:32:07 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE
Sep  4 21:32:07 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 14 pio 512 in
Sep  4 21:32:07 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 21:32:07 Tower kernel: ata10.00: status: { DRDY }
Sep  4 21:32:07 Tower kernel: ata10: hard resetting link
Sep  4 21:32:07 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:32:07 Tower kernel: ata9.00: failed command: IDENTIFY DEVICE
Sep  4 21:32:07 Tower kernel: ata9.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
Sep  4 21:32:07 Tower kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 21:32:07 Tower kernel: ata9.00: status: { DRDY }
Sep  4 21:32:07 Tower kernel: ata9: hard resetting link
Sep  4 21:32:08 Tower kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 21:32:08 Tower kernel: ata10.00: configured for UDMA/133
Sep  4 21:32:08 Tower kernel: ata10: EH complete
Sep  4 21:32:08 Tower kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 21:32:08 Tower kernel: ata9.00: configured for UDMA/133
Sep  4 21:32:08 Tower kernel: ata9: EH complete
Sep  4 21:40:25 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:40:25 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE
Sep  4 21:40:25 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 17 pio 512 in
Sep  4 21:40:25 Tower kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Sep  4 21:40:25 Tower kernel: ata10.00: status: { DRDY }
Sep  4 21:40:25 Tower kernel: ata10: hard resetting link
Sep  4 21:40:26 Tower kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 21:40:26 Tower kernel: ata10.00: configured for UDMA/133
Sep  4 21:40:26 Tower kernel: ata10: EH complete
Sep  4 21:41:06 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 21:41:06 Tower kernel: ata10.00: failed command: SMART
Sep  4 21:41:06 Tower kernel: ata10.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 4 pio 512 in
Sep  4 21:41:06 Tower kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 21:41:06 Tower kernel: ata10.00: status: { DRDY }
Sep  4 21:41:06 Tower kernel: ata10: hard resetting link
Sep  4 21:41:07 Tower kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 21:41:07 Tower kernel: ata10.00: configured for UDMA/133
Sep  4 21:41:07 Tower kernel: ata10: EH complete
Sep  4 21:45:35 Tower sshd[29303]: Accepted password for leech from 24.125.219.59 port 48086 ssh2
Sep  4 21:46:35 Tower sshd[29307]: Received disconnect from 24.125.219.59: 11: 
Sep  4 22:00:49 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 22:00:49 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE
Sep  4 22:00:49 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 1 pio 512 in
Sep  4 22:00:49 Tower kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 22:00:49 Tower kernel: ata10.00: status: { DRDY }
Sep  4 22:00:49 Tower kernel: ata10: hard resetting link
Sep  4 22:00:49 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 22:00:49 Tower kernel: ata9.00: failed command: IDENTIFY DEVICE
Sep  4 22:00:49 Tower kernel: ata9.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 1 pio 512 in
Sep  4 22:00:49 Tower kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 22:00:49 Tower kernel: ata9.00: status: { DRDY }
Sep  4 22:00:49 Tower kernel: ata9: hard resetting link
Sep  4 22:00:50 Tower kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 22:00:50 Tower kernel: ata10.00: configured for UDMA/133
Sep  4 22:00:50 Tower kernel: ata10: EH complete
Sep  4 22:00:50 Tower kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 22:00:50 Tower kernel: ata9.00: configured for UDMA/133
Sep  4 22:00:50 Tower kernel: ata9: EH complete
Sep  4 22:01:59 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  4 22:01:59 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE
Sep  4 22:01:59 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 6 pio 512 in
Sep  4 22:01:59 Tower kernel:          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  4 22:01:59 Tower kernel: ata10.00: status: { DRDY }
Sep  4 22:01:59 Tower kernel: ata10: hard resetting link
Sep  4 22:02:00 Tower kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep  4 22:02:00 Tower kernel: ata10.00: configured for UDMA/133
Sep  4 22:02:00 Tower kernel: ata10: EH complete
Sep  4 22:34:28 Tower afpd[28622]: acl = acl_get_file(path, ACL_TYPE_ACCESS) failed: No such file or directory
Sep  4 22:34:28 Tower afpd[28622]: posix_acls_to_uaperms(path, st, ma) failed: No such file or directory
...
Sep  4 22:36:42 Tower afpd[28622]: posix_acls_to_uaperms(path, st, ma) failed: No such file or directory
Sep  4 23:00:43 Tower login[4241]: ROOT LOGIN  on '/dev/tty1'

<snip login failures>

Sep  5 00:57:04 Tower kernel: ------------[ cut here ]------------
Sep  5 00:57:04 Tower kernel: kernel BUG at drivers/md/unraid.c:461!
Sep  5 00:57:04 Tower kernel: invalid opcode: 0000 [#1] SMP 
Sep  5 00:57:04 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod btrfs xor lzo_compress raid6_pq vhost_net vhost tun i2c_i801 e1000e ptp pps_core ahci libahci
Sep  5 00:57:04 Tower kernel: CPU: 1 PID: 32027 Comm: kworker/1:4 Not tainted 3.16.0-unRAID #6
Sep  5 00:57:04 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AD2550R/U3S3, BIOS P1.20 07/22/2013
Sep  5 00:57:04 Tower kernel: Workqueue: md submit_flush_data [md_mod]
Sep  5 00:57:04 Tower kernel: task: ffff88006670bf00 ti: ffff88010c950000 task.ti: ffff88010c950000
Sep  5 00:57:04 Tower kernel: RIP: 0010:[<ffffffffa015ed23>]  [<ffffffffa015ed23>] unraid_make_request+0x34a/0x51e [md_mod]

  • Author

Well I stand corrected here in that it could be an unRAID driver issue too.

however there are some hardware errors that keep cropping up.

ata10.00, ata7.00, ata9.00.

 

That's three drives, which could possibly be on the same controller.

Without seeing the whole syslog I can't determine what controller and/or hardware.

 

I finally put my system info in my .sig  (plus got an avatar)

 

I think you're right that ata7,9, and 10 are a suspect controller;  I have already disconnected one of 4 sockets on that card (i.e. #8), and I suspect that the 7 onboard sata ports are ata 0-6.

 

Tonight I'll try reorganizing all of my drives on the MB sockets, remove the controller and repeat.  My setup crashed today while sitting silent within a couple of hours.  If it's system SW I wouldn't expect it to be getting worse.

 

Dennis

  • Author

From the last crash it appeared that I was having failures on several drive ports from my controller card.  Last night i rebuilt my array (with one fewer drive) using only the ports on my MB.  Then I set up an rsync copy, during a parity check.  The copy did seem to go faster, but the server crashed in about 7 hours.  I captured the syslog via ssh but it doesn't have a lot of provocative info to my eye.  At least there are no drive errors reported.

 

I'll set up the test again now.

 

Media-Mac:~ media$ ssh [email protected]
Last login: Fri Sep  5 18:46:36 2014
Linux 3.16.0-unRAID.
root@Tower:~# tail -f /var/log/syslog
Sep  5 18:45:39 Tower sshd[7123]: error: Could not get shadow information for NOUSER
Sep  5 18:45:39 Tower sshd[7123]: Failed password for invalid user admin from 116.10.191.214 port 11842 ssh2
Sep  5 18:45:39 Tower sshd[7123]: Failed password for invalid user admin from 116.10.191.214 port 11842 ssh2
Sep  5 18:45:39 Tower sshd[7123]: Disconnecting: Too many authentication failures for admin [preauth]
Sep  5 18:45:40 Tower sshd[7162]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:45:41 Tower sshd[7102]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:45:41 Tower sshd[7171]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:45:41 Tower sshd[7153]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:46:36 Tower login[4199]: ROOT LOGIN  on '/dev/tty2'
Sep  5 18:47:13 Tower sshd[7642]: Accepted publickey for root from 192.168.1.1 port xxx ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  5 18:48:24 Tower sshd[8002]: Accepted publickey for root from 192.168.1.1 port xxx ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  5 18:48:57 Tower php: /usr/bin/docker stop Dropbox
Sep  5 18:48:59 Tower php: Dropbox 
Sep  5 18:48:59 Tower php: 
Sep  5 19:37:43 Tower sshd[8743]: Accepted password for leech from <ip addr> port xxx ssh2
Sep  5 19:38:00 Tower sshd[8747]: Received disconnect from <ip addr>: 11: 
Sep  5 21:48:37 Tower sshd[11314]: Failed password for root from 122.225.109.101 port 30543 ssh2
Sep  5 21:48:37 Tower sshd[11314]: Failed password for root from 122.225.109.101 port 30543 ssh2
Sep  5 21:48:37 Tower sshd[11314]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:39 Tower sshd[11319]: Failed password for root from 122.225.109.101 port 31183 ssh2
Sep  5 21:48:39 Tower sshd[11319]: Failed password for root from 122.225.109.101 port 31183 ssh2
Sep  5 21:48:39 Tower sshd[11319]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:41 Tower sshd[11326]: Failed password for root from 122.225.109.101 port 31669 ssh2
Sep  5 21:48:42 Tower sshd[11326]: Failed password for root from 122.225.109.101 port 31669 ssh2
Sep  5 21:48:42 Tower sshd[11326]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:44 Tower sshd[11333]: Failed password for root from 122.225.109.101 port 32136 ssh2
Sep  5 21:48:44 Tower sshd[11333]: Failed password for root from 122.225.109.101 port 32136 ssh2
Sep  5 21:48:44 Tower sshd[11333]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:46 Tower sshd[11337]: Failed password for root from 122.225.109.101 port 33091 ssh2
Sep  5 21:48:47 Tower sshd[11337]: Failed password for root from 122.225.109.101 port 33091 ssh2
Sep  5 21:48:47 Tower sshd[11337]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:49 Tower sshd[11341]: Failed password for root from 122.225.109.101 port 33789 ssh2
Sep  5 21:48:49 Tower sshd[11341]: Failed password for root from 122.225.109.101 port 33789 ssh2
Sep  5 21:48:49 Tower sshd[11341]: Disconnecting: Too many authentication failures for root [preauth]

<snip failed logins>

Sep  5 21:54:00 Tower sshd[13751]: input_userauth_request: invalid user admin [preauth]
Sep  5 21:54:01 Tower sshd[13743]: Failed password for root from 116.10.191.214 port 58210 ssh2
Sep  5 21:54:01 Tower sshd[13747]: Failed password for root from 116.10.191.214 port 58209 ssh2
Sep  5 21:54:01 Tower sshd[13751]: error: Could not get shadow information for NOUSER
Sep  5 21:54:01 Tower sshd[13751]: Failed password for invalid user admin from 116.10.191.214 port 58402 ssh2
Sep  5 21:54:01 Tower sshd[13743]: Failed password for root from 116.10.191.214 port 58210 ssh2
Sep  5 21:54:01 Tower sshd[13743]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:54:01 Tower sshd[13747]: Failed password for root from 116.10.191.214 port 58209 ssh2
Sep  5 21:54:01 Tower sshd[13747]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:54:01 Tower sshd[13751]: Failed password for invalid user admin from 116.10.191.214 port 58402 ssh2
Sep  5 21:54:01 Tower sshd[13751]: Disconnecting: Too many authentication failures for admin [preauth]
Sep  5 21:54:02 Tower sshd[13755]: Failed password for root from 116.10.191.214 port 1281 ssh2
Sep  5 21:54:03 Tower sshd[13755]: Failed password for root from 116.10.191.214 port 1281 ssh2
Sep  5 21:54:03 Tower sshd[13755]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:54:05 Tower sshd[13785]: Invalid user admin from 116.10.191.214
Sep  5 21:54:05 Tower sshd[13785]: input_userauth_request: invalid user admin [preauth]
Sep  5 21:54:06 Tower sshd[13785]: error: Could not get shadow information for NOUSER
Sep  5 21:54:06 Tower sshd[13785]: Failed password for invalid user admin from 116.10.191.214 port 3993 ssh2
Sep  5 21:54:06 Tower sshd[13785]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 21:54:06 Tower sshd[13777]: fatal: Write failed: Connection reset by peer [preauth]
Sep  5 21:54:06 Tower sshd[13781]: fatal: Write failed: Connection reset by peer [preauth]
Sep  5 21:54:07 Tower sshd[13789]: fatal: Write failed: Connection reset by peer [preauth]
Read from remote host amor.us.to: Operation timed out
Connection to <ip>  closed.

 

It is curious that the failed logins don't come in steadily through the night,  the large majority came in in a six minute period preceding the crash.

 

Here's TOP during the crash:

Last login: Fri Sep  5 18:46:36 2014
Linux 3.16.0-unRAID.
root@Tower:~# tail -f /var/log/syslog
Sep  5 18:45:39 Tower sshd[7123]: error: Could not get shadow information for NOUSER
Sep  5 18:45:39 Tower sshd[7123]: Failed password for invalid user admin from 116.10.191.214 port 11842 ssh2
Sep  5 18:45:39 Tower sshd[7123]: Failed password for invalid user admin from 116.10.191.214 port 11842 ssh2
Sep  5 18:45:39 Tower sshd[7123]: Disconnecting: Too many authentication failures for admin [preauth]
Sep  5 18:45:40 Tower sshd[7162]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:45:41 Tower sshd[7102]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:45:41 Tower sshd[7171]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:45:41 Tower sshd[7153]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 18:46:36 Tower login[4199]: ROOT LOGIN  on '/dev/tty2'
Sep  5 18:47:13 Tower sshd[7642]: Accepted publickey for root from 192.168.1.1 port xxx ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  5 18:48:24 Tower sshd[8002]: Accepted publickey for root from 192.168.1.1 port xxx ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  5 18:48:57 Tower php: /usr/bin/docker stop Dropbox
Sep  5 18:48:59 Tower php: Dropbox 
Sep  5 18:48:59 Tower php: 
Sep  5 19:37:43 Tower sshd[8743]: Accepted password for leech from 24.125.219.59 port 39801 ssh2
Sep  5 19:38:00 Tower sshd[8747]: Received disconnect from 24.125.219.59: 11: 
Sep  5 21:48:37 Tower sshd[11314]: Failed password for root from 122.225.109.101 port 30543 ssh2
Sep  5 21:48:37 Tower sshd[11314]: Failed password for root from 122.225.109.101 port 30543 ssh2
Sep  5 21:48:37 Tower sshd[11314]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:39 Tower sshd[11319]: Failed password for root from 122.225.109.101 port 31183 ssh2
Sep  5 21:48:39 Tower sshd[11319]: Failed password for root from 122.225.109.101 port 31183 ssh2
Sep  5 21:48:39 Tower sshd[11319]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:41 Tower sshd[11326]: Failed password for root from 122.225.109.101 port 31669 ssh2
Sep  5 21:48:42 Tower sshd[11326]: Failed password for root from 122.225.109.101 port 31669 ssh2
Sep  5 21:48:42 Tower sshd[11326]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:44 Tower sshd[11333]: Failed password for root from 122.225.109.101 port 32136 ssh2
Sep  5 21:48:44 Tower sshd[11333]: Failed password for root from 122.225.109.101 port 32136 ssh2
Sep  5 21:48:44 Tower sshd[11333]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:46 Tower sshd[11337]: Failed password for root from 122.225.109.101 port 33091 ssh2
Sep  5 21:48:47 Tower sshd[11337]: Failed password for root from 122.225.109.101 port 33091 ssh2
Sep  5 21:48:47 Tower sshd[11337]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:48:49 Tower sshd[11341]: Failed password for root from 122.225.109.101 port 33789 ssh2
Sep  5 21:48:49 Tower sshd[11341]: Failed password for root from 122.225.109.101 port 33789 ssh2
Sep  5 21:48:49 Tower sshd[11341]: Disconnecting: Too many authentication failures for root [preauth]

<snip failed logins>

Sep  5 21:54:00 Tower sshd[13751]: input_userauth_request: invalid user admin [preauth]
Sep  5 21:54:01 Tower sshd[13743]: Failed password for root from 116.10.191.214 port 58210 ssh2
Sep  5 21:54:01 Tower sshd[13747]: Failed password for root from 116.10.191.214 port 58209 ssh2
Sep  5 21:54:01 Tower sshd[13751]: error: Could not get shadow information for NOUSER
Sep  5 21:54:01 Tower sshd[13751]: Failed password for invalid user admin from 116.10.191.214 port 58402 ssh2
Sep  5 21:54:01 Tower sshd[13743]: Failed password for root from 116.10.191.214 port 58210 ssh2
Sep  5 21:54:01 Tower sshd[13743]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:54:01 Tower sshd[13747]: Failed password for root from 116.10.191.214 port 58209 ssh2
Sep  5 21:54:01 Tower sshd[13747]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:54:01 Tower sshd[13751]: Failed password for invalid user admin from 116.10.191.214 port 58402 ssh2
Sep  5 21:54:01 Tower sshd[13751]: Disconnecting: Too many authentication failures for admin [preauth]
Sep  5 21:54:02 Tower sshd[13755]: Failed password for root from 116.10.191.214 port 1281 ssh2
Sep  5 21:54:03 Tower sshd[13755]: Failed password for root from 116.10.191.214 port 1281 ssh2
Sep  5 21:54:03 Tower sshd[13755]: Disconnecting: Too many authentication failures for root [preauth]
Sep  5 21:54:05 Tower sshd[13785]: Invalid user admin from 116.10.191.214
Sep  5 21:54:05 Tower sshd[13785]: input_userauth_request: invalid user admin [preauth]
Sep  5 21:54:06 Tower sshd[13785]: error: Could not get shadow information for NOUSER
Sep  5 21:54:06 Tower sshd[13785]: Failed password for invalid user admin from 116.10.191.214 port 3993 ssh2
Sep  5 21:54:06 Tower sshd[13785]: fatal: Read from socket failed: Connection reset by peer [preauth]
Sep  5 21:54:06 Tower sshd[13777]: fatal: Write failed: Connection reset by peer [preauth]
Sep  5 21:54:06 Tower sshd[13781]: fatal: Write failed: Connection reset by peer [preauth]
Sep  5 21:54:07 Tower sshd[13789]: fatal: Write failed: Connection reset by peer [preauth]
Read from remote host <1p>: Operation timed out
Connection to <ip>  closed.

  • Author

OK, this looks like an informative crash.

 

unRaid 6b8, hardware below except the controller unplugged.  plugins apcupsd, cache_dirs running.  Parity check running (server hasn't been able long enough to complete one!) SMB/NFS and AFP running but not in use.

 

Started rsync (-avh with copy of large files from a reiserfs array volume to an xfs volume)

started remote terminal with tail -f capture and another with top.

 

Hard crash happened in 90 minutes.

 

Here's the log:

 

root@Tower:~# tail -f /var/log/syslog
Sep  6 06:55:28 Tower emhttp: shcmd (58): cp /etc/avahi/services/afp.service- /etc/avahi/services/afp.service
Sep  6 06:55:28 Tower avahi-daemon[3880]: Files changed, reloading.
Sep  6 06:55:28 Tower avahi-daemon[3880]: Service group file /etc/avahi/services/afp.service changed, reloading.
Sep  6 06:55:28 Tower emhttp: shcmd (59): /usr/local/sbin/emhttp_event svcs_restarted
Sep  6 06:55:28 Tower emhttp_event: svcs_restarted
Sep  6 06:55:28 Tower emhttp: shcmd (60): /usr/local/sbin/emhttp_event started
Sep  6 06:55:28 Tower emhttp_event: started
Sep  6 06:55:28 Tower rc.unRAID[4832][4833]: Processing /etc/rc.d/rc.unRAID.d/ start scripts.
Sep  6 06:55:28 Tower avahi-daemon[3880]: Service "Tower" (/etc/avahi/services/smb.service) successfully established.
Sep  6 06:55:29 Tower avahi-daemon[3880]: Service "Tower-AFP" (/etc/avahi/services/afp.service) successfully established.
Sep  6 06:55:39 Tower kernel: docker0: port 1(veth9241) entered forwarding state
Sep  6 06:55:39 Tower kernel: docker0: port 2(veth59d2) entered forwarding state
Sep  6 06:55:40 Tower kernel: docker0: port 3(veth38c6) entered forwarding state
Sep  6 06:56:00 Tower cache_dirs: ==============================================
Sep  6 06:56:00 Tower cache_dirs: command-args=-w -m 1 -M 10 -d 6 -p 10 -U 50000 -i /mnt -B -a -noleaf
Sep  6 06:56:00 Tower cache_dirs: vfs_cache_pressure=10
Sep  6 06:56:00 Tower cache_dirs: max_seconds=10, min_seconds=1
Sep  6 06:56:00 Tower cache_dirs: max_depth=6
Sep  6 06:56:00 Tower cache_dirs: command=find -noleaf
Sep  6 06:56:00 Tower cache_dirs: version=1.6.9
Sep  6 06:56:00 Tower cache_dirs: ---------- caching directories ---------------
Sep  6 06:56:00 Tower cache_dirs: 
Sep  6 06:56:00 Tower cache_dirs: ----------------------------------------------
Sep  6 06:56:00 Tower cache_dirs: ERROR: included directory "/mnt" does not exist.
Sep  6 06:56:00 Tower cache_dirs: cache_dirs process ID 5267 started, To terminate it, type: cache_dirs -q
Sep  6 06:56:16 Tower sshd[5363]: Accepted publickey for root from 192.168.1.1 port xxxx ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  6 07:14:35 Tower login[4239]: ROOT LOGIN  on '/dev/tty1'
Sep  6 07:30:12 Tower sshd[17439]: Accepted publickey for root from 192.168.1.144 port xxxx ssh2: RSA 02:cb:cf:9d:44:aa:04:49:7d:2c:9f:a9:00:da:ef:92
Sep  6 08:11:31 Tower kernel: ------------[ cut here ]------------
Sep  6 08:11:31 Tower kernel: kernel BUG at drivers/md/unraid.c:462!
Sep  6 08:11:31 Tower kernel: invalid opcode: 0000 [#1] SMP 
Sep  6 08:11:31 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod btrfs xor lzo_compress raid6_pq vhost_net vhost tun i2c_i801 e1000e ptp pps_core ahci libahci
Sep  6 08:11:31 Tower kernel: CPU: 0 PID: 13420 Comm: kworker/0:6 Not tainted 3.16.0-unRAID #6
Sep  6 08:11:31 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AD2550R/U3S3, BIOS P1.20 07/22/2013
Sep  6 08:11:31 Tower kernel: Workqueue: md submit_flush_data [md_mod]
Sep  6 08:11:31 Tower kernel: task: ffff88006fd246e0 ti: ffff88010b19c000 task.ti: ffff88010b19c000
Sep  6 08:11:31 Tower kernel: RIP: 0010:[<ffffffffa018bd2c>]  [<ffffffffa018bd2c>] unraid_make_request+0x353/0x51e [md_mod]
Sep  6 08:11:31 Tower kernel: RSP: 0018:ffff88010b19fd90  EFLAGS: 00010286
Sep  6 08:11:31 Tower kernel: RAX: ffff880119240120 RBX: ffff8800b1c51d00 RCX: 000000000000508e
Sep  6 08:11:31 Tower kernel: RDX: 000000000000508e RSI: 000000000000508e RDI: ffff880119240044
Sep  6 08:11:31 Tower kernel: RBP: ffff88010b19fde0 R08: 0000000000000046 R09: ffff88006fd246e0
Sep  6 08:11:31 Tower kernel: R10: 000000058107dae0 R11: 000000000000b6f6 R12: 0000000000000001
Sep  6 08:11:31 Tower kernel: R13: ffff8800da846a80 R14: ffff880119240000 R15: ffff8800da846a80
Sep  6 08:11:31 Tower kernel: FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
Sep  6 08:11:31 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep  6 08:11:31 Tower kernel: CR2: 00007fabe9d511de CR3: 00000000515d5000 CR4: 00000000000007f0
Sep  6 08:11:31 Tower kernel: Stack:
Sep  6 08:11:31 Tower kernel:  0000000000000001 ffff880119240120 ffff880119240044 000000007486ab13
Sep  6 08:11:31 Tower kernel:  000000007486aad0 ffff8800b1c51a20 ffff8800b1c51d00 ffff8800da846a80
Sep  6 08:11:31 Tower kernel:  ffff8800b1c51a00 0000000000000000 ffff88010b19fe10 ffffffffa018bf74
Sep  6 08:11:31 Tower kernel: Call Trace:
Sep  6 08:11:31 Tower kernel:  [<ffffffffa018bf74>] submit_flush_data+0x7d/0x93 [md_mod]
Sep  6 08:11:31 Tower kernel:  [<ffffffff81052ea9>] process_one_work+0x20d/0x332
Sep  6 08:11:31 Tower kernel:  [<ffffffff81053b5d>] worker_thread+0x392/0x507
Sep  6 08:11:31 Tower kernel:  [<ffffffff810537cb>] ? cancel_delayed_work_sync+0x10/0x10
Sep  6 08:11:31 Tower kernel:  [<ffffffff810588de>] kthread+0xd6/0xde
Sep  6 08:11:31 Tower kernel:  [<ffffffff81058808>] ? kthread_create_on_node+0x168/0x168
Sep  6 08:11:31 Tower kernel:  [<ffffffff815582fc>] ret_from_fork+0x7c/0xb0
Sep  6 08:11:31 Tower kernel:  [<ffffffff81058808>] ? kthread_create_on_node+0x168/0x168
Sep  6 08:11:31 Tower kernel: Code: 46 44 48 89 c7 48 89 45 c0 e8 cb be 3c e1 48 8b 45 b8 48 83 78 18 00 74 02 0f 0b 48 83 78 20 00 74 02 0f 0b 48 83 78 28 00 74 02 <0f> 0b 48 8b 50 08 80 e6 df 48 89 50 08 f6 43 18 01 75 06 48 89 
Sep  6 08:11:31 Tower kernel: RIP  [<ffffffffa018bd2c>] unraid_make_request+0x353/0x51e [md_mod]
Sep  6 08:11:31 Tower kernel:  RSP <ffff88010b19fd90>
Sep  6 08:11:31 Tower kernel: ---[ end trace 0a6b223d28c9bb17 ]---
Sep  6 08:11:31 Tower kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8
Sep  6 08:11:31 Tower kernel: IP: [<ffffffff81058e81>] kthread_data+0xc/0x11
Sep  6 08:11:31 Tower kernel: PGD 1742067 PUD 1744067 PMD 0 
Sep  6 08:11:31 Tower kernel: Oops: 0000 [#2] SMP 

 

the console pic is attached.

 

Top shows

top - 08:11:38 up  1:17,  4 users,  load average: 9.89, 9.42, 9.39
Tasks: 293 total,   2 running, 291 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.1%sy,  0.0%ni, 66.5%id, 33.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4038012k total,  3762660k used,   275352k free,   126176k buffers
Swap:        0k total,        0k used,        0k free,  2089068k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
5401 root      20   0 13412 2408 1972 R    1  0.1   0:40.27 top                
4752 nobody    20   0 1484m  36m    0 S    0  0.9   3:09.42 sabnzbdplus        
    1 root      20   0  4364 1608 1508 S    0  0.0   0:26.63 init               
    2 root      20   0     0    0    0 S    0  0.0   0:00.01 kthreadd           
    3 root      20   0     0    0    0 S    0  0.0   0:14.86 ksoftirqd/0        
    4 root      20   0     0    0    0 S    0  0.0   0:02.51 kworker/0:0        
    5 root       0 -20     0    0    0 S    0  0.0   0:00.00 kworker/0:0H       
    7 root      20   0     0    0    0 S    0  0.0   0:19.80 rcu_sched          
    8 root      20   0     0    0    0 S    0  0.0   0:00.00 rcu_bh             
    9 root      RT   0     0    0    0 S    0  0.0   0:00.33 migration/0        
   10 root      RT   0     0    0    0 S    0  0.0   0:00.75 migration/1        
   11 root      20   0     0    0    0 S    0  0.0   0:13.82 ksoftirqd/1        
   13 root       0 -20     0    0    0 S    0  0.0   0:00.00 kworker/1:0H       
   14 root      RT   0     0    0    0 S    0  0.0   0:00.15 migration/2        
   15 root      20   0     0    0    0 S    0  0.0   0:14.03 ksoftirqd/2        
   16 root      20   0     0    0    0 S    0  0.0   0:02.32 kworker/2:0        
   17 root       0 -20     0    0    0 S    0  0.0   0:00.00 kworker/2:0H   

 

-=-=-=-=

 

I note the bug notice:

Sep  6 08:11:31 Tower kernel: kernel BUG at drivers/md/unraid.c:462!
Sep  6 08:11:31 Tower kernel: invalid opcode: 0000 [#1] SMP 
Sep  6 08:11:31 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod btrfs xor lzo_compress raid6_pq vhost_net vhost tun i2c_i801 e1000e ptp pps_core ahci libahci

 

 

This is similar to the bug in a previous crash except it was 461 then

 

http://lime-technology.com/forum/index.php?topic=35043.msg326019#msg326019

 

I think this is pretty good evidence that there is an issue with the referenced unRAID driver.

 

I also note that the previous drive/controller issues are gone.  I've ordered a new controller...

 

Thanks

 

Dennis

 

crash_9_6.JPG.c27c469475a34c56c0e1abd3dd8648e5.JPG

  • Author

And another informative crash, with same unraid.c driver error.  This time the rsync was from a reiserfs to another resierfs formatted volume;  before I was copying to an xfs system.

 

The crash was 16 minutes into a copy.

 

Linux 3.16.0-unRAID.
root@Tower:~# tail -f /var/log/syslog 
Sep  6 09:41:49 Tower emhttp: shcmd (36): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md4 /mnt/disk4 |& logger
Sep  6 09:41:49 Tower emhttp: shcmd (37): mkdir -p /mnt/disk5
Sep  6 09:41:49 Tower emhttp: shcmd (38): set -o pipefail ; mount -t xfs -o noatime,nodiratime /dev/md5 /mnt/disk5 |& logger
Sep  6 09:41:49 Tower kernel: XFS (md5): Mounting V4 Filesystem
Sep  6 09:41:50 Tower kernel: XFS (md5): Starting recovery (logdev: internal)
Sep  6 09:41:50 Tower emhttp: shcmd (39): mkdir -p /mnt/cache
Sep  6 09:41:50 Tower kernel: XFS (md5): Ending recovery (logdev: internal)
Sep  6 09:41:50 Tower emhttp: shcmd (40): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/sdh1 /mnt/cache |& logger
Sep  6 09:41:50 Tower kernel: BTRFS info (device sdh1): disk space caching is enabled
Sep  6 09:41:50 Tower kernel: BTRFS: detected SSD devices, enabling SSD mode
Sep  6 09:41:51 Tower emhttp: shcmd (41): mkdir /mnt/user0
Sep  6 09:41:51 Tower emhttp: shcmd (42): /usr/local/sbin/shfs /mnt/user0 -disks 16777214 -o noatime,big_writes,allow_other  |& logger
Sep  6 09:41:51 Tower emhttp: shcmd (43): mkdir /mnt/user
Sep  6 09:41:51 Tower emhttp: shcmd (44): /usr/local/sbin/shfs /mnt/user -disks 16777215 1024 -o noatime,big_writes,allow_other -o remember=330  |& logger
Sep  6 09:41:51 Tower emhttp: shcmd (45): crontab -c /etc/cron.d - <<< "# Generated mover schedule: 30 2 * * 5 /usr/local/sbin/mover |& logger"
Sep  6 09:41:51 Tower emhttp: shcmd (46): /usr/local/sbin/emhttp_event disks_mounted
Sep  6 09:41:51 Tower emhttp_event: disks_mounted
Sep  6 09:41:51 Tower logger: Starting docker.io
Sep  6 09:41:51 Tower logger: Maybe expand image file
Sep  6 09:41:51 Tower kernel: BTRFS: device fsid 0ab81931-36e8-49c4-924b-6ca691a81a8c devid 1 transid 2827 /dev/loop8
Sep  6 09:41:51 Tower kernel: BTRFS info (device loop8): disk space caching is enabled
Sep  6 09:41:51 Tower logger: Resize '/var/lib/docker' of 'max'
Sep  6 09:41:51 Tower logger: starting docker ...
Sep  6 09:41:51 Tower kernel: BTRFS: new size for /dev/loop8 is 8589934592
Sep  6 09:41:51 Tower kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Sep  6 09:41:52 Tower avahi-daemon[3913]: Joining mDNS multicast group on interface docker0.IPv4 with address 172.17.42.1.
Sep  6 09:41:52 Tower avahi-daemon[3913]: New relevant interface docker0.IPv4 for mDNS.
Sep  6 09:41:52 Tower avahi-daemon[3913]: Registering new address record for 172.17.42.1 on docker0.IPv4.
Sep  6 09:41:53 Tower kernel: device veth0595 entered promiscuous mode
Sep  6 09:41:53 Tower avahi-daemon[3913]: Withdrawing workstation service for veth2b02.
Sep  6 09:41:53 Tower kernel: docker0: port 1(veth0595) entered forwarding state
Sep  6 09:41:53 Tower kernel: docker0: port 1(veth0595) entered forwarding state
Sep  6 09:41:53 Tower logger: CouchPotato: started succesfully!
Sep  6 09:41:54 Tower logger: Dropbox: started succesfully!
Sep  6 09:41:54 Tower kernel: device veth6eca entered promiscuous mode
Sep  6 09:41:54 Tower kernel: docker0: port 2(veth6eca) entered forwarding state
Sep  6 09:41:54 Tower kernel: docker0: port 2(veth6eca) entered forwarding state
Sep  6 09:41:54 Tower avahi-daemon[3913]: Withdrawing workstation service for veth9afa.
Sep  6 09:41:54 Tower kernel: docker0: port 2(veth6eca) entered disabled state
Sep  6 09:41:54 Tower kernel: docker0: port 2(veth6eca) entered forwarding state
Sep  6 09:41:54 Tower kernel: docker0: port 2(veth6eca) entered forwarding state
Sep  6 09:41:54 Tower logger: SABnzbd: started succesfully!
Sep  6 09:41:55 Tower kernel: device vethaa01 entered promiscuous mode
Sep  6 09:41:55 Tower kernel: docker0: port 3(vethaa01) entered forwarding state
Sep  6 09:41:55 Tower kernel: docker0: port 3(vethaa01) entered forwarding state
Sep  6 09:41:55 Tower avahi-daemon[3913]: Withdrawing workstation service for veth473e.
Sep  6 09:41:55 Tower logger: SickBeard: started succesfully!
Sep  6 09:41:55 Tower kernel: mdcmd (41): check CORRECT
Sep  6 09:41:55 Tower kernel: md: recovery thread woken up ...
Sep  6 09:41:55 Tower kernel: md: recovery thread syncing parity disk ...
Sep  6 09:41:55 Tower kernel: md: using 1536k window, over a total of 3907018532 blocks.
Sep  6 09:41:57 Tower ntpd[1379]: Listen normally on 3 docker0 172.17.42.1 UDP 123
Sep  6 09:41:57 Tower ntpd[1379]: peers refreshed
Sep  6 09:41:57 Tower ntpd[1379]: new interface(s) found: waking up resolver
Sep  6 09:41:57 Tower emhttp: shcmd (47): :>/etc/samba/smb-shares.conf
Sep  6 09:41:57 Tower emhttp: shcmd (48): cp /etc/exports- /etc/exports
Sep  6 09:41:57 Tower avahi-daemon[3913]: Files changed, reloading.
Sep  6 09:41:57 Tower emhttp: shcmd (49): cp /etc/netatalk/AppleVolumes.default- /etc/netatalk/AppleVolumes.default
Sep  6 09:41:57 Tower avahi-daemon[3913]: Files changed, reloading.
Sep  6 09:41:57 Tower avahi-daemon[3913]: Files changed, reloading.
Sep  6 09:41:57 Tower emhttp: shcmd (50): echo '"/mnt/user/Media" -async,no_subtree_check,fsid=101 *(sec=sys,ro,insecure,anongid=100,anonuid=99,all_squash)' >>/etc/exports
Sep  6 09:41:57 Tower emhttp: shcmd (51): chmod 777 '/mnt/user/junk'
Sep  6 09:41:57 Tower emhttp: shcmd (52): chown 'nobody':'users' '/mnt/user/junk'
Sep  6 09:41:57 Tower avahi-daemon[3913]: Files changed, reloading.
Sep  6 09:41:57 Tower emhttp: Restart SMB...
Sep  6 09:41:57 Tower emhttp: shcmd (53): killall -HUP smbd
Sep  6 09:41:57 Tower emhttp: shcmd (54): cp /etc/avahi/services/smb.service- /etc/avahi/services/smb.service
Sep  6 09:41:57 Tower avahi-daemon[3913]: Files changed, reloading.
Sep  6 09:41:57 Tower avahi-daemon[3913]: Service group file /etc/avahi/services/smb.service changed, reloading.
Sep  6 09:41:57 Tower emhttp: shcmd (55): ps axc | grep -q rpc.mountd
Sep  6 09:41:57 Tower emhttp: _shcmd: shcmd (55): exit status: 1
Sep  6 09:41:57 Tower emhttp: Start NFS...
Sep  6 09:41:57 Tower emhttp: shcmd (56): /etc/rc.d/rc.nfsd start |& logger
Sep  6 09:41:57 Tower logger: Starting NFS server daemons:
Sep  6 09:41:57 Tower logger:   /usr/sbin/exportfs -r
Sep  6 09:41:57 Tower logger:   /usr/sbin/rpc.nfsd 8
Sep  6 09:41:57 Tower logger:   /usr/sbin/rpc.mountd
Sep  6 09:41:57 Tower rpc.mountd[4826]: Version 1.2.8 starting
Sep  6 09:41:57 Tower emhttp: Restart AFP...
Sep  6 09:41:57 Tower emhttp: shcmd (57): killall -HUP afpd
Sep  6 09:41:58 Tower emhttp: shcmd (58): cp /etc/avahi/services/afp.service- /etc/avahi/services/afp.service
Sep  6 09:41:58 Tower avahi-daemon[3913]: Files changed, reloading.
Sep  6 09:41:58 Tower avahi-daemon[3913]: Service group file /etc/avahi/services/afp.service changed, reloading.
Sep  6 09:41:58 Tower emhttp: shcmd (59): /usr/local/sbin/emhttp_event svcs_restarted
Sep  6 09:41:58 Tower emhttp_event: svcs_restarted
Sep  6 09:41:58 Tower emhttp: shcmd (60): /usr/local/sbin/emhttp_event started
Sep  6 09:41:58 Tower emhttp_event: started
Sep  6 09:41:58 Tower rc.unRAID[4836][4837]: Processing /etc/rc.d/rc.unRAID.d/ start scripts.
Sep  6 09:41:58 Tower avahi-daemon[3913]: Service "Tower" (/etc/avahi/services/smb.service) successfully established.
Sep  6 09:41:58 Tower avahi-daemon[3913]: Service "Tower-AFP" (/etc/avahi/services/afp.service) successfully established.
Sep  6 09:42:00 Tower cache_dirs: ==============================================
Sep  6 09:42:00 Tower cache_dirs: command-args=-w -m 1 -M 10 -d 6 -p 10 -U 50000 -i /mnt/ -B -a -noleaf
Sep  6 09:42:00 Tower cache_dirs: vfs_cache_pressure=10
Sep  6 09:42:00 Tower cache_dirs: max_seconds=10, min_seconds=1
Sep  6 09:42:00 Tower cache_dirs: max_depth=6
Sep  6 09:42:00 Tower cache_dirs: command=find -noleaf
Sep  6 09:42:00 Tower cache_dirs: version=1.6.9
Sep  6 09:42:00 Tower cache_dirs: ---------- caching directories ---------------
Sep  6 09:42:00 Tower cache_dirs: 
Sep  6 09:42:00 Tower cache_dirs: ----------------------------------------------
Sep  6 09:42:00 Tower cache_dirs: ERROR: included directory "/mnt/" does not exist.
Sep  6 09:42:00 Tower cache_dirs: cache_dirs process ID 4907 started, To terminate it, type: cache_dirs -q
Sep  6 09:42:08 Tower kernel: docker0: port 1(veth0595) entered forwarding state
Sep  6 09:42:09 Tower kernel: docker0: port 2(veth6eca) entered forwarding state
Sep  6 09:42:10 Tower kernel: docker0: port 3(vethaa01) entered forwarding state
Sep  6 09:42:46 Tower sshd[5373]: Accepted publickey for root from 192.168.1.1 port 49860 ssh2: RSA 36:1e:36:55:3c:a6:e9:c4:03:0c:7a:78:7f:a9:3f:57
Sep  6 09:43:18 Tower login[4237]: ROOT LOGIN  on '/dev/tty1'
Sep  6 09:58:18 Tower kernel: ------------[ cut here ]------------
Sep  6 09:58:18 Tower kernel: kernel BUG at drivers/md/unraid.c:461!
Sep  6 09:58:18 Tower kernel: invalid opcode: 0000 [#1] SMP 
Sep  6 09:58:18 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod btrfs xor lzo_compress raid6_pq vhost_net vhost tun i2c_i801 e1000e ptp pps_core ahci libahci
Sep  6 09:58:18 Tower kernel: CPU: 0 PID: 6980 Comm: kworker/0:5 Not tainted 3.16.0-unRAID #6
Sep  6 09:58:18 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AD2550R/U3S3, BIOS P1.20 07/22/2013
Sep  6 09:58:18 Tower kernel: Workqueue: md submit_flush_data [md_mod]
Sep  6 09:58:18 Tower kernel: task: ffff8800d80e87e0 ti: ffff88010ad10000 task.ti: ffff88010ad10000
Sep  6 09:58:18 Tower kernel: RIP: 0010:[<ffffffffa018bd23>]  [<ffffffffa018bd23>] unraid_make_request+0x34a/0x51e [md_mod]
Sep  6 09:58:18 Tower kernel: RSP: 0018:ffff88010ad13d90  EFLAGS: 00010286
Sep  6 09:58:18 Tower kernel: RAX: ffff8800d8d4bc90 RBX: ffff88008b671500 RCX: 0000000000000004
Sep  6 09:58:18 Tower kernel: RDX: 000000000000bbe4 RSI: 0000000000000001 RDI: ffff8800d8d4bbb4
Sep  6 09:58:18 Tower kernel: RBP: ffff88010ad13de0 R08: 0000000000000004 R09: ffff880119ae1808
Sep  6 09:58:18 Tower kernel: R10: ffff8800d8d4bb80 R11: 0000000000000003 R12: 0000000000000001
Sep  6 09:58:18 Tower kernel: R13: ffff8800d7c4a680 R14: ffff8800d8d4bb70 R15: ffff8800d7c4a680
Sep  6 09:58:18 Tower kernel: FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
Sep  6 09:58:18 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep  6 09:58:18 Tower kernel: CR2: 00002acb30fb8000 CR3: 000000006e7dc000 CR4: 00000000000007f0
Sep  6 09:58:18 Tower kernel: Stack:
Sep  6 09:58:18 Tower kernel:  0000000000000001 ffff8800d8d4bc90 ffff8800d8d4bbb4 00000000748b4623
Sep  6 09:58:18 Tower kernel:  00000000748b45e0 ffff88008fe07420 ffff88008b671500 ffff8800d7c4a680
Sep  6 09:58:18 Tower kernel:  ffff88008fe07400 0000000000000000 ffff88010ad13e10 ffffffffa018bf74
Sep  6 09:58:18 Tower kernel: Call Trace:
Sep  6 09:58:18 Tower kernel:  [<ffffffffa018bf74>] submit_flush_data+0x7d/0x93 [md_mod]
Sep  6 09:58:18 Tower kernel:  [<ffffffff81052ea9>] process_one_work+0x20d/0x332
Sep  6 09:58:18 Tower kernel:  [<ffffffff81053b5d>] worker_thread+0x392/0x507
Sep  6 09:58:18 Tower kernel:  [<ffffffff810537cb>] ? cancel_delayed_work_sync+0x10/0x10
Sep  6 09:58:18 Tower kernel:  [<ffffffff810588de>] kthread+0xd6/0xde
Sep  6 09:58:18 Tower kernel:  [<ffffffff81058808>] ? kthread_create_on_node+0x168/0x168
Sep  6 09:58:18 Tower kernel:  [<ffffffff815582fc>] ret_from_fork+0x7c/0xb0
Sep  6 09:58:18 Tower kernel:  [<ffffffff81058808>] ? kthread_create_on_node+0x168/0x168
Sep  6 09:58:18 Tower kernel: Code: 02 0f 0b 48 89 45 b8 49 8d 46 44 48 89 c7 48 89 45 c0 e8 cb be 3c e1 48 8b 45 b8 48 83 78 18 00 74 02 0f 0b 48 83 78 20 00 74 02 <0f> 0b 48 83 78 28 00 74 02 0f 0b 48 8b 50 08 80 e6 df 48 89 50 
Sep  6 09:58:18 Tower kernel: RIP  [<ffffffffa018bd23>] unraid_make_request+0x34a/0x51e [md_mod]
Sep  6 09:58:18 Tower kernel:  RSP <ffff88010ad13d90>
Sep  6 09:58:18 Tower kernel: ---[ end trace 33fcc705af08aae6 ]---
Sep  6 09:58:18 Tower kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8
Sep  6 09:58:48 Tower kernel: IP: [<ffffffff81058e81>] kthread_data+0xc/0x11
Sep  6 09:58:48 Tower kernel: PGD 1742067 PUD 1744067 PMD 0 
Sep  6 09:58:48 Tower kernel: Oops: 0000 [#2] SMP 
Sep  6 09:58:48 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod btrfs xor lzo_compress raid6_pq vhost_net vhost tun i2c_i801 e1000e ptp pps_core ahci libahci
Sep  6 09:58:48 Tower kernel: CPU: 0 PID: 6980 Comm: kworker/0:5 Tainted: G      D       3.16.0-unRAID #6
Sep  6 09:58:48 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AD2550R/U3S3, BIOS P1.20 07/22/2013
Sep  6 09:58:48 Tower kernel: task: ffff8800d80e87e0 ti: ffff88010ad10000 task.ti: ffff88010ad10000
Sep  6 09:58:48 Tower kernel: RIP: 0010:[<ffffffff81058e81>]  [<ffffffff81058e81>] kthread_data+0xc/0x11
Sep  6 09:58:48 Tower kernel: RSP: 0018:ffff88010ad139d8  EFLAGS: 00010002
Sep  6 09:58:48 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff818f1000
Sep  6 09:58:48 Tower kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800d80e87e0
Sep  6 09:58:48 Tower kernel: RBP: ffff88010ad139e8 R08: 00000000001016ff R09: 000000000000b88e
Sep  6 09:58:48 Tower kernel: R10: 0000000000002a17 R11: 000000000000ba23 R12: 0000000000000000
Sep  6 09:58:48 Tower kernel: R13: ffff8800d80e8c08 R14: ffff88010ad13b40 R15: ffff8800d80e87e0
Sep  6 09:58:48 Tower kernel: FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
Sep  6 09:58:48 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep  6 09:58:48 Tower kernel: CR2: 0000000000000028 CR3: 00000000d9e8c000 CR4: 00000000000007f0
Sep  6 09:58:48 Tower kernel: Stack:
Sep  6 09:58:48 Tower kernel:  ffffffff81053dcb ffff88011fc12c40 ffff88010ad13ae8 ffffffff81554a3c
Sep  6 09:58:48 Tower kernel:  ffff88010ad13fd8 ffff8800d80e87e0 0000000000012c40 ffff88011ac04240
Sep  6 09:58:48 Tower kernel:  ffff88010ad13a50 0000000000000046 0000000000000001 ffff8800893f9900

<truncated>

 

There is a bunch of noise after the start of the crash that I had to truncate, but the complete syslog is attached.

 

Note the driver fault cited for unraid.c:

Sep  6 09:58:18 Tower kernel: kernel BUG at drivers/md/unraid.c:461!
Sep  6 09:58:18 Tower kernel: invalid opcode: 0000 [#1] SMP 

 

this is on 6b8, but the crashes started for me in 6b7 while migrating drives to newly xfs formatted drives.  Never had a crash in 6b6 or earlier.

syslog_9_6.txt

Thank you for the very detailed report!

 

Yes I found a race condition in the unraid driver.  This only happens when xfs is the target (but bug is not with xfs, it's with unraid driver).  Fix coming...

  • 2 weeks later...
  • Author

I haven't heard from L-T that this bug is fixed, but after an upgrade to 6b9 I have been able to transfer 1.2 TB of data across various drives, during a parity check.

 

I'm not sure if this issue should be marked 'solved' but it is a good sign.

 

pducharme and jphipps both had similar issues that might be worth comparing.

 

And by the way, the SYBA 4x SATA III card that was acting up was replaced by an identical unit and seems to be working well;  I have no complaints about this inexpensive card.

 

Dennis

I haven't heard from L-T that this bug is fixed, but after an upgrade to 6b9 I have been able to transfer 1.2 TB of data across various drives, during a parity check.

It's this line in the -beta9 change log:

 

- unraid: fix overlapped i/o detection race condition

 

Marking this solved.

 

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.