Jump to content

jus7incase

Members
  • Posts

    91
  • Joined

  • Last visited

Everything posted by jus7incase

  1. Got it. It is running. Will come back with the results in the evening. Since the problem is 100% repeatable currently, I suppose a few hours mem test would be enough? 6h?
  2. This kind of debugging is out of my reach. I have not run a mem test, let me know how. FYI this is a SuperMicro board using ECC ram. Would faulty ECC mem cause then this or be detected at a different layer?
  3. Another instance of the problem after rewiring the drives. Mover: /usr/local/sbin/mover 2>&1 | tee /boot/logmover.txt mover started skipping app/ moving download/ ./download/sabnzbd/complete/couchpotato/.DS_Store >f.stpog... download/sabnzbd/complete/couchpotato/.DS_Store rsync: write failed on "/mnt/user0/download/sabnzbd/complete/couchpotato/.DS_Store": No space left on device (28) rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7] rsync: connection unexpectedly closed (29 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.7] moving private/ ./somepath/somebigfile *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2468 *** ======= Backtrace: ========= /lib/libc.so.6(+0x705aa)[0x400a65aa] /lib/libc.so.6(+0x73503)[0x400a9503] /lib/libc.so.6(cfree+0x70)[0x400ac6b0] rsync[0x807cd74] rsync[0x807de60] rsync[0x804f3aa] rsync[0x8050b5f] rsync[0x8051e56] rsync[0x8065825] rsync[0x80666ac] /lib/libc.so.6(__libc_start_main+0xe6)[0x4004cb86] rsync[0x804aad1] ======= Memory map: ======== 08048000-0809d000 r-xp 00000000 00:01 1518 /usr/bin/rsync 0809d000-080a1000 rwxp 00054000 00:01 1518 /usr/bin/rsync 080a1000-080f2000 rwxp 00000000 00:00 0 [heap] 40000000-4001d000 r-xp 00000000 00:01 3280 /lib/ld-2.11.1.so 4001d000-4001e000 r-xp 0001d000 00:01 3280 /lib/ld-2.11.1.so 4001e000-4001f000 rwxp 0001e000 00:01 3280 /lib/ld-2.11.1.so 4001f000-40020000 r-xp 00000000 00:00 0 [vdso] 40020000-40021000 rwxp 00000000 00:00 0 40028000-4002e000 r-xp 00000000 00:01 3596 /lib/libacl.so.1.1.0 4002e000-4002f000 rwxp 00005000 00:01 3596 /lib/libacl.so.1.1.0 4002f000-40035000 r-xp 00000000 00:01 3244 /lib/libpopt.so.0.0.0 40035000-40036000 rwxp 00006000 00:01 3244 /lib/libpopt.so.0.0.0 40036000-40192000 r-xp 00000000 00:01 3019 /lib/libc-2.11.1.so 40192000-40193000 ---p 0015c000 00:01 3019 /lib/libc-2.11.1.so 40193000-40195000 r-xp 0015c000 00:01 3019 /lib/libc-2.11.1.so 40195000-40196000 rwxp 0015e000 00:01 3019 /lib/libc-2.11.1.so 40196000-40199000 rwxp 00000000 00:00 0 40199000-4019d000 r-xp 00000000 00:01 3208 /lib/libattr.so.1.1.0 4019d000-4019e000 rwxp 00003000 00:01 3208 /lib/libattr.so.1.1.0 4019e000-401a0000 rwxp 00000000 00:00 0 401a0000-401f6000 r-xp 00000000 00:01 2221 /usr/lib/locale/locale-archive 401f6000-40235000 r-xp 00000000 00:01 2233 /usr/lib/locale/en_US.utf8/LC_CTYPE 40235000-40297000 rwxp 00000000 00:00 0 40297000-402b3000 r-xp 00000000 00:01 2158 /usr/lib/libgcc_s.so.1 402b3000-402b4000 rwxp 0001b000 00:01 2158 /usr/lib/libgcc_s.so.1 40300000-40321000 rwxp 00000000 00:00 0 40321000-40400000 ---p 00000000 00:00 0 bfeda000-bfefb000 rw-p 00000000 00:00 0 [stack] find: `rsync' terminated by signal 6 rsync: writefd_unbuffered failed to write 79 bytes to socket [Receiver]: Broken pipe (32) rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [Receiver=3.0.7] ./somepath/someotherbigfile *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 *** ======= Backtrace: ========= /lib/libc.so.6(+0x705aa)[0x400a65aa] /lib/libc.so.6(+0x73503)[0x400a9503] /lib/libc.so.6(cfree+0x70)[0x400ac6b0] rsync[0x807cd74] rsync[0x807de60] rsync[0x804f3aa] rsync[0x8050b5f] rsync[0x8051e56] rsync[0x8065825] rsync[0x80666ac] /lib/libc.so.6(__libc_start_main+0xe6)[0x4004cb86] rsync[0x804aad1] ======= Memory map: ======== 08048000-0809d000 r-xp 00000000 00:01 1518 /usr/bin/rsync 0809d000-080a1000 rwxp 00054000 00:01 1518 /usr/bin/rsync 080a1000-080f2000 rwxp 00000000 00:00 0 [heap] 40000000-4001d000 r-xp 00000000 00:01 3280 /lib/ld-2.11.1.so 4001d000-4001e000 r-xp 0001d000 00:01 3280 /lib/ld-2.11.1.so 4001e000-4001f000 rwxp 0001e000 00:01 3280 /lib/ld-2.11.1.so 4001f000-40020000 r-xp 00000000 00:00 0 [vdso] 40020000-40021000 rwxp 00000000 00:00 0 40028000-4002e000 r-xp 00000000 00:01 3596 /lib/libacl.so.1.1.0 4002e000-4002f000 rwxp 00005000 00:01 3596 /lib/libacl.so.1.1.0 4002f000-40035000 r-xp 00000000 00:01 3244 /lib/libpopt.so.0.0.0 40035000-40036000 rwxp 00006000 00:01 3244 /lib/libpopt.so.0.0.0 40036000-40192000 r-xp 00000000 00:01 3019 /lib/libc-2.11.1.so 40192000-40193000 ---p 0015c000 00:01 3019 /lib/libc-2.11.1.so 40193000-40195000 r-xp 0015c000 00:01 3019 /lib/libc-2.11.1.so 40195000-40196000 rwxp 0015e000 00:01 3019 /lib/libc-2.11.1.so 40196000-40199000 rwxp 00000000 00:00 0 40199000-4019d000 r-xp 00000000 00:01 3208 /lib/libattr.so.1.1.0 4019d000-4019e000 rwxp 00003000 00:01 3208 /lib/libattr.so.1.1.0 4019e000-401a0000 rwxp 00000000 00:00 0 401a0000-401f6000 r-xp 00000000 00:01 2221 /usr/lib/locale/locale-archive 401f6000-40235000 r-xp 00000000 00:01 2233 /usr/lib/locale/en_US.utf8/LC_CTYPE 40235000-40297000 rwxp 00000000 00:00 0 40297000-402b3000 r-xp 00000000 00:01 2158 /usr/lib/libgcc_s.so.1 402b3000-402b4000 rwxp 0001b000 00:01 2158 /usr/lib/libgcc_s.so.1 40300000-40321000 rwxp 00000000 00:00 0 40321000-40400000 ---p 00000000 00:00 0 bf83e000-bf85f000 rw-p 00000000 00:00 0 [stack] find: `rsync' terminated by signal 6 rsync: writefd_unbuffered failed to write 79 bytes to socket [Receiver]: Broken pipe (32) rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [Receiver=3.0.7] find: `fuser' terminated by signal 9 Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Process fuser (pid: 2342, ti=eec72000 task=f725e880 task.ti=eec72000) Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Stack: Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Call Trace: Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Code: 00 89 55 f0 e8 78 7c fd ff 8b 55 f0 85 c0 0f 84 98 00 00 00 8b 48 04 3b 11 0f 83 88 00 00 00 8b 49 04 8d 14 91 8b 12 85 d2 74 7c <8b> 7a 28 e8 76 7c fd ff 8d 83 d0 02 00 00 e8 84 c2 31 00 8b 83 Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: EIP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a SS:ESP 0068:eec73eac Syslog: May 29 13:15:26 unRAID shfs/user0: shfs_write: write: (28) No space left on device May 29 13:15:28 unRAID kernel: BUG: unable to handle kernel paging request at 2e6b6c89 May 29 13:15:28 unRAID kernel: IP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a May 29 13:15:28 unRAID kernel: *pdpt = 0000000030d7e001 *pde = 0000000000000000 May 29 13:15:28 unRAID kernel: Oops: 0000 [#1] SMP Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Process fuser (pid: 2342, ti=eec72000 task=f725e880 task.ti=eec72000) Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Stack: Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Call Trace: May 29 13:15:28 unRAID kernel: Modules linked in: ntfs md_mod sit2fe(O) m88ds3103(O) cx25840(O) sg cx23885(O) rc_core(O) videobuf_dma_sg(O) snd_pcm snd_timer snd_page_alloc cx2341x(O) v4l2_common(O) i2c_i801 coretemp hwmon e1000e ptp pps_core videodev(O) tda18271(O) snd soundcore videobuf_dvb(O) ahci libahci dvb_core(O) videobuf_core(O) btcx_risc(O) tveeprom(O) i2c_core [last unloaded: md_mod] May 29 13:15:28 unRAID kernel: Pid: 2342, comm: fuser Tainted: G O 3.9.6p-unRAID #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM May 29 13:15:28 unRAID kernel: EIP: 0060:[<c10c9931>] EFLAGS: 00010202 CPU: 2 May 29 13:15:28 unRAID kernel: EIP is at tid_fd_revalidate+0x56/0x12a May 29 13:15:28 unRAID kernel: EAX: f3681200 EBX: f3d02be0 ECX: f0ebdc00 EDX: 2e6b6c61 May 29 13:15:28 unRAID kernel: ESI: f6750440 EDI: ee961000 EBP: eec73ebc ESP: eec73eac May 29 13:15:28 unRAID kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 May 29 13:15:28 unRAID kernel: CR0: 80050033 CR2: 2e6b6c89 CR3: 2edcd000 CR4: 000407f0 May 29 13:15:28 unRAID kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 May 29 13:15:28 unRAID kernel: DR6: ffff0ff0 DR7: 00000400 May 29 13:15:28 unRAID kernel: Process fuser (pid: 2342, ti=eec72000 task=f725e880 task.ti=eec72000) May 29 13:15:28 unRAID kernel: Stack: May 29 13:15:28 unRAID kernel: 0000002a f6750440 ee961000 f0f93240 eec73ecc c10c9ad1 ee87d300 ee961000 May 29 13:15:28 unRAID kernel: eec73f0c c10c6de0 0000002a 0000000d c14bcfba eec73f3b eec73f48 eec73f3d May 29 13:15:28 unRAID kernel: c109ac64 eec73f90 00003234 00000002 eec73f3b f0f93240 0000002a f3681200 May 29 13:15:28 unRAID kernel: Call Trace: May 29 13:15:28 unRAID kernel: [<c10c9ad1>] proc_fd_instantiate+0x6a/0x74 May 29 13:15:28 unRAID kernel: [<c10c6de0>] proc_fill_cache+0x66/0xf9 May 29 13:15:28 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50 May 29 13:15:28 unRAID kernel: [<c10c9756>] proc_readfd_common+0x15d/0x1a4 May 29 13:15:28 unRAID kernel: [<c10c9a67>] ? proc_fdinfo_instantiate+0x62/0x62 May 29 13:15:28 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50 May 29 13:15:28 unRAID kernel: [<c10c97c3>] proc_readfd+0x12/0x14 May 29 13:15:28 unRAID kernel: [<c10c9a67>] ? proc_fdinfo_instantiate+0x62/0x62 May 29 13:15:28 unRAID kernel: [<c109af56>] vfs_readdir+0x52/0x7a May 29 13:15:28 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50 May 29 13:15:28 unRAID kernel: [<c109b0e8>] sys_getdents64+0x62/0xba May 29 13:15:28 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb May 29 13:15:28 unRAID kernel: Code: 00 89 55 f0 e8 78 7c fd ff 8b 55 f0 85 c0 0f 84 98 00 00 00 8b 48 04 3b 11 0f 83 88 00 00 00 8b 49 04 8d 14 91 8b 12 85 d2 74 7c <8b> 7a 28 e8 76 7c fd ff 8d 83 d0 02 00 00 e8 84 c2 31 00 8b 83 May 29 13:15:28 unRAID kernel: EIP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a SS:ESP 0068:eec73eac May 29 13:15:28 unRAID kernel: CR2: 000000002e6b6c89 May 29 13:15:28 unRAID kernel: ---[ end trace a6212521814ae9ae ]--- Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: Code: 00 89 55 f0 e8 78 7c fd ff 8b 55 f0 85 c0 0f 84 98 00 00 00 8b 48 04 3b 11 0f 83 88 00 00 00 8b 49 04 8d 14 91 8b 12 85 d2 74 7c <8b> 7a 28 e8 76 7c fd ff 8d 83 d0 02 00 00 e8 84 c2 31 00 8b 83 Message from syslogd@unRAID at Fri May 29 13:15:28 2015 ... unRAID kernel: EIP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a SS:ESP 0068:eec73eac
  4. Since the log above may not be very heplful I tried to reproduce the problem and capture more of the log. I found the following stuff in the log of the mover: rsync: write failed on "/mnt/user0/download/sabnzbd/complete/couchpotato/.DS_Store": No space left on device (28) rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7] rsync: connection unexpectedly closed (29 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.7] After that it goes on though and when copying a large file it breaks down like this (snap from syslog): May 29 12:02:15 unRAID shfs/user0: shfs_write: write: (28) No space left on device May 29 12:02:18 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 *** May 29 12:02:19 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 *** May 29 12:02:20 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2458 *** May 29 12:02:21 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 *** May 29 12:02:22 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d24a8 *** May 29 12:02:23 unRAID kernel: BUG: unable to handle kernel paging request at 2e6b6c89 May 29 12:02:23 unRAID kernel: IP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a May 29 12:02:23 unRAID kernel: *pdpt = 0000000030ff3001 *pde = 0000000000000000 May 29 12:02:23 unRAID kernel: Oops: 0000 [#1] SMP May 29 12:02:23 unRAID kernel: Modules linked in: md_mod sit2fe(O) m88ds3103(O) cx25840(O) sg cx23885(O) rc_core(O) videobuf_dma_sg(O) snd_pcm snd_timer snd_page_alloc cx2341x(O) v4l2_common(O) i2c_i801 ahci libahci coretemp hwmon videodev(O) tda18271(O) snd soundcore videobuf_dvb(O) e1000e dvb_core(O) videobuf_core(O) btcx_risc(O) ptp tveeprom(O) i2c_core pps_core [last unloaded: md_mod] May 29 12:02:23 unRAID kernel: Pid: 4008, comm: fuser Tainted: G O 3.9.6p-unRAID #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM May 29 12:02:23 unRAID kernel: EIP: 0060:[<c10c9931>] EFLAGS: 00010202 CPU: 1 May 29 12:02:23 unRAID kernel: EIP is at tid_fd_revalidate+0x56/0x12a May 29 12:02:23 unRAID kernel: EAX: f740a300 EBX: f3498000 ECX: f0eda000 EDX: 2e6b6c61 May 29 12:02:23 unRAID kernel: ESI: eeea54c0 EDI: eeea0380 EBP: f71d1ebc ESP: f71d1eac May 29 12:02:23 unRAID kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 May 29 12:02:23 unRAID kernel: CR0: 80050033 CR2: 2e6b6c89 CR3: 3771a000 CR4: 000407f0 May 29 12:02:23 unRAID kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 May 29 12:02:23 unRAID kernel: DR6: ffff0ff0 DR7: 00000400 May 29 12:02:23 unRAID kernel: Process fuser (pid: 4008, ti=f71d0000 task=f34990e0 task.ti=f71d0000) May 29 12:02:23 unRAID kernel: Stack: May 29 12:02:23 unRAID kernel: 00000004 eeea54c0 eeea0380 f72979c0 f71d1ecc c10c9ad1 eed57680 eeea0380 May 29 12:02:23 unRAID kernel: f71d1f0c c10c6de0 00000004 0000000d c14bcfba f71d1f3b f71d1f48 f71d1f3c May 29 12:02:23 unRAID kernel: c109ac64 f71d1f90 00000034 00000001 f71d1f3b f72979c0 00000004 f740a300 May 29 12:02:23 unRAID kernel: Call Trace: May 29 12:02:23 unRAID kernel: [<c10c9ad1>] proc_fd_instantiate+0x6a/0x74 May 29 12:02:23 unRAID kernel: [<c10c6de0>] proc_fill_cache+0x66/0xf9 May 29 12:02:23 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50 May 29 12:02:23 unRAID kernel: [<c10c9756>] proc_readfd_common+0x15d/0x1a4 May 29 12:02:23 unRAID kernel: [<c10c9a67>] ? proc_fdinfo_instantiate+0x62/0x62 May 29 12:02:23 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50 May 29 12:02:23 unRAID kernel: [<c1096258>] ? final_putname+0x2d/0x30 May 29 12:02:23 unRAID kernel: [<c10c97c3>] proc_readfd+0x12/0x14 May 29 12:02:23 unRAID kernel: [<c10c9a67>] ? proc_fdinfo_instantiate+0x62/0x62 May 29 12:02:23 unRAID kernel: [<c109af56>] vfs_readdir+0x52/0x7a May 29 12:02:23 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50 May 29 12:02:23 unRAID kernel: [<c109b0e8>] sys_getdents64+0x62/0xba May 29 12:02:23 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb May 29 12:02:23 unRAID kernel: Code: 00 89 55 f0 e8 78 7c fd ff 8b 55 f0 85 c0 0f 84 98 00 00 00 8b 48 04 3b 11 0f 83 88 00 00 00 8b 49 04 8d 14 91 8b 12 85 d2 74 7c <8b> 7a 28 e8 76 7c fd ff 8d 83 d0 02 00 00 e8 84 c2 31 00 8b 83 May 29 12:02:23 unRAID kernel: EIP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a SS:ESP 0068:f71d1eac May 29 12:02:23 unRAID kernel: CR2: 000000002e6b6c89 May 29 12:02:23 unRAID kernel: ---[ end trace 7f01c1191ac43037 ]--- May 29 12:06:47 unRAID kernel: BUG: unable to handle kernel NULL pointer dereference at 00000050 May 29 12:06:47 unRAID kernel: IP: [<c109ccc5>] d_path+0x16/0x100 May 29 12:06:47 unRAID kernel: *pdpt = 000000002c868001 *pde = 0000000000000000 May 29 12:06:47 unRAID kernel: Oops: 0000 [#2] SMP May 29 12:06:47 unRAID kernel: Modules linked in: md_mod sit2fe(O) m88ds3103(O) cx25840(O) sg cx23885(O) rc_core(O) videobuf_dma_sg(O) snd_pcm snd_timer snd_page_alloc cx2341x(O) v4l2_common(O) i2c_i801 ahci libahci coretemp hwmon videodev(O) tda18271(O) snd soundcore videobuf_dvb(O) e1000e dvb_core(O) videobuf_core(O) btcx_risc(O) ptp tveeprom(O) i2c_core pps_core [last unloaded: md_mod] May 29 12:06:47 unRAID kernel: Pid: 4135, comm: ps Tainted: G D O 3.9.6p-unRAID #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM May 29 12:06:47 unRAID kernel: EIP: 0060:[<c109ccc5>] EFLAGS: 00010292 CPU: 3 May 29 12:06:47 unRAID kernel: EIP is at d_path+0x16/0x100 May 29 12:06:47 unRAID kernel: EAX: 00000000 EBX: 0000007f ECX: 00001000 EDX: eef40000 May 29 12:06:47 unRAID kernel: ESI: ec82ff54 EDI: eee54de0 EBP: ec82ff48 ESP: ec82ff2c May 29 12:06:47 unRAID kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 May 29 12:06:47 unRAID kernel: CR0: 80050033 CR2: 00000050 CR3: 3094c000 CR4: 000407f0 May 29 12:06:47 unRAID kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 May 29 12:06:47 unRAID kernel: DR6: ffff0ff0 DR7: 00000400 May 29 12:06:47 unRAID kernel: Process ps (pid: 4135, ti=ec82e000 task=f0a91e60 task.ti=ec82e000) May 29 12:06:47 unRAID kernel: Stack: May 29 12:06:47 unRAID kernel: 00000000 00001000 eef41000 eee54de0 ec82ff48 0000007f eef40000 ec82ff68 May 29 12:06:47 unRAID kernel: c10c549e 40033d40 00000000 00000000 00004000 ec82ff80 c13ee380 ec82ff94 May 29 12:06:47 unRAID kernel: c1092396 ec82ff80 ec82ff7c eee54de0 00000000 f34ff0d0 eed56f80 bfbf9e4c May 29 12:06:47 unRAID kernel: Call Trace: May 29 12:06:47 unRAID kernel: [<c10c549e>] proc_pid_readlink+0x4d/0xa6 May 29 12:06:47 unRAID kernel: [<c1092396>] sys_readlinkat+0x76/0xab May 29 12:06:47 unRAID kernel: [<c10923f2>] sys_readlink+0x27/0x29 May 29 12:06:47 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb May 29 12:06:47 unRAID kernel: Code: ff 19 c0 83 c0 02 eb 05 b8 02 00 00 00 83 c4 24 5b 5e 5f 5d c3 55 89 e5 56 89 c6 53 8d 04 0a 83 ec 14 89 45 ec 8b 46 04 89 4d e8 <8b> 58 50 85 db 74 0e 8b 5b 20 85 db 74 07 ff d3 e9 ce 00 00 00 May 29 12:06:47 unRAID kernel: EIP: [<c109ccc5>] d_path+0x16/0x100 SS:ESP 0068:ec82ff2c May 29 12:06:47 unRAID kernel: CR2: 0000000000000050 May 29 12:06:47 unRAID kernel: ---[ end trace 7f01c1191ac43038 ]--- In the terminal where the mover is running it looks like this: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2498 *** ======= Backtrace: ========= /lib/libc.so.6(+0x705aa)[0x400a65aa] /lib/libc.so.6(+0x73503)[0x400a9503] /lib/libc.so.6(cfree+0x70)[0x400ac6b0] rsync[0x807cd74] rsync[0x807de60] rsync[0x804f3aa] rsync[0x8050b5f] rsync[0x8051e56] rsync[0x8065825] rsync[0x80666ac] /lib/libc.so.6(__libc_start_main+0xe6)[0x4004cb86] rsync[0x804aad1] ======= Memory map: ======== 08048000-0809d000 r-xp 00000000 00:01 2536 /usr/bin/rsync 0809d000-080a1000 rwxp 00054000 00:01 2536 /usr/bin/rsync 080a1000-080f2000 rwxp 00000000 00:00 0 [heap] 40000000-4001d000 r-xp 00000000 00:01 4298 /lib/ld-2.11.1.so 4001d000-4001e000 r-xp 0001d000 00:01 4298 /lib/ld-2.11.1.so 4001e000-4001f000 rwxp 0001e000 00:01 4298 /lib/ld-2.11.1.so This terminal is frozen, but this time no kernel panic. Can you help?
  5. Folks, I narrowed the kernel panic now definitely down to the mover. I scheduled it and monitored the syslog with tail -f. Here is what I could capture per telnet and on the console. After that the machine was unresponsive. For me, alas the error dump information that follows at the bottom is so far opaque. If anyonecan make sense of this, please go ahead. For the record: The faulty 3TB drive I reported above is not allocated to the array, but connected to Supermicro board on SATA3 connector. 8TB Parity drive connected to SuperMicro board on SATA3, 250GB Samsung Cache drive connected to SATA2 on SuperMicro board. 8TB data drive connected to SuperMicro on SATA2. Rest of drives connected to SUpermicro SATA2 and to Digitus controller. I would assume the mover would copy from the 250Gb Cache drive to the new 8TB data drive, thereby also accessing the 8TB parity drive. Boom. May 28 20:41:00 unRAID kernel: df7d1440 f3671f00 df7d1440 c82e3d50 c10a16aa f0ee8380 df7d1710 c82e3da4 May 28 20:41:00 unRAID kernel: c102acca 00000000 00000009 00000246 c82e3ec0 c82e3d88 00000001 00000000 May 28 20:41:00 unRAID kernel: Call Trace: May 28 20:41:00 unRAID kernel: [<c10a1604>] put_files_struct+0x55/0x8e May 28 20:41:00 unRAID kernel: [<c10a16aa>] exit_files+0x34/0x38 May 28 20:41:00 unRAID kernel: [<c102acca>] do_exit+0x2c7/0x73e May 28 20:41:00 unRAID kernel: [<c1027090>] ? print_oops_end_marker+0x2a/0x2c May 28 20:41:00 unRAID kernel: [<c1004c3a>] oops_end+0x79/0x7e May 28 20:41:00 unRAID kernel: [<c13df945>] no_context+0x1ad/0x1b5 May 28 20:41:00 unRAID kernel: [<c13dfbd9>] __bad_area_nosemaphore+0x125/0x12d May 28 20:41:00 unRAID kernel: [<c13dfc69>] bad_area+0x37/0x3d May 28 20:41:00 unRAID kernel: [<c10209d7>] __do_page_fault+0x1bf/0x391 May 28 20:41:00 unRAID kernel: [<c104466f>] ? sched_clock_cpu+0x3f/0x15e May 28 20:41:00 unRAID kernel: [<c1020bae>] ? vmalloc_sync_all+0x5/0x5 May 28 20:41:00 unRAID kernel: [<c1020bb6>] do_page_fault+0x8/0xa May 28 20:41:00 unRAID kernel: [<c13e63ea>] error_code+0x5a/0x60 May 28 20:41:00 unRAID kernel: [<c104007b>] ? clean_sort_range+0x11/0xc8 May 28 20:41:00 unRAID kernel: [<c10a007b>] ? iput+0x67/0xe5 May 28 20:41:00 unRAID kernel: [<c1020bae>] ? vmalloc_sync_all+0x5/0x5 May 28 20:41:00 unRAID kernel: [<c10a14fa>] ? dup_fd+0x13d/0x1ca May 28 20:41:00 unRAID kernel: [<c1025d8a>] copy_process.part.60+0x3db/0xd1d May 28 20:41:00 unRAID kernel: [<c10267a4>] do_fork+0xbb/0x20d May 28 20:41:00 unRAID kernel: [<c102698d>] sys_clone+0x20/0x22 May 28 20:41:00 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb May 28 20:41:00 unRAID kernel: Code: 00 8d 90 01 02 00 00 83 fa 01 77 07 b8 fc ff ff ff eb 0d 89 c2 83 e2 fd 81 fa fc fd ff ff 74 ec 5d c3 55 89 e5 57 56 53 89 c3 51 <8b> 40 20 85 c0 75 10 c7 04 24 42 c6 49 c1 31 f6 e8 b6 2c 35 00 May 28 20:41:00 unRAID kernel: EIP: [<c108d19d>] filp_close+0x9/0x61 SS:ESP 0068:c82e3d14 May 28 20:41:00 unRAID kernel: CR2: 0000000073750062 May 28 20:41:00 unRAID kernel: ---[ end trace 1d6206f196c7d87a ]--- May 28 20:41:00 unRAID kernel: Fixing recursive fault but reboot is needed! Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Call Trace: Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Stack: Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Process find (pid: 12432, ti=c82e2000 task=df7d1440 task.ti=c82e2000) Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Code: 00 00 00 89 55 dc 99 f7 f9 8b 55 e0 8b 72 0c 89 c1 f3 a4 89 c1 8b 72 08 31 d2 8b 7b 08 f3 a4 eb 1d 8b 4d dc 8b 04 91 85 c0 74 06 <f0> ff 40 20 eb 06 8b 4b 0c 0f b3 11 8b 4d e4 89 04 91 42 3b 55 Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Call Trace: Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: EIP: [<c108d19d>] filp_close+0x9/0x61 SS:ESP 0068:c82e3d14 Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: EIP: [<c10a14fa>] dup_fd+0x13d/0x1ca SS:ESP 0068:c82e3efc Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Stack: Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Process find (pid: 12432, ti=c82e2000 task=df7d1440 task.ti=c82e2000) Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ... unRAID kernel: Code: 00 8d 90 01 02 00 00 83 fa 01 77 07 b8 fc ff ff ff eb 0d 89 c2 83 e2 fd 81 fa fc fd ff ff 74 ec 5d c3 55 89 e5 57 56 53 89 c3 51 <8b> 40 20 85 c0 75 10 c7 04 24 42 c6 49 c1 31 f6 e8 b6 2c 35 00
  6. This is what I get after the pre-clear, basically the same figures as captured during the pre-clear: ** Changed attributes in files: /tmp/smart_start_sdc /tmp/smart_finish_sdc ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Raw_Read_Error_Rate = 109 114 6 ok 21345144 Seek_Error_Rate = 51 51 30 near_thresh 1383022160179 Spin_Retry_Count = 100 100 97 near_thresh 0 End-to-End_Error = 93 93 99 FAILING_NOW 7 Airflow_Temperature_Cel = 63 67 45 near_thresh 37 Temperature_Celsius = 37 33 0 ok 37 *** Failing SMART Attributes in /tmp/smart_finish_sdc *** ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 184 End-to-End_Error 0x0032 093 093 099 Old_age Always FAILING_NOW 7 What does feature 184 tell me? The other errors are quite old. Update: The drive is still under warranty, trying to get it replaced.
  7. Yeah, right... IN FACT that disk makes his system to go kernel panic. Smart ID #7 is BAD even for the disk firmware: a worst value touched of 60 (on a 100 basis) with a SMART BAD low limit fixed @ 30 for a 22 days disk is... BAD. IMHO. Sentence. Or probably this is the reason because I don't go 4 Seagates... never. After having dumped SMART reports for all drives I can tell that all Seagates report similar numbers. It seems to be normal for Seagate consumer grade drives (not saying that this is good, though). Also note that the numbers for WD and Samsung drives in my system are farther away from the failure threshold.
  8. Folks, I took a SMART report of all drives. Except for one drive, all other drives have logged no errors. The only drive that has logged errors is the former parity drive. The error seems to be old (at 44 days operation) and I never had parity problems with that drive. Also the feature 184 End-to-End Error reports "FAILING NOW"! I was intending to replace my old 250GB cache drive with this 3TB drive. Could someone please take a look if this drive is ok to use as cache drive? (I understand the cache content is not protected by parity, if not run in a pool, I don't have unRAID 6 yet) Thank you so much! PS: The report was taken while the drive is pre-clearing! Thus the high temp. It is usually cooler. smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST3000DM001-1CH166 Firmware Version: CC26 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu May 28 13:05:33 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 113 099 006 Pre-fail Always - 55763472 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1969 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 051 047 030 Pre-fail Always - 1383022134666 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 14866 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 62 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 093 093 099 Old_age Always FAILING_NOW 7 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4 190 Airflow_Temperature_Cel 0x0022 062 055 045 Old_age Always - 38 (Min/Max 30/39) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11 193 Load_Cycle_Count 0x0032 081 081 000 Old_age Always - 38294 194 Temperature_Celsius 0x0022 038 045 000 Old_age Always - 38 (0 17 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 182291296946429 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 72536194288 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 155726243753 SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 5 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.856 READ DMA EXT ef 10 02 00 00 00 a0 00 44d+20:48:56.856 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 44d+20:48:56.856 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 44d+20:48:56.855 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 44d+20:48:56.855 SET FEATURES [set transfer mode] Error 4 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.731 READ DMA EXT ef 10 02 00 00 00 a0 00 44d+20:48:56.731 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 44d+20:48:56.731 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 44d+20:48:56.730 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 44d+20:48:56.730 SET FEATURES [set transfer mode] Error 3 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.606 READ DMA EXT ef 10 02 00 00 00 a0 00 44d+20:48:56.606 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 44d+20:48:56.606 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 44d+20:48:56.605 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 44d+20:48:56.605 SET FEATURES [set transfer mode] Error 2 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.451 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.450 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.443 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.437 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.430 READ DMA EXT Error 1 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 10 b2 06 e0 00 44d+20:48:55.925 READ DMA ca 00 08 98 8f 00 e0 00 44d+20:48:21.660 WRITE DMA c8 00 08 98 8f 00 e0 00 44d+20:48:21.660 READ DMA ca 00 c0 d8 8e 00 e0 00 44d+20:48:21.659 WRITE DMA ca 00 08 d0 8e 00 e0 00 44d+20:48:21.659 WRITE DMA SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
  9. Folks, I took a SMART report of all drives. Except for one drive, all other drives have logged no errors. The only drive that has logged errors is the former parity drive. The error seems to be old (at 44 days operation) and I never hat parity problems with that drive. Also the feature 184 End-to-End Error reports "FAILING NOW"! I was intending to replace my old 250GB cache drive with this 3TB drive. Could someone please take a look if this drive is ok to use as cache drive? (I understand the cache content is not protected by parity, if not run in a pool, I don't have unRAID 6 yet) Thank you so much! The error report from the pre-clear: ** Changed attributes in files: /tmp/smart_start_sdc /tmp/smart_finish_sdc ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Raw_Read_Error_Rate = 109 114 6 ok 21345144 Seek_Error_Rate = 51 51 30 near_thresh 1383022160179 Spin_Retry_Count = 100 100 97 near_thresh 0 End-to-End_Error = 93 93 99 FAILING_NOW 7 Airflow_Temperature_Cel = 63 67 45 near_thresh 37 Temperature_Celsius = 37 33 0 ok 37 *** Failing SMART Attributes in /tmp/smart_finish_sdc *** ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 184 End-to-End_Error 0x0032 093 093 099 Old_age Always FAILING_NOW 7 The report following was taken while the drive is pre-clearing! Thus the high temp. It is usually cooler. smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST3000DM001-1CH166 Firmware Version: CC26 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu May 28 13:05:33 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 113 099 006 Pre-fail Always - 55763472 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1969 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 051 047 030 Pre-fail Always - 1383022134666 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 14866 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 62 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 093 093 099 Old_age Always FAILING_NOW 7 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4 190 Airflow_Temperature_Cel 0x0022 062 055 045 Old_age Always - 38 (Min/Max 30/39) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11 193 Load_Cycle_Count 0x0032 081 081 000 Old_age Always - 38294 194 Temperature_Celsius 0x0022 038 045 000 Old_age Always - 38 (0 17 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 182291296946429 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 72536194288 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 155726243753 SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 5 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.856 READ DMA EXT ef 10 02 00 00 00 a0 00 44d+20:48:56.856 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 44d+20:48:56.856 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 44d+20:48:56.855 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 44d+20:48:56.855 SET FEATURES [set transfer mode] Error 4 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.731 READ DMA EXT ef 10 02 00 00 00 a0 00 44d+20:48:56.731 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 44d+20:48:56.731 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 44d+20:48:56.730 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 44d+20:48:56.730 SET FEATURES [set transfer mode] Error 3 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.606 READ DMA EXT ef 10 02 00 00 00 a0 00 44d+20:48:56.606 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 44d+20:48:56.606 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 44d+20:48:56.605 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 44d+20:48:56.605 SET FEATURES [set transfer mode] Error 2 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 44d+20:48:56.451 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.450 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.443 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.437 READ DMA EXT 25 00 08 ff ff ff ef 00 44d+20:48:56.430 READ DMA EXT Error 1 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 10 b2 06 e0 00 44d+20:48:55.925 READ DMA ca 00 08 98 8f 00 e0 00 44d+20:48:21.660 WRITE DMA c8 00 08 98 8f 00 e0 00 44d+20:48:21.660 READ DMA ca 00 c0 d8 8e 00 e0 00 44d+20:48:21.659 WRITE DMA ca 00 08 d0 8e 00 e0 00 44d+20:48:21.659 WRITE DMA SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
  10. I think I narrowed it down to when this happens by disabling all services that do nightly scans on the drives, except for the mover. I then scheduled the mover to a different time and boom - kernel panic at that time. The mover script look quite inconspicious. Mainly an rsync. I do not then that rsync itseld should pose a problem, also since the same mover script did work fine for the last years. The cache drive is currently hooked to the Digitus controller. I will attach that drive also to the SuperMicro board to see what happens. I will keep you posted. I had the same kernel panic last night: "5 buffers handled - should be 1", then it shuts down the CPUs, and last thing is "32 buffers handled - should be 1" Current setup: parity (8TB) connected to supermicro board. Another 8TB media disk connected to Digitus controller, as well as 3TB (supposed to become cache) and current cache 250GB. Worked fine all day. During night sometimes some (cron) job or service seems to trigger the kernel panic. After reboot I looked into syslog, but it is a fresh file created right after the reboot. How can you actually diagnose an unRAID system, if syslog gets removed on a reboot? ??puzzled?? I found a core dump: -rw------- 1 root root 397312 2015-05-25 03:43 core Does this help in any way to get further information? Any ideas how to find out more?
  11. Just for the record, the other 8TB drive (same batch) come with nearly the identical SMART values. Either this is normal or the whole batch has a problem. With a few exceptions, the "raw" attribute values are not meaningful. Each manufacturer is free to use the raw number however they want, and frequently they will use bit positions to indicate certain values. Interpreting a bunch of status bits as a number can produce alarmingly high decimal numbers that are, as I said, meaningless. Even for a single manufacturer, the values can have different meanings for different models and even firmware versions. Manufacturers do "normalize" the values into a scale from 1 to 255. Lower is worse. A nominal value is often 100. The "VALUE" column is the current normalized value, the "WORST" column is how low the value has gone in the past, and the THRESH" is the value at which the attribute will be considered failed. So for attribute #1, 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 146594920 the current normalized value is 117, the worst it has gotten is 99, and the drive will consider 6 and below a failure. You are not even close to failure. The raw value means nothing to you. The few attributes we look carefully at the raw values are reallocated sectors (#5), pending sectors (#197), CRC errors (#198) and temperature (#194). #5 and #197 are often indicators of drive failure long before the normalize values drop significantly. The #198 is often a sign of a bad or loose cable. The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s. Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.
  12. Just to make sure this is applicable in my case: I have a kernel panic at night when the machine is unattended. I can see the panic on screen but they keyboard does not react and the NIC is down. I cannot access the machine by keyboard and by network. does "powerdown" save syslog even when I have a kernel panic and push the reset button?
  13. I had the same kernel panic last night: "5 buffers handled - should be 1", then it shuts down the CPUs, and last thing is "32 buffers handled - should be 1" Current setup: parity (8TB) connected to supermicro board. Another 8TB media disk connected to Digitus controller, as well as 3TB (supposed to become cache) and current cache 250GB. Worked fine all day. During night sometimes some (cron) job or service seems to trigger the kernel panic. After reboot I looked into syslog, but it is a fresh file created right after the reboot. How can you actually diagnose an unRAID system, if syslog gets removed on a reboot? ??puzzled?? I found a core dump: -rw------- 1 root root 397312 2015-05-25 03:43 core Does this help in any way to get further information? Any ideas how to find out more?
  14. Guys, you are awesome. It is great to get that much response and to see how many people care and take the time to help! Intermediate status: I pre-cleared in 1 run and put it back as parity drive. No kernel panic so far. For the record: The drive is connected to one of the SATA3 connectors on a SuperMicro X9SCL-F. There is also a Digitus DS-30104-1 controller in the system, hosting the cache drive and 2 other media drives (1 of them being also a new 8TB Seagate Archive, currently pre-clearing). Both Seagate Archive 8TB drives were bought from the same batch. I REALLY would like to see it under a HDD Regenerator scan... Good to know it looks OK so far from the SMART report. I have no clue how to do the HDD Regenerator scan. Can it be done on the parity drive? After a reboot after a kernel panic, will the syslog be available? Doesn't Unraid delete this kind of stuff upon reboot? I will make the syslog available when applicable, but not universally. Raise a hand if you are willing to help and take a look and I may contact you when kernel panics next time. Regarding the following statement I have a disconcerting additional information: BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed... ) When I switched the system back to the old 3TB partity drive in order to pre-clear the 8TB drive, I experienced another kernel panic under the following circumstances: 3TB drive parity was rebuilt successfully, 8TB drive was physically/electrically connected, but not assigned to the array. The 8TB drive was idle (not being pre-cleared). After rebooting and rebuilding parity in maintenance mode I started pre-clearing the 8TB drive. No errors on the console, no kernel panic. Then I unassigned the 3TB parity drive and assigned the 8TB drive as parity and let it build the parit yin maintenance mode. 3TB electrically connected but not assigned to the array, idling. Since then no kernel panic. It is a puzzle to me. I began suspecting firmware problems with the new drives, but that is not supported by the fact that the last panic occured while the new drives were idle. Also I was at some time suspecting the Digitus controller, but the 8TB parity drive is not connected to it. The other 8TB currently pre-clearing is connected to the Digitus, still no kernel panic. The kernel panic always occurs by night, probably when the mover does its work and some of the other non-OS-services do their grunt work. Still, this doesn't help me understanding what's happening.
  15. Gentlemen, don't worry about the reported temperature. The report was done right after stopping the pre-clear. The drive is at around 34 centigrades when spinning idle. Totally normal. From what I've read HGST and Seagate now define the max temp of 60c on these larger multi-platter drives. i.e. Operating (drive case max °C) 60 Nonoperating (ambient °C) –40 to 70 I remember reading a specific article from Seagate saying they raised the acceptable temperature. At 50 or 52-54 I might be exploring a better cooler solution. From what I've seen in my 6TB HGST 7200 RPM drives and 6TB Seagate drives, the highest values were in the 45c range after a grueling 1 week badblocks preclear burn in. I'm not disputing the temperature goals per post, I'm responding what what Seagate and HGST believes the high acceptable range is. Frankly 55 is too close to 60. So I might set alarms at 53-54 and stop work if it climbs.
  16. Now I did one round of preclear and you will find the new SMART report attached to the end of this post. 4x Seagate ST3000DM001-1CH166, one of them will be cache drive 2x WD WDC_WD30EZRX-00AZ6B0_WD Plus the 8TB drive (2x) What else would be of relevance? SuperMicro Mainboard X9SCL-F-0, 8TB attached to SATA3 Digitus DS-30104-1 additional SATA controller Here the SMART report after a preclear of the 8TB Seagate drive: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST8000AS0002-1NA17Z Serial Number: Z8403NMN Firmware Version: AR13 User Capacity: 8,001,563,222,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 9 ATA Standard is: Not recognized. Minor revision code: 0x001f Local Time is: Thu May 14 21:40:29 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 146594920 3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 21 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 076 060 030 Pre-fail Always - 44486206 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 465 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 3 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 057 056 045 Old_age Always - 43 (Min/Max 26/44) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 21 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 63 194 Temperature_Celsius 0x0022 043 044 000 Old_age Always - 43 (0 25 0 0) 195 Hardware_ECC_Recovered 0x001a 117 099 000 Old_age Always - 146594920 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 38190849196214 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 60384571496 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 63277694076 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 266 - # 2 Short offline Completed without error 00% 251 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
  17. Hello all, when I use a newly bought Seagate Archive 8TB drive (ST8000AS0002) on a Unraid 5stable array as a parity drive (not tried as data drive), I will get a kernel panic after some time. This interestingly even blocks the traffic going over the attached switch, no idea how that happens. (When I replace the 8TB drive with a 3TB drive for parity, all works smooth again for days. The 8TB drive is then still connected to the machine but not assigned in Unraid) My questions: Is that a know problem? How may this be fixed? Move to Unraid6? Drive firmware update? Any ideas? Thanks for caring, JC smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST8000AS0002-1NA17Z Serial Number: Z8403NMN Firmware Version: AR13 User Capacity: 8,001,563,222,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 9 ATA Standard is: Not recognized. Minor revision code: 0x001f Local Time is: Wed May 6 00:25:19 2015 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 194380720 3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13681010 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 251 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 3 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 057 045 Old_age Always - 35 (Min/Max 26/41) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 37 194 Temperature_Celsius 0x0022 035 043 000 Old_age Always - 35 (0 25 0 0) 195 Hardware_ECC_Recovered 0x001a 118 099 000 Old_age Always - 194380720 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 182325656682555 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 28300614680 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 16244567367 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 251 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
×
×
  • Create New...