Seagate Archive 8TB ST8000AS0002 Unraid 5stable kernel panic

May 28, 201511 yr

Author

It's totally normal for the 8TB Archive. Both mine have similarly high numbers after 22 days.

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

Yeah, right... IN FACT that disk makes his system to go kernel panic.

Smart ID #7 is BAD even for the disk firmware: a worst value touched of 60 (on a 100 basis) with a SMART BAD low limit fixed @ 30 for a 22 days disk is... BAD. IMHO. Sentence.

Or probably this is the reason because I don't go 4 Seagates... never.

After having dumped SMART reports for all drives I can tell that all Seagates report similar numbers. It seems to be normal for Seagate consumer grade drives (not saying that this is good, though).

Also note that the numbers for WD and Samsung drives in my system are farther away from the failure threshold.

Quote

May 28, 201511 yr

From observing my 8TB Archive drives, and having one report a SMART failure which has now gone, I think these drives throw a failure when the following scenario happens:

1. a lot of small files are copied to the drive in one lump (my test is 60GB of MP3 files)

2. the drive starts shingling

3. the drive is interrupted during the shingling procedure on several occasions (by rebooting the machine)

4. drive throws a SMART failure

If I leave the drive idle for 5-10 minutes until the accessing noises stop (IE shingling has finished), I can reboot the machine as many times and as frequently as I want with no SMART failure

If I don't let the drive idle and therefore not have a chance to finish it's housekeeping, and then do the same number of reboots, the drive will crap out.

If I leave the drive running with the PC's BIOS "SMART failure" error on the screen, about 20 minutes of accessing noises can be observed, and once that is done and a reset performed the drive works perfectly.

These drives are designed to be run 24/7, and they really should be run that way, because shingling (housekeeping) shouldn't be interrupted. I also think the SMART figures are odd because they're reporting something different compared to a traditional PMR drive.

Quote

May 28, 201511 yr

Author

Folks,

I narrowed the kernel panic now definitely down to the mover. I scheduled it and monitored the syslog with tail -f.

Here is what I could capture per telnet and on the console. After that the machine was unresponsive.

For me, alas the error dump information that follows at the bottom is so far opaque. If anyonecan make sense of this, please go ahead.

For the record: The faulty 3TB drive I reported above is not allocated to the array, but connected to Supermicro board on SATA3 connector. 8TB Parity drive connected to SuperMicro board on SATA3, 250GB Samsung Cache drive connected to SATA2 on SuperMicro board. 8TB data drive connected to SuperMicro on SATA2. Rest of drives connected to SUpermicro SATA2 and to Digitus controller.

I would assume the mover would copy from the 250Gb Cache drive to the new 8TB data drive, thereby also accessing the 8TB parity drive.

Boom.

May 28 20:41:00 unRAID kernel: df7d1440 f3671f00 df7d1440 c82e3d50 c10a16aa f0ee8380 df7d1710 c82e3da4

May 28 20:41:00 unRAID kernel: c102acca 00000000 00000009 00000246 c82e3ec0 c82e3d88 00000001 00000000

May 28 20:41:00 unRAID kernel: Call Trace:

May 28 20:41:00 unRAID kernel: [<c10a1604>] put_files_struct+0x55/0x8e

May 28 20:41:00 unRAID kernel: [<c10a16aa>] exit_files+0x34/0x38

May 28 20:41:00 unRAID kernel: [<c102acca>] do_exit+0x2c7/0x73e

May 28 20:41:00 unRAID kernel: [<c1027090>] ? print_oops_end_marker+0x2a/0x2c

May 28 20:41:00 unRAID kernel: [<c1004c3a>] oops_end+0x79/0x7e

May 28 20:41:00 unRAID kernel: [<c13df945>] no_context+0x1ad/0x1b5

May 28 20:41:00 unRAID kernel: [<c13dfbd9>] __bad_area_nosemaphore+0x125/0x12d

May 28 20:41:00 unRAID kernel: [<c13dfc69>] bad_area+0x37/0x3d

May 28 20:41:00 unRAID kernel: [<c10209d7>] __do_page_fault+0x1bf/0x391

May 28 20:41:00 unRAID kernel: [<c104466f>] ? sched_clock_cpu+0x3f/0x15e

May 28 20:41:00 unRAID kernel: [<c1020bae>] ? vmalloc_sync_all+0x5/0x5

May 28 20:41:00 unRAID kernel: [<c1020bb6>] do_page_fault+0x8/0xa

May 28 20:41:00 unRAID kernel: [<c13e63ea>] error_code+0x5a/0x60

May 28 20:41:00 unRAID kernel: [<c104007b>] ? clean_sort_range+0x11/0xc8

May 28 20:41:00 unRAID kernel: [<c10a007b>] ? iput+0x67/0xe5

May 28 20:41:00 unRAID kernel: [<c1020bae>] ? vmalloc_sync_all+0x5/0x5

May 28 20:41:00 unRAID kernel: [<c10a14fa>] ? dup_fd+0x13d/0x1ca

May 28 20:41:00 unRAID kernel: [<c1025d8a>] copy_process.part.60+0x3db/0xd1d

May 28 20:41:00 unRAID kernel: [<c10267a4>] do_fork+0xbb/0x20d

May 28 20:41:00 unRAID kernel: [<c102698d>] sys_clone+0x20/0x22

May 28 20:41:00 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb

May 28 20:41:00 unRAID kernel: Code: 00 8d 90 01 02 00 00 83 fa 01 77 07 b8 fc ff ff ff eb 0d 89 c2 83 e2 fd 81 fa fc fd ff ff 74 ec 5d c3 55 89 e5 57 56 53 89 c3 51 <8b> 40 20 85 c0 75 10 c7 04 24 42 c6 49 c1 31 f6 e8 b6 2c 35 00

May 28 20:41:00 unRAID kernel: EIP: [<c108d19d>] filp_close+0x9/0x61 SS:ESP 0068:c82e3d14

May 28 20:41:00 unRAID kernel: CR2: 0000000073750062

May 28 20:41:00 unRAID kernel: ---[ end trace 1d6206f196c7d87a ]---

May 28 20:41:00 unRAID kernel: Fixing recursive fault but reboot is needed!

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Call Trace:

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Stack:

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Process find (pid: 12432, ti=c82e2000 task=df7d1440 task.ti=c82e2000)

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Code: 00 00 00 89 55 dc 99 f7 f9 8b 55 e0 8b 72 0c 89 c1 f3 a4 89 c1 8b 72 08 31 d2 8b 7b 08 f3 a4 eb 1d 8b 4d dc 8b 04 91 85 c0 74 06 <f0> ff 40 20 eb 06 8b 4b 0c 0f b3 11 8b 4d e4 89 04 91 42 3b 55

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Call Trace:

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: EIP: [<c108d19d>] filp_close+0x9/0x61 SS:ESP 0068:c82e3d14

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: EIP: [<c10a14fa>] dup_fd+0x13d/0x1ca SS:ESP 0068:c82e3efc

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Stack:

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Process find (pid: 12432, ti=c82e2000 task=df7d1440 task.ti=c82e2000)

Message from syslogd@unRAID at Thu May 28 20:41:00 2015 ...

unRAID kernel: Code: 00 8d 90 01 02 00 00 83 fa 01 77 07 b8 fc ff ff ff eb 0d 89 c2 83 e2 fd 81 fa fc fd ff ff 74 ec 5d c3 55 89 e5 57 56 53 89 c3 51 <8b> 40 20 85 c0 75 10 c7 04 24 42 c6 49 c1 31 f6 e8 b6 2c 35 00

Quote

May 29, 201511 yr

Author

Since the log above may not be very heplful I tried to reproduce the problem and capture more of the log. I found the following stuff in the log of the mover:

rsync: write failed on "/mnt/user0/download/sabnzbd/complete/couchpotato/.DS_Store": No space left on device (28)

rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7]

rsync: connection unexpectedly closed (29 bytes received so far) [sender]

rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.7]

After that it goes on though and when copying a large file it breaks down like this (snap from syslog):

May 29 12:02:15 unRAID shfs/user0: shfs_write: write: (28) No space left on device

May 29 12:02:18 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 ***

May 29 12:02:19 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 ***

May 29 12:02:20 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2458 ***

May 29 12:02:21 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 ***

May 29 12:02:22 unRAID rsync: *** glibc detected *** rsync: free(): invalid next size (normal): 0x080d24a8 ***

May 29 12:02:23 unRAID kernel: BUG: unable to handle kernel paging request at 2e6b6c89

May 29 12:02:23 unRAID kernel: IP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a

May 29 12:02:23 unRAID kernel: *pdpt = 0000000030ff3001 *pde = 0000000000000000

May 29 12:02:23 unRAID kernel: Oops: 0000 [#1] SMP

May 29 12:02:23 unRAID kernel: Modules linked in: md_mod sit2fe(O) m88ds3103(O) cx25840(O) sg cx23885(O) rc_core(O) videobuf_dma_sg(O) snd_pcm snd_timer snd_page_alloc cx2341x(O) v4l2_common(O) i2c_i801 ahci libahci coretemp hwmon videodev(O) tda18271(O) snd soundcore videobuf_dvb(O) e1000e dvb_core(O) videobuf_core(O) btcx_risc(O) ptp tveeprom(O) i2c_core pps_core [last unloaded: md_mod]

May 29 12:02:23 unRAID kernel: Pid: 4008, comm: fuser Tainted: G O 3.9.6p-unRAID #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM

May 29 12:02:23 unRAID kernel: EIP: 0060:[<c10c9931>] EFLAGS: 00010202 CPU: 1

May 29 12:02:23 unRAID kernel: EIP is at tid_fd_revalidate+0x56/0x12a

May 29 12:02:23 unRAID kernel: EAX: f740a300 EBX: f3498000 ECX: f0eda000 EDX: 2e6b6c61

May 29 12:02:23 unRAID kernel: ESI: eeea54c0 EDI: eeea0380 EBP: f71d1ebc ESP: f71d1eac

May 29 12:02:23 unRAID kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

May 29 12:02:23 unRAID kernel: CR0: 80050033 CR2: 2e6b6c89 CR3: 3771a000 CR4: 000407f0

May 29 12:02:23 unRAID kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

May 29 12:02:23 unRAID kernel: DR6: ffff0ff0 DR7: 00000400

May 29 12:02:23 unRAID kernel: Process fuser (pid: 4008, ti=f71d0000 task=f34990e0 task.ti=f71d0000)

May 29 12:02:23 unRAID kernel: Stack:

May 29 12:02:23 unRAID kernel: 00000004 eeea54c0 eeea0380 f72979c0 f71d1ecc c10c9ad1 eed57680 eeea0380

May 29 12:02:23 unRAID kernel: f71d1f0c c10c6de0 00000004 0000000d c14bcfba f71d1f3b f71d1f48 f71d1f3c

May 29 12:02:23 unRAID kernel: c109ac64 f71d1f90 00000034 00000001 f71d1f3b f72979c0 00000004 f740a300

May 29 12:02:23 unRAID kernel: Call Trace:

May 29 12:02:23 unRAID kernel: [<c10c9ad1>] proc_fd_instantiate+0x6a/0x74

May 29 12:02:23 unRAID kernel: [<c10c6de0>] proc_fill_cache+0x66/0xf9

May 29 12:02:23 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50

May 29 12:02:23 unRAID kernel: [<c10c9756>] proc_readfd_common+0x15d/0x1a4

May 29 12:02:23 unRAID kernel: [<c10c9a67>] ? proc_fdinfo_instantiate+0x62/0x62

May 29 12:02:23 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50

May 29 12:02:23 unRAID kernel: [<c1096258>] ? final_putname+0x2d/0x30

May 29 12:02:23 unRAID kernel: [<c10c97c3>] proc_readfd+0x12/0x14

May 29 12:02:23 unRAID kernel: [<c10c9a67>] ? proc_fdinfo_instantiate+0x62/0x62

May 29 12:02:23 unRAID kernel: [<c109af56>] vfs_readdir+0x52/0x7a

May 29 12:02:23 unRAID kernel: [<c109ac64>] ? sys_ioctl+0x50/0x50

May 29 12:02:23 unRAID kernel: [<c109b0e8>] sys_getdents64+0x62/0xba

May 29 12:02:23 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb

May 29 12:02:23 unRAID kernel: Code: 00 89 55 f0 e8 78 7c fd ff 8b 55 f0 85 c0 0f 84 98 00 00 00 8b 48 04 3b 11 0f 83 88 00 00 00 8b 49 04 8d 14 91 8b 12 85 d2 74 7c <8b> 7a 28 e8 76 7c fd ff 8d 83 d0 02 00 00 e8 84 c2 31 00 8b 83

May 29 12:02:23 unRAID kernel: EIP: [<c10c9931>] tid_fd_revalidate+0x56/0x12a SS:ESP 0068:f71d1eac

May 29 12:02:23 unRAID kernel: CR2: 000000002e6b6c89

May 29 12:02:23 unRAID kernel: ---[ end trace 7f01c1191ac43037 ]---

May 29 12:06:47 unRAID kernel: BUG: unable to handle kernel NULL pointer dereference at 00000050

May 29 12:06:47 unRAID kernel: IP: [<c109ccc5>] d_path+0x16/0x100

May 29 12:06:47 unRAID kernel: *pdpt = 000000002c868001 *pde = 0000000000000000

May 29 12:06:47 unRAID kernel: Oops: 0000 [#2] SMP

May 29 12:06:47 unRAID kernel: Modules linked in: md_mod sit2fe(O) m88ds3103(O) cx25840(O) sg cx23885(O) rc_core(O) videobuf_dma_sg(O) snd_pcm snd_timer snd_page_alloc cx2341x(O) v4l2_common(O) i2c_i801 ahci libahci coretemp hwmon videodev(O) tda18271(O) snd soundcore videobuf_dvb(O) e1000e dvb_core(O) videobuf_core(O) btcx_risc(O) ptp tveeprom(O) i2c_core pps_core [last unloaded: md_mod]

May 29 12:06:47 unRAID kernel: Pid: 4135, comm: ps Tainted: G D O 3.9.6p-unRAID #1 Supermicro X9SCL/X9SCM/X9SCL/X9SCM

May 29 12:06:47 unRAID kernel: EIP: 0060:[<c109ccc5>] EFLAGS: 00010292 CPU: 3

May 29 12:06:47 unRAID kernel: EIP is at d_path+0x16/0x100

May 29 12:06:47 unRAID kernel: EAX: 00000000 EBX: 0000007f ECX: 00001000 EDX: eef40000

May 29 12:06:47 unRAID kernel: ESI: ec82ff54 EDI: eee54de0 EBP: ec82ff48 ESP: ec82ff2c

May 29 12:06:47 unRAID kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

May 29 12:06:47 unRAID kernel: CR0: 80050033 CR2: 00000050 CR3: 3094c000 CR4: 000407f0

May 29 12:06:47 unRAID kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

May 29 12:06:47 unRAID kernel: DR6: ffff0ff0 DR7: 00000400

May 29 12:06:47 unRAID kernel: Process ps (pid: 4135, ti=ec82e000 task=f0a91e60 task.ti=ec82e000)

May 29 12:06:47 unRAID kernel: Stack:

May 29 12:06:47 unRAID kernel: 00000000 00001000 eef41000 eee54de0 ec82ff48 0000007f eef40000 ec82ff68

May 29 12:06:47 unRAID kernel: c10c549e 40033d40 00000000 00000000 00004000 ec82ff80 c13ee380 ec82ff94

May 29 12:06:47 unRAID kernel: c1092396 ec82ff80 ec82ff7c eee54de0 00000000 f34ff0d0 eed56f80 bfbf9e4c

May 29 12:06:47 unRAID kernel: Call Trace:

May 29 12:06:47 unRAID kernel: [<c10c549e>] proc_pid_readlink+0x4d/0xa6

May 29 12:06:47 unRAID kernel: [<c1092396>] sys_readlinkat+0x76/0xab

May 29 12:06:47 unRAID kernel: [<c10923f2>] sys_readlink+0x27/0x29

May 29 12:06:47 unRAID kernel: [<c13e5f40>] syscall_call+0x7/0xb

May 29 12:06:47 unRAID kernel: Code: ff 19 c0 83 c0 02 eb 05 b8 02 00 00 00 83 c4 24 5b 5e 5f 5d c3 55 89 e5 56 89 c6 53 8d 04 0a 83 ec 14 89 45 ec 8b 46 04 89 4d e8 <8b> 58 50 85 db 74 0e 8b 5b 20 85 db 74 07 ff d3 e9 ce 00 00 00

May 29 12:06:47 unRAID kernel: EIP: [<c109ccc5>] d_path+0x16/0x100 SS:ESP 0068:ec82ff2c

May 29 12:06:47 unRAID kernel: CR2: 0000000000000050

May 29 12:06:47 unRAID kernel: ---[ end trace 7f01c1191ac43038 ]---

In the terminal where the mover is running it looks like this:

*** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2498 ***

======= Backtrace: =========

/lib/libc.so.6(+0x705aa)[0x400a65aa]

/lib/libc.so.6(+0x73503)[0x400a9503]

/lib/libc.so.6(cfree+0x70)[0x400ac6b0]

rsync[0x807cd74]

rsync[0x807de60]

rsync[0x804f3aa]

rsync[0x8050b5f]

rsync[0x8051e56]

rsync[0x8065825]

rsync[0x80666ac]

/lib/libc.so.6(__libc_start_main+0xe6)[0x4004cb86]

rsync[0x804aad1]

======= Memory map: ========

08048000-0809d000 r-xp 00000000 00:01 2536 /usr/bin/rsync

0809d000-080a1000 rwxp 00054000 00:01 2536 /usr/bin/rsync

080a1000-080f2000 rwxp 00000000 00:00 0 [heap]

40000000-4001d000 r-xp 00000000 00:01 4298 /lib/ld-2.11.1.so

4001d000-4001e000 r-xp 0001d000 00:01 4298 /lib/ld-2.11.1.so

4001e000-4001f000 rwxp 0001e000 00:01 4298 /lib/ld-2.11.1.so

This terminal is frozen, but this time no kernel panic.

Can you help?

Quote

May 29, 201511 yr

Author

Another instance of the problem after rewiring the drives.

Mover:

/usr/local/sbin/mover 2>&1 | tee /boot/logmover.txt

mover started

skipping app/

moving download/

./download/sabnzbd/complete/couchpotato/.DS_Store

>f.stpog... download/sabnzbd/complete/couchpotato/.DS_Store

rsync: write failed on "/mnt/user0/download/sabnzbd/complete/couchpotato/.DS_Store": No space left on device (28)

rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7]

rsync: connection unexpectedly closed (29 bytes received so far) [sender]

rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.7]

moving private/

./somepath/somebigfile

*** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2468 ***

======= Backtrace: =========

/lib/libc.so.6(+0x705aa)[0x400a65aa]

/lib/libc.so.6(+0x73503)[0x400a9503]

/lib/libc.so.6(cfree+0x70)[0x400ac6b0]

rsync[0x807cd74]

rsync[0x807de60]

rsync[0x804f3aa]

rsync[0x8050b5f]

rsync[0x8051e56]

rsync[0x8065825]

rsync[0x80666ac]

/lib/libc.so.6(__libc_start_main+0xe6)[0x4004cb86]

rsync[0x804aad1]

======= Memory map: ========

08048000-0809d000 r-xp 00000000 00:01 1518 /usr/bin/rsync

0809d000-080a1000 rwxp 00054000 00:01 1518 /usr/bin/rsync

080a1000-080f2000 rwxp 00000000 00:00 0 [heap]

40000000-4001d000 r-xp 00000000 00:01 3280 /lib/ld-2.11.1.so

4001d000-4001e000 r-xp 0001d000 00:01 3280 /lib/ld-2.11.1.so

4001e000-4001f000 rwxp 0001e000 00:01 3280 /lib/ld-2.11.1.so

4001f000-40020000 r-xp 00000000 00:00 0 [vdso]

40020000-40021000 rwxp 00000000 00:00 0

40028000-4002e000 r-xp 00000000 00:01 3596 /lib/libacl.so.1.1.0

4002e000-4002f000 rwxp 00005000 00:01 3596 /lib/libacl.so.1.1.0

4002f000-40035000 r-xp 00000000 00:01 3244 /lib/libpopt.so.0.0.0

40035000-40036000 rwxp 00006000 00:01 3244 /lib/libpopt.so.0.0.0

40036000-40192000 r-xp 00000000 00:01 3019 /lib/libc-2.11.1.so

40192000-40193000 ---p 0015c000 00:01 3019 /lib/libc-2.11.1.so

40193000-40195000 r-xp 0015c000 00:01 3019 /lib/libc-2.11.1.so

40195000-40196000 rwxp 0015e000 00:01 3019 /lib/libc-2.11.1.so

40196000-40199000 rwxp 00000000 00:00 0

40199000-4019d000 r-xp 00000000 00:01 3208 /lib/libattr.so.1.1.0

4019d000-4019e000 rwxp 00003000 00:01 3208 /lib/libattr.so.1.1.0

4019e000-401a0000 rwxp 00000000 00:00 0

401a0000-401f6000 r-xp 00000000 00:01 2221 /usr/lib/locale/locale-archive

401f6000-40235000 r-xp 00000000 00:01 2233 /usr/lib/locale/en_US.utf8/LC_CTYPE

40235000-40297000 rwxp 00000000 00:00 0

40297000-402b3000 r-xp 00000000 00:01 2158 /usr/lib/libgcc_s.so.1

402b3000-402b4000 rwxp 0001b000 00:01 2158 /usr/lib/libgcc_s.so.1

40300000-40321000 rwxp 00000000 00:00 0

40321000-40400000 ---p 00000000 00:00 0

bfeda000-bfefb000 rw-p 00000000 00:00 0 [stack]

find: `rsync' terminated by signal 6

rsync: writefd_unbuffered failed to write 79 bytes to socket [Receiver]: Broken pipe (32)

rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [Receiver=3.0.7]

./somepath/someotherbigfile

*** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2450 ***