mpt2sas problem (Supermicro AOC-USAS2-L8e) throwing lots of errors into syslog

AndroidCat · July 6, 2012

Hi,

my first post here as I recently moved from ZFS world to unRAID.

I've been successfully running unRAID on ESXi5 and X9SCM-F for last 1.5months. I am using M1015 flashed to IT mode and 5GB of RAM.

Problems appeared when I started using Supermicro AOC-USAS2-L8i as a second controller. Mind you that card has been in the system from the beginning, passed-through and connected to HDDs which were sitting IDLE. In other words those drives were never precleared and never made it to the array at that point.

Since I started pre-clear drives hooked up to this Supermicro I observed lots of errors in syslog. Or basically it was same error repeating every minute or so. I believe frequency of that error is determined by HDD usage. Preclear is very HDD intensive (although I don't know how much it stresses controller itself...) and I think it is "reading phase" of preclear script causing most of those, not so much "Writing phase".

Anyway, usually there seems to be no or minimal impact on unRAID and SAMBA. But one day when I was preclearing out of the sudden access to files became very slow and even small AVI was shuttering. The issue was gone after preclear finished.

So my theory is that disk activity on AOC card is causing some problems to whole unRAID. At this point I am bit worried to write any valuable data to those disks.

And by the way preclear was successful for those HDDs and I was able to add them to my array. They have not dropped since then.

unRAID version is 5.0rc4 and 5 (no difference).

Jul 6 06:39:23 unraid kernel: smbd: page allocation failure: order:3, mode:0x4020

Jul 6 06:39:23 unraid kernel: Pid: 4757, comm: smbd Not tainted 3.0.35-unRAID #2 (Errors)

Jul 6 06:39:23 unraid kernel: Call Trace: (Errors)

Jul 6 06:39:23 unraid kernel: [<c105fee8>] warn_alloc_failed+0xb2/0xc4 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1060653>] __alloc_pages_nodemask+0x456/0x47f (Errors)

Jul 6 06:39:23 unraid kernel: [<c10606d0>] __get_free_pages+0xf/0x21 (Errors)

Jul 6 06:39:23 unraid kernel: [<c107dfa0>] __kmalloc+0x28/0xff (Errors)

Jul 6 06:39:23 unraid kernel: [<c1025df6>] ? T.1527+0x31/0x35 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12910a9>] pskb_expand_head+0xca/0x1eb (Errors)

Jul 6 06:39:23 unraid kernel: [<c1291549>] __pskb_pull_tail+0x41/0x21f (Errors)

Jul 6 06:39:23 unraid kernel: [<c1298321>] dev_hard_start_xmit+0x20a/0x322 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12a6d6a>] sch_direct_xmit+0x50/0x137 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1298537>] dev_queue_xmit+0xfe/0x274 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12b3d4f>] ip_finish_output+0x237/0x272 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12b3e2a>] ip_output+0xa0/0xa8 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12b326e>] ip_local_out+0x1b/0x1e (Errors)

Jul 6 06:39:23 unraid kernel: [<c12b3772>] ip_queue_xmit+0x2a5/0x2f2 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12c2ce8>] tcp_transmit_skb+0x4d7/0x50d (Errors)

Jul 6 06:39:23 unraid kernel: [<c12c4ff2>] tcp_write_xmit+0x2f9/0x3d7 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12c5114>] __tcp_push_pending_frames+0x18/0x6f (Errors)

Jul 6 06:39:23 unraid kernel: [<c12bae84>] do_tcp_sendpages+0x466/0x493 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12baf02>] tcp_sendpage+0x51/0x66 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12baeb1>] ? do_tcp_sendpages+0x493/0x493 (Errors)

Jul 6 06:39:23 unraid kernel: [<c12d2725>] inet_sendpage+0x82/0x9c (Errors)

Jul 6 06:39:23 unraid kernel: [<c12d26a3>] ? inet_dgram_connect+0x5e/0x5e (Errors)

Jul 6 06:39:23 unraid kernel: [<c128a95e>] kernel_sendpage+0x1a/0x2d (Errors)

Jul 6 06:39:23 unraid kernel: [<c128a998>] sock_sendpage+0x27/0x2c (Errors)

Jul 6 06:39:23 unraid kernel: [<c10995ce>] pipe_to_sendpage+0x5a/0x6c (Errors)

Jul 6 06:39:23 unraid kernel: [<c128a971>] ? kernel_sendpage+0x2d/0x2d (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099634>] splice_from_pipe_feed+0x54/0xc4 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099b8a>] __splice_from_pipe+0x36/0x55 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099cf1>] splice_from_pipe+0x51/0x64 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099d30>] ? default_file_splice_write+0x2c/0x2c (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099d43>] generic_splice_sendpage+0x13/0x15 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099e85>] do_splice_from+0x57/0x61 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099ea6>] direct_splice_actor+0x17/0x1c (Errors)

Jul 6 06:39:23 unraid kernel: [<c109a471>] splice_direct_to_actor+0xbe/0x16b (Errors)

Jul 6 06:39:23 unraid kernel: [<c117a7da>] ? fuse_file_flock+0x33/0x33 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1099e8f>] ? do_splice_from+0x61/0x61 (Errors)

Jul 6 06:39:23 unraid kernel: [<c109a569>] do_splice_direct+0x4b/0x62 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1080516>] do_sendfile+0x157/0x19f (Errors)

Jul 6 06:39:23 unraid kernel: [<c108059a>] sys_sendfile64+0x3c/0x7c (Errors)

Jul 6 06:39:23 unraid kernel: [<c130f525>] syscall_call+0x7/0xb (Errors)

Jul 6 06:39:23 unraid kernel: [<c130007b>] ? _cpu_down+0x13f/0x1bc (Errors)

Jul 6 06:39:23 unraid kernel: Mem-Info:

Jul 6 06:39:23 unraid kernel: DMA per-cpu:

Jul 6 06:39:23 unraid kernel: CPU 0: hi: 0, btch: 1 usd: 0

Jul 6 06:39:23 unraid kernel: Normal per-cpu:

Jul 6 06:39:23 unraid kernel: CPU 0: hi: 186, btch: 31 usd: 16

Jul 6 06:39:23 unraid kernel: HighMem per-cpu:

Jul 6 06:39:23 unraid kernel: CPU 0: hi: 186, btch: 31 usd: 26

Jul 6 06:39:23 unraid kernel: active_anon:5668 inactive_anon:39 isolated_anon:0

Jul 6 06:39:23 unraid kernel: active_file:275127 inactive_file:908639 isolated_file:23

Jul 6 06:39:23 unraid kernel: unevictable:46338 dirty:15122 writeback:14245 unstable:0

Jul 6 06:39:23 unraid kernel: free:9186 slab_reclaimable:26625 slab_unreclaimable:4710

Jul 6 06:39:23 unraid kernel: mapped:3416 shmem:53 pagetables:230 bounce:0

Jul 6 06:39:23 unraid kernel: DMA free:3536kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:2704kB inactive_file:7892kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:244kB writeback:2552kB mapped:0kB shmem:0kB slab_reclaimable:1216kB slab_unreclaimable:496kB kernel_stack:80kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

Jul 6 06:39:23 unraid kernel: lowmem_reserve[]: 0 869 5056 5056

Jul 6 06:39:23 unraid kernel: Normal free:3752kB min:3736kB low:4668kB high:5604kB active_anon:16kB inactive_anon:0kB active_file:265612kB inactive_file:377724kB unevictable:0kB isolated(anon):0kB isolated(file):92kB present:890008kB mlocked:0kB dirty:60244kB writeback:54428kB mapped:4kB shmem:0kB slab_reclaimable:105284kB slab_unreclaimable:18344kB kernel_stack:768kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

Jul 6 06:39:23 unraid kernel: lowmem_reserve[]: 0 0 33495 33495

Jul 6 06:39:23 unraid kernel: HighMem free:29456kB min:512kB low:5012kB high:9512kB active_anon:22656kB inactive_anon:156kB active_file:832192kB inactive_file:3248940kB unevictable:185352kB isolated(anon):0kB isolated(file):0kB present:4287396kB mlocked:0kB dirty:0kB writeback:0kB mapped:13660kB shmem:212kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:920kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

Jul 6 06:39:23 unraid kernel: lowmem_reserve[]: 0 0 0 0

Jul 6 06:39:23 unraid kernel: DMA: 14*4kB 3*8kB 0*16kB 2*32kB 7*64kB 13*128kB 5*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3536kB

Jul 6 06:39:23 unraid kernel: Normal: 906*4kB 2*8kB 1*16kB 3*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3752kB

Jul 6 06:39:23 unraid kernel: HighMem: 34*4kB 11*8kB 399*16kB 348*32kB 83*64kB 40*128kB 5*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 29456kB

Jul 6 06:39:23 unraid kernel: 1230182 total pagecache pages

Jul 6 06:39:23 unraid kernel: 0 pages in swap cache

Jul 6 06:39:23 unraid kernel: Swap cache stats: add 0, delete 0, find 0/0

Jul 6 06:39:23 unraid kernel: Free swap = 0kB

Jul 6 06:39:23 unraid kernel: Total swap = 0kB

Jul 6 06:39:23 unraid kernel: 1572848 pages RAM

Jul 6 06:39:23 unraid kernel: 1344514 pages HighMem

Jul 6 06:39:23 unraid kernel: 275946 pages reserved

Jul 6 06:39:23 unraid kernel: 161883 pages shared

Jul 6 06:39:23 unraid kernel: 1132557 pages non-shared

AOCsyslog.zip

Joe L. · July 6, 2012

So my theory is that disk activity on AOC card is causing some problems to whole unRAID. At this point I am bit worried to write any valuable data to those disks.

Based on the inability to allocate memory (the memory page allocation error started the whole series of errors you saw) My theory is you were pre-clearing multiple disks, did not use any of the options of the preclear script to limit the memory it was using, and simply ran your server out of "low"memory. When the samba process (smbd) went to allocate more, there was none free.

Jul 6 06:39:23 unraid kernel: smbd: page allocation failure: order:3, mode:0x4020

Jul 6 06:39:23 unraid kernel: Pid: 4757, comm: smbd Not tainted 3.0.35-unRAID #2 (Errors)

Jul 6 06:39:23 unraid kernel: Call Trace: (Errors)

Jul 6 06:39:23 unraid kernel: [<c105fee8>] warn_alloc_failed+0xb2/0xc4 (Errors)

Jul 6 06:39:23 unraid kernel: [<c1060653>] __alloc_pages_nodemask+0x456/0x47f (Errors)

Jul 6 06:39:23 unraid kernel: [<c10606d0>] __get_free_pages+0xf/0x21 (Errors)

Jul 6 06:39:23 unraid kernel: [<c107dfa0>] __kmalloc+0x28/0xff (Errors)

Good luck with your server. Next time you pre-clear multiple drives, consider using the -r and -w options

Joe L.

AndroidCat · July 6, 2012

Thanks Joe.

I agree with you that syslog writings look like low memory issue. However checking unRAID and ESXi memory allocation I could not see it got exhausted. It was getting higher, but never went over 75% of available RAM.

I believe there is something more going on...

I did preclear multiple drives (2 or 3 max at the time) before using M1015 and never noticed those errors (yes I am checking syslog periodically LOL). Now with Supermicro it is enough to start preclear for one drive in order to see the issue.

Maybe, just maybe the firmware on Supermicro is too old. Currently it is 7.xx. Since this is LSI2008 based card can I apply here exactly same procedure and firmware as in case of M1015?

Appreciate your help.

Joe L. · July 6, 2012

I believe there is something more going on...

I suppose there is always the possibility the driver is over-writing the memory allocated to it, and therefore corrupting the free memory map in the kernel. (We always assume that a process only uses the memory it was allocated, and does not accidentally use more than allocated.)

Joe L.

PeterB · July 7, 2012

Maybe, just maybe the firmware on Supermicro is too old. Currently it is 7.xx. Since this is LSI2008 based card can I apply here exactly same procedure and firmware as in case of M1015?

Yes, that firmware is old. Drives on my USAS2-L8i started throwing errors when I added my 5in3 drive cages. This was fixed by updating the card's firmware.

Firmware is currently at 13.5 - you can go to the LSI site and download it. I flash my card using the Linux version of the flash utility with unRAID running (but it's best to stop the array first - you can get some strange errors if the drives are accessed while you are flashing!).

AndroidCat · July 7, 2012

Thanks PeterB

Will this LSI SAS 9200-8e firmware be OK?

http://www.lsi.com/support/Pages/Download-Results.aspx?productcode=P00055&assettype=0&component=Storage%20Component&productfamily=Host%20Bus%20Adapters&productname=LSI%20SAS%209200-8e

I believe I can use Linux flash utility. but borrow newer 13.5 firmware from DOS/Windows package?

Will try tomorrow and report back...

PeterB · July 7, 2012

Thanks PeterB

Will this LSI SAS 9200-8e firmware be OK?

http://www.lsi.com/support/Pages/Download-Results.aspx?productcode=P00055&assettype=0&component=Storage%20Component&productfamily=Host%20Bus%20Adapters&productname=LSI%20SAS%209200-8e

I believe that will be the same firmware, although I downloaded the 9211 file, which includes both IT and IR firmwares.

I believe I can use Linux flash utility. but borrow newer 13.5 firmware from DOS/Windows package?

Will try tomorrow and report back...

That is correct. However, depending on the current version/card identity, you may need to revert to an old flash utility. At one stage I was using a release 5 flash utility - newer versions attempt to prevent cross-flashing etc. I still have a copy of the P5 installer, if you need it and can't find it anywhere.

Edit:

Ah, I see the P5 Linux installer is still there for download.

AndroidCat · July 7, 2012

Just to report on the progress.

1. Downloaded LSI IT Firmware and BIOS from LSI website here: http://www.lsi.com/support/Pages/Download-Results.aspx?productcode=P00055&assettype=0&component=Storage%20Component&productfamily=Host%20Bus%20Adapters&productname=LSI%20SAS%209200-8e

2. Downloaded current P13 Linux sas2flash utility from same location

3. Passed-through Supermicro controller to separate OpenSUSE VM under ESXi /to be on the safe side, could be done from within unRAID itself/

4. flashed without any problems like so: sas2flash -o -f <firmware file> -b <bios rom file>

5. sas2flash -list reported firmware version 13.0.57 and BIOS 7.25.0

6. Passed card back to unRAID and voila, everything detected and reporting disks correctly.

I just started some preclear to see if the initial issue is gone. To be continued...

PeterB, I used latest sas2flash and it did not complain, possibly because Supermicro card was '8e' type which means IT mode, so no cross-flashing here...

flipphos · July 24, 2012

I am having this kind of issue as the OP.

Jul 24 21:13:38 Tower kernel: Pid: 1444, comm: smbd Not tainted 3.0.35-unRAID #2 (Errors)
Jul 24 21:13:38 Tower kernel: Call Trace: (Errors)
Jul 24 21:13:38 Tower kernel:  [<c105fee8>] warn_alloc_failed+0xb2/0xc4 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1060653>] __alloc_pages_nodemask+0x456/0x47f (Errors)
Jul 24 21:13:38 Tower kernel:  [<c10606d0>] __get_free_pages+0xf/0x21 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c107dfa0>] __kmalloc+0x28/0xff (Errors)
Jul 24 21:13:38 Tower kernel:  [<c103efff>] ? hrtimer_try_to_cancel+0x6e/0x77 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12910a9>] pskb_expand_head+0xca/0x1eb (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1291549>] __pskb_pull_tail+0x41/0x21f (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1298321>] dev_hard_start_xmit+0x20a/0x322 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12a6d6a>] sch_direct_xmit+0x50/0x137 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1298537>] dev_queue_xmit+0xfe/0x274 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b3d4f>] ip_finish_output+0x237/0x272 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c108ccf1>] ? do_sys_poll+0x129/0x188 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b3e2a>] ip_output+0xa0/0xa8 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b326e>] ip_local_out+0x1b/0x1e (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b3772>] ip_queue_xmit+0x2a5/0x2f2 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12c2ce8>] tcp_transmit_skb+0x4d7/0x50d (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12c4ff2>] tcp_write_xmit+0x2f9/0x3d7 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12c50f8>] tcp_push_one+0x28/0x2c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12bad70>] do_tcp_sendpages+0x352/0x493 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12baf02>] tcp_sendpage+0x51/0x66 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12baeb1>] ? do_tcp_sendpages+0x493/0x493 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12d2725>] inet_sendpage+0x82/0x9c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12d26a3>] ? inet_dgram_connect+0x5e/0x5e (Errors)
Jul 24 21:13:38 Tower kernel:  [<c128a95e>] kernel_sendpage+0x1a/0x2d (Errors)
Jul 24 21:13:38 Tower kernel:  [<c128a998>] sock_sendpage+0x27/0x2c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c10995ce>] pipe_to_sendpage+0x5a/0x6c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c128a971>] ? kernel_sendpage+0x2d/0x2d (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099634>] splice_from_pipe_feed+0x54/0xc4 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099b8a>] __splice_from_pipe+0x36/0x55 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099cf1>] splice_from_pipe+0x51/0x64 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099d30>] ? default_file_splice_write+0x2c/0x2c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099d43>] generic_splice_sendpage+0x13/0x15 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099e85>] do_splice_from+0x57/0x61 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099ea6>] direct_splice_actor+0x17/0x1c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c109a471>] splice_direct_to_actor+0xbe/0x16b (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099e8f>] ? do_splice_from+0x61/0x61 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c109a569>] do_splice_direct+0x4b/0x62 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1080516>] do_sendfile+0x157/0x19f (Errors)
Jul 24 21:13:38 Tower kernel:  [<c108059a>] sys_sendfile64+0x3c/0x7c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c130f525>] syscall_call+0x7/0xb (Errors)

My rig is quite similar to the OP.

UNRAID 5.0rc5 running under ESXi 5, assinged 4 GB out of 8 GB RAM to UNRAID, X8SIL, 128G SSD for ESXi, 1x M1015 (flashed in LSI IT mode) + HP SAS Expander passthrough to UNRAID. 1x3T Seagate VX as parity and 5x3T Seagate as data disks. All the 3T disks have the latest firmware CC4H.

The errors occurred between 19:31 and 21:13, when I was running the preclear_disk.sh script to clear two 2T Seagate disks, in the post_read phase. I am sure the post read phase started earlier than 17:00 and progressed 81% when it was 1:45 early in the next day morning.

I remember I was probably streaming a movie from the UNRAID server during the above time window (i.e. 19:31 - 21:13), and checked the progress of preclear a couple of times from the vSphere client.

Also I turned the unmenu and Plex media server on, though I did not visit the Plex server during the time window.

It is off the topic, but another strange thing I noticed was that I could not transfer big files to disk4. I tried to transfer a 7.7 GB video file to the disk4 from my computer over cable, the computer complained could not write to the disk. I checked the disk4 and found a file of exact 4 GB in size was written. I also remember that I had this problem on disk4 before, while the problem is gone right after I restart the UNRAID server yesterday. I haven't restarted the server since then. And I have the same problem again later on. Meanwhile I tried to write to the disk4 with 2 small files (of 244 MB and 311 MB) with success. Then I tried to transfer a folder containing one 12 GB file and several small files in KBs, the computer returned error as "Can not write to the disk", and nothing was written to the disk4. So it appears to me that right after the restart, I could write big files (probably >4 GB) to the disk4, but after a certain time point, I could not.

syslog-2012-07-25.zip

Joe L. · July 24, 2012

I am having this kind of issue as the OP.
Jul 24 21:13:38 Tower kernel: Pid: 1444, comm: smbd Not tainted 3.0.35-unRAID #2 (Errors)
Jul 24 21:13:38 Tower kernel: Call Trace: (Errors)
Jul 24 21:13:38 Tower kernel:  [<c105fee8>] warn_alloc_failed+0xb2/0xc4 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1060653>] __alloc_pages_nodemask+0x456/0x47f (Errors)
Jul 24 21:13:38 Tower kernel:  [<c10606d0>] __get_free_pages+0xf/0x21 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c107dfa0>] __kmalloc+0x28/0xff (Errors)
Jul 24 21:13:38 Tower kernel:  [<c103efff>] ? hrtimer_try_to_cancel+0x6e/0x77 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12910a9>] pskb_expand_head+0xca/0x1eb (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1291549>] __pskb_pull_tail+0x41/0x21f (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1298321>] dev_hard_start_xmit+0x20a/0x322 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12a6d6a>] sch_direct_xmit+0x50/0x137 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1298537>] dev_queue_xmit+0xfe/0x274 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b3d4f>] ip_finish_output+0x237/0x272 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c108ccf1>] ? do_sys_poll+0x129/0x188 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b3e2a>] ip_output+0xa0/0xa8 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b326e>] ip_local_out+0x1b/0x1e (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12b3772>] ip_queue_xmit+0x2a5/0x2f2 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12c2ce8>] tcp_transmit_skb+0x4d7/0x50d (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12c4ff2>] tcp_write_xmit+0x2f9/0x3d7 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12c50f8>] tcp_push_one+0x28/0x2c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12bad70>] do_tcp_sendpages+0x352/0x493 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12baf02>] tcp_sendpage+0x51/0x66 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12baeb1>] ? do_tcp_sendpages+0x493/0x493 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12d2725>] inet_sendpage+0x82/0x9c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c12d26a3>] ? inet_dgram_connect+0x5e/0x5e (Errors)
Jul 24 21:13:38 Tower kernel:  [<c128a95e>] kernel_sendpage+0x1a/0x2d (Errors)
Jul 24 21:13:38 Tower kernel:  [<c128a998>] sock_sendpage+0x27/0x2c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c10995ce>] pipe_to_sendpage+0x5a/0x6c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c128a971>] ? kernel_sendpage+0x2d/0x2d (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099634>] splice_from_pipe_feed+0x54/0xc4 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099b8a>] __splice_from_pipe+0x36/0x55 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099cf1>] splice_from_pipe+0x51/0x64 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099d30>] ? default_file_splice_write+0x2c/0x2c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099d43>] generic_splice_sendpage+0x13/0x15 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099574>] ? splice_from_pipe_begin+0x10/0x10 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099e85>] do_splice_from+0x57/0x61 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099ea6>] direct_splice_actor+0x17/0x1c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c109a471>] splice_direct_to_actor+0xbe/0x16b (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1099e8f>] ? do_splice_from+0x61/0x61 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c109a569>] do_splice_direct+0x4b/0x62 (Errors)
Jul 24 21:13:38 Tower kernel:  [<c1080516>] do_sendfile+0x157/0x19f (Errors)
Jul 24 21:13:38 Tower kernel:  [<c108059a>] sys_sendfile64+0x3c/0x7c (Errors)
Jul 24 21:13:38 Tower kernel:  [<c130f525>] syscall_call+0x7/0xb (Errors)
My rig is quite similar to the OP.

UNRAID 5.0rc5 running under ESXi 5, assinged 4 GB out of 8 GB RAM to UNRAID, X8SIL, 128G SSD for ESXi, 1x M1015 (flashed in LSI IT mode) + HP SAS Expander passthrough to UNRAID. 1x3T Seagate VX as parity and 5x3T Seagate as data disks. All the 3T disks have the latest firmware CC4H.

The errors occurred between 19:31 and 21:13, when I was running the preclear_disk.sh script to clear two 2T Seagate disks, in the post_read phase. I am sure the post read phase started earlier than 17:00 and progressed 81% when it was 1:45 early in the next day morning.

I remember I was probably streaming a movie from the UNRAID server during the above time window (i.e. 19:31 - 21:13), and checked the progress of preclear a couple of times from the vSphere client.

Also I turned the unmenu and Plex media server on, though I did not visit the Plex server during the time window.

It is off the topic, but another strange thing I noticed was that I could not transfer big files to disk4. I tried to transfer a 7.7 GB video file to the disk4 from my computer over cable, the computer complained could not write to the disk. I checked the disk4 and found a file of exact 4 GB in size was written. I also remember that I had this problem on disk4 before, while the problem is gone right after I restart the UNRAID server yesterday. I haven't restarted the server since then. And I have the same problem again later on. Meanwhile I tried to write to the disk4 with 2 small files (of 244 MB and 311 MB) with success. Then I tried to transfer a folder containing one 12 GB file and several small files in KBs, the computer returned error as "Can not write to the disk", and nothing was written to the disk4. So it appears to me that right after the restart, I could write big files (probably >4 GB) to the disk4, but after a certain time point, I could not.

It looks to me as if you ran out of RAM.

This line says memory could not be allocated.

Jul 24 21:13:38 Tower kernel: [<c105fee8>] warn_alloc_failed+0xb2/0xc4 (Errors)

Once you run out of RAM, all bets are off. (and it is typically "low" memory that is running out.)

Check it with

free -l

Shut down your add-ons, or limit the memory they are using. The two pre-clears alone would use the bulk of memory unless you used the-r -w and -b options to limit their memory usage.

I do not think this is directly related to the "rc" release.

Joe L.

flipphos · July 25, 2012

Thanks Joe. I agree that this errors might not be related to the RC, and was caused by some other programs instead, such as preclear_disk.sh. Any way, it is gone now as preclear is finished on my 2x 2T hard disks. And I know where it came from.

I just add the 2 disks into the array and reboot the UNRAID clean.

Now I am able to copy the files I mentioned above to the disk4, which I could not do yesterday. This bothers me still.

AndroidCat · September 6, 2012

Well, I marked this topic as SOLVED as the issue has not been seen since.

As suggested by Joe and other posters those errors are only happening when using preclear together with heavy use of SAMBA share. Somehow that seems to consume too much of a RAM or something.

Anyway the problem was NOT caused by old LSI firmware and probably is also not an RC issue.

mpt2sas problem (Supermicro AOC-USAS2-L8e) throwing lots of errors into syslog

Recommended Posts

AndroidCat

Link to comment

Joe L.

Link to comment

AndroidCat

Link to comment

Joe L.

Link to comment

PeterB

Link to comment

AndroidCat

Link to comment

PeterB

Link to comment

AndroidCat

Link to comment

flipphos

Link to comment

Joe L.

Link to comment

flipphos

Link to comment

AndroidCat

Link to comment

Join the conversation