[6.2.4] Segfault errors in rebuild/clearing disks


Recommended Posts

Hi,

1. started new Unraid array with 6.2.4, it's 19 disks + parity, before this it was 5.0, same disks.

I decided for a fresh start with 6.2.4 after formatting USB, added network.cfg and no other things like disk shares.

2. When I started the array with all disks on, after 2 hrs system freezed (no log about this).

3. Then I decided to use 1 disk+parity and then start adding disks: here enclosed my log (#1) showing some segfault.

First attempt went fine (still some errors, but parity rebuilt with 1+1 disk), then I checked memory with Memtest with no errors.

Now I'm trying to add other disks (one at the time) with some segfault errors (see log #2).

 

 

Any ideas about it ?

log#1.zip

log#2.zip

Link to comment

Next episode.

 

Started MEMTEST all afternoon, no problems at all.

Restarted Unraid from scratch using 5.0.6, no problems at all (except for a bad disk detected and replaced), now running fine.

So what could be different from v5 to v6 ?

Bad disk was on motherboard SATA

Do I have to check SASLP firmware or other things.

 

My system:

 

Asus M3n-ht deluxe

16gb RAM

2x saslp v8

 

 

Link to comment

You forgot to attach your v6 diagnostics!  ;)  Not much we can do without them.  Also while you are running v5, grab a syslog and attach it, so we can compare.  As it is, from your 6.2.4 syslog, there are only 11 drives found by the kernel.

 

One SAS card has 8 drives, the other has none.  I assume that may be the missing 8 drives?  Really really need to see the v6 diagnostics!

 

Check the Marvell bug report for your cards ->  Marvell disk controller chipsets and virtualization

 

Not related, but check for a newer motherboard BIOS, yours is from 2011, and the newer technologies of current kernels do seem to work better with newer BIOS's.  Also check for newer firmware for the SAS cards.

 

The problem still looks like bad RAM or bad dependencies, conflicts in the libraries loaded.  Are you absolutely sure there is no v5 stuff left on the flash drive after you prepared 6.2.4 on it?  No old plugins?  Nothing in /boot/extra?  Nothing added in the go file?  (With the diagnostics, I could have checked that.)

Link to comment

Hi,

used 11 drives as a trial, because unraid hang up after a while.

About dependencies I don't think there's any issue, because I've formatted form scratch my USB key, and added network.cfg only.

 

This is what I'm gonna try:

- will check diagnostics from v5 version;

- search and apply new firmware to mb and SASLP;

- start from v5 (fully functional today) and migrate to v6 (also from scratch);

- check and attach new v6 disgnostics (if I'm lucky enough to start array);

 

Thanks, will let you know.

Link to comment

Back again, so:

- no upgrades for MB BIOS (last available for motherboard already installed);

- v5 fully functional ;

- installed v6, fully functional (see diagnostics) .... for about 30 minutes, then back again with GUI unreachable, can access via Telnet.

 

Did not check BIOS upgrade for SASLP.

 

Here enclosed you can find what I see from console.

 

6.0_towerfu_backup-diagnostics-20161128-2040.zip

IMG_20161128_211749.jpg.5383c205bd8e17d4ce531370c180f8ee.jpg

20161128_syslog_5.0.zip

Link to comment
  • 3 weeks later...

Here I am again.

Replaced 2 disks that seemed to be faulty.

Disabled VT on MB Bios, ran parity check twice, everything alright.

Restarted array (first with 5.0.6 then with 6.2.4), it seems to be OK.

After activating notifications on FixCommonProblems plugin I got this (see log for full version).

 

 

Dec 15 20:57:03 TowerFU_backup kernel: plugin[4382]: segfault at ffffd78 ip 00002b6e9152e148 sp 00007ffccfeaba90 error 4 in ld-2.23.so[2b6e91518000+25000]

Dec 15 20:57:03 TowerFU_backup kernel: scan.php[4224]: segfault at 10000040 ip 000000000063cf45 sp 00007ffceb15ef28 error 4 in php[400000+701000]

Dec 15 20:58:42 TowerFU_backup php: /usr/local/emhttp/plugins/dynamix/scripts/notify 'smtp-init'

Dec 15 20:58:44 TowerFU_backup kernel: BUG: unable to handle kernel paging request at ffffea0012aa3bc0

Dec 15 20:58:44 TowerFU_backup kernel: IP: [<ffffffff810ea088>] free_pages_and_swap_cache+0x23/0x64

Dec 15 20:58:44 TowerFU_backup kernel: PGD 43f7f7067 PUD 43f7f6067 PMD 0

Dec 15 20:58:44 TowerFU_backup kernel: Oops: 0000 [#2] PREEMPT SMP

Dec 15 20:58:44 TowerFU_backup kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net tun vhost macvtap macvlan ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod pata_marvell mxm_wmi kvm k10temp mvsas libsas scsi_transport_sas ahci libahci forcedeth pata_amd wmi asus_atk0110 acpi_cpufreq [last unloaded: md_mod]

Dec 15 20:58:44 TowerFU_backup kernel: CPU: 0 PID: 5117 Comm: plugin Tainted: G      D W      4.4.30-unRAID #2

Dec 15 20:58:44 TowerFU_backup kernel: Hardware name: System manufacturer System Product Name/M3N-HT DELUXE, BIOS ASUS M3N-HT Deluxe ACPI BIOS Revision 3401 01/28/2011

Dec 15 20:58:44 TowerFU_backup kernel: task: ffff88042b3ac080 ti: ffff88040ccc4000 task.ti: ffff88040ccc4000

Dec 15 20:58:44 TowerFU_backup kernel: RIP: 0010:[<ffffffff810ea088>]  [<ffffffff810ea088>] free_pages_and_swap_cache+0x23/0x64

Dec 15 20:58:44 TowerFU_backup kernel: RSP: 0018:ffff88040ccc7d60  EFLAGS: 00010202

Dec 15 20:58:44 TowerFU_backup kernel: RAX: 010000000004007c RBX: ffffea0012aa3bc0 RCX: ffff88043fc18c60

Dec 15 20:58:44 TowerFU_backup kernel: RDX: 000000000040e90f RSI: 00000000000001fe RDI: ffff88043fc0f4e0

Dec 15 20:58:44 TowerFU_backup kernel: RBP: ffff88040ccc7d80 R08: 0000000000000007 R09: 0000000000000d80

Dec 15 20:58:44 TowerFU_backup kernel: R10: 0000000000000d80 R11: 000000000040e90f R12: 0000000000000031

Dec 15 20:58:44 TowerFU_backup kernel: R13: ffff88040cdb3010 R14: 00000000000001fe R15: ffff88042b3ac650

Dec 15 20:58:44 TowerFU_backup kernel: FS:  00002b897effcc40(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000

Dec 15 20:58:44 TowerFU_backup kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Dec 15 20:58:44 TowerFU_backup kernel: CR2: ffffea0012aa3bc0 CR3: 0000000001849000 CR4: 00000000000006f0

Dec 15 20:58:44 TowerFU_backup kernel: Stack:

Dec 15 20:58:44 TowerFU_backup kernel: ffff88040cdb3000 ffff88040ccc7df0 ffff88040ccc7e18 ffff880429db18e8

Dec 15 20:58:44 TowerFU_backup kernel: ffff88040ccc7da8 ffffffff810d8221 ffff88040ccc7df0 ffff88040cc63b80

Dec 15 20:58:44 TowerFU_backup kernel: ffff88040ccc7ed0 ffff88040ccc7dc0 ffffffff810d8f46 ffff88040ccc7df0

Dec 15 20:58:44 TowerFU_backup kernel: Call Trace:

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff810d8221>] tlb_flush_mmu_free+0x28/0x40

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff810d8f46>] tlb_flush_mmu+0x15/0x18

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff810d8f58>] tlb_finish_mmu+0xf/0x34

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff810e0bec>] exit_mmap+0x93/0x106

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff81048260>] mmput+0x48/0xd9

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff8104c8d8>] do_exit+0x344/0x87d

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff81040855>] ? __do_page_fault+0x2c1/0x335

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff8104ce79>] do_group_exit+0x3c/0x95

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff8104cee1>] SyS_exit_group+0xf/0xf

Dec 15 20:58:44 TowerFU_backup kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d

Dec 15 20:58:44 TowerFU_backup kernel: Code: e8 7b a8 fd ff 5b 5d c3 55 48 89 e5 41 56 41 89 f6 41 55 49 89 fd 41 54 45 31 e4 53 e8 df b3 fd ff 45 39 e6 7e 30 4b 8b 5c e5 00 <48> 8b 03 f6 c4 80 74 1e 8b 43 18 85 c0 79 17 f0 0f ba 2b 00 72

Dec 15 20:58:44 TowerFU_backup kernel: RIP  [<ffffffff810ea088>] free_pages_and_swap_cache+0x23/0x64

Dec 15 20:58:44 TowerFU_backup kernel: RSP <ffff88040ccc7d60>

Dec 15 20:58:44 TowerFU_backup kernel: CR2: ffffea0012aa3bc0

Dec 15 20:58:44 TowerFU_backup kernel: ---[ end trace c4b71505ad8d1d2f ]---

Dec 15 20:58:44 TowerFU_backup kernel: Fixing recursive fault but reboot is needed!

 

 

 

Any help is appreciated.

towerfu_backup-diagnostics-20161215-2107.zip

syslog.txt

Link to comment

Definitely run memtest for at least 24hours.  No way that FCP will cause a segfault if there's not a hardware failure

 

Dec 15 20:57:03 TowerFU_backup kernel: scan.php[4224]: segfault at 10000040 ip 000000000063cf45 sp 00007ffceb15ef28 error 4 in php[400000+701000]

 

And if you've already run a memtest, then start pulling out channels of memory (eg: if you've got 4 sticks, pull out 2)  Most m/b that don't use buffered memory are not always reliable with 4 sticks.

Link to comment
  • 2 weeks later...

Hi, and happy Xmas to everybody.

 

Got tired of my memory/mb problems and replaced with a new A88XM-plus/A8-7600k/Ballistix mem

After normal troubleshooting got stuck with this...

Any suggestions ?

 

BTW, after this I upgraded BIOS, something got wrong and no POST, no video, nothing at all... but this is another story.

 

Thanks to all

IMG_20161224_171210.jpg.a3b602fde1bafb488cb5b066cfd19c88.jpg

Link to comment

Well, a few days ago, I spent quite a bit of time studying your diagnostics and syslogs, trying to find something definite, and failed.  I just tried again, and I still can't explain the issues.  It still really looks like a memory problem or a dependency problem, but I could not find any dependency issues, and you've tested your memory (wouldn't mind seeing what the PassMark Memtest could find though).

 

Your last post on different hardware is also a failure, before unRAID could even boot, a 'kernel panic' with 'unable to mount root fs'.  It's a different problem at least, possibly the boot drive not prepared correctly, which doesn't make sense as it's probably the same one you have been booting with.  Not a likely solution, but do try a different USB port, preferably a USB 2.0 port.

Link to comment

So, after getting this message, as I stated before, I decided to upgrade motherborad BIOS: as I upgrade BIOS something got wrong (don't know what because at the end BIOS message told me that everything was OK) and the mobo did not start at all.

Then I replaced the mobo (thanks, santa Amazon): the new mobo came already with upgraded BIOS.

Re-check connectors, memory, cables and now everything seems to be OK (no errors and no strange behaviours at present).

Mistery not solved, but Unraid functional (at present and crossing fingers....).

 

A spare mobo for sale... anyone ?  :D

 

Thanks, and have a happy new year's eve.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.