[RESOLVED] Kernel NULL pointer dereference while attempting to preclear

pantheis · March 5, 2011

I'm using the latest stable release of unraid (4.7) and the latest preclear script (1.7) and I have been unable to successfully complete a preclear. I was testing the server for over a month with a 100GB and 120GB drive I had laying around and had no issues. I didn't know about the preclear script when I first set the server up, so these drives were just formatted through Unraid itself in the normal fashion.

I decided to rebuild the unraid array from scratch to include both 2TB drives and the 100GB drive. I started the preclear on one of the 2TB drives. At about the 30% mark (about an hour) into the initial step (pre-read), the preclear stopped and spit out something close to what I have listed from the syslog below.

I rebooted and ran 3 full passes of memtest with no errors. I then attempted to preclear the other 2TB drive and had the same problem at about 50% through the first step. I decided to try preclearing the 100GB drive and see if it happens again. The 100GB drive got to about 50% through the post-read step and then errored out again. That exact error I copy and pasted below.

One thing to note: The unraid server itself is still responsive. I'm able to telnet into it, login from the console and get to the main //tower URL. I have UnMenu installed along with a number of addons and after this problem happens, unmenu seems to have crashed as well. Because this, I backed up my current unraid flash disk, wiped it and copied a fresh clean install of unraid to it and am currently attempting to preclear the 100GB drive again without any 3rd party addons installed.

I'm not sure if that will work or not so I wanted to get this posted in the hopes that maybe somebody can help me figure out what in the heck is going on with my box.

--edit--

After letting the 100GB drive run, it completed successfully with a clean install of unraid with no 3rd party addons. I am now running the preclear script on both of my 2GB drives at the same time. If that works too (will know in about 10 hours), then I'm going to assume that either unmenu or some addon that I was using was somehow causing some severe kernel issues. I'm still open to other suggestions or comments though based on what is posted here.

Motherboard: Asus A8N SLI Deluxe

Memory: Corsair 2GB (2x1GB) Value Select DDR400

CPU: AMD Athlon 64 4000

PSU: Antec TRU-430 Watt

Hard disk drives:

2x Western Digital 2TB WD20EARS

1X Maxtor DiamondMax 10 family 100GB (ATA/133 and SATA/150)

PCI S3 generic VGA video card

Cut and paste from syslog, full syslog attached:

Mar 5 04:29:58 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 00000014

Mar 5 04:29:58 Tower kernel: IP: [<c1087e54>] block_invalidatepage+0x24/0x97

Mar 5 04:29:58 Tower kernel: *pdpt = 0000000011fbf001 *pde = 0000000000000000

Mar 5 04:29:58 Tower kernel: Oops: 0000 [#1] SMP

Mar 5 04:29:58 Tower kernel: last sysfs file: /sys/devices/virtual/block/loop0/uevent

Mar 5 04:29:58 Tower kernel: Modules linked in: md_mod xor forcedeth sata_sil sata_nv amd74xx [last unloaded: xor]

Mar 5 04:29:58 Tower kernel:

Mar 5 04:29:58 Tower kernel: Pid: 30275, comm: dd Not tainted (2.6.32.9-unRAID # System name

Mar 5 04:29:58 Tower kernel: EIP: 0060:[<c1087e54>] EFLAGS: 00010203 CPU: 0

Mar 5 04:29:58 Tower kernel: EIP is at block_invalidatepage+0x24/0x97

Mar 5 04:29:58 Tower kernel: EAX: 00000000 EBX: 00000000 ECX: 00000002 EDX: cb9d9c80

Mar 5 04:29:58 Tower kernel: ESI: c4f54000 EDI: cb9d9c80 EBP: d1fbde74 ESP: d1fbde5c

Mar 5 04:29:58 Tower kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

Mar 5 04:29:58 Tower kernel: Process dd (pid: 30275, ti=d1fbc000 task=c46fa940 task.ti=d1fbc000)

Mar 5 04:29:58 Tower kernel: Stack:

Mar 5 04:29:58 Tower kernel: 00000000 00000000 00000000 c1087e30 f7022b28 00000006 d1fbde80 c104f651

Mar 5 04:29:58 Tower kernel: <0> c4f54000 d1fbde90 c104fa21 c4f54000 00c1f149 d1fbdf08 c104faf1 01757bd9

Mar 5 04:29:58 Tower kernel: <0> 00000000 00000000 00000000 f7022b28 00000000 ffffffff d1fbdec8 00c1f149

Mar 5 04:29:58 Tower kernel: Call Trace:

Mar 5 04:29:58 Tower kernel: [<c1087e30>] ? block_invalidatepage+0x0/0x97

Mar 5 04:29:58 Tower kernel: [<c104f651>] ? do_invalidatepage+0x19/0x1c

Mar 5 04:29:58 Tower kernel: [<c104fa21>] ? truncate_inode_page+0x4a/0x84

Mar 5 04:29:58 Tower kernel: [<c104faf1>] ? truncate_inode_pages_range+0x96/0x238

Mar 5 04:29:58 Tower kernel: [<c104fc9f>] ? truncate_inode_pages+0xc/0x10

Mar 5 04:29:58 Tower kernel: [<c108a52d>] ? kill_bdev+0x2c/0x2f

Mar 5 04:29:58 Tower kernel: [<c108adea>] ? __blkdev_put+0x43/0xf7

Mar 5 04:29:58 Tower kernel: [<c108aea8>] ? blkdev_put+0xa/0xc

Mar 5 04:29:58 Tower kernel: [<c108b698>] ? blkdev_close+0x25/0x29

Mar 5 04:29:58 Tower kernel: [<c106d39c>] ? __fput+0xd9/0x17d

Mar 5 04:29:58 Tower kernel: [<c106d659>] ? fput+0x17/0x19

Mar 5 04:29:58 Tower kernel: [<c106b068>] ? filp_close+0x51/0x5b

Mar 5 04:29:58 Tower kernel: [<c106bfe5>] ? sys_close+0x5c/0x8e

Mar 5 04:29:58 Tower kernel: [<c1002935>] ? syscall_call+0x7/0xb

Mar 5 04:29:58 Tower kernel: Code: 46 44 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 0c 89 55 e8 8b 00 a8 01 75 04 0f 0b eb fe f6 c4 08 74 72 8b 7e 0c 31 c0 89 fb <8b> 53 14 01 c2 89 55 f0 8b 53 04 39 45 e8 89 55 ec 77 3f e8 96

Mar 5 04:29:58 Tower kernel: EIP: [<c1087e54>] block_invalidatepage+0x24/0x97 SS:ESP 0068:d1fbde5c

Mar 5 04:29:58 Tower kernel: CR2: 0000000000000014

Mar 5 04:29:58 Tower kernel: ---[ end trace cbb405a0a413a409 ]---

syslog.txt

pantheis · March 5, 2011

Posting a follow-up reply just to state, one of the preclears of the 2TB drives that I was doing has already failed with the same error:

It looks like I'm still having the same issues with the clean install of unRaid, so I'm back to square one and out of ideas.

Mar 5 10:18:37 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 00000014

Mar 5 10:18:37 Tower kernel: IP: [<c1087e54>] block_invalidatepage+0x24/0x97

Mar 5 10:18:37 Tower kernel: *pdpt = 000000003130e001 *pde = 0000000000000000

Mar 5 10:18:37 Tower kernel: Oops: 0000 [#1] SMP

Mar 5 10:18:37 Tower kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/host1/target1:0:0/1:0:0:0/block/sda/stat

Mar 5 10:18:37 Tower kernel: Modules linked in: md_mod xor forcedeth sata_sil sata_nv amd74xx [last unloaded: xor]

Mar 5 10:18:37 Tower kernel:

Mar 5 10:18:37 Tower kernel: Pid: 21618, comm: dd Not tainted (2.6.32.9-unRAID # System name

Mar 5 10:18:37 Tower kernel: EIP: 0060:[<c1087e54>] EFLAGS: 00010207 CPU: 0

Mar 5 10:18:37 Tower kernel: EIP is at block_invalidatepage+0x24/0x97

Mar 5 10:18:37 Tower kernel: EAX: 00000000 EBX: 00000000 ECX: 00000002 EDX: e2e27840

Mar 5 10:18:37 Tower kernel: ESI: c4f56000 EDI: e2e27840 EBP: ee283e74 ESP: ee283e5c

Mar 5 10:18:37 Tower kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

Mar 5 10:18:37 Tower kernel: Process dd (pid: 21618, ti=ee282000 task=f74c2940 task.ti=ee282000)

Mar 5 10:18:37 Tower kernel: Stack:

Mar 5 10:18:37 Tower kernel: 00000000 00000000 00000000 c1087e30 f7022b28 00000000 ee283e80 c104f651

Mar 5 10:18:37 Tower kernel: <0> c4f56000 ee283e90 c104fa21 c4f56000 0112b98a ee283f08 c104faf1 1d1c1115

Mar 5 10:18:37 Tower kernel: <0> 00000000 00000000 00000000 f7022b28 00000000 ffffffff ee283ec8 0112b98a

Mar 5 10:18:37 Tower kernel: Call Trace:

Mar 5 10:18:37 Tower kernel: [<c1087e30>] ? block_invalidatepage+0x0/0x97

Mar 5 10:18:37 Tower kernel: [<c104f651>] ? do_invalidatepage+0x19/0x1c

Mar 5 10:18:37 Tower kernel: [<c104fa21>] ? truncate_inode_page+0x4a/0x84

Mar 5 10:18:37 Tower kernel: [<c104faf1>] ? truncate_inode_pages_range+0x96/0x238

Mar 5 10:18:37 Tower kernel: [<c104fc9f>] ? truncate_inode_pages+0xc/0x10

Mar 5 10:18:37 Tower kernel: [<c108a52d>] ? kill_bdev+0x2c/0x2f

Mar 5 10:18:37 Tower kernel: [<c108adea>] ? __blkdev_put+0x43/0xf7

Mar 5 10:18:37 Tower kernel: [<c108aea8>] ? blkdev_put+0xa/0xc

Mar 5 10:18:37 Tower kernel: [<c108b698>] ? blkdev_close+0x25/0x29

Mar 5 10:18:37 Tower kernel: [<c106d39c>] ? __fput+0xd9/0x17d

Mar 5 10:18:37 Tower kernel: [<c106d659>] ? fput+0x17/0x19

Mar 5 10:18:37 Tower kernel: [<c106b068>] ? filp_close+0x51/0x5b

Mar 5 10:18:37 Tower kernel: [<c106bfe5>] ? sys_close+0x5c/0x8e

Mar 5 10:18:37 Tower kernel: [<c1002935>] ? syscall_call+0x7/0xb

Mar 5 10:18:37 Tower kernel: Code: 46 44 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 0c 89 55 e8 8b 00 a8 01 75 04 0f 0b eb fe f6 c4 08 74 72 8b 7e 0c 31 c0 89 fb <8b> 53 14 01 c2 89 55 f0 8b 53 04 39 45 e8 89 55 ec 77 3f e8 96

Mar 5 10:18:37 Tower kernel: EIP: [<c1087e54>] block_invalidatepage+0x24/0x97 SS:ESP 0068:ee283e5c

Mar 5 10:18:37 Tower kernel: CR2: 0000000000000014

Mar 5 10:18:37 Tower kernel: ---[ end trace a8272eb19fd0120b ]---

pantheis · March 14, 2011

After pulling my hair out trying to figure this issue out, I gave up and ordered a replacement motherboard, CPU and RAM. I've been unable to determine a root cause for this issue and nothing I have tried, including returning to the exact previously conditions that had been working fine has resolved it.

Given that other people have been using this motherboard without issues, I can only assume that something in the hardware I'm using failed while transferring the hardware to the new case.

dgaschk · March 14, 2011

Post your entire syslog. zip it if needed.

pantheis · March 14, 2011

I already did. The first post I made has the entire syslog as a text attachment.

dgaschk · March 14, 2011

Have you checked and reseated all cable connections?

pantheis · March 15, 2011

I've checked and reseated all cables multiple times. I've tried two different power supplies, two different CPUs, reseated the memory, tried a different flash device, tried the latest beta of unRaid 5, none of it helped.

I even went back and tried using the two HD's I had been using for a month with no issues and they also were now unable to complete preclear reliably.

The new motherboard, cpu and RAM will be going in tonight and I'm pretty confident that it will resolve the issue. I think the old hardware I had laying around was just barely working and I was just lucky not to experience any issues for the first month I tested it out.

mikechy · March 15, 2011

Want to hear a funny story?

Over the weekend I experienced a situation where unRAID became non-responsive. It required me to hard power cycle the box forcing a parity check. Twice the parity check failed on me, hanging. I could telnet in and grab the syslog and had a similar error:

Mar 13 03:19:01 BIGHOSS kernel: BUG: unable to handle kernel NULL pointer dereference at 00000006

Mar 13 03:19:01 BIGHOSS kernel: IP: [] copy_data+0x22/0x142 [md_mod]

Mar 13 03:19:01 BIGHOSS kernel: *pdpt = 00000000375b5001 *pde = 0000000000000000

Mar 13 03:19:01 BIGHOSS kernel: Oops: 0000 [#1] SMP

Mar 13 03:19:01 BIGHOSS kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:08.0/host4/target4:0:0/4:0:0:0/block/sdd/stat

Mar 13 03:19:01 BIGHOSS kernel: Modules linked in: md_mod xor forcedeth sata_sil sata_nv amd74xx

Mar 13 03:19:01 BIGHOSS kernel:

Mar 13 03:19:01 BIGHOSS kernel: Pid: 1441, comm: unraidd Not tainted (2.6.32.9-unRAID #5) System name

Mar 13 03:19:01 BIGHOSS kernel: EIP: 0060:[] EFLAGS: 00010287 CPU: 0

Mar 13 03:19:01 BIGHOSS kernel: EIP is at copy_data+0x22/0x142 [md_mod]

Mar 13 03:19:01 BIGHOSS kernel: EAX: c374f000 EBX: 6c745090 ECX: 00000002 EDX: 00000002

Mar 13 03:19:01 BIGHOSS kernel: ESI: 00000000 EDI: c3752f50 EBP: c2891ed4 ESP: c2891e9c

Mar 13 03:19:01 BIGHOSS kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068

Mar 13 03:19:01 BIGHOSS kernel: Process unraidd (pid: 1441, ti=c2890000 task=f756ba70 task.ti=c2890000)

Mar 13 03:19:01 BIGHOSS kernel: Stack:

Mar 13 03:19:01 BIGHOSS kernel: 00000000 00000000 00000002 00000001 c374f000 00000000 f8286c6c c2891f00

Mar 13 03:19:01 BIGHOSS kernel: <0> c2891ed8 f8284a0c 00000000 c3752ee0 c3752ee0 c3752f50 c2891f1c f829c8ff

Mar 13 03:19:01 BIGHOSS kernel: <0> 6c745090 00000000 00000001 00000004 00000004 c374d000 00000002 00000000

Mar 13 03:19:01 BIGHOSS kernel: Call Trace:

Mar 13 03:19:01 BIGHOSS kernel: [] ? xor_blocks+0x26/0x7e [xor]

Mar 13 03:19:01 BIGHOSS kernel: [] ? compute_parity+0x177/0x281 [md_mod]

Mar 13 03:19:01 BIGHOSS kernel: [] ? handle_stripe+0x861/0xbc9 [md_mod]

Mar 13 03:19:01 BIGHOSS kernel: [] ? unraidd+0x8f/0xb0 [md_mod]

Mar 13 03:19:01 BIGHOSS kernel: [] ? unraidd+0x0/0xb0 [md_mod]

Mar 13 03:19:01 BIGHOSS kernel: [] ? kthread+0x61/0x68

Mar 13 03:19:01 BIGHOSS kernel: [] ? kthread+0x0/0x68

Mar 13 03:19:01 BIGHOSS kernel: [] ? kernel_thread_helper+0x7/0x1a

Mar 13 03:19:01 BIGHOSS kernel: Code: ff fe 43 40 fb 5b 5e 5d c3 55 89 e5 57 56 53 83 ec 2c 8b 75 0c 89 45 d4 89 c8 8b 5d 08 89 55 d0 e8 eb a6 db c8 8b 4d d0 89 45 d8 <8b> 51 04 8b 01 39 f2 72 0d 77 04 39 d8 72 07 29 d8 c1 e0 09 eb

Mar 13 03:19:01 BIGHOSS kernel: EIP: [] copy_data+0x22/0x142 [md_mod] SS:ESP 0068:c2891e9c

Mar 13 03:19:01 BIGHOSS kernel: CR2: 0000000000000006

Mar 13 03:19:01 BIGHOSS kernel: ---[ end trace bbc193420da81a25 ]---

It was determined that it was a BUG in 5.0 b2 that I was running so I fell back to 4.7 and am doing a parity check right now.

So I'm very interested to see if it was a hardware issue for you...

pantheis · March 15, 2011

I'm pretty confident that it will end up being hardware related due to the system having worked fine for about a month and failing after moving everything to the new case. I even attempted to skip the failing preclear steps and just format and start the array through unRaid. While that did finish, parity kept having issues and unraid itself generated much the same error in the syslog just through normal usage as the preclear script did while attempting to preclear, including crashing the server repeatedly.

Given that I had zero issues with this hardware until I moved it to the new case, I'm going to assume I zapped it somehow or that it was already on the verge of failing. It is rather old hardware and was heavily used as a gaming system back in the day.

The replacement hardware has arrived and I'll be putting in the replacement parts late tonight after work. Based on my previous experience, the preclear would take anywhere from an hour to 6+ hours before it would fail. I could usually get it to fail faster by attempting to preclear all three drives at once, so I'm going to give that a go and I'll post an update to the thread as soon as I know anything.

pantheis · March 15, 2011

New hardware is installed and performing wonderfully. It's amazing how much faster a modern CPU is compared to the old Athlon64 that I was using.

I have started a preclear using the latest script (1.8 as of this writing) on all three of my hard disks at the same time using three separate SSH sessions. The 100GB drive will probably finish at this rate within a couple of hours. The two 2TB drives look like they will take much longer. Overall preclear speeds on the 2TB drives are over 100MB/sec read rates and the 100GB drive is around 60-70MM/sec.

So far, no errors.

As an aside, this new motherboard and CPU makes my system much more quiet. That old Athlon64 heatsink+fan was loud!

mikechy · March 15, 2011

Was it really just as easy as moving the drives over to the new hardware and rebooting?

I am running off of an Asus A8N-SLI Deluxe which is a great mobo (has 8 Sata II ports on the board) but it is like 5-6 years old.

What new equipment did you get, I'm curious.

pantheis · March 15, 2011

I was also running on an Asus A8N-SLI Deluxe for a month with a 100GB sata and 120GB IDE HD without any issues. That test worked great, so I pulled the trigger on an Antec 900 case and two 2TB WD-EARS hard drives. After installing everything into the new case, when I tried to preclear the new drives, or even either of the old test drives, I got the previously discussed kernel errors.

After trying pretty much everything, I decided to buy a new motherboard, CPU and RAM as I figured it was probably the motherboard or RAM I was currently using that was causing the issues.

I picked up the following:

JetWay JHZ03-GT-LF AM3 AMD 880G HDMI Micro ATX AMD Motherboard

GeIL Value 2GB 240-Pin DDR3 SDRAM DDR3 1333 (PC3 10666) Desktop Memory Model GV32GB1333C9SC

AMD Sempron 140 Sargas 2.7GHz Socket AM3 45W Single-Core Processor SDX140HBGQBOX

Total cost with shipping was $133.83.

I picked that board based on the recommendation of another poster here who is running a full 22 drive array off of it without issues for months. I'm not too worried over it not having fully solid caps (it says it does, but it really doesn't) as the Antec 900 case I have is keeping things very cool and heat is the main concern with semi-solid caps. It is also noticeably much faster and quieter than the A8N-SLI Deluxe I was using. I believe it's the new CPU cooling fan that came with the Sempron 140 along with having a passive chipset cooling solution instead of an active fan based one.

Last night, I started preclearing all three drives (100GB, and both 2TB drives) all at the same time. The 100GB drive finished overnight in about 4 hours. Both 2TB drives were about 25% through writing zeroes to them when I left for work after about 8 hours. All three drives never spiked above 27C and were usually hovering around 25C. The 2TB drives are reading and writing around 100-110MB/sec using 4K sector alignment and no jumpers. The 100GB drive was doing about 55-65MB/sec.

To answer your question about moving things to a new board, what you want to do is follow the directions posted elsewhere on the site. I believe they say to take a screenshot or write down which drives (by drive serial number) you have assigned to which slots in your array, power down everything, install the new hardware and then power everything back up, then check the devices screen to make sure each drive is where it is supposed to be. I would double-check on those directions though. I didn't have anything on my array so I didn't have to worry about losing anything when swapping to the new hardware.

I also noticed your problem seemed to go away once you downgraded to Unraid 4.7. That's good to hear!

mikechy · March 15, 2011

That's a nice inexpensive setup. I know it's just a matter of time until I have to replace the A8N. I already have a 40mm fan glued on top of the burned out one attached to the mobo

pantheis · March 15, 2011

$100 for the case, $160 for the HD's and $133 for the motherboard, cpu and ram.

Total cost: $393.

The thing about this build is it starts inexpensive, but with what I have right now (+plus an unraid license), I can expand to 6 drives and not buy anything else except the hard disks. If I want to go further, the case I'm using will support up to 15 drives in 5in3's. I like the future options.

pantheis · March 16, 2011

As of 11:00AM Pacific, both 2TB drives were still running the preclear script and are about 75% through the final post read at about 30 hours in so far. Read rates are down to about 75MB/sec, but that's not unexpected.

pantheis · March 18, 2011

Preclear on all three drives at the same time completed successfully.

As of 5:47PM Pacific, it is 71% through the initial parity calculation.

I'm now having some super strange issues with Crashplan not being able to run but literally everything else with the server is functioning wonderfully. I'm going to call this issue resolved due to faulty hardware.

[RESOLVED] Kernel NULL pointer dereference while attempting to preclear

Recommended Posts

pantheis

Link to comment

pantheis

Link to comment

pantheis

Link to comment

dgaschk

Link to comment

pantheis

Link to comment

dgaschk

Link to comment

pantheis

Link to comment

mikechy

Link to comment

pantheis

Link to comment

pantheis

Link to comment

mikechy

Link to comment

pantheis

Link to comment

mikechy

Link to comment

pantheis

Link to comment

pantheis

Link to comment

pantheis

Link to comment

Join the conversation