pantheis Posted March 5, 2011 Share Posted March 5, 2011 I'm using the latest stable release of unraid (4.7) and the latest preclear script (1.7) and I have been unable to successfully complete a preclear. I was testing the server for over a month with a 100GB and 120GB drive I had laying around and had no issues. I didn't know about the preclear script when I first set the server up, so these drives were just formatted through Unraid itself in the normal fashion. I decided to rebuild the unraid array from scratch to include both 2TB drives and the 100GB drive. I started the preclear on one of the 2TB drives. At about the 30% mark (about an hour) into the initial step (pre-read), the preclear stopped and spit out something close to what I have listed from the syslog below. I rebooted and ran 3 full passes of memtest with no errors. I then attempted to preclear the other 2TB drive and had the same problem at about 50% through the first step. I decided to try preclearing the 100GB drive and see if it happens again. The 100GB drive got to about 50% through the post-read step and then errored out again. That exact error I copy and pasted below. One thing to note: The unraid server itself is still responsive. I'm able to telnet into it, login from the console and get to the main //tower URL. I have UnMenu installed along with a number of addons and after this problem happens, unmenu seems to have crashed as well. Because this, I backed up my current unraid flash disk, wiped it and copied a fresh clean install of unraid to it and am currently attempting to preclear the 100GB drive again without any 3rd party addons installed. I'm not sure if that will work or not so I wanted to get this posted in the hopes that maybe somebody can help me figure out what in the heck is going on with my box. --edit-- After letting the 100GB drive run, it completed successfully with a clean install of unraid with no 3rd party addons. I am now running the preclear script on both of my 2GB drives at the same time. If that works too (will know in about 10 hours), then I'm going to assume that either unmenu or some addon that I was using was somehow causing some severe kernel issues. I'm still open to other suggestions or comments though based on what is posted here. Motherboard: Asus A8N SLI Deluxe Memory: Corsair 2GB (2x1GB) Value Select DDR400 CPU: AMD Athlon 64 4000 PSU: Antec TRU-430 Watt Hard disk drives: 2x Western Digital 2TB WD20EARS 1X Maxtor DiamondMax 10 family 100GB (ATA/133 and SATA/150) PCI S3 generic VGA video card Cut and paste from syslog, full syslog attached: Mar 5 04:29:58 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 00000014 Mar 5 04:29:58 Tower kernel: IP: [<c1087e54>] block_invalidatepage+0x24/0x97 Mar 5 04:29:58 Tower kernel: *pdpt = 0000000011fbf001 *pde = 0000000000000000 Mar 5 04:29:58 Tower kernel: Oops: 0000 [#1] SMP Mar 5 04:29:58 Tower kernel: last sysfs file: /sys/devices/virtual/block/loop0/uevent Mar 5 04:29:58 Tower kernel: Modules linked in: md_mod xor forcedeth sata_sil sata_nv amd74xx [last unloaded: xor] Mar 5 04:29:58 Tower kernel: Mar 5 04:29:58 Tower kernel: Pid: 30275, comm: dd Not tainted (2.6.32.9-unRAID # System name Mar 5 04:29:58 Tower kernel: EIP: 0060:[<c1087e54>] EFLAGS: 00010203 CPU: 0 Mar 5 04:29:58 Tower kernel: EIP is at block_invalidatepage+0x24/0x97 Mar 5 04:29:58 Tower kernel: EAX: 00000000 EBX: 00000000 ECX: 00000002 EDX: cb9d9c80 Mar 5 04:29:58 Tower kernel: ESI: c4f54000 EDI: cb9d9c80 EBP: d1fbde74 ESP: d1fbde5c Mar 5 04:29:58 Tower kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Mar 5 04:29:58 Tower kernel: Process dd (pid: 30275, ti=d1fbc000 task=c46fa940 task.ti=d1fbc000) Mar 5 04:29:58 Tower kernel: Stack: Mar 5 04:29:58 Tower kernel: 00000000 00000000 00000000 c1087e30 f7022b28 00000006 d1fbde80 c104f651 Mar 5 04:29:58 Tower kernel: <0> c4f54000 d1fbde90 c104fa21 c4f54000 00c1f149 d1fbdf08 c104faf1 01757bd9 Mar 5 04:29:58 Tower kernel: <0> 00000000 00000000 00000000 f7022b28 00000000 ffffffff d1fbdec8 00c1f149 Mar 5 04:29:58 Tower kernel: Call Trace: Mar 5 04:29:58 Tower kernel: [<c1087e30>] ? block_invalidatepage+0x0/0x97 Mar 5 04:29:58 Tower kernel: [<c104f651>] ? do_invalidatepage+0x19/0x1c Mar 5 04:29:58 Tower kernel: [<c104fa21>] ? truncate_inode_page+0x4a/0x84 Mar 5 04:29:58 Tower kernel: [<c104faf1>] ? truncate_inode_pages_range+0x96/0x238 Mar 5 04:29:58 Tower kernel: [<c104fc9f>] ? truncate_inode_pages+0xc/0x10 Mar 5 04:29:58 Tower kernel: [<c108a52d>] ? kill_bdev+0x2c/0x2f Mar 5 04:29:58 Tower kernel: [<c108adea>] ? __blkdev_put+0x43/0xf7 Mar 5 04:29:58 Tower kernel: [<c108aea8>] ? blkdev_put+0xa/0xc Mar 5 04:29:58 Tower kernel: [<c108b698>] ? blkdev_close+0x25/0x29 Mar 5 04:29:58 Tower kernel: [<c106d39c>] ? __fput+0xd9/0x17d Mar 5 04:29:58 Tower kernel: [<c106d659>] ? fput+0x17/0x19 Mar 5 04:29:58 Tower kernel: [<c106b068>] ? filp_close+0x51/0x5b Mar 5 04:29:58 Tower kernel: [<c106bfe5>] ? sys_close+0x5c/0x8e Mar 5 04:29:58 Tower kernel: [<c1002935>] ? syscall_call+0x7/0xb Mar 5 04:29:58 Tower kernel: Code: 46 44 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 0c 89 55 e8 8b 00 a8 01 75 04 0f 0b eb fe f6 c4 08 74 72 8b 7e 0c 31 c0 89 fb <8b> 53 14 01 c2 89 55 f0 8b 53 04 39 45 e8 89 55 ec 77 3f e8 96 Mar 5 04:29:58 Tower kernel: EIP: [<c1087e54>] block_invalidatepage+0x24/0x97 SS:ESP 0068:d1fbde5c Mar 5 04:29:58 Tower kernel: CR2: 0000000000000014 Mar 5 04:29:58 Tower kernel: ---[ end trace cbb405a0a413a409 ]--- syslog.txt Quote Link to comment
pantheis Posted March 5, 2011 Author Share Posted March 5, 2011 Posting a follow-up reply just to state, one of the preclears of the 2TB drives that I was doing has already failed with the same error: It looks like I'm still having the same issues with the clean install of unRaid, so I'm back to square one and out of ideas. Mar 5 10:18:37 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 00000014 Mar 5 10:18:37 Tower kernel: IP: [<c1087e54>] block_invalidatepage+0x24/0x97 Mar 5 10:18:37 Tower kernel: *pdpt = 000000003130e001 *pde = 0000000000000000 Mar 5 10:18:37 Tower kernel: Oops: 0000 [#1] SMP Mar 5 10:18:37 Tower kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/host1/target1:0:0/1:0:0:0/block/sda/stat Mar 5 10:18:37 Tower kernel: Modules linked in: md_mod xor forcedeth sata_sil sata_nv amd74xx [last unloaded: xor] Mar 5 10:18:37 Tower kernel: Mar 5 10:18:37 Tower kernel: Pid: 21618, comm: dd Not tainted (2.6.32.9-unRAID # System name Mar 5 10:18:37 Tower kernel: EIP: 0060:[<c1087e54>] EFLAGS: 00010207 CPU: 0 Mar 5 10:18:37 Tower kernel: EIP is at block_invalidatepage+0x24/0x97 Mar 5 10:18:37 Tower kernel: EAX: 00000000 EBX: 00000000 ECX: 00000002 EDX: e2e27840 Mar 5 10:18:37 Tower kernel: ESI: c4f56000 EDI: e2e27840 EBP: ee283e74 ESP: ee283e5c Mar 5 10:18:37 Tower kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Mar 5 10:18:37 Tower kernel: Process dd (pid: 21618, ti=ee282000 task=f74c2940 task.ti=ee282000) Mar 5 10:18:37 Tower kernel: Stack: Mar 5 10:18:37 Tower kernel: 00000000 00000000 00000000 c1087e30 f7022b28 00000000 ee283e80 c104f651 Mar 5 10:18:37 Tower kernel: <0> c4f56000 ee283e90 c104fa21 c4f56000 0112b98a ee283f08 c104faf1 1d1c1115 Mar 5 10:18:37 Tower kernel: <0> 00000000 00000000 00000000 f7022b28 00000000 ffffffff ee283ec8 0112b98a Mar 5 10:18:37 Tower kernel: Call Trace: Mar 5 10:18:37 Tower kernel: [<c1087e30>] ? block_invalidatepage+0x0/0x97 Mar 5 10:18:37 Tower kernel: [<c104f651>] ? do_invalidatepage+0x19/0x1c Mar 5 10:18:37 Tower kernel: [<c104fa21>] ? truncate_inode_page+0x4a/0x84 Mar 5 10:18:37 Tower kernel: [<c104faf1>] ? truncate_inode_pages_range+0x96/0x238 Mar 5 10:18:37 Tower kernel: [<c104fc9f>] ? truncate_inode_pages+0xc/0x10 Mar 5 10:18:37 Tower kernel: [<c108a52d>] ? kill_bdev+0x2c/0x2f Mar 5 10:18:37 Tower kernel: [<c108adea>] ? __blkdev_put+0x43/0xf7 Mar 5 10:18:37 Tower kernel: [<c108aea8>] ? blkdev_put+0xa/0xc Mar 5 10:18:37 Tower kernel: [<c108b698>] ? blkdev_close+0x25/0x29 Mar 5 10:18:37 Tower kernel: [<c106d39c>] ? __fput+0xd9/0x17d Mar 5 10:18:37 Tower kernel: [<c106d659>] ? fput+0x17/0x19 Mar 5 10:18:37 Tower kernel: [<c106b068>] ? filp_close+0x51/0x5b Mar 5 10:18:37 Tower kernel: [<c106bfe5>] ? sys_close+0x5c/0x8e Mar 5 10:18:37 Tower kernel: [<c1002935>] ? syscall_call+0x7/0xb Mar 5 10:18:37 Tower kernel: Code: 46 44 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 0c 89 55 e8 8b 00 a8 01 75 04 0f 0b eb fe f6 c4 08 74 72 8b 7e 0c 31 c0 89 fb <8b> 53 14 01 c2 89 55 f0 8b 53 04 39 45 e8 89 55 ec 77 3f e8 96 Mar 5 10:18:37 Tower kernel: EIP: [<c1087e54>] block_invalidatepage+0x24/0x97 SS:ESP 0068:ee283e5c Mar 5 10:18:37 Tower kernel: CR2: 0000000000000014 Mar 5 10:18:37 Tower kernel: ---[ end trace a8272eb19fd0120b ]--- Quote Link to comment
pantheis Posted March 14, 2011 Author Share Posted March 14, 2011 After pulling my hair out trying to figure this issue out, I gave up and ordered a replacement motherboard, CPU and RAM. I've been unable to determine a root cause for this issue and nothing I have tried, including returning to the exact previously conditions that had been working fine has resolved it. Given that other people have been using this motherboard without issues, I can only assume that something in the hardware I'm using failed while transferring the hardware to the new case. Quote Link to comment
dgaschk Posted March 14, 2011 Share Posted March 14, 2011 Post your entire syslog. zip it if needed. Quote Link to comment
pantheis Posted March 14, 2011 Author Share Posted March 14, 2011 I already did. The first post I made has the entire syslog as a text attachment. Quote Link to comment
dgaschk Posted March 14, 2011 Share Posted March 14, 2011 Have you checked and reseated all cable connections? Quote Link to comment
pantheis Posted March 15, 2011 Author Share Posted March 15, 2011 I've checked and reseated all cables multiple times. I've tried two different power supplies, two different CPUs, reseated the memory, tried a different flash device, tried the latest beta of unRaid 5, none of it helped. I even went back and tried using the two HD's I had been using for a month with no issues and they also were now unable to complete preclear reliably. The new motherboard, cpu and RAM will be going in tonight and I'm pretty confident that it will resolve the issue. I think the old hardware I had laying around was just barely working and I was just lucky not to experience any issues for the first month I tested it out. Quote Link to comment
mikechy Posted March 15, 2011 Share Posted March 15, 2011 Want to hear a funny story? Over the weekend I experienced a situation where unRAID became non-responsive. It required me to hard power cycle the box forcing a parity check. Twice the parity check failed on me, hanging. I could telnet in and grab the syslog and had a similar error: Mar 13 03:19:01 BIGHOSS kernel: BUG: unable to handle kernel NULL pointer dereference at 00000006 Mar 13 03:19:01 BIGHOSS kernel: IP: [] copy_data+0x22/0x142 [md_mod] Mar 13 03:19:01 BIGHOSS kernel: *pdpt = 00000000375b5001 *pde = 0000000000000000 Mar 13 03:19:01 BIGHOSS kernel: Oops: 0000 [#1] SMP Mar 13 03:19:01 BIGHOSS kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:08.0/host4/target4:0:0/4:0:0:0/block/sdd/stat Mar 13 03:19:01 BIGHOSS kernel: Modules linked in: md_mod xor forcedeth sata_sil sata_nv amd74xx Mar 13 03:19:01 BIGHOSS kernel: Mar 13 03:19:01 BIGHOSS kernel: Pid: 1441, comm: unraidd Not tainted (2.6.32.9-unRAID #5) System name Mar 13 03:19:01 BIGHOSS kernel: EIP: 0060:[] EFLAGS: 00010287 CPU: 0 Mar 13 03:19:01 BIGHOSS kernel: EIP is at copy_data+0x22/0x142 [md_mod] Mar 13 03:19:01 BIGHOSS kernel: EAX: c374f000 EBX: 6c745090 ECX: 00000002 EDX: 00000002 Mar 13 03:19:01 BIGHOSS kernel: ESI: 00000000 EDI: c3752f50 EBP: c2891ed4 ESP: c2891e9c Mar 13 03:19:01 BIGHOSS kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Mar 13 03:19:01 BIGHOSS kernel: Process unraidd (pid: 1441, ti=c2890000 task=f756ba70 task.ti=c2890000) Mar 13 03:19:01 BIGHOSS kernel: Stack: Mar 13 03:19:01 BIGHOSS kernel: 00000000 00000000 00000002 00000001 c374f000 00000000 f8286c6c c2891f00 Mar 13 03:19:01 BIGHOSS kernel: <0> c2891ed8 f8284a0c 00000000 c3752ee0 c3752ee0 c3752f50 c2891f1c f829c8ff Mar 13 03:19:01 BIGHOSS kernel: <0> 6c745090 00000000 00000001 00000004 00000004 c374d000 00000002 00000000 Mar 13 03:19:01 BIGHOSS kernel: Call Trace: Mar 13 03:19:01 BIGHOSS kernel: [] ? xor_blocks+0x26/0x7e [xor] Mar 13 03:19:01 BIGHOSS kernel: [] ? compute_parity+0x177/0x281 [md_mod] Mar 13 03:19:01 BIGHOSS kernel: [] ? handle_stripe+0x861/0xbc9 [md_mod] Mar 13 03:19:01 BIGHOSS kernel: [] ? unraidd+0x8f/0xb0 [md_mod] Mar 13 03:19:01 BIGHOSS kernel: [] ? unraidd+0x0/0xb0 [md_mod] Mar 13 03:19:01 BIGHOSS kernel: [] ? kthread+0x61/0x68 Mar 13 03:19:01 BIGHOSS kernel: [] ? kthread+0x0/0x68 Mar 13 03:19:01 BIGHOSS kernel: [] ? kernel_thread_helper+0x7/0x1a Mar 13 03:19:01 BIGHOSS kernel: Code: ff fe 43 40 fb 5b 5e 5d c3 55 89 e5 57 56 53 83 ec 2c 8b 75 0c 89 45 d4 89 c8 8b 5d 08 89 55 d0 e8 eb a6 db c8 8b 4d d0 89 45 d8 <8b> 51 04 8b 01 39 f2 72 0d 77 04 39 d8 72 07 29 d8 c1 e0 09 eb Mar 13 03:19:01 BIGHOSS kernel: EIP: [] copy_data+0x22/0x142 [md_mod] SS:ESP 0068:c2891e9c Mar 13 03:19:01 BIGHOSS kernel: CR2: 0000000000000006 Mar 13 03:19:01 BIGHOSS kernel: ---[ end trace bbc193420da81a25 ]--- It was determined that it was a BUG in 5.0 b2 that I was running so I fell back to 4.7 and am doing a parity check right now. So I'm very interested to see if it was a hardware issue for you... Quote Link to comment
pantheis Posted March 15, 2011 Author Share Posted March 15, 2011 I'm pretty confident that it will end up being hardware related due to the system having worked fine for about a month and failing after moving everything to the new case. I even attempted to skip the failing preclear steps and just format and start the array through unRaid. While that did finish, parity kept having issues and unraid itself generated much the same error in the syslog just through normal usage as the preclear script did while attempting to preclear, including crashing the server repeatedly. Given that I had zero issues with this hardware until I moved it to the new case, I'm going to assume I zapped it somehow or that it was already on the verge of failing. It is rather old hardware and was heavily used as a gaming system back in the day. The replacement hardware has arrived and I'll be putting in the replacement parts late tonight after work. Based on my previous experience, the preclear would take anywhere from an hour to 6+ hours before it would fail. I could usually get it to fail faster by attempting to preclear all three drives at once, so I'm going to give that a go and I'll post an update to the thread as soon as I know anything. Quote Link to comment
pantheis Posted March 15, 2011 Author Share Posted March 15, 2011 New hardware is installed and performing wonderfully. It's amazing how much faster a modern CPU is compared to the old Athlon64 that I was using. I have started a preclear using the latest script (1.8 as of this writing) on all three of my hard disks at the same time using three separate SSH sessions. The 100GB drive will probably finish at this rate within a couple of hours. The two 2TB drives look like they will take much longer. Overall preclear speeds on the 2TB drives are over 100MB/sec read rates and the 100GB drive is around 60-70MM/sec. So far, no errors. As an aside, this new motherboard and CPU makes my system much more quiet. That old Athlon64 heatsink+fan was loud! Quote Link to comment
mikechy Posted March 15, 2011 Share Posted March 15, 2011 Was it really just as easy as moving the drives over to the new hardware and rebooting? I am running off of an Asus A8N-SLI Deluxe which is a great mobo (has 8 Sata II ports on the board) but it is like 5-6 years old. What new equipment did you get, I'm curious. Quote Link to comment
pantheis Posted March 15, 2011 Author Share Posted March 15, 2011 I was also running on an Asus A8N-SLI Deluxe for a month with a 100GB sata and 120GB IDE HD without any issues. That test worked great, so I pulled the trigger on an Antec 900 case and two 2TB WD-EARS hard drives. After installing everything into the new case, when I tried to preclear the new drives, or even either of the old test drives, I got the previously discussed kernel errors. After trying pretty much everything, I decided to buy a new motherboard, CPU and RAM as I figured it was probably the motherboard or RAM I was currently using that was causing the issues. I picked up the following: JetWay JHZ03-GT-LF AM3 AMD 880G HDMI Micro ATX AMD Motherboard GeIL Value 2GB 240-Pin DDR3 SDRAM DDR3 1333 (PC3 10666) Desktop Memory Model GV32GB1333C9SC AMD Sempron 140 Sargas 2.7GHz Socket AM3 45W Single-Core Processor SDX140HBGQBOX Total cost with shipping was $133.83. I picked that board based on the recommendation of another poster here who is running a full 22 drive array off of it without issues for months. I'm not too worried over it not having fully solid caps (it says it does, but it really doesn't) as the Antec 900 case I have is keeping things very cool and heat is the main concern with semi-solid caps. It is also noticeably much faster and quieter than the A8N-SLI Deluxe I was using. I believe it's the new CPU cooling fan that came with the Sempron 140 along with having a passive chipset cooling solution instead of an active fan based one. Last night, I started preclearing all three drives (100GB, and both 2TB drives) all at the same time. The 100GB drive finished overnight in about 4 hours. Both 2TB drives were about 25% through writing zeroes to them when I left for work after about 8 hours. All three drives never spiked above 27C and were usually hovering around 25C. The 2TB drives are reading and writing around 100-110MB/sec using 4K sector alignment and no jumpers. The 100GB drive was doing about 55-65MB/sec. To answer your question about moving things to a new board, what you want to do is follow the directions posted elsewhere on the site. I believe they say to take a screenshot or write down which drives (by drive serial number) you have assigned to which slots in your array, power down everything, install the new hardware and then power everything back up, then check the devices screen to make sure each drive is where it is supposed to be. I would double-check on those directions though. I didn't have anything on my array so I didn't have to worry about losing anything when swapping to the new hardware. I also noticed your problem seemed to go away once you downgraded to Unraid 4.7. That's good to hear! Quote Link to comment
mikechy Posted March 15, 2011 Share Posted March 15, 2011 That's a nice inexpensive setup. I know it's just a matter of time until I have to replace the A8N. I already have a 40mm fan glued on top of the burned out one attached to the mobo Quote Link to comment
pantheis Posted March 15, 2011 Author Share Posted March 15, 2011 $100 for the case, $160 for the HD's and $133 for the motherboard, cpu and ram. Total cost: $393. The thing about this build is it starts inexpensive, but with what I have right now (+plus an unraid license), I can expand to 6 drives and not buy anything else except the hard disks. If I want to go further, the case I'm using will support up to 15 drives in 5in3's. I like the future options. Quote Link to comment
pantheis Posted March 16, 2011 Author Share Posted March 16, 2011 As of 11:00AM Pacific, both 2TB drives were still running the preclear script and are about 75% through the final post read at about 30 hours in so far. Read rates are down to about 75MB/sec, but that's not unexpected. Quote Link to comment
pantheis Posted March 18, 2011 Author Share Posted March 18, 2011 Preclear on all three drives at the same time completed successfully. As of 5:47PM Pacific, it is 71% through the initial parity calculation. I'm now having some super strange issues with Crashplan not being able to run but literally everything else with the server is functioning wonderfully. I'm going to call this issue resolved due to faulty hardware. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.