[SOLVED] Kernel errors and random reboots


Recommended Posts

Hi there,

 

I just recently build my unRaid server, and decided to go with Unraid 5 beta12a. But I've been getting lots of kernel panic's and the server getting unresponsive.

 

Where should I start looking? Is this related to the beta version, or is it a hardware issue?

 

I'm using an Asus P5B motherboard, with an Adaptec 1430SA controller card, 2x 2TB hardrives, and 3x 1TB drives. (With another 4x 1TB drives I still need to pre-clear and add).

 

Can I downgrade to 4.7 without losing my existing data?

 

Below is the messages from syslog:

 

Oct 10 00:13:53 pooh kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000f

Oct 10 00:13:53 pooh kernel: IP: [<c10ce949>] prepare_error_buf+0x57/0x3e5

Oct 10 00:13:53 pooh kernel: *pdpt = 0000000006e51001 *pde = 0000000000000000

Oct 10 00:13:53 pooh kernel: Oops: 0000 [#1] SMP

Oct 10 00:13:53 pooh kernel: Modules linked in: md_mod ntfs xor ide_gd_mod pata_jmicron asus_atk0110 r8168 hwmon sata_mv i2c_i801 i2c_core jmicron ata_piix ahci libahci [last unloaded: md_mod]

Oct 10 00:13:53 pooh kernel:

Oct 10 00:13:53 pooh kernel: Pid: 14317, comm: shfs Not tainted 3.0.3-unRAID #7 System manufacturer System Product Name/P5B

Oct 10 00:13:53 pooh kernel: EIP: 0060:[<c10ce949>] EFLAGS: 00010286 CPU: 1

Oct 10 00:13:53 pooh kernel: EIP is at prepare_error_buf+0x57/0x3e5

Oct 10 00:13:53 pooh kernel: EAX: c14ca08a EBX: c139f34b ECX: 00000000 EDX: c14ca08a

Oct 10 00:13:53 pooh kernel: ESI: ffffffff EDI: c1767e14 EBP: c1767de4 ESP: c1767d7c

Oct 10 00:13:53 pooh kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

Oct 10 00:13:53 pooh kernel: Process shfs (pid: 14317, ti=c1766000 task=f76a3960 task.ti=c1766000)

Oct 10 00:13:53 pooh kernel: Stack:

Oct 10 00:13:53 pooh kernel:  c1a5b000 c1767db8 c1767e80 c1767e3c c1767e80 c1767e00 c10d4004 00000001

Oct 10 00:13:53 pooh kernel:  00000000 00001000 f76b0000 c10ce16d c14ca48a f0ad2f00 c14ca08a 00000000

Oct 10 00:13:53 pooh kernel:  00000026 c1767e28 c1767e28 c10d89bf 00000000 00000000 00000000 f76b0000

Oct 10 00:13:53 pooh kernel: Call Trace:

Oct 10 00:13:53 pooh kernel:  [<c10d4004>] ? search_for_position_by_key+0x32/0x24b

Oct 10 00:13:53 pooh kernel:  [<c10ce16d>] ? add_save_link+0x15f/0x1a6

Oct 10 00:13:53 pooh kernel:  [<c10d89bf>] ? do_journal_end+0x908/0x92a

Oct 10 00:13:53 pooh kernel:  [<c10cf4c9>] __reiserfs_error+0x1a/0xa9

Oct 10 00:13:53 pooh kernel:  [<c10d51b1>] reiserfs_do_truncate+0x15c/0x479

Oct 10 00:13:53 pooh kernel:  [<c10dc75f>] ? reiserfs_for_each_xattr+0x6e/0x1fe

Oct 10 00:13:53 pooh kernel:  [<c10d54fc>] reiserfs_delete_object+0x2e/0x62

Oct 10 00:13:53 pooh kernel:  [<c10c561a>] reiserfs_evict_inode+0x7c/0xd5

Oct 10 00:13:53 pooh kernel:  [<c109022e>] evict+0x59/0xec

Oct 10 00:13:53 pooh kernel:  [<c1090548>] iput_final+0xea/0xef

Oct 10 00:13:53 pooh kernel:  [<c1090577>] iput+0x2a/0x2d

Oct 10 00:13:53 pooh kernel:  [<c10898b4>] do_unlinkat+0xbe/0x108

Oct 10 00:13:53 pooh kernel:  [<c10881e4>] ? path_lookupat+0x16f/0x4ba

Oct 10 00:13:53 pooh kernel:  [<c108990e>] sys_unlink+0x10/0x12

Oct 10 00:13:53 pooh kernel:  [<c130ab4d>] syscall_call+0x7/0xb

Oct 10 00:13:53 pooh kernel:  [<c1300000>] ? quirk_usb_disable_ehci+0x84/0x129

Oct 10 00:13:53 pooh kernel: Code: 45 d0 6c a0 4c c1 89 45 9c e9 8c 02 00 00 8b 16 8d 5e 04 8b 45 d0 89 de e8 c6 f8 ff ff e9 64 02 00 00 8d 7e 04 8b 36 85 f6 74 66 <8a> 46 10 bb 72 c1 3c c1 84 c0 74 20 3c 03 0f 84 31 03 00 00 3c

Oct 10 00:13:53 pooh kernel: EIP: [<c10ce949>] prepare_error_buf+0x57/0x3e5 SS:ESP 0068:c1767d7c

Oct 10 00:13:53 pooh kernel: CR2: 000000000000000f

Oct 10 00:13:53 pooh kernel: ---[ end trace ffe0c3ead5183a95 ]---

Oct 10 00:13:53 pooh kernel: ------------[ cut here ]------------

Oct 10 00:13:53 pooh kernel: WARNING: at kernel/exit.c:909 do_exit+0x2c/0x274()

Oct 10 00:13:53 pooh kernel: Hardware name: System Product Name

Oct 10 00:13:53 pooh kernel: Modules linked in: md_mod ntfs xor ide_gd_mod pata_jmicron asus_atk0110 r8168 hwmon sata_mv i2c_i801 i2c_core jmicron ata_piix ahci libahci [last unloaded: md_mod]

Oct 10 00:13:53 pooh kernel: Pid: 14317, comm: shfs Tainted: G      D     3.0.3-unRAID #7

Oct 10 00:13:53 pooh kernel: Call Trace:

Oct 10 00:13:53 pooh kernel:  [<c10288ac>] warn_slowpath_common+0x65/0x7a

Oct 10 00:13:53 pooh kernel:  [<c102b724>] ? do_exit+0x2c/0x274

Oct 10 00:13:53 pooh kernel:  [<c10288d0>] warn_slowpath_null+0xf/0x13

Oct 10 00:13:53 pooh kernel:  [<c102b724>] do_exit+0x2c/0x274

Oct 10 00:13:53 pooh kernel:  [<c10048b5>] oops_end+0x75/0x7c

Oct 10 00:13:53 pooh kernel:  [<c101b0c1>] no_context+0xac/0xb6

Oct 10 00:13:53 pooh kernel:  [<c101b1b3>] __bad_area_nosemaphore+0xe8/0xf0

Oct 10 00:13:53 pooh kernel:  [<c101b36a>] ? mm_fault_error+0x129/0x129

Oct 10 00:13:53 pooh kernel:  [<c101b200>] bad_area+0x35/0x3b

Oct 10 00:13:53 pooh kernel:  [<c101b516>] do_page_fault+0x1ac/0x332

Oct 10 00:13:53 pooh kernel:  [<c101b36a>] ? mm_fault_error+0x129/0x129

Oct 10 00:13:53 pooh kernel:  [<c130b14a>] error_code+0x5a/0x60

Oct 10 00:13:53 pooh kernel:  [<c101b36a>] ? mm_fault_error+0x129/0x129

Oct 10 00:13:53 pooh kernel:  [<c10ce949>] ? prepare_error_buf+0x57/0x3e5

Oct 10 00:13:53 pooh kernel:  [<c10d4004>] ? search_for_position_by_key+0x32/0x24b

Oct 10 00:13:53 pooh kernel:  [<c10ce16d>] ? add_save_link+0x15f/0x1a6

Oct 10 00:13:53 pooh kernel:  [<c10d89bf>] ? do_journal_end+0x908/0x92a

Oct 10 00:13:53 pooh kernel:  [<c10cf4c9>] __reiserfs_error+0x1a/0xa9

Oct 10 00:13:53 pooh kernel:  [<c10d51b1>] reiserfs_do_truncate+0x15c/0x479

Oct 10 00:13:53 pooh kernel:  [<c10dc75f>] ? reiserfs_for_each_xattr+0x6e/0x1fe

Oct 10 00:13:53 pooh kernel:  [<c10d54fc>] reiserfs_delete_object+0x2e/0x62

Oct 10 00:13:53 pooh kernel:  [<c10c561a>] reiserfs_evict_inode+0x7c/0xd5

Oct 10 00:13:53 pooh kernel:  [<c109022e>] evict+0x59/0xec

Oct 10 00:13:53 pooh kernel:  [<c1090548>] iput_final+0xea/0xef

Oct 10 00:13:53 pooh kernel:  [<c1090577>] iput+0x2a/0x2d

Oct 10 00:13:53 pooh kernel:  [<c10898b4>] do_unlinkat+0xbe/0x108

Oct 10 00:13:53 pooh kernel:  [<c10881e4>] ? path_lookupat+0x16f/0x4ba

Oct 10 00:13:53 pooh kernel:  [<c108990e>] sys_unlink+0x10/0x12

Oct 10 00:13:53 pooh kernel:  [<c130ab4d>] syscall_call+0x7/0xb

Oct 10 00:13:53 pooh kernel:  [<c1300000>] ? quirk_usb_disable_ehci+0x84/0x129

Oct 10 00:13:53 pooh kernel: ---[ end trace ffe0c3ead5183a96 ]---

 

Link to comment

Cool, thank you for that, I'll give it a try this evening.

 

I posted the initial post in a bit of a hurry so it was a bit light on the detail.

 

I had an external USB drive plugged into the unRaid server, created a temp directory called /backup and then mounted the ext3 filesystem that was on the USB drive there.

 

I then used the Telnet shell to copy files from /backup to /mnt/user/Backup (a user share I had created).

 

It ran for a while (maybe an hour), and then the unRaid server just rebooted by itself and got stuck in the BIOS. I had to power cycle it.

 

I then ran the memory check on the unRaid USB flash drive, and it found RAM errors. The system had 2x 2GB and 2x 1GB RAM modules. I removed the 2x 1GB (leaving the system with 4GB) and re-ran the memory test. I let it complete to 100% and it found no errors.

 

I then unplugged the external USB drive, booted up the unRaid server, and let it finish its parity check. It found and corrected 15 errors according to the web interface.

 

Then I tried to delete those files in the Backup user share from a Windows 7 system, and halfway through deleting it, I got that kernel oops. I rebooted the system again, and tried to delete them again, same thing happened.

 

So it's likely that there are some filesystem corruption from the first time the machine crashed, due to possibly faulty ram.

Link to comment

This was the end of the output of reiserfsck:

 

 

bad_indirect_item: block 284819480: The item (1871 1937 0x1 IND (1), len 1452, location 1128 entry count 0, fsck need 0, format new) has the bad pointer (362) to the block (370795981), which is in tree already
bad_stat_data: The objectid (1938) is shared by at least two files. Can be fixed with --rebuild-tree only.
bad_path: The left delimiting key [8208 8359 0x35001 IND (1)] of the node (284819480) must be equal to the first element's key [1871 1936 0x1 IND (1)] within the node.
finished                  
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Fatal corruptions were found, Semantic pass skipped
1 found corruptions can be fixed only when running with --rebuild-tree

 

Parity check is still running from this morning, so I'll let that finish before trying to fix up the reiserfs problems.

Link to comment

I know I already marked this as solved, but just a follow up in case someone else has a similar issue.

 

After repairing the filesystem I longer got the 'kernel oops' problems, but my unRaid system kept rebooting after a few hours of use. I think one of these random reboots is what caused the file system corruption in the first place.

 

Having experienced similar issues before with a randomly rebooting system I suspected the powersupply. It was a RaidMax 630W modular powersupply. It served me well, but it was over 7 years old.

 

I replaced it with a Corsair CX600 and the system has been rock solid since. The system currently has 7 drives in it, but is designed to take up to 12 LP/green drives.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.