Experimental fix/test for mpt2sas controllers (LSI)


Recommended Posts

  • Replies 75
  • Created
  • Last Reply

Top Posters In This Topic

After reading this thread I though I'd give this release a try but thing did not go well.

 

2 x M1015's flashed to IT mode

 

Jul 31 21:41:01 Tower kernel: Pid: 912, comm: scsi_eh_10 Tainted: G          O 3.4.4-unRAID #2 Gigabyte Technology Co., Ltd. GA-890GPA-UD3H/GA-890GPA-UD3H

Jul 31 21:41:01 Tower kernel: EIP: 0060:[<c103ad1e>] EFLAGS: 00010093 CPU: 0

Jul 31 21:41:01 Tower kernel: EIP is at __wake_up_common+0x17/0x5c

Jul 31 21:41:01 Tower kernel: EAX: f2fe0564 EBX: fffffff4 ECX: 00000001 EDX: 00000003

Jul 31 21:41:01 Tower kernel: ESI: 00000246 EDI: 00000003 EBP: f2f75eb4 ESP: f2f75e9c

Jul 31 21:41:01 Tower kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068

Jul 31 21:41:01 Tower kernel: CR0: 8005003b CR2: 00000000 CR3: 361a5000 CR4: 000007f0

Jul 31 21:41:01 Tower kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

Jul 31 21:41:01 Tower kernel: DR6: ffff0ff0 DR7: 00000400

Jul 31 21:41:01 Tower kernel: Process scsi_eh_10 (pid: 912, ti=f2f74000 task=f3541e60 task.ti=f2f74000)

Jul 31 21:41:01 Tower kernel: Stack:

Jul 31 21:41:01 Tower kernel:  00000246 00000282 00000001 f2fe0560 00000246 f2fe055c f2f75ed0 c103b2ca

Jul 31 21:41:01 Tower kernel:  00000000 00000000 f2fe03b4 00000002 00000001 f2f75ee0 f853c379 f2fe03b4

Jul 31 21:41:01 Tower kernel:  00000000 f2f75efc f853de89 f2fe04c0 f2fe04d4 f2fe03b4 00002002 f2fe03b4

Jul 31 21:41:01 Tower kernel: Call Trace:

Jul 31 21:41:01 Tower kernel:  [<c103b2ca>] complete+0x2b/0x3e

Jul 31 21:41:01 Tower kernel:  [<f853c379>] _base_reset_handler+0x11a/0x178 [mpt2sas]

Jul 31 21:41:01 Tower kernel:  [<f853de89>] mpt2sas_base_hard_reset_handler+0x12d/0x1e9 [mpt2sas]

Jul 31 21:41:01 Tower kernel:  [<f8546f71>] mpt2sas_scsih_issue_tm+0xd8/0x445 [mpt2sas]

Jul 31 21:41:01 Tower kernel:  [<f85428a3>] ? _scsih_tm_display_info+0x13f/0x147 [mpt2sas]

Jul 31 21:41:01 Tower kernel:  [<f85476b0>] _scsih_abort+0x120/0x15f [mpt2sas]

Jul 31 21:41:01 Tower kernel:  [<c122f125>] scsi_unjam_host+0xab/0x14e

Jul 31 21:41:01 Tower kernel:  [<c122f252>] scsi_error_handler+0x8a/0xdc

Jul 31 21:41:01 Tower kernel:  [<c122f1c8>] ? scsi_unjam_host+0x14e/0x14e

Jul 31 21:41:01 Tower kernel:  [<c1035ba4>] kthread+0x67/0x6c

Jul 31 21:41:01 Tower kernel:  [<c1035b3d>] ? kthread_freezable_should_stop+0x49/0x49

Jul 31 21:41:01 Tower kernel:  [<c131fd76>] kernel_thread_helper+0x6/0xd

Jul 31 21:41:01 Tower kernel: Code: db 01 42 30 11 5a 34 eb 08 31 db 01 42 28 11 5a 2c 5b 5d c3 55 89 e5 57 89 d7 56 53 83 ec 0c 89 4d f0 8b 58 04 83 c0 04 83 eb 0c <8b> 73 0c 89 45 e8 83 ee 0c eb 2a 8b 03 89 fa ff 75 0c 8b 4d 08

Jul 31 21:41:01 Tower kernel: EIP: [<c103ad1e>] __wake_up_common+0x17/0x5c SS:ESP 0068:f2f75e9c

Jul 31 21:41:01 Tower kernel: CR2: 0000000000000000

Jul 31 21:41:01 Tower kernel: ---[ end trace 17137aaa2e86c242 ]---

 

syslog-2012-07-31.txt

Link to comment

After reading this thread I though I'd give this release a try but thing did not go well.

2 x M1015's flashed to IT mode

 

I suspect you have very old firmware P11:

 

Jul 31 21:37:21 Tower kernel: mpt2sas0: LSISAS2008: FWVersion(11.00.00.00), ChipRevision(0x03), BiosVersion(07.21.00.00)

Jul 31 21:37:21 Tower kernel: mpt2sas1: LSISAS2008: FWVersion(11.00.00.00), ChipRevision(0x03), BiosVersion(07.21.00.00)

 

Try the latest P14 firmware and report back

They are a lot of specific SATA HDs fixes in P14 - it is probably a good idea for everyone to upgrade to this one.

 

It is probably a good idea for everyone posting here to report the card firmware and BIOS revisions and in this way TOM can use/forward this info to the Linux guys.

Link to comment

Just a quick update. Upgraded both cards to P14 and copied a couple of large file to the array. All seem good at this point and a bit faster the the previous RC's.

 

Aug  1 19:37:32 Tower kernel: mpt2sas0: LSISAS2008: FWVersion(14.00.00.00), ChipRevision(0x03), BiosVersion(07.27.00.00) (Drive related)

 

I can post a syslog if anyone want's it.

Link to comment

This is my syslug entry for the M1015 card

mpt2sas0: LSISAS2008: FWVersion(14.00.00.00), ChipRevision(0x03), BiosVersion(00.00.00.00) (Drive related)

Should I be concerned?

 

Maybe.

 

I haven't upgraded to RC 6 yet, will that change the bios version?

 

No.

 

A little while ago, someone was suggesting not flashing the LSI card bios (for faster boot?).  The only reason I can think of for the Bios Version not being set is that you've omitted the bios from the flash.

Link to comment

version: 5.0-rc6-r8168-test2

Supermicro X7SPA-H-D525

Supermicro AOC-SASLP-MV8 firmware version .21

 

I'm having problems adding disks attached to the SAS controller card.

I have 6 disks currently attached to the motherboard, everything is working fine on RC5.

 

Tonight I added the AOC-SASLP-MV8 and 1 WD 3.0TB drive plugged into port 0 on the breakout cable.

I boot into unraid just fine with RC5 and RC6.

I stop the array, select disk 6 drop down and find the disk. It shows up with a blue dot and shows temperature etc.

I then "Yes I want to do this" and hit the start button.

It says "Spinning up all drives...clearing disk 6...

 

Now the webgui is locked up and I cannot refresh it.

It doesn't show me any progress of the clearing of the disk.

The webgui seems locked up. How long should I wait for the webgui to show up again?

 

Link to comment
The webgui seems locked up. How long should I wait for the webgui to show up again?

 

I've not experienced this, but I understand that it can stay 'hung' for quite some time.

 

It is much better to use Joe's preclear_disk script on a disk before you introduce it to the array.

Link to comment

version: 5.0-rc6-r8168-test2

Supermicro X7SPA-H-D525

The webgui seems locked up. How long should I wait for the webgui to show up again?

Ive tested this version briefly with the same motherboard but with a SAS2LP controller, i also could not get the webgui to show. Also all user shares where gone, only disk shares.

Link to comment

This is my syslug entry for the M1015 card

mpt2sas0: LSISAS2008: FWVersion(14.00.00.00), ChipRevision(0x03), BiosVersion(00.00.00.00) (Drive related)

Should I be concerned? I haven't upgraded to RC 6 yet, will that change the bios version?

 

You have the latest firmware but the person you bought the card from did not flash any BIOS. This is acceptable for IT mode and it will make your computer boot 20-30s faster. Other than that probably no harm effect.

Link to comment

version: 5.0-rc6-r8168-test2

Supermicro X7SPA-H-D525

Supermicro AOC-SASLP-MV8 firmware version .21

 

I'm having problems adding disks attached to the SAS controller card.

I have 6 disks currently attached to the motherboard, everything is working fine on RC5.

 

Tonight I added the AOC-SASLP-MV8 and 1 WD 3.0TB drive plugged into port 0 on the breakout cable.

I boot into unraid just fine with RC5 and RC6.

I stop the array, select disk 6 drop down and find the disk. It shows up with a blue dot and shows temperature etc.

I then "Yes I want to do this" and hit the start button.

It says "Spinning up all drives...clearing disk 6...

 

Now the webgui is locked up and I cannot refresh it.

It doesn't show me any progress of the clearing of the disk.

The webgui seems locked up. How long should I wait for the webgui to show up again?

 

AOC-SASLP-MV8 is not an LSI card. However this may become official release and probably includes all the fixes from RC6 (but there is no official change history to be sure).

 

You also have a relatively low power CPU and coupled with the fact that you have attached the 3TB disk to the SM controller it may take 15-18 hours if you did not catch any bugs along the way. Using the Joe L. preclear script will test the disk better and avoid the big wait you are observing right now.

 

Link to comment

The webgui seems locked up. How long should I wait for the webgui to show up again?

Disks can be written to at a rate roughly 100MB/s.  It will slow down as it gets to inner cylinders on the disk to a rate as low as 50 or 60MB/s.    I assume no slowdown for this following estimate.

 

At 100MB/s it will take 10 seconds to zero 1GB.  You will be clearing 6GB every minute. (roughly)

 

You have a 3TB drive, that is 3000GB.  3000 / 6 = 500 minutes.

 

500 minutes = 8.33 hours.

 

As already mentioned, this is about the best case.  If you actually write at an average of 75MB/s, it will take over 10 hours.

 

This is exactly why the pre-clear script was originally developed...  (and when I originally wrote it, large disks were only 500GB, so down-time was only three or four hours.)

 

Joe L.

Link to comment

AOC-SASLP-MV8 is not an LSI card. However this may become official release and probably includes all the fixes from RC6 (but there is no official change history to be sure).

 

You also have a relatively low power CPU and coupled with the fact that you have attached the 3TB disk to the SM controller it may take 15-18 hours if you did not catch any bugs along the way. Using the Joe L. preclear script will test the disk better and avoid the big wait you are observing right now.

 

As a X7SPA and SASLP-MV8 user myself, I was under the impression there were no specific problems with these cards. I've also read Tom uses these cards for his own tests.

 

As I only have a SASLP-MV8 on my production server (with 24Tb) I cannot test the config myself unfortunately.

Link to comment

Update:

 

I left the box running overnight. It finished clearing and I was able to format it.

 

So I guess the webgui freezes until the clearing of the disk is completed.

I can live with that.

 

Starting the second disk now...

 

WebGui should show progress during a Clear, what browser and version are you using?

Link to comment

This is my syslug entry for the M1015 card

mpt2sas0: LSISAS2008: FWVersion(14.00.00.00), ChipRevision(0x03), BiosVersion(00.00.00.00) (Drive related)

Should I be concerned? I haven't upgraded to RC 6 yet, will that change the bios version?

 

You have the latest firmware but the person you bought the card from did not flash any BIOS. This is acceptable for IT mode and it will make your computer boot 20-30s faster. Other than that probably no harm effect.

 

I will leave it as is...and I will upgrade to this release as well. Thanks for the pointer.

Link to comment

Update:

 

I left the box running overnight. It finished clearing and I was able to format it.

 

So I guess the webgui freezes until the clearing of the disk is completed.

I can live with that.

 

Starting the second disk now...

 

WebGui should show progress during a Clear, what browser and version are you using?

 

I am using the latest chrome browser.

Link to comment

 

Seem to have a problem...

 

The system was running great until this showed up...

 

 

 

Quote

 

Aug  6 23:46:37 THOR kernel: Call Trace:

Aug  6 23:46:37 THOR kernel:  [<c1055180>] print_cpu_stall+0x59/0xd1

Aug  6 23:46:37 THOR kernel:  [<c1055233>] __rcu_pending+0x3b/0x125

Aug  6 23:46:37 THOR kernel:  [<c1055393>] rcu_check_callbacks+0x76/0xa1

Aug  6 23:46:37 THOR kernel:  [<c102b11b>] update_process_times+0x2d/0x58

Aug  6 23:46:37 THOR kernel:  [<c1048dfe>] tick_periodic+0x63/0x65

Aug  6 23:46:37 THOR kernel:  [<c1048e19>] tick_handle_periodic+0x19/0x6c

Aug  6 23:46:37 THOR kernel:  [<c10173f2>] smp_apic_timer_interrupt+0x67/0x7a

Aug  6 23:46:37 THOR kernel:  [<c131f3fa>] apic_timer_interrupt+0x2a/0x30

Aug  6 23:46:37 THOR kernel:  [<f853a6e0>] ? xor_sse_4+0x262/0x338 [xor]

Aug  6 23:46:37 THOR kernel:  [<f853b8cd>] xor_blocks+0x55/0x71 [xor]

Aug  6 23:46:37 THOR kernel:  [<f853b8cd>] ? xor_blocks+0x55/0x71 [xor]

Aug  6 23:46:37 THOR kernel:  [<f8618951>] check_parity+0xb3/0xc3 [md_mod]

Aug  6 23:46:37 THOR kernel:  [<f86194b5>] handle_stripe+0xa7d/0xcfb [md_mod]

Aug  6 23:46:37 THOR kernel:  [<f86197aa>] unraidd+0x77/0xb8 [md_mod]

Aug  6 23:46:37 THOR kernel:  [<f8616925>] md_thread+0xcc/0xe3 [md_mod]

Aug  6 23:46:37 THOR kernel:  [<c1035e9c>] ? wake_up_bit+0x5b/0x5b

Aug  6 23:46:37 THOR kernel:  [<f8616859>] ? import_device+0x147/0x147 [md_mod]

Aug  6 23:46:37 THOR kernel:  [<c1035ba4>] kthread+0x67/0x6c

Aug  6 23:46:37 THOR kernel:  [<c1035b3d>] ? kthread_freezable_should_stop+0x49/0x49

Aug  6 23:46:37 THOR kernel:  [<c131fd76>] kernel_thread_helper+0x6/0xd

Aug  6 23:49:37 THOR kernel: INFO: rcu_sched self-detected stall on CPU { 2}  (t=96015 jiffies)

Aug  6 23:49:37 THOR kernel: Pid: 2389, comm: unraidd Not tainted 3.4.4-unRAID #2

Aug  6 23:49:37 THOR kernel: Call Trace:

Aug  6 23:49:37 THOR kernel:  [<c1055180>] print_cpu_stall+0x59/0xd1

Aug  6 23:49:37 THOR kernel:  [<c1055233>] __rcu_pending+0x3b/0x125

Aug  6 23:49:37 THOR kernel:  [<c1055393>] rcu_check_callbacks+0x76/0xa1

Aug  6 23:49:37 THOR kernel:  [<c102b11b>] update_process_times+0x2d/0x58

Aug  6 23:49:37 THOR kernel:  [<c1048dfe>] tick_periodic+0x63/0x65

Aug  6 23:49:37 THOR kernel:  [<c1048e19>] tick_handle_periodic+0x19/0x6c

Aug  6 23:49:37 THOR kernel:  [<c10173f2>] smp_apic_timer_interrupt+0x67/0x7a

Aug  6 23:49:37 THOR kernel:  [<c131f3fa>] apic_timer_interrupt+0x2a/0x30

Aug  6 23:49:37 THOR kernel:  [<f853a903>] ? xor_sse_5+0x14d/0x3d8 [xor]

Aug  6 23:49:37 THOR kernel:  [<c10161a3>] ? native_smp_send_reschedule+0x3f/0x41

Aug  6 23:49:37 THOR kernel:  [<f853b8de>] xor_blocks+0x66/0x71 [xor]

Aug  6 23:49:37 THOR kernel:  [<f853b8de>] ? xor_blocks+0x66/0x71 [xor]

Aug  6 23:49:37 THOR kernel:  [<f861892b>] check_parity+0x8d/0xc3 [md_mod]

Aug  6 23:49:37 THOR kernel:  [<f86194b5>] handle_stripe+0xa7d/0xcfb [md_mod]

Aug  6 23:49:37 THOR kernel:  [<f86197aa>] unraidd+0x77/0xb8 [md_mod]

Aug  6 23:49:37 THOR kernel:  [<f8616925>] md_thread+0xcc/0xe3 [md_mod]

Aug  6 23:49:37 THOR kernel:  [<c1035e9c>] ? wake_up_bit+0x5b/0x5b

Aug  6 23:49:37 THOR kernel:  [<f8616859>] ? import_device+0x147/0x147 [md_mod]

Aug  6 23:49:37 THOR kernel:  [<c1035ba4>] kthread+0x67/0x6c

Aug  6 23:49:37 THOR kernel:  [<c1035b3d>] ? kthread_freezable_should_stop+0x49/0x49

Aug  6 23:49:37 THOR kernel:  [<c131fd76>] kernel_thread_helper+0x6/0xd

Aug  6 23:49:46 THOR in.telnetd[31505]: connect from 192.168.1.2 (192.168.1.2)

Aug  6 23:49:52 THOR login[31506]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.2'

Aug  6 23:50:36 THOR in.telnetd[31521]: connect from 192.168.1.2 (192.168.1.2)

Aug  6 23:50:42 THOR login[31522]: ROOT LOGIN  on '/dev/pts/2' from '192.168.1.2'

Aug  6 23:52:37 THOR kernel: INFO: rcu_sched self-detected stall on CPU { 2}  (t=114018 jiffies)

Aug  6 23:52:37 THOR kernel: Pid: 2389, comm: unraidd Not tainted 3.4.4-unRAID #2

Aug  6 23:52:37 THOR kernel: Call Trace:

Aug  6 23:52:37 THOR kernel:  [<c1055180>] print_cpu_stall+0x59/0xd1

Aug  6 23:52:37 THOR kernel:  [<c1055233>] __rcu_pending+0x3b/0x125

Aug  6 23:52:37 THOR kernel:  [<c1055393>] rcu_check_callbacks+0x76/0xa1

Aug  6 23:52:37 THOR kernel:  [<c102b11b>] update_process_times+0x2d/0x58

Aug  6 23:52:37 THOR kernel:  [<c1048dfe>] tick_periodic+0x63/0x65

Aug  6 23:52:37 THOR kernel:  [<c1048e19>] tick_handle_periodic+0x19/0x6c

Aug  6 23:52:37 THOR kernel:  [<c10173f2>] smp_apic_timer_interrupt+0x67/0x7a

Aug  6 23:52:37 THOR kernel:  [<c131f3fa>] apic_timer_interrupt+0x2a/0x30

Aug  6 23:52:37 THOR kernel:  [<c119103f>] ? blk_update_request+0x112/0x2e4

Aug  6 23:52:37 THOR kernel:  [<c119121f>] blk_update_bidi_request+0xe/0x4f

Aug  6 23:52:37 THOR kernel:  [<c1191ad5>] blk_end_bidi_request+0x18/0x50

Aug  6 23:52:37 THOR kernel:  [<c1191b41>] blk_end_request+0xa/0xc

Aug  6 23:52:37 THOR kernel:  [<c1230818>] scsi_end_request+0x1f/0x70

Aug  6 23:52:37 THOR kernel:  [<c1230b7e>] scsi_io_completion+0x1a9/0x3e0

Aug  6 23:52:37 THOR kernel:  [<c123091c>] ? scsi_device_unbusy+0x7c/0x82

Aug  6 23:52:37 THOR kernel:  [<c122bca2>] scsi_finish_command+0x97/0x9d

Aug  6 23:52:37 THOR kernel:  [<c1230e7e>] scsi_softirq_done+0xba/0xc2

Aug  6 23:52:37 THOR kernel:  [<c1195e26>] blk_done_softirq+0x4a/0x57

Aug  6 23:52:37 THOR kernel:  [<c1027046>] __do_softirq+0x6b/0xe5

Aug  6 23:52:37 THOR kernel:  [<c1026fdb>] ? irq_enter+0x41/0x41

Aug  6 23:52:37 THOR kernel:  <IRQ>  [<c1026e8f>] ? irq_exit+0x32/0x58

Aug  6 23:52:37 THOR kernel:  [<c1003506>] ? do_IRQ+0x7c/0x90

Aug  6 23:52:37 THOR kernel:  [<c131fd69>] ? common_interrupt+0x29/0x30

Aug  6 23:52:37 THOR kernel:  [<c11a65bc>] ? memcmp+0x15/0x25

Aug  6 23:52:37 THOR kernel:  [<f86194d9>] ? handle_stripe+0xaa1/0xcfb [md_mod]

Aug  6 23:52:37 THOR kernel:  [<f86197aa>] ? unraidd+0x77/0xb8 [md_mod]

Aug  6 23:52:37 THOR kernel:  [<f8616925>] ? md_thread+0xcc/0xe3 [md_mod]

Aug  6 23:52:37 THOR kernel:  [<c1035e9c>] ? wake_up_bit+0x5b/0x5b

Aug  6 23:52:37 THOR kernel:  [<f8616859>] ? import_device+0x147/0x147 [md_mod]

Aug  6 23:52:37 THOR kernel:  [<c1035ba4>] ? kthread+0x67/0x6c

Aug  6 23:52:37 THOR kernel:  [<c1035b3d>] ? kthread_freezable_should_stop+0x49/0x49

Aug  6 23:52:37 THOR kernel:  [<c131fd76>] ? kernel_thread_helper+0x6/0xd

 

 

 

 

I started a parity check, & a few minutes later went to refresh the web GUI...

 

The GUI would no longer come up...  I had access via putty & the terminal though.

 

Tried everything I could find to get the system to shut down... no go

 

I captured the syslog, changed the 2 files back to rc5 & hard reset it...

 

Running parity now...showed 128 errors right off... 95% done now with no further errors.

syslog.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.