5.0RC3 - BLK_EH_NOT_HANDLED causes freeze

MyKroFt · June 26, 2012

RC5 - just got one again and it eventually red balled one of my drives again.....

Jun 25 10:10:23 Tower kernel: sas: command 0xf74d4d80, task 0xf75a72c0, timed out: BLK_EH_NOT_HANDLED
Jun 25 10:10:23 Tower kernel: sas: Enter sas_scsi_recover_host
Jun 25 10:10:23 Tower kernel: sas: trying to find task 0xf75a72c0
Jun 25 10:10:23 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf75a72c0
Jun 25 10:10:23 Tower kernel: sas: sas_scsi_find_task: querying task 0xf75a72c0
Jun 25 10:10:23 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5
Jun 25 10:10:23 Tower kernel: sas: sas_scsi_find_task: task 0xf75a72c0 failed to abort
Jun 25 10:10:23 Tower kernel: sas: task 0xf75a72c0 is not at LU: I_T recover
Jun 25 10:10:23 Tower kernel: sas: I_T nexus reset for dev 0700000000000000
Jun 25 10:10:25 Tower kernel: mvsas 0000:03:00.0: Phy7 : No sig fis
Jun 25 10:10:25 Tower kernel: sas: sas_form_port: phy7 belongs to port7 already(1)!
Jun 25 10:10:25 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[7]:rc= 0
Jun 25 10:10:25 Tower kernel: sas: I_T 0700000000000000 recovered
Jun 25 10:10:25 Tower kernel: sas: sas_ata_task_done: SAS error 8d
Jun 25 10:10:25 Tower kernel: ata7: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata8: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata9: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata10: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata11: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata12: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata13: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata14: sas eh calling libata port error handler
Jun 25 10:10:25 Tower kernel: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 t0
Jun 25 10:10:25 Tower kernel: ata14.00: failed command: CHECK POWER MODE
Jun 25 10:10:25 Tower kernel: ata14.00: cmd e5/00:00:00:00:00/00:00:00:00:00/00 tag 0
Jun 25 10:10:25 Tower kernel:          res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x3 (HSM violation)
Jun 25 10:10:25 Tower kernel: ata14.00: status: { ERR }
Jun 25 10:10:25 Tower kernel: ata14.00: error: { ABRT }
Jun 25 10:10:25 Tower kernel: ata14: hard resetting link
Jun 25 10:10:27 Tower kernel: mvsas 0000:03:00.0: Phy7 : No sig fis
Jun 25 10:10:27 Tower kernel: sas: sas_form_port: phy7 belongs to port7 already(1)!
Jun 25 10:10:27 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[7]:rc= 0
Jun 25 10:10:27 Tower kernel: sas: sas_ata_hard_reset: Found ATA device.
Jun 25 10:10:32 Tower kernel: ata14.00: qc timeout (cmd 0x27)
Jun 25 10:10:32 Tower kernel: ata14.00: failed to read native max address (err_mask=0x4)
Jun 25 10:10:32 Tower kernel: ata14.00: HPA support seems broken, skipping HPA handling
Jun 25 10:10:32 Tower kernel: ata14.00: revalidation failed (errno=-5)
Jun 25 10:10:32 Tower kernel: ata14: hard resetting link
Jun 25 10:10:34 Tower kernel: mvsas 0000:03:00.0: Phy7 : No sig fis
Jun 25 10:10:34 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[7]:rc= 0
Jun 25 10:10:34 Tower kernel: sas: sas_ata_hard_reset: Found ATA device.
Jun 25 10:10:34 Tower kernel: sas: sas_ata_task_done: SAS error 2
Jun 25 10:10:36 Tower kernel: sas: sas_ata_task_done: SAS error 2
Jun 25 10:10:36 Tower kernel: sas: sas_form_port: phy7 belongs to port7 already(1)!
Jun 25 10:10:36 Tower kernel: ata14.00: both IDENTIFYs aborted, assuming NODEV
Jun 25 10:10:36 Tower kernel: ata14.00: revalidation failed (errno=-2)
Jun 25 10:10:39 Tower kernel: ata14: hard resetting link
Jun 25 10:10:41 Tower kernel: mvsas 0000:03:00.0: Phy7 : No sig fis
Jun 25 10:10:41 Tower kernel: sas: sas_form_port: phy7 belongs to port7 already(1)!
Jun 25 10:10:42 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:42 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[7]:rc= 0
Jun 25 10:10:42 Tower kernel: sas: sas_ata_hard_reset: Found ATA device.
Jun 25 10:10:42 Tower kernel: sas: sas_ata_task_done: SAS error 2
Jun 25 10:10:42 Tower kernel: ata14.00: both IDENTIFYs aborted, assuming NODEV
Jun 25 10:10:42 Tower kernel: ata14.00: revalidation failed (errno=-2)
Jun 25 10:10:42 Tower kernel: ata14.00: disabled
Jun 25 10:10:42 Tower kernel: ata14: EH complete
Jun 25 10:10:42 Tower kernel: sas: --- Exit sas_scsi_recover_host
Jun 25 10:10:42 Tower kernel: mdcmd (154): spindown 4
Jun 25 10:10:42 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:42 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:42 Tower kernel: mdcmd (155): spindown 4
Jun 25 10:10:42 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:42 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:42 Tower last message repeated 3 times
Jun 25 10:10:42 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:42 Tower kernel: mdcmd (156): spindown 4
Jun 25 10:10:42 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:42 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:42 Tower last message repeated 3 times
Jun 25 10:10:42 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:42 Tower kernel: mdcmd (157): spindown 4
Jun 25 10:10:42 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:42 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:42 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:42 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:42 Tower kernel: mdcmd (158): spindown 4
Jun 25 10:10:42 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:42 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:42 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:42 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:42 Tower kernel: mdcmd (159): spindown 4
Jun 25 10:10:42 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:43 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:43 Tower last message repeated 3 times
Jun 25 10:10:43 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:43 Tower kernel: mdcmd (160): spindown 4
Jun 25 10:10:43 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:43 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:43 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:43 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:43 Tower kernel: mdcmd (161): spindown 4
Jun 25 10:10:43 Tower kernel: md: disk4: ATA_OP e0 ioctl error: -5
Jun 25 10:10:43 Tower emhttp: mdcmd: write: Input/output error
Jun 25 10:10:43 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:43 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 10:10:43 Tower kernel: mdcmd (162): spindown 4

repeats over and over until

Jun 25 19:28:23 Tower kernel: end_request: I/O error, dev sdj, sector 1448887568
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448886480/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887504/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887512/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887520/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887528/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887536/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887544/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887552/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887560/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887568/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887576/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887584/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887592/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887600/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887608/4, count: 1
Jun 25 19:28:23 Tower kernel: md: disk4 read error
Jun 25 19:28:23 Tower kernel: handle_stripe read error: 1448887616/4, count: 1
Jun 25 19:28:27 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jun 25 19:28:27 Tower last message repeated 3 times
Jun 25 19:28:34 Tower kernel: sd 0:0:7:0: [sdj] Unhandled error code
Jun 25 19:28:34 Tower kernel: sd 0:0:7:0: [sdj]  Result: hostbyte=0x04 driverbyte=0x00
Jun 25 19:28:34 Tower kernel: sd 0:0:7:0: [sdj] CDB: cdb[0]=0x2a: 2a 00 56 5c 41 10 00 00 08 00
Jun 25 19:28:34 Tower kernel: end_request: I/O error, dev sdj, sector 1448886544
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448886480/4, count: 1
Jun 25 19:28:34 Tower kernel: md: recovery thread woken up ...
Jun 25 19:28:34 Tower kernel: md: recovery thread has nothing to resync
Jun 25 19:28:34 Tower kernel: sd 0:0:7:0: [sdj] Unhandled error code
Jun 25 19:28:34 Tower kernel: sd 0:0:7:0: [sdj]  Result: hostbyte=0x04 driverbyte=0x00
Jun 25 19:28:34 Tower kernel: sd 0:0:7:0: [sdj] CDB: cdb[0]=0x2a: 2a 00 56 5c 45 10 00 00 78 00
Jun 25 19:28:34 Tower kernel: end_request: I/O error, dev sdj, sector 1448887568
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887504/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887512/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887520/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887528/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887536/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887544/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887552/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887560/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887568/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887576/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887584/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887592/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887600/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887608/4, count: 1
Jun 25 19:28:34 Tower kernel: md: disk4 write error
Jun 25 19:28:34 Tower kernel: handle_stripe write error: 1448887616/4, count: 1
Jun 25 19:43:42 Tower kernel: mdcmd (8123): spindown 8
Jun 25 19:43:42 Tower kernel: mdcmd (8124): spindown 9
Jun 25 19:43:42 Tower kernel: mdcmd (8125): spindown 10
Jun 25 19:43:51 Tower kernel: mdcmd (8126): spindown 5
Jun 25 19:43:52 Tower kernel: mdcmd (8127): spindown 11
Jun 25 19:43:53 Tower kernel: mdcmd (8128): spindown 12
Jun 25 19:45:36 Tower kernel: mdcmd (8129): spindown 6
Jun 25 19:45:37 Tower kernel: mdcmd (8130): spindown 7
Jun 25 19:50:35 Tower kernel: mdcmd (8131): spindown 3
Jun 25 19:51:18 Tower kernel: mdcmd (8132): spindown 1
Jun 25 19:54:13 Tower kernel: mdcmd (8133): spindown 0
Jun 25 19:54:14 Tower kernel: mdcmd (8134): spindown 2

smart report for drive says its 100% and no reallocated sectors

Myk

SuperW2 · June 28, 2012

I will be posting a special release called 5.0-rc4-scst-1, which uses a different mvsas driver. Watch for that announcement and then let's see how it behaves in your system.

Any update on this?

Spindown disabled since the 24th and no repeat of lockup issue.

SuperW2 · June 28, 2012

If RC5 has the new MVAS driver in it, it made it WORSE!

I had zero issues on RC3 for weeks. Upgraded to 5 and had 2 of these hard lock/freezes in 3 days.

SuperW2 · June 29, 2012

If RC5 has the new MVAS driver in it, it made it WORSE!

I had zero issues on RC3 for weeks. Upgraded to 5 and had 2 of these hard lock/freezes in 3 days.

Another Lockup/Freeze/Unresponsive GUI/Shares directly after a BLK_EH_NOT_HANDLED again today. I had turned off spin down a few days ago, and it was better, but locked up again today. I think I'm going back to RC3 to see if problem persists.

Jun 28 11:25:47 media kernel: sas: command 0xd7104d80, task 0xf2f37180, timed out: BLK_EH_NOT_HANDLED

Jun 28 11:25:47 media kernel: sas: Enter sas_scsi_recover_host

Jun 28 11:25:47 media kernel: sas: trying to find task 0xf2f37180

Jun 28 11:25:47 media kernel: sas: sas_scsi_find_task: aborting task 0xf2f37180

Jun 28 11:25:47 media kernel: sas: sas_scsi_find_task: querying task 0xf2f37180

Jun 28 11:25:47 media kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5

Jun 28 11:25:47 media kernel: sas: sas_scsi_find_task: task 0xf2f37180 failed to abort

Jun 28 11:25:47 media kernel: sas: task 0xf2f37180 is not at LU: I_T recover

Jun 28 11:25:47 media kernel: sas: I_T nexus reset for dev 0200000000000000

Jun 28 11:25:47 media kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Jun 28 11:25:50 media kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[2]:rc= 0

Jun 28 11:25:50 media kernel: sas: I_T 0200000000000000 recovered

Jun 28 11:25:50 media kernel: sas: sas_ata_task_done: SAS error 8d

Jun 28 11:25:50 media kernel: ata13: sas eh calling libata port error handler

Jun 28 11:25:50 media kernel: ata14: sas eh calling libata port error handler

Jun 28 11:25:50 media kernel: ata15: sas eh calling libata port error handler

Jun 28 11:25:50 media kernel: ata15.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 t0

Jun 28 11:25:50 media kernel: ata15.00: failed command: CHECK POWER MODE

Jun 28 11:25:50 media kernel: ata15.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0

Jun 28 11:25:50 media kernel: res 01/04:ff:00:00:00/00:00:00:00:00/40 Emask 0x3 (HSM violation)

Jun 28 11:25:50 media kernel: ata15.00: status: { ERR }

Jun 28 11:25:50 media kernel: ata15.00: error: { ABRT }

Jun 28 11:25:50 media kernel: ata15: hard resetting link

Jun 28 11:25:50 media kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Jun 28 11:25:52 media kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[2]:rc= 0

Jun 28 11:25:52 media kernel: sas: sas_ata_hard_reset: Found ATA device.

Jun 28 11:25:52 media kernel: ata15.00: configured for UDMA/133

Jun 28 11:25:52 media kernel: ata15: EH complete

Jun 28 11:25:52 media kernel: ata16: sas eh calling libata port error handler

Jun 28 11:25:52 media kernel: ata17: sas eh calling libata port error handler

Jun 28 11:25:52 media kernel: ata18: sas eh calling libata port error handler

Jun 28 11:25:52 media kernel: ata19: sas eh calling libata port error handler

Jun 28 11:25:52 media kernel: ata20: sas eh calling libata port error handler

Jun 28 11:25:52 media kernel: sas: --- Exit sas_scsi_recover_host

Jun 28 11:26:52 media kernel: sas: sas_ata_task_done: SAS error 2

vinnie · June 29, 2012

I'm also getting this problem.

I upgraded from 4.7 to 5.0-RC3 and everything freezes after about 3-6 days of up time except I can telnet in still and I have the same entries in my syslog as SuperW2.

Does anyone know what the last beta version was that didn't have this issue? Is it Beta 11 or 12?

Is the downgrading procedure the same as upgrading?

Can I just copy bzimage and bzroot over what I already have?

Cheers

whiteatom · June 29, 2012

Yes, down grading is as simple as copying over the 2 files. I have found the problem isn't there at beta 14. I've been running it for over a week where I was getting it after 36 hours or so on rc 3.

guyonphone · June 29, 2012

Hello All,

It appears as though I am also having the same problem, I was asked by user dgaschk to post a link to my problem thread, here.

http://lime-technology.com/forum/index.php?topic=21099.0

Thank you

mtruffa · June 30, 2012

The last beta I did not have a problem with was 11. I still got them in 12. I have been back on 11 not for over a week without a problem.

tyrindor · July 1, 2012

I get 100-107MB/s parity syncs on B14, and so far no BLK_EH_NOT_HANDLED. On RC5, I get 60-75MB/s and BLK_EH_NOT_HANDLED errors. Hopefully both of these things are fixed before 5.0 Final.

cyrnel · July 2, 2012

*&#(&@(&#!! Pardon my French.

Finally another fatal BLK_EH_NOT_HANDLED. This time the server ran from 6/6 to 7/2. The error actually hit last night:

Jul 1 21:22:21 Tower1 kernel: sas: command 0xf76a9000, task 0xf02f7040, timed out: BLK_EH_NOT_HANDLED

We watched a movie tonight without any problem. Things only went south - and now the server is blocked - after I refreshed the web interface a few minutes ago.

tyrindor · July 2, 2012

*&#(&@(&#!! Pardon my French.

Finally another fatal BLK_EH_NOT_HANDLED. This time the server ran from 6/6 to 7/2. The error actually hit last night:

Jul 1 21:22:21 Tower1 kernel: sas: command 0xf76a9000, task 0xf02f7040, timed out: BLK_EH_NOT_HANDLED

We watched a movie tonight without any problem. Things only went south - and now the server is blocked - after I refreshed the web interface a few minutes ago.

Looks like you will need to downgrade to get a reliable server. I'm on B14 and it hasn't happened, which I believe is the latest beta that it doesn't happen on.

RC5 release notes say there are issues, and that he can reproduce them, so one can only hope these issues are finally fixed in RC6. There's no way 5.0 will go final with this bug, it's too major and to many people have SAS MV8 cards.

whiteatom · July 2, 2012

B14 is the one I'm running... No errors, and faster parity-checks. To anyone getting this error, you can just overwrite the same 2 files you do when you upgrade to go back to B14.

I agree, this has got to be one of the show stoppers for 5.0 final.

whiteatom

tyrindor · July 2, 2012

B14 is the one I'm running... No errors, and faster parity-checks. To anyone getting this error, you can just overwrite the same 2 files you do when you upgrade to go back to B14.

I agree, this has got to be one of the show stoppers for 5.0 final.

whiteatom

Would you by any chance happen to experience this on B14 with your cards? It makes me raise an eyebrow, because it does not happen on RC5.

http://lime-technology.com/forum/index.php?topic=21189.0

vinnie · July 2, 2012

I've downgraded to Beta 14 so reset the clock and fingers crossed.

If this doesn't work then I think it'll have to be beta 11 as originally suggested on the first page.

It would be surprising if v5 went final with this issue still being outstanding and introduced only in Release Candidates.

This card (AOC-SASLP-MV8) must be the most popular card in use with unRAID.

Limetech - any news on the promised separate RC4 release with the alternate mvsas driver (5.0-rc4-scst-1)?

Cheers

guyonphone · July 2, 2012

I can confirm that switching back to 5.0-beta14 from 5.0-RC5 resolved this problem for me as well.

Thank You

Original Post:

http://lime-technology.com/forum/index.php?topic=21099.0

chickensoup · July 3, 2012

I've downgraded to Beta 14 so reset the clock and fingers crossed.

Any difference to parity check speed?

vinnie · July 3, 2012

I've downgraded to Beta 14 so reset the clock and fingers crossed.

Any difference to parity check speed?

I have a Seagate ST3000DM001 3TB as my Parity, another couple as Data along with a mixture of 2TB and 1TB WD and Hitachi drives, 13 in total including the Parity drive.

I generally see a speed of about 40-50MB/s at the start of a parity check which then increases to about 105MB/s once it gets over 2TB.

I presume this is because it only needs to access the other two Seagate 3TB drives at that point which are probably quicker than the SATA 2 WDEARS20 and 5K300's?

I could be completely wrong of course but it sounds like a plausible explanation to me.

Cheers

Vin

dgaschk · July 3, 2012

The SATA speed is not the bottleneck. Even SATA I is sufficient for all but the new Seagate drives. Older drives are slower. What speed after 5 min?

vinnie · July 3, 2012

The SATA speed is not the bottleneck. Even SATA I is sufficient for all but the new Seagate drives. Older drives are slower. What speed after 5 min?

It seems pretty constant at around 40-50MB most of the time until later on in the check when it jumps up to over 100MB.

Is that not normal?

I think I might run another parity check and sample the speed throughout. I've only really casually checked it in the past.

cyrnel · July 3, 2012

It seems pretty constant at around 40-50MB most of the time until later on in the check when it jumps up to over 100MB.

Is that not normal?

I think I might run another parity check and sample the speed throughout. I've only really casually checked it in the past.

If some of your disks are smaller they'll finish first and the remaining disks can proceed more quickly.

dgaschk · July 4, 2012

Big disks are usually newer and faster than smaller older disks.

chickensoup · July 4, 2012

That's quite a big speed difference. It could be your PCIE bus bottlenecking with that many drives but I could be wrong. Have you checked the health of each of your drives? Could also be an older one on it's way out.

PeterB · July 4, 2012

Have you checked the health of each of your drives? Could also be an older one on it's way out.

Indeed. I have an EADS drive which appears to be perfectly health in all respects ... except that the read speed slows to about 5Mb/s between around 75% and 85% during the pre-/post-read in preclear.

Its SMART status is healthy and it passes the WDC test program.

vinnie · July 4, 2012

That's quite a big speed difference. It could be your PCIE bus bottlenecking with that many drives but I could be wrong. Have you checked the health of each of your drives? Could also be an older one on it's way out.

I have three older 1TB drives that came out of a QNAP that I used to run several years ago.

I removed one from the array recently when it got red balled due to a write error.

But I think there might also be another one of them in there that has several previous write errors from its QNAP days (myMain showed them in Red under SMART view).

I'm running with an Atom board (Supermicro X7SPA-HF-D525) with an additional AOC-SAS-MV8 card - total 14 slots.

I know I should look to replace them really. Hmmm I'm getting worried now. :-)

Wouldn't hurt to replace them with a shiny new 2 or 3TBer :-)

tyrindor · July 4, 2012

If you are having slow parity speeds on b14, try b12a. I had the same issue on one of my other unRAID machines. Luckily my new machine does not seem to be affected by it, but I still wish we could all use the latest RC.

5.0RC3 - BLK_EH_NOT_HANDLED causes freeze

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived