Adding 2nd Parity Fails

DingHo · January 15, 2019

Hello,

I just bought 2 new 4TB drives with the intention of adding one for data and the other as a 2nd parity drive. Both successfully pre-cleared. I then ran a parity check successfully. I stopped the array, added one as data, restarted the array successfully. Stopped the array again, added the next drive as 2nd parity, started the array and began the rebuild. It ran fine for about 6 hours then slowed to a crawl. I stopped the rebuild. I found several errors in the log:

Jan 15 14:36:44 Scour kernel: sas: sas_eh_handle_sas_errors: task 0x00000000675fa35a is aborted

Jan 15 14:36:44 Scour kernel: sas: ata11: end_device-9:2: cmd error handler

Jan 15 14:36:44 Scour kernel: sas: ata9: end_device-9:0: dev error handler

Jan 15 14:36:44 Scour kernel: sas: ata10: end_device-9:1: dev error handler

Jan 15 14:36:44 Scour kernel: sas: ata11: end_device-9:2: dev error handler

Jan 15 14:36:44 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jan 15 14:36:44 Scour kernel: ata11.00: failed command: READ DMA EXT

Jan 15 14:36:44 Scour kernel: ata11.00: cmd 25/00:00:98:12:6e/00:04:a7:00:00/e0 tag 17 dma 524288 in

Jan 15 14:36:44 Scour kernel: res 40/00:80:28:0d:71/00:00:29:00:00/40 Emask 0x4 (timeout)

Jan 15 14:36:44 Scour kernel: ata11.00: status: { DRDY }

Jan 15 14:36:44 Scour kernel: ata11: hard resetting link

Jan 15 14:36:44 Scour kernel: sas: ata12: end_device-9:3: dev error handler

Jan 15 14:36:44 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Jan 15 14:36:47 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0

Jan 15 14:36:47 Scour kernel: ata11.00: configured for UDMA/100

Jan 15 14:36:47 Scour kernel: ata11: EH complete

Jan 15 14:36:47 Scour kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jan 15 15:10:38 Scour kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Jan 15 15:10:38 Scour kernel: sas: trying to find task 0x00000000691cfafa

Jan 15 15:10:38 Scour kernel: sas: sas_scsi_find_task: aborting task 0x00000000691cfafa

Jan 15 15:10:38 Scour kernel: sas: sas_scsi_find_task: task 0x00000000691cfafa is aborted

Jan 15 15:10:38 Scour kernel: sas: sas_eh_handle_sas_errors: task 0x00000000691cfafa is aborted

Jan 15 15:10:38 Scour kernel: sas: ata11: end_device-9:2: cmd error handler

Jan 15 15:10:38 Scour kernel: sas: ata9: end_device-9:0: dev error handler

Jan 15 15:10:38 Scour kernel: sas: ata10: end_device-9:1: dev error handler

Jan 15 15:10:38 Scour kernel: sas: ata11: end_device-9:2: dev error handler

Jan 15 15:10:38 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jan 15 15:10:38 Scour kernel: ata11.00: failed command: READ DMA EXT

Jan 15 15:10:38 Scour kernel: ata11.00: cmd 25/00:00:60:95:db/00:04:b7:00:00/e0 tag 19 dma 524288 in

Jan 15 15:10:38 Scour kernel: res 40/00:00:67:17:b6/00:00:7b:00:00/e0 Emask 0x4 (timeout)

Jan 15 15:10:38 Scour kernel: ata11.00: status: { DRDY }

Jan 15 15:10:38 Scour kernel: sas: ata12: end_device-9:3: dev error handler

Jan 15 15:10:38 Scour kernel: ata11: hard resetting link

Jan 15 15:10:39 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Jan 15 15:10:41 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0

Jan 15 15:10:41 Scour kernel: ata11.00: configured for UDMA/100

Jan 15 15:10:41 Scour kernel: ata11: EH complete

Jan 15 15:10:41 Scour kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jan 15 16:46:31 Scour emhttpd: req (13): startState=STARTED&file=&csrf_token=****************&cmdNoCheck=Cancel

Jan 15 16:46:31 Scour kernel: mdcmd (48): nocheck

Jan 15 16:46:41 Scour kernel: md: md_do_sync: got signal, exit...

Jan 15 16:46:49 Scour kernel: md: recovery thread: completion status: -4

Jan 15 16:47:01 Scour sSMTP[18094]: Creating SSL connection to host

Jan 15 16:47:01 Scour sSMTP[18094]: SSL connection using ECDHE-RSA-CHACHA20-POLY1305

Is there a way to determine which drives are ATA 9-12?

I have a SAS2LP-MV8 installed, in addition to the Intel and ASMedia sata ports. I've tried to d/l the diagnostics, but it seems to have hung, and I can't access any shares anymore so I'll just power it down for now.

Any help is appreciated.

JorgeB · January 15, 2019

Please post the diagnostics: Tools -> Diagnostics, even if you can only get them after rebooting.

Edited January 15, 2019 by johnnie.black

trurl · January 15, 2019

1 hour ago, DingHo said:

I've tried to d/l the diagnostics, but it seems to have hung

For future reference, see here about some ways to get diagnostics from the command line:

DingHo · January 15, 2019

Thank you. Had to do a hard reset, but everything comes back up fine.

scour-diagnostics-20190116-0729.zip

JorgeB · January 15, 2019

Parity2 looks healthy, you're using a SAS2LP, and parity2 is connected there, without diags from when it happened it's difficult to say for sure, but those controllers can drop disks without reason and are not recommend with Unraid for some time, it is known.

DingHo · January 16, 2019

Thanks @johnnie.black.

I tried swapping parity2 and a data drive that was on the Intel SATA controller. My [faulty] thinking was the 2nd parity might work if not on the troublesome SAS2LP. Parity Sync started after reboot, ran for a bit, then started throwing up a ton of this error:

Jan 16 11:06:03 Scour kernel: REISERFS error (device md3): vs-7000 search_by_entry_key: search_by_key returned item position == 0

I was able to get the diag file this time. The parity sync is actually still running now after all those errors.

With the new logs, can you tell with more certainty this likely a SAS2LP issue? Should I just pony up the dough for and LSI controller at this point?

Thanks again.

scour-diagnostics-20190116-1102.zip

DingHo · January 16, 2019

I spoke too soon about the parity sync still running. The data drive I moved to the SAS2LP is now showing as offline. (red X).
I stopped the parity sync. Now I'm a bit nervous about next steps. Any help appreciated.

scour-diagnostics-20190116-1144.zip

JorgeB · January 16, 2019

Yes, it definitely looks like the typical SAS2LP problem, simultaneous timeout errors on multiple disks:

Jan 16 10:56:48 Scour kernel: sas: ata10: end_device-9:1: cmd error handler
Jan 16 10:56:48 Scour kernel: sas: ata11: end_device-9:2: cmd error handler
Jan 16 10:56:48 Scour kernel: sas: ata12: end_device-9:3: cmd error handler
Jan 16 10:56:48 Scour kernel: sas: ata9: end_device-9:0: dev error handler
Jan 16 10:56:48 Scour kernel: sas: ata10: end_device-9:1: dev error handler
Jan 16 10:56:48 Scour kernel: ata10.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen
Jan 16 10:56:48 Scour kernel: ata10.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata10.00: cmd 60/00:00:c8:82:04/04:00:03:00:00/40 tag 9 ncq dma 524288 in
Jan 16 10:56:48 Scour kernel:         res 40/00:40:08:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata10.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata10: hard resetting link
Jan 16 10:56:48 Scour kernel: sas: ata11: end_device-9:2: dev error handler
Jan 16 10:56:48 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x7d000000 SErr 0x0 action 0x6 frozen
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:e8:4e:0d/00:00:50:00:00/40 tag 24 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/00:00:c8:96:04/04:00:03:00:00/40 tag 26 ncq dma 524288 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:88:19:89/00:00:74:00:00/40 tag 27 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:08:1a:7c/00:00:74:00:00/40 tag 28 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:d0:78:42:d3/00:00:36:00:00/40 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:e8:a7:ce/00:00:ae:00:00/40 tag 29 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:c8:dd:8a/00:00:74:00:00/40 tag 30 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11: hard resetting link
Jan 16 10:56:48 Scour kernel: sas: ata12: end_device-9:3: dev error handler
Jan 16 10:56:48 Scour kernel: ata12.00: exception Emask 0x0 SAct 0x3d00000 SErr 0x0 action 0x6 frozen
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/00:00:c8:92:04/04:00:03:00:00/40 tag 20 ncq dma 524288 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:88:19:89/00:00:74:00:00/40 tag 22 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:08:1a:7c/00:00:74:00:00/40 tag 23 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:a8:a8:f3:28/00:00:6c:00:00/40 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:e8:a7:ce/00:00:ae:00:00/40 tag 24 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:c8:dd:8a/00:00:74:00:00/40 tag 25 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12: hard resetting link
Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!
Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy1 belongs to port1 already(1)!
Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Jan 16 10:56:50 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0
Jan 16 10:56:50 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[1]:rc= 0

I would recommend you don't try any more rebuilds/syncs before replacing the controller with an LSI.

DingHo · January 16, 2019

Thanks.

I found an LSI 8211-8i for sale locally. I can probably get it in the next day or so.

Would you be so kind as to outline the steps I'd need to take to get back up and running given my current state of affairs?

JorgeB · January 16, 2019

First since disk7 dropped offline you'll need to reboot and check SMART, but I expect it to be fine.

Then, and assuming nothing was written to disk7 since it got disabled, you have two options, rebuild disk7 (on top of the old disk or to a new one) or do a new config and re-sync parity, if rebuilding on top of the old disk make sure the emulated disk is mounting correctly and data appears fine before doing it, a filesystem check before doing it is also a good idea.

Safest option would be rebuilding to a new disk while keeping the old one intact in case something goes wrong during the rebuild.

DingHo · January 16, 2019

@johnnie.black Thanks, I really appreciate your help.

I'll rebuild with a new disk. Do these steps look correct?

1. Stop array
2. Remove Parity2 from array. (will use as new rebuild disk)
3. Power down
4. Replace SAS2LP with 8211-8i
5. Power on
6. Check SMART on disk7, confirm OK
7. Stop array, set disk7 to 'No Device'
8. Add old Parity2 as replacement disk7 for rebuilding
9. Profit.

NewDisplayName · January 16, 2019

Seems like errors i had before i got my LSI.

JorgeB · January 16, 2019

6 minutes ago, DingHo said:

I'll rebuild with a new disk. Do these steps look correct?

You'll need to make Unraid "forget" parity2 before using it, you can do that by unassigning it and starting the array one time before powering down.

DingHo · January 16, 2019

16 minutes ago, nuhll said:

Seems like errors i had before i got my LSI.

Thanks for confirming. Thought the SAS2LP would be more reliable. 😥

9 minutes ago, johnnie.black said:

You'll need to make Unraid "forget" parity2 before using it, you can do that by unassigning it and starting the array one time before powering down.

Will do. I'll update when I get the LSI card.

itimpi · January 16, 2019

2 hours ago, DingHo said:

Thought the SAS2LP would be more reliable.

It was very reliable on Unraid v5 which used 32-bit Linux underneath. It is since the move to 64-bit linux (which underlies Unraid v6 that the unreliability started showing up. It is probably, therefore something obscure in the 64-driver for that card but since it has been an issue for some time I expect it is unlikely to be resolved any time soon.

JonathanM · January 16, 2019

After you get your system stable again, I recommend investigating what you need to do to get your drives migrated to anything other than ReiserFS. Typically it involves getting to a point where you have 1 totally empty data drive, formatting that drive as XFS or BTRFS, and copying the data off of one of your ReiserFS drives, formatting that drive, lather rinse repeat.

DingHo · January 17, 2019

4 hours ago, jonathanm said:

After you get your system stable again, I recommend investigating what you need to do to get your drives migrated to anything other than ReiserFS.

Thanks for the advice. I've been thinking of doing that.

DingHo · January 18, 2019

I picked up an IBM m1015, flashed it to rev. P20. Rebuild has been proceeding successfully for ~9 hours. Disk 7 SMART test was fine.

Assuming the rebuild successfully completes by the time I wake up tomorrow, am I good to go? i.e. Can I re-add my previous disk 7 after a preclear, or should I run a parity check or anything else first?

Really appreciate the help I've received here.

JorgeB · January 18, 2019

Non correcting parity check is always a good idea when there's new hardware, to make sure all is well.

Adding 2nd Parity Fails

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation