Adding 2nd Parity Fails


DingHo

Recommended Posts

Hello,

I just bought 2 new 4TB drives with the intention of adding one for data and the other as a 2nd parity drive.  Both successfully pre-cleared.  I then ran a parity check successfully. I stopped the array, added one as data, restarted the array successfully.  Stopped the array again, added the next drive as 2nd parity, started the array and began the rebuild.  It ran fine for about 6 hours then slowed to a crawl.  I stopped the rebuild.  I found several errors in the log:

 

Jan 15 14:36:44 Scour kernel: sas: sas_eh_handle_sas_errors: task 0x00000000675fa35a is aborted

Jan 15 14:36:44 Scour kernel: sas: ata11: end_device-9:2: cmd error handler

Jan 15 14:36:44 Scour kernel: sas: ata9: end_device-9:0: dev error handler

Jan 15 14:36:44 Scour kernel: sas: ata10: end_device-9:1: dev error handler

Jan 15 14:36:44 Scour kernel: sas: ata11: end_device-9:2: dev error handler

Jan 15 14:36:44 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jan 15 14:36:44 Scour kernel: ata11.00: failed command: READ DMA EXT

Jan 15 14:36:44 Scour kernel: ata11.00: cmd 25/00:00:98:12:6e/00:04:a7:00:00/e0 tag 17 dma 524288 in

Jan 15 14:36:44 Scour kernel: res 40/00:80:28:0d:71/00:00:29:00:00/40 Emask 0x4 (timeout)

Jan 15 14:36:44 Scour kernel: ata11.00: status: { DRDY }

Jan 15 14:36:44 Scour kernel: ata11: hard resetting link

Jan 15 14:36:44 Scour kernel: sas: ata12: end_device-9:3: dev error handler

Jan 15 14:36:44 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Jan 15 14:36:47 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0

Jan 15 14:36:47 Scour kernel: ata11.00: configured for UDMA/100

Jan 15 14:36:47 Scour kernel: ata11: EH complete

Jan 15 14:36:47 Scour kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jan 15 15:10:38 Scour kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Jan 15 15:10:38 Scour kernel: sas: trying to find task 0x00000000691cfafa

Jan 15 15:10:38 Scour kernel: sas: sas_scsi_find_task: aborting task 0x00000000691cfafa

Jan 15 15:10:38 Scour kernel: sas: sas_scsi_find_task: task 0x00000000691cfafa is aborted

Jan 15 15:10:38 Scour kernel: sas: sas_eh_handle_sas_errors: task 0x00000000691cfafa is aborted

Jan 15 15:10:38 Scour kernel: sas: ata11: end_device-9:2: cmd error handler

Jan 15 15:10:38 Scour kernel: sas: ata9: end_device-9:0: dev error handler

Jan 15 15:10:38 Scour kernel: sas: ata10: end_device-9:1: dev error handler

Jan 15 15:10:38 Scour kernel: sas: ata11: end_device-9:2: dev error handler

Jan 15 15:10:38 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jan 15 15:10:38 Scour kernel: ata11.00: failed command: READ DMA EXT

Jan 15 15:10:38 Scour kernel: ata11.00: cmd 25/00:00:60:95:db/00:04:b7:00:00/e0 tag 19 dma 524288 in

Jan 15 15:10:38 Scour kernel: res 40/00:00:67:17:b6/00:00:7b:00:00/e0 Emask 0x4 (timeout)

Jan 15 15:10:38 Scour kernel: ata11.00: status: { DRDY }

Jan 15 15:10:38 Scour kernel: sas: ata12: end_device-9:3: dev error handler

Jan 15 15:10:38 Scour kernel: ata11: hard resetting link

Jan 15 15:10:39 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Jan 15 15:10:41 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0

Jan 15 15:10:41 Scour kernel: ata11.00: configured for UDMA/100

Jan 15 15:10:41 Scour kernel: ata11: EH complete

Jan 15 15:10:41 Scour kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jan 15 16:46:31 Scour emhttpd: req (13): startState=STARTED&file=&csrf_token=****************&cmdNoCheck=Cancel

Jan 15 16:46:31 Scour kernel: mdcmd (48): nocheck

Jan 15 16:46:41 Scour kernel: md: md_do_sync: got signal, exit...

Jan 15 16:46:49 Scour kernel: md: recovery thread: completion status: -4

Jan 15 16:47:01 Scour sSMTP[18094]: Creating SSL connection to host

Jan 15 16:47:01 Scour sSMTP[18094]: SSL connection using ECDHE-RSA-CHACHA20-POLY1305

 

Is there a way to determine which drives are ATA 9-12?

I have a SAS2LP-MV8 installed, in addition to the Intel and ASMedia sata ports.  I've tried to d/l the diagnostics, but it seems to have hung, and I can't access any shares anymore so I'll just power it down for now.

 

Any help is appreciated.

 

 

Link to comment

Thanks @johnnie.black.

I tried swapping parity2 and a data drive that was on the Intel SATA controller.  My [faulty] thinking was the 2nd parity might work if not on the troublesome SAS2LP.  Parity Sync started after reboot, ran for a bit, then started throwing up a ton of this error:

Jan 16 11:06:03 Scour kernel: REISERFS error (device md3): vs-7000 search_by_entry_key: search_by_key returned item position == 0

 

I was able to get the diag file this time.  The parity sync is actually still running now after all those errors. 

With the new logs, can you tell with more certainty this likely a SAS2LP issue?  Should I just pony up the dough for and LSI controller at this point?


Thanks again.

 

scour-diagnostics-20190116-1102.zip

Link to comment

Yes, it definitely looks like the typical SAS2LP problem, simultaneous timeout errors on multiple disks:

 

Jan 16 10:56:48 Scour kernel: sas: ata10: end_device-9:1: cmd error handler
Jan 16 10:56:48 Scour kernel: sas: ata11: end_device-9:2: cmd error handler
Jan 16 10:56:48 Scour kernel: sas: ata12: end_device-9:3: cmd error handler
Jan 16 10:56:48 Scour kernel: sas: ata9: end_device-9:0: dev error handler
Jan 16 10:56:48 Scour kernel: sas: ata10: end_device-9:1: dev error handler
Jan 16 10:56:48 Scour kernel: ata10.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen
Jan 16 10:56:48 Scour kernel: ata10.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata10.00: cmd 60/00:00:c8:82:04/04:00:03:00:00/40 tag 9 ncq dma 524288 in
Jan 16 10:56:48 Scour kernel:         res 40/00:40:08:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata10.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata10: hard resetting link
Jan 16 10:56:48 Scour kernel: sas: ata11: end_device-9:2: dev error handler
Jan 16 10:56:48 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x7d000000 SErr 0x0 action 0x6 frozen
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:e8:4e:0d/00:00:50:00:00/40 tag 24 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/00:00:c8:96:04/04:00:03:00:00/40 tag 26 ncq dma 524288 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:88:19:89/00:00:74:00:00/40 tag 27 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:08:1a:7c/00:00:74:00:00/40 tag 28 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:d0:78:42:d3/00:00:36:00:00/40 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:e8:a7:ce/00:00:ae:00:00/40 tag 29 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:c8:dd:8a/00:00:74:00:00/40 tag 30 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata11: hard resetting link
Jan 16 10:56:48 Scour kernel: sas: ata12: end_device-9:3: dev error handler
Jan 16 10:56:48 Scour kernel: ata12.00: exception Emask 0x0 SAct 0x3d00000 SErr 0x0 action 0x6 frozen
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/00:00:c8:92:04/04:00:03:00:00/40 tag 20 ncq dma 524288 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:88:19:89/00:00:74:00:00/40 tag 22 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:08:1a:7c/00:00:74:00:00/40 tag 23 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:a8:a8:f3:28/00:00:6c:00:00/40 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:e8:a7:ce/00:00:ae:00:00/40 tag 24 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:c8:dd:8a/00:00:74:00:00/40 tag 25 ncq dma 4096 in
Jan 16 10:56:48 Scour kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY }
Jan 16 10:56:48 Scour kernel: ata12: hard resetting link
Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!
Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy1 belongs to port1 already(1)!
Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Jan 16 10:56:50 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0
Jan 16 10:56:50 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[1]:rc= 0

I would recommend you don't try any more rebuilds/syncs before replacing the controller with an LSI.

Link to comment

First since disk7 dropped offline you'll need to reboot and check SMART, but I expect it to be fine.

 

Then, and assuming nothing was written to disk7 since it got disabled, you have two options, rebuild disk7 (on top of the old disk or to a new one) or do a new config and re-sync parity, if rebuilding on top of the old disk make sure the emulated disk is mounting correctly and data appears fine before doing it, a filesystem check before doing it is also a good idea.

 

Safest option would be rebuilding to a new disk while keeping the old one intact in case something goes wrong during the rebuild.

 

 

Link to comment

@johnnie.black Thanks, I really appreciate your help.

 

I'll rebuild with a new disk.  Do these steps look correct?

1. Stop array
2. Remove Parity2 from array. (will use as new rebuild disk)
3. Power down
4. Replace SAS2LP with 8211-8i
5. Power on
6. Check SMART on disk7, confirm OK
7. Stop array, set disk7 to 'No Device'
8. Add old Parity2 as replacement disk7 for rebuilding
9. Profit.

Link to comment
16 minutes ago, nuhll said:

Seems like errors i had before i got my LSI.

Thanks for confirming.  Thought the SAS2LP would be more reliable.  😥

9 minutes ago, johnnie.black said:

You'll need to make Unraid "forget" parity2 before using it, you can do that by unassigning it and starting the array one time before powering down.

Will do.  I'll update when I get the LSI card.

Link to comment
2 hours ago, DingHo said:

Thought the SAS2LP would be more reliable.

It was very reliable on Unraid v5 which used 32-bit Linux underneath.    It is since the move to 64-bit linux (which underlies Unraid v6 that the unreliability started showing up.   It is probably, therefore something obscure in the 64-driver for that card but since it has been an issue for some time I expect it is unlikely to be resolved any time soon.

Link to comment

After you get your system stable again, I recommend investigating what you need to do to get your drives migrated to anything other than ReiserFS. Typically it involves getting to a point where you have 1 totally empty data drive, formatting that drive as XFS or BTRFS, and copying the data off of one of your ReiserFS drives, formatting that drive, lather rinse repeat.

Link to comment

I picked up an IBM m1015, flashed it to rev. P20.  Rebuild has been proceeding successfully for ~9 hours. Disk 7 SMART test was fine.

 

Assuming the rebuild successfully completes by the time I wake up tomorrow, am I good to go?  i.e. Can I re-add my previous disk 7 after a preclear, or should I run a parity check or anything else first?

 

Really appreciate the help I've received here.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.