DingHo Posted January 15, 2019 Share Posted January 15, 2019 Hello, I just bought 2 new 4TB drives with the intention of adding one for data and the other as a 2nd parity drive. Both successfully pre-cleared. I then ran a parity check successfully. I stopped the array, added one as data, restarted the array successfully. Stopped the array again, added the next drive as 2nd parity, started the array and began the rebuild. It ran fine for about 6 hours then slowed to a crawl. I stopped the rebuild. I found several errors in the log: Jan 15 14:36:44 Scour kernel: sas: sas_eh_handle_sas_errors: task 0x00000000675fa35a is aborted Jan 15 14:36:44 Scour kernel: sas: ata11: end_device-9:2: cmd error handler Jan 15 14:36:44 Scour kernel: sas: ata9: end_device-9:0: dev error handler Jan 15 14:36:44 Scour kernel: sas: ata10: end_device-9:1: dev error handler Jan 15 14:36:44 Scour kernel: sas: ata11: end_device-9:2: dev error handler Jan 15 14:36:44 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 15 14:36:44 Scour kernel: ata11.00: failed command: READ DMA EXT Jan 15 14:36:44 Scour kernel: ata11.00: cmd 25/00:00:98:12:6e/00:04:a7:00:00/e0 tag 17 dma 524288 in Jan 15 14:36:44 Scour kernel: res 40/00:80:28:0d:71/00:00:29:00:00/40 Emask 0x4 (timeout) Jan 15 14:36:44 Scour kernel: ata11.00: status: { DRDY } Jan 15 14:36:44 Scour kernel: ata11: hard resetting link Jan 15 14:36:44 Scour kernel: sas: ata12: end_device-9:3: dev error handler Jan 15 14:36:44 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)! Jan 15 14:36:47 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0 Jan 15 14:36:47 Scour kernel: ata11.00: configured for UDMA/100 Jan 15 14:36:47 Scour kernel: ata11: EH complete Jan 15 14:36:47 Scour kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Jan 15 15:10:38 Scour kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Jan 15 15:10:38 Scour kernel: sas: trying to find task 0x00000000691cfafa Jan 15 15:10:38 Scour kernel: sas: sas_scsi_find_task: aborting task 0x00000000691cfafa Jan 15 15:10:38 Scour kernel: sas: sas_scsi_find_task: task 0x00000000691cfafa is aborted Jan 15 15:10:38 Scour kernel: sas: sas_eh_handle_sas_errors: task 0x00000000691cfafa is aborted Jan 15 15:10:38 Scour kernel: sas: ata11: end_device-9:2: cmd error handler Jan 15 15:10:38 Scour kernel: sas: ata9: end_device-9:0: dev error handler Jan 15 15:10:38 Scour kernel: sas: ata10: end_device-9:1: dev error handler Jan 15 15:10:38 Scour kernel: sas: ata11: end_device-9:2: dev error handler Jan 15 15:10:38 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 15 15:10:38 Scour kernel: ata11.00: failed command: READ DMA EXT Jan 15 15:10:38 Scour kernel: ata11.00: cmd 25/00:00:60:95:db/00:04:b7:00:00/e0 tag 19 dma 524288 in Jan 15 15:10:38 Scour kernel: res 40/00:00:67:17:b6/00:00:7b:00:00/e0 Emask 0x4 (timeout) Jan 15 15:10:38 Scour kernel: ata11.00: status: { DRDY } Jan 15 15:10:38 Scour kernel: sas: ata12: end_device-9:3: dev error handler Jan 15 15:10:38 Scour kernel: ata11: hard resetting link Jan 15 15:10:39 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)! Jan 15 15:10:41 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0 Jan 15 15:10:41 Scour kernel: ata11.00: configured for UDMA/100 Jan 15 15:10:41 Scour kernel: ata11: EH complete Jan 15 15:10:41 Scour kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Jan 15 16:46:31 Scour emhttpd: req (13): startState=STARTED&file=&csrf_token=****************&cmdNoCheck=Cancel Jan 15 16:46:31 Scour kernel: mdcmd (48): nocheck Jan 15 16:46:41 Scour kernel: md: md_do_sync: got signal, exit... Jan 15 16:46:49 Scour kernel: md: recovery thread: completion status: -4 Jan 15 16:47:01 Scour sSMTP[18094]: Creating SSL connection to host Jan 15 16:47:01 Scour sSMTP[18094]: SSL connection using ECDHE-RSA-CHACHA20-POLY1305 Is there a way to determine which drives are ATA 9-12? I have a SAS2LP-MV8 installed, in addition to the Intel and ASMedia sata ports. I've tried to d/l the diagnostics, but it seems to have hung, and I can't access any shares anymore so I'll just power it down for now. Any help is appreciated. Quote Link to comment
JorgeB Posted January 15, 2019 Share Posted January 15, 2019 (edited) Please post the diagnostics: Tools -> Diagnostics, even if you can only get them after rebooting. Edited January 15, 2019 by johnnie.black Quote Link to comment
trurl Posted January 15, 2019 Share Posted January 15, 2019 1 hour ago, DingHo said: I've tried to d/l the diagnostics, but it seems to have hung For future reference, see here about some ways to get diagnostics from the command line: Quote Link to comment
DingHo Posted January 15, 2019 Author Share Posted January 15, 2019 Thank you. Had to do a hard reset, but everything comes back up fine. scour-diagnostics-20190116-0729.zip Quote Link to comment
JorgeB Posted January 15, 2019 Share Posted January 15, 2019 Parity2 looks healthy, you're using a SAS2LP, and parity2 is connected there, without diags from when it happened it's difficult to say for sure, but those controllers can drop disks without reason and are not recommend with Unraid for some time, it is known. Quote Link to comment
DingHo Posted January 16, 2019 Author Share Posted January 16, 2019 Thanks @johnnie.black. I tried swapping parity2 and a data drive that was on the Intel SATA controller. My [faulty] thinking was the 2nd parity might work if not on the troublesome SAS2LP. Parity Sync started after reboot, ran for a bit, then started throwing up a ton of this error: Jan 16 11:06:03 Scour kernel: REISERFS error (device md3): vs-7000 search_by_entry_key: search_by_key returned item position == 0 I was able to get the diag file this time. The parity sync is actually still running now after all those errors. With the new logs, can you tell with more certainty this likely a SAS2LP issue? Should I just pony up the dough for and LSI controller at this point? Thanks again. scour-diagnostics-20190116-1102.zip Quote Link to comment
DingHo Posted January 16, 2019 Author Share Posted January 16, 2019 I spoke too soon about the parity sync still running. The data drive I moved to the SAS2LP is now showing as offline. (red X). I stopped the parity sync. Now I'm a bit nervous about next steps. Any help appreciated. scour-diagnostics-20190116-1144.zip Quote Link to comment
JorgeB Posted January 16, 2019 Share Posted January 16, 2019 Yes, it definitely looks like the typical SAS2LP problem, simultaneous timeout errors on multiple disks: Jan 16 10:56:48 Scour kernel: sas: ata10: end_device-9:1: cmd error handler Jan 16 10:56:48 Scour kernel: sas: ata11: end_device-9:2: cmd error handler Jan 16 10:56:48 Scour kernel: sas: ata12: end_device-9:3: cmd error handler Jan 16 10:56:48 Scour kernel: sas: ata9: end_device-9:0: dev error handler Jan 16 10:56:48 Scour kernel: sas: ata10: end_device-9:1: dev error handler Jan 16 10:56:48 Scour kernel: ata10.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen Jan 16 10:56:48 Scour kernel: ata10.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata10.00: cmd 60/00:00:c8:82:04/04:00:03:00:00/40 tag 9 ncq dma 524288 in Jan 16 10:56:48 Scour kernel: res 40/00:40:08:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata10.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata10: hard resetting link Jan 16 10:56:48 Scour kernel: sas: ata11: end_device-9:2: dev error handler Jan 16 10:56:48 Scour kernel: ata11.00: exception Emask 0x0 SAct 0x7d000000 SErr 0x0 action 0x6 frozen Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:e8:4e:0d/00:00:50:00:00/40 tag 24 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:ff:ff:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/00:00:c8:96:04/04:00:03:00:00/40 tag 26 ncq dma 524288 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:88:19:89/00:00:74:00:00/40 tag 27 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:08:1a:7c/00:00:74:00:00/40 tag 28 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:d0:78:42:d3/00:00:36:00:00/40 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:e8:a7:ce/00:00:ae:00:00/40 tag 29 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata11.00: cmd 60/08:00:c8:dd:8a/00:00:74:00:00/40 tag 30 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata11.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata11: hard resetting link Jan 16 10:56:48 Scour kernel: sas: ata12: end_device-9:3: dev error handler Jan 16 10:56:48 Scour kernel: ata12.00: exception Emask 0x0 SAct 0x3d00000 SErr 0x0 action 0x6 frozen Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/00:00:c8:92:04/04:00:03:00:00/40 tag 20 ncq dma 524288 in Jan 16 10:56:48 Scour kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:88:19:89/00:00:74:00:00/40 tag 22 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:08:1a:7c/00:00:74:00:00/40 tag 23 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:a8:a8:f3:28/00:00:6c:00:00/40 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:e8:a7:ce/00:00:ae:00:00/40 tag 24 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata12.00: failed command: READ FPDMA QUEUED Jan 16 10:56:48 Scour kernel: ata12.00: cmd 60/08:00:c8:dd:8a/00:00:74:00:00/40 tag 25 ncq dma 4096 in Jan 16 10:56:48 Scour kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 16 10:56:48 Scour kernel: ata12.00: status: { DRDY } Jan 16 10:56:48 Scour kernel: ata12: hard resetting link Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy2 belongs to port2 already(1)! Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy1 belongs to port1 already(1)! Jan 16 10:56:48 Scour kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Jan 16 10:56:50 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[3]:rc= 0 Jan 16 10:56:50 Scour kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[1]:rc= 0 I would recommend you don't try any more rebuilds/syncs before replacing the controller with an LSI. Quote Link to comment
DingHo Posted January 16, 2019 Author Share Posted January 16, 2019 Thanks. I found an LSI 8211-8i for sale locally. I can probably get it in the next day or so. Would you be so kind as to outline the steps I'd need to take to get back up and running given my current state of affairs? Quote Link to comment
JorgeB Posted January 16, 2019 Share Posted January 16, 2019 First since disk7 dropped offline you'll need to reboot and check SMART, but I expect it to be fine. Then, and assuming nothing was written to disk7 since it got disabled, you have two options, rebuild disk7 (on top of the old disk or to a new one) or do a new config and re-sync parity, if rebuilding on top of the old disk make sure the emulated disk is mounting correctly and data appears fine before doing it, a filesystem check before doing it is also a good idea. Safest option would be rebuilding to a new disk while keeping the old one intact in case something goes wrong during the rebuild. Quote Link to comment
DingHo Posted January 16, 2019 Author Share Posted January 16, 2019 @johnnie.black Thanks, I really appreciate your help. I'll rebuild with a new disk. Do these steps look correct? 1. Stop array 2. Remove Parity2 from array. (will use as new rebuild disk) 3. Power down 4. Replace SAS2LP with 8211-8i 5. Power on 6. Check SMART on disk7, confirm OK 7. Stop array, set disk7 to 'No Device' 8. Add old Parity2 as replacement disk7 for rebuilding 9. Profit. Quote Link to comment
NewDisplayName Posted January 16, 2019 Share Posted January 16, 2019 Seems like errors i had before i got my LSI. Quote Link to comment
JorgeB Posted January 16, 2019 Share Posted January 16, 2019 6 minutes ago, DingHo said: I'll rebuild with a new disk. Do these steps look correct? You'll need to make Unraid "forget" parity2 before using it, you can do that by unassigning it and starting the array one time before powering down. Quote Link to comment
DingHo Posted January 16, 2019 Author Share Posted January 16, 2019 16 minutes ago, nuhll said: Seems like errors i had before i got my LSI. Thanks for confirming. Thought the SAS2LP would be more reliable. 😥 9 minutes ago, johnnie.black said: You'll need to make Unraid "forget" parity2 before using it, you can do that by unassigning it and starting the array one time before powering down. Will do. I'll update when I get the LSI card. Quote Link to comment
itimpi Posted January 16, 2019 Share Posted January 16, 2019 2 hours ago, DingHo said: Thought the SAS2LP would be more reliable. It was very reliable on Unraid v5 which used 32-bit Linux underneath. It is since the move to 64-bit linux (which underlies Unraid v6 that the unreliability started showing up. It is probably, therefore something obscure in the 64-driver for that card but since it has been an issue for some time I expect it is unlikely to be resolved any time soon. Quote Link to comment
JonathanM Posted January 16, 2019 Share Posted January 16, 2019 After you get your system stable again, I recommend investigating what you need to do to get your drives migrated to anything other than ReiserFS. Typically it involves getting to a point where you have 1 totally empty data drive, formatting that drive as XFS or BTRFS, and copying the data off of one of your ReiserFS drives, formatting that drive, lather rinse repeat. Quote Link to comment
DingHo Posted January 17, 2019 Author Share Posted January 17, 2019 4 hours ago, jonathanm said: After you get your system stable again, I recommend investigating what you need to do to get your drives migrated to anything other than ReiserFS. Thanks for the advice. I've been thinking of doing that. Quote Link to comment
DingHo Posted January 18, 2019 Author Share Posted January 18, 2019 I picked up an IBM m1015, flashed it to rev. P20. Rebuild has been proceeding successfully for ~9 hours. Disk 7 SMART test was fine. Assuming the rebuild successfully completes by the time I wake up tomorrow, am I good to go? i.e. Can I re-add my previous disk 7 after a preclear, or should I run a parity check or anything else first? Really appreciate the help I've received here. Quote Link to comment
JorgeB Posted January 18, 2019 Share Posted January 18, 2019 Non correcting parity check is always a good idea when there's new hardware, to make sure all is well. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.