Disk offline - removed - Parity rebuild slow


aurevo

Recommended Posts

Hello,

 

after I changed the processor and mainboard and put the server back into operation, a hard disk was not available. As far as I saw, the SMART values were ok, but I wanted to remove the hard disk anyway, because a first rebuild attempt with the same hard disk was much too slow.

 

I created a new configuration and now the rebuild is running, but the speed is absolutely not ok:

Total size: 8 TB    
Elapsed time: 1 minute    
Current position: 40.8 MB (0.0 %)    
Estimated speed: 429.1 KB/sec    
Estimated finish: 215 days, 6 hours, 57 minutes

 

Parity-Sync/Data-Rebuild in the past was much faster and this values are looking like an error or another reason.

 

First log before removing disk, second after removing and starting rebuild.

 

 

Can anyone see anything in the logs?

 

tower-diagnostics-20200218-2321.zip tower-diagnostics-20200218-2022.zip

Edited by aurevo
Link to comment

You're using a controller with a SATA port multiplier (or it's connected to one) and these are not recommended for various reasons, in this case it keeps resetting the disks:

 

Feb 18 23:22:33 Tower kernel: ata10.00: failed to read SCR 1 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.00: failed to read SCR 0 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.01: failed to read SCR 1 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.02: failed to read SCR 1 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.00: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 18 23:22:33 Tower kernel: ata10.00: failed command: READ DMA EXT
Feb 18 23:22:33 Tower kernel: ata10.00: cmd 25/00:40:80:76:04/00:05:00:00:00/e0 tag 10 dma 688128 in
Feb 18 23:22:33 Tower kernel:         res 50/00:00:3f:86:04/00:00:00:00:00/e0 Emask 0x4 (timeout)
Feb 18 23:22:33 Tower kernel: ata10.00: status: { DRDY }
Feb 18 23:22:34 Tower kernel: ata10.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:34 Tower kernel: ata10.00: hard resetting link
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is disabled
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is enabled
Feb 18 23:22:35 Tower kernel: ata10.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is disabled
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is enabled
Feb 18 23:22:35 Tower kernel: ata10.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:36 Tower kernel: ahci 0000:01:00.0: FBS is disabled
Feb 18 23:22:36 Tower kernel: ahci 0000:01:00.0: FBS is enabled
Feb 18 23:22:36 Tower kernel: ata10.02: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:36 Tower kernel: ata10.00: configured for UDMA/33
Feb 18 23:22:36 Tower kernel: ata10.01: configured for UDMA/133
Feb 18 23:22:36 Tower kernel: ata10.02: configured for UDMA/133
Feb 18 23:22:36 Tower kernel: ata10: EH complete

 

Link to comment
On 2/19/2020 at 8:33 AM, johnnie.black said:

You're using a controller with a SATA port multiplier (or it's connected to one) and these are not recommended for various reasons, in this case it keeps resetting the disks:

 


Feb 18 23:22:33 Tower kernel: ata10.00: failed to read SCR 1 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.00: failed to read SCR 0 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.01: failed to read SCR 1 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.02: failed to read SCR 1 (Emask=0x40)
Feb 18 23:22:33 Tower kernel: ata10.00: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 18 23:22:33 Tower kernel: ata10.00: failed command: READ DMA EXT
Feb 18 23:22:33 Tower kernel: ata10.00: cmd 25/00:40:80:76:04/00:05:00:00:00/e0 tag 10 dma 688128 in
Feb 18 23:22:33 Tower kernel:         res 50/00:00:3f:86:04/00:00:00:00:00/e0 Emask 0x4 (timeout)
Feb 18 23:22:33 Tower kernel: ata10.00: status: { DRDY }
Feb 18 23:22:34 Tower kernel: ata10.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:34 Tower kernel: ata10.00: hard resetting link
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is disabled
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is enabled
Feb 18 23:22:35 Tower kernel: ata10.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is disabled
Feb 18 23:22:35 Tower kernel: ahci 0000:01:00.0: FBS is enabled
Feb 18 23:22:35 Tower kernel: ata10.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:36 Tower kernel: ahci 0000:01:00.0: FBS is disabled
Feb 18 23:22:36 Tower kernel: ahci 0000:01:00.0: FBS is enabled
Feb 18 23:22:36 Tower kernel: ata10.02: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 18 23:22:36 Tower kernel: ata10.00: configured for UDMA/33
Feb 18 23:22:36 Tower kernel: ata10.01: configured for UDMA/133
Feb 18 23:22:36 Tower kernel: ata10.02: configured for UDMA/133
Feb 18 23:22:36 Tower kernel: ata10: EH complete

 

 

This controller worked for several weeks and months without such problems, so why the problems start now?

 

So at the moment it is not any of the disks?

Link to comment
4 hours ago, johnnie.black said:

Don't know, just know that those problems are typical when using port multipliers.

 

Do you have any other solutions?

 

It really worked for months without such problems, the controller can't start now, when I just changed the mainboard, I see the problems there.

Link to comment
On 2/20/2020 at 6:50 PM, johnnie.black said:

My solution would be to replace the known problematic controller, other than that can't help.

 

Even though I absolutely can't understand why the controller worked for months and now suddenly doesn't, I swapped it for a Fujitsu D2607-8i and crossflashd it to LSI 9211-8i.

 

Now I have the next problem. The hard disks were all recognized again and I wanted to start the array to do a rebuild / parity sync.

Now it shows me an unmountable filesystem for disk 3.


Is there anything I can do here, or have I lost data again thanks to Unraid?

tower-diagnostics-20200226-1508.zip

Link to comment

There's a problem with disk3:

Feb 26 15:06:30 Tower kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Feb 26 15:06:30 Tower kernel: sd 1:0:6:0: [sdl] tag#1698 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb 26 15:06:30 Tower kernel: sd 1:0:6:0: [sdl] tag#1698 Sense Key : 0x2 [current]
Feb 26 15:06:30 Tower kernel: sd 1:0:6:0: [sdl] tag#1698 ASC=0x4 ASCQ=0x0
Feb 26 15:06:30 Tower kernel: sd 1:0:6:0: [sdl] tag#1698 CDB: opcode=0x88 88 00 00 00 00 00 ae a8 a8 58 00 00 01 08 00 00
Feb 26 15:06:30 Tower kernel: print_req_error: I/O error, dev sdl, sector 2930288728
Feb 26 15:06:30 Tower kernel: md: disk3 read error, sector=2930288664
Feb 26 15:06:30 Tower kernel: md: disk3 read error, sector=2930288672
Feb 26 15:06:30 Tower kernel: md: disk3 read error, sector=2930288680
Feb 26 15:06:30 Tower kernel: md: disk3 read error, sector=2930288688
Feb 26 15:06:30 Tower kernel: md: disk3 read error, sector=2930288696
Feb 26 15:06:30 Tower kernel: md: disk3 read error, sector=2930288704

 

Start by replacing/swapping cables.

Link to comment
58 minutes ago, johnnie.black said:

Looking again at the original diags it's possible, even probable, that disk3 is really having issues, but before and because of the port multiplier the problem appeared to be on multiple disks, all the ones connected to the same port.

 

Changed cable and port to internal port on mainboard.

 

Total size:8 TB

Elapsed time:less than a minute

Current position:22.0 MB (0.0 %)

Estimated speed:759.3 KB/sec

Estimated finish:121 days, 9 hours, 38 minutes

 

Now it is incredible slow again, but filesystem can be read again.

Link to comment
1 hour ago, johnnie.black said:

Again issues with disk3, since you've already used a different controller/SATA cable try also a different power cable, if issues persist it's likely a disk problem, but because parity is invalid you can't rebuild it, see if you can copy any important data to another disk, failing that try ddrescue.

 

Also changed power cable.

 

Copy started with good speed, get lower and abort via Krusader.

 

Interesting, that the disk was completely not recognized on the Fujitsu Controller and now is recognized, but partially defect.

 

Does the ddrescue also works on encrypted volumes?

 

And any way I need another disk with same space or more, right?

Link to comment
11 hours ago, aurevo said:

Interesting, that the disk was completely not recognized on the Fujitsu Controller and now is recognized,

It was recognized, but it gave errors immediately, it didn't even mount, different drivers can behave differently when there are issues.

 

11 hours ago, aurevo said:

Does the ddrescue also works on encrypted volumes?

 

And any way I need another disk with same space or more, right?

Yes to both.

 

 

 

Link to comment
On 2/27/2020 at 8:27 AM, johnnie.black said:

It was recognized, but it gave errors immediately, it didn't even mount, different drivers can behave differently when there are issues.

 

Yes to both.

 

 

 

I ask one more time before I make a mistake and the data is finally lost:

 

My defective disk is part of the array, my new disk isn't. The disks are not mounted and I use ddrescue as follows: ddrescue -f /dev/sdX /dev/sdY /boot/ddrescue.log

 

How do I decrypt the data, or is the hard disk copied encrypted to the new hard disk?

 

And which hard disk is my source and which destination?

Link to comment
On 2/29/2020 at 3:42 PM, aurevo said:

How do I decrypt the data, or is the hard disk copied encrypted to the new hard disk?

dd/ddrescue is a bit by bit copy, including encryption.

 

On 2/29/2020 at 3:42 PM, aurevo said:

And which hard disk is my source and which destination?

Source first, then destination.

 

 

Link to comment
24 minutes ago, johnnie.black said:

On the GUI's main page.

So it is running now for at least six hours.

 

If it finishes, how to show UnRAID to take the new cloned disk as the old one? 

Do I assign the new hard disk to the old slot, or can I leave it in the new slot and UnRAID notices that the old data is there?

 

If the SSH connection terminates unexpectedly, is ddrescue still running in the background or do I need to restart the process? And is it then aborted, or does it continue at the old location?

Edited by aurevo
Link to comment
25 minutes ago, aurevo said:

If it finishes, how to show UnRAID to take the new cloned disk as the old one? 

Tool -> new config -> keep all assignments

Then unassign old disk and assign new one in its place (new disk need to be same capacity as the old one to be used in the array)

Check "parity is already valid" before stating array (but if there are errors on ddrescue a parity check will be needed)

 

25 minutes ago, aurevo said:

If the SSH connection terminates unexpectedly, is ddrescue still running in the background or do I need to restart the process?

You'll need to restart, using "screen" can get around that.

 

 

 

Link to comment
2 hours ago, johnnie.black said:

Tool -> new config -> keep all assignments

Then unassign old disk and assign new one in its place (new disk need to be same capacity as the old one to be used in the array)

Check "parity is already valid" before stating array (but if there are errors on ddrescue a parity check will be needed)

 

You'll need to restart, using "screen" can get around that.

 

 

 

 

Old one was 4TB, new one 6TB, so I only can mount it and copy over data or can it be same size or bigger?

Parity check / rebuild is also needed because the parity was not given, that was the reason to use ddrescue.

 

Screen is not installed. Can I use the one from the Nerd Pack?

Edited by aurevo
Link to comment

You can still mount it with UD then copy data to the array, Unraid won't accept it since partition won't be using the full disk, you could sync parity with the disk unmountable then rebuild, partition would be correctly recreated during that, but it will take longer.

 

4 minutes ago, aurevo said:

Screen is not installed. How to install it?

Install the nerdpack plugin, you can then installed screen there.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.