Disk offline - removed - Parity rebuild slow


aurevo

Recommended Posts

18 minutes ago, johnnie.black said:

Looks more like a connection/power issue, but it can be the controller.

 

I used the same power cables that were working for several months without problems.

 

The SAS-SATA cable are brand new and one of them I already replaced. 

 

You got a recommendation for me?

 

Otherwise I would install the old port multiplier again and test if it works with it.

Currently it is a D2607-A21 crossflashed.
But I don't know how I can test or see if it's the controller.

Link to comment
On 3/4/2020 at 6:41 PM, johnnie.black said:

Easiest way would be to test with another controller, but you could test that controller on another computer if possible.

 

I think in the next few days I will first install the old controller again and test it.

 

For the exchange of the power supply I would have to take all components apart again.

 

I flashed the controller according to the following instructions: https://forums.unraid.net/topic/12114-lsi-controller-fw-updates-irit-modes/?do=findComment&comment=522038

Is the manual still the most current one, or are there new options in the meantime?

Link to comment
16 hours ago, johnnie.black said:

Those are current.

 

I just switched to the old SATA controller.

 

Diagnostics attached.

 

Now everything is available, but I can't mount the ddrescue-dest, although it was mountable before I changed the adapter, now the only option is to format.

 

The four Toshiba HDD still showing error state.

 

2020-03-08 00_38_33-Tower_Main.png

tower-diagnostics-20200308-0036.zip

2020-03-08 00_40_23-Tower_Dashboard.png

Edited by aurevo
Link to comment

That's one of the disks having ATA errors (hardware problem):

 

Mar  8 00:35:30 Tower kernel: ata5.00: status: { DRDY SENSE ERR }
Mar  8 00:35:30 Tower kernel: ata5.00: error: { ABRT }
Mar  8 00:35:30 Tower kernel: ata5.00: configured for UDMA/133 (device error ignored)
Mar  8 00:35:30 Tower kernel: ata5: EH complete
Mar  8 00:35:30 Tower kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar  8 00:35:30 Tower kernel: ata5.00: irq_stat 0x40000001
Mar  8 00:35:30 Tower kernel: ata5.00: failed command: READ DMA EXT
Mar  8 00:35:30 Tower kernel: ata5.00: cmd 25/00:08:00:f4:a0/00:00:ba:02:00/e0 tag 3 dma 4096 in
Mar  8 00:35:30 Tower kernel:         res 53/04:08:00:f4:a0/00:00:ba:02:00/e0 Emask 0x1 (device error)
Mar  8 00:35:30 Tower kernel: ata5.00: status: { DRDY SENSE ERR }
Mar  8 00:35:30 Tower kernel: ata5.00: error: { ABRT }
Mar  8 00:35:30 Tower kernel: ata5.00: configured for UDMA/133 (device error ignored)
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 Sense Key : 0x5 [current]
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 ASC=0x21 ASCQ=0x4
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 CDB: opcode=0x88 88 00 00 00 00 02 ba a0 f4 00 00 00 00 08 00 00
Mar  8 00:35:30 Tower kernel: print_req_error: I/O error, dev sdf, sector 11721044992
Mar  8 00:35:30 Tower kernel: Buffer I/O error on dev sdf, logical block 1465130624, async page read

You still have some hardware issue causing problems with the devices using different controllers, I don't remember, did you try another power supply?

Link to comment
3 hours ago, johnnie.black said:

That's one of the disks having ATA errors (hardware problem):

 


Mar  8 00:35:30 Tower kernel: ata5.00: status: { DRDY SENSE ERR }
Mar  8 00:35:30 Tower kernel: ata5.00: error: { ABRT }
Mar  8 00:35:30 Tower kernel: ata5.00: configured for UDMA/133 (device error ignored)
Mar  8 00:35:30 Tower kernel: ata5: EH complete
Mar  8 00:35:30 Tower kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar  8 00:35:30 Tower kernel: ata5.00: irq_stat 0x40000001
Mar  8 00:35:30 Tower kernel: ata5.00: failed command: READ DMA EXT
Mar  8 00:35:30 Tower kernel: ata5.00: cmd 25/00:08:00:f4:a0/00:00:ba:02:00/e0 tag 3 dma 4096 in
Mar  8 00:35:30 Tower kernel:         res 53/04:08:00:f4:a0/00:00:ba:02:00/e0 Emask 0x1 (device error)
Mar  8 00:35:30 Tower kernel: ata5.00: status: { DRDY SENSE ERR }
Mar  8 00:35:30 Tower kernel: ata5.00: error: { ABRT }
Mar  8 00:35:30 Tower kernel: ata5.00: configured for UDMA/133 (device error ignored)
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 Sense Key : 0x5 [current]
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 ASC=0x21 ASCQ=0x4
Mar  8 00:35:30 Tower kernel: sd 5:0:0:0: [sdf] tag#3 CDB: opcode=0x88 88 00 00 00 00 02 ba a0 f4 00 00 00 00 08 00 00
Mar  8 00:35:30 Tower kernel: print_req_error: I/O error, dev sdf, sector 11721044992
Mar  8 00:35:30 Tower kernel: Buffer I/O error on dev sdf, logical block 1465130624, async page read

You still have some hardware issue causing problems with the devices using different controllers, I don't remember, did you try another power supply?

 

Where this errors with both controllers? 

This is for sdf, which ones having these errors too?

 

No, I did not changed the Power Supply. No Power Supply at hand actually.

Edited by aurevo
Link to comment
  • 2 weeks later...
On 3/9/2020 at 6:34 PM, johnnie.black said:

IIRC there were errors with all controllers, on the original ad-don controller with a port multiplier, on the LSI, and this time with the onboard SATA ports, hence why I suspect there could be a power issue.

 

The current status is as follows:

 

I have rebuilt the CPU, the mainboard back to the old components. In parallel I exchanged the power supply with an identical one.

After that I had the parity restored and currently it looks good, even if the previously mentioned hard disks still show errors (though not always, the errors had disappeared temporarily).

 

Would you be so kind to have another look at the logs?

 

Currently the hard drives are connected onboard and with the LSI controller.

 

In normal use, the server is running as desired again and no longer outputs errors.

tower-diagnostics-20200319-0125.zip

Link to comment
  • 2 weeks later...
On 3/19/2020 at 8:30 AM, johnnie.black said:

Everything looks normal on the diags posted, there aren't any disk errors.

Hello,

 

until this night the server was running without problems and with a good performance as it should be for ten days.

 

The problems started when the JDownloader Docker was no longer available.

This morning I restarted the server and the 2TB disk was unreachable/disabled and emulated.


Since there is no important data on this disk, I created a new configuration and started the parity sync.

 

Now the log is currently filled with errors while reading Disk3 and I paused the process for now.

tower-diagnostics-20200329-1333.zip

Link to comment
On 3/30/2020 at 11:56 AM, johnnie.black said:

Disk looks mostly OK but there are recent UNC @ LBA errors, you should run an extended SMART test.

 

I have removed the (possibly) defective 2TB and 4TB hard disks, as they have caused problems in the past.

Now I would like to try to rescue the data with ddrescue according to your instructions.

The 4TB hard disk is still installed and as a target I have connected a 4TB external hard disk.

 

Under Dashboard the old disk is shown as "sdd" (but not under UD or Main) and the new one as "sdm".

 

Unfortunately no copy process is possible at the moment:

 

root@Tower:/# ddrescue -f /dev/sdd /dev/sdm /boot/ddrescue.log
GNU ddrescue 1.23
Press Ctrl-C to interrupt
Initial status (read from mapfile)
rescued: 4000 GB, tried: 0 B, bad-sector: 0 B, bad areas: 0

Current status
     ipos: 0 B, non-trimmed: 0 B, current rate: 0 B/s
     opos: 0 B, non-scraped: 0 B, average rate: 0 B/s
non-tried: 0 B, bad-sector: 0 B, error rate: 0 B/s
  rescued: 4000 GB, bad areas: 0, run time: 0s
pct rescued: 100.00%, read errors: 0, remaining time: n/a
                              time since last successful read: n/a
 

tower-diagnostics-20200402-0106.zip

Link to comment
On 4/2/2020 at 1:05 PM, johnnie.black said:

Did you ever use ddrescue before? If yes delete the log file (/boot/ddrescue.log) or use a different name.

 

Yes I used it before. After deleting the log file, it worked.

 

Got this error, I think the disk I copied to was smaller than the disk I was copying from.

 

Copied disk is not mountable with UD, it only says "Format".

 

root@Tower:~# ddrescue -f /dev/sde /dev/sda /boot/ddrescue.log
GNU ddrescue 1.23
Press Ctrl-C to interrupt
     ipos:  786902 MB, non-trimmed:        0 B,  current rate:    184 MB/s
     ipos:  814066 MB, non-trimmed:        0 B,  current rate:    186 MB/s
     ipos:    1861 GB, non-trimmed:        0 B,  current rate:    158 MB/s
     ipos:    4000 GB, non-trimmed:        0 B,  current rate:   4325 kB/s
     opos:    4000 GB, non-scraped:        0 B,  average rate:  50349 kB/s
non-tried:   34430 kB,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:    4000 GB,   bad areas:        0,        run time: 22h  4m 18s
pct rescued:   99.99%, read errors:        0,  remaining time:          1s
                              time since last successful read:         n/a
Copying non-tried blocks... Pass 1 (forwards)
ddrescue: Write error: No space left on device
 

tower-diagnostics-20200403-1802.zip

Edited by aurevo
Link to comment
On 4/3/2020 at 6:39 PM, johnnie.black said:

Yes, not enough space on target disk, you can use a disk larger than 4TB (to mount with UD, not in the array).

 

Buying a larger hard disk than the one I own is pointless and too expensive for these purposes.

 

I had removed the 2TB and the 4TB hard disk, because I thought that they only cause errors.

However, I was able to copy all the data until just before the end (because the disk I wanted to clone to was too small), and for the second time, and copy all the data that was on the 4TB disk to the array.


So I thought I might be able to add that disk back into the array, but again I get tons of errors.


I did a SMART check and an extended SMART check and passed.

 

Is the disk broken, or why do I only get these errors in the array, everything else is working fine?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.