Disk failure during a large transfer - need advice - General Support (V5 and Older)

October 11, 200916 yr

I have several hot running Hitachi and Seagate 750 GB drives which I want to replace in my array with 1 TB and 1.5 TB models. The way I planned to do this was to add a 1.5 TB drive to the array, and MOVE the data from 2x 750 GB drives onto it. Then I was going to remove the drives I didn't need and replace with new drives.

The move from the first 750 GB drive went well. The second move resulted in 256 write errors on the 1.5 TB drive, causing it to red ball. However, the rest of the data move continued (to parity/phantom drive, I assume). Now the move is done, but I have a bad drive it seems. I wish I'd copied the data from the 750 GB drives and then deleted it, rather than moved it (in MC). Now I'm wondering whether I should copy it all back to those drives, so that it's on physical media rather than recreated from parity. I haven't restarted the array or anything. Wondering what to do until I can get my hands on another 1.5 TB drive.

Any suggestions?

Quote

October 11, 200916 yr

First, read errors on a drive will never cause it to be taken off-line and give a "red" indicator. Any single "write" error will.

So, you had a write error.

The subsequent "writes" were made to the parity drive as if the actual drive were present. So, you need not "copy anything" anywhere if you do not want to, you just need to replace the failed drive.

Since it is a brand new drive, it is as high an odds it is a loose connection to the drive that was the cause of the failure as the drive itself. The only way to know is for you to provide a syslog, before you reboot. Instructions in the wiki under troubleshooting.

If it is just a loose connection, you'll need to un-assign the drive, reboot, then re-assign the drive and it will be reconstructed when you press "Start" Whatever you do, do not press the button labeled "restore" unless you want it to forget anything you ever had written to the failed drive.

Quote

October 11, 200916 yr

Author

Hi Joe L.

Thanks. I believe I said "Write" errors, not "read"; I've learned enough form your previous helpful posts to others to know that. I'll go and post a syslog very soon - baby crying. The reason I wanted to copy the data back was as a fail-safe in case something else goes wrong. Also, other than the parity drive, this is the only one I have that's 1.5 TB, so it may be a little while before I get a replacement. I do have some 1 TBs spare.

So, in summary, I should just go ahead and do a regular drive replacement/rebuild.

Joe M.

First, read errors on a drive will never cause it to be taken off-line and give a "red" indicator. Any single "write" error will.

So, you had a write error.

The subsequent "writes" were made to the parity drive as if the actual drive were present. So, you need not "copy anything" anywhere if you do not want to, you just need to replace the failed drive.

Since it is a brand new drive, it is as high an odds it is a loose connection to the drive that was the cause of the failure as the drive itself. The only way to know is for you to provide a syslog, before you reboot. Instructions in the wiki under troubleshooting.

If it is just a loose connection, you'll need to un-assign the drive, reboot, then re-assign the drive and it will be reconstructed when you press "Start" Whatever you do, do not press the button labeled "restore" unless you want it to forget anything you ever had written to the failed drive.

Quote

October 11, 200916 yr

Hi Joe L.

Thanks. I believe I said "Write" errors, not "read"; I've learned enough form your previous helpful posts to others to know that. I'll go and post a syslog very soon - baby crying. The reason I wanted to copy the data back was as a fail-safe in case something else goes wrong. Also, other than the parity drive, this is the only one I have that's 1.5 TB, so it may be a little while before I get a replacement. I do have some 1 TBs spare.

So, in summary, I should just go ahead and do a regular drive replacement/rebuild.

Joe M.

First, read errors on a drive will never cause it to be taken off-line and give a "red" indicator. Any single "write" error will.

So, you had a write error.

The subsequent "writes" were made to the parity drive as if the actual drive were present. So, you need not "copy anything" anywhere if you do not want to, you just need to replace the failed drive.

Since it is a brand new drive, it is as high an odds it is a loose connection to the drive that was the cause of the failure as the drive itself. The only way to know is for you to provide a syslog, before you reboot. Instructions in the wiki under troubleshooting.

If it is just a loose connection, you'll need to un-assign the drive, reboot, then re-assign the drive and it will be reconstructed when you press "Start" Whatever you do, do not press the button labeled "restore" unless you want it to forget anything you ever had written to the failed drive.

The array has already expanded the file-system on the new 1.5TB drive you were using to replace the older drive, therefore, it is un-likely to allow you to swap in a smaller drive and start a new rebuild without you convincing it somehow to let you.

If possible, make a copy of the contents of the failed drive elsewhere. Then, let's see a SMART report on the failed drive to be certain it really is failed. It could be as simple as a loose cable.

Do you have a spare slot/port on your array where you can put back your smaller 750 gig drive in addition to those now in the array? If so, you can use it for temp storage...

Or, if the only thing on the new 1.5TB drive is also on the old 750 gig drive, you can put the old 1.5TB drive back in the array, remove the failed 1.5TB drive and use the "trust-mt-parity" procedure as described in the wiki. I'd let the resulting parity check run through to completion. It will find (and correct) parity errors if you had done any other writes to the drive you were replacing.

You are correct, I mis-read your first post... you did say "write" errors.

If you have unMENU installed, you can use it to mount the old 750Gig drive outside of the protected array and copy your data there. (actually, you can use any drive you have, even the 1TB) If they have a file-system on them, you can mount it using unMENU and transfer your files using windows explorer. You can do all that via unix commands too, but unMENU makes it far easier for most Linux novices.

You best bet right now is to

un-assign the failing 1.5TB drive

power down and correct the cabling that might be loose

power up and hope the disk is now seen properly

run a SMART report on the disk

re-assign the 1.5TB disk to replace the failed data disk.

press "Start" to reconstruct onto the 1.5TB disk.

Hope it does not fail again.

If you find the 1.5TB drive is really defective, you must either get a replacement 1.5TB drive for it, or use a non-standard way to convince unRAID to use a smaller drive.

That would be a modified version of the trust-my-parity procedure, where instead of telling the array that slot 99 is invalid, you give it the actual slot number of the drive you are attempting to replace/upgrade. The command to do that MUST be made between pressing "restore" and then starting the array, otherwise it will rebuild parity without your data from the disk being replaced... that would make you cry along with your baby.

Quote

October 11, 200916 yr

Author

Joe

Thanks for taking the time to write such a comprehensive answer. Both 750 GB drives are still in my array, and there are no free slots left. I will copy the failed drives data back to those disks and see what I can do about unassigning and reassigning the failed drive. Got to get that syslog and will even try a smart report too. CHeers,

Joe

Quote

October 12, 200916 yr

Author

syslog part 1

Quote

October 12, 200916 yr

Author

part 2

Quote

October 12, 200916 yr

Author

part 3

Quote

October 12, 200916 yr

Author

part 4 - with lots of errors I think

Quote

October 12, 200916 yr

The errors start in part 3 of your syslog.

Basically, communications are lost with /dev/sdn (disk15) Linux tries again and again, and again to restore communications with the disk, resetting the SATA controller and trying different speeds before finally giving up. The "read stripe" errors are then all printed, as it was unable to communicate with the disk.

Part 4 of your syslog is filled with errors from the unRAID management process logging that it cannot get the disk status or temperature. I don't think it contains any additional helpful information.

Oct 10 19:37:47 MediaServer kernel: ata11.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen
Oct 10 19:37:47 MediaServer kernel: ata11.00: irq_stat 0x00400000, PHY RDY changed
Oct 10 19:37:47 MediaServer kernel: ata11: SError: { PHYRdyChg LinkSeq TrStaTrns }
Oct 10 19:37:47 MediaServer kernel: ata11.00: cmd 25/00:00:4f:98:a2/00:04:63:00:00/e0 tag 0 dma 524288 in
Oct 10 19:37:47 MediaServer kernel:          res 50/00:00:4e:98:a2/00:00:63:00:00/e0 Emask 0x10 (ATA bus error)
Oct 10 19:37:47 MediaServer kernel: ata11.00: status: { DRDY }
Oct 10 19:37:47 MediaServer kernel: ata11: hard resetting link
Oct 10 19:37:57 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:37:57 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:38:01 MediaServer kernel: ata11.00: configured for UDMA/133
Oct 10 19:38:01 MediaServer kernel: ata11: EH complete
Oct 10 19:38:01 MediaServer kernel: sd 11:0:0:0: [sdn] 2930277168 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Oct 10 19:38:01 MediaServer kernel: sd 11:0:0:0: [sdn] Write Protect is off
Oct 10 19:38:01 MediaServer kernel: sd 11:0:0:0: [sdn] Mode Sense: 00 3a 00 00
Oct 10 19:38:01 MediaServer kernel: sd 11:0:0:0: [sdn] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 10 19:38:01 MediaServer kernel: ata11.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen
Oct 10 19:38:01 MediaServer kernel: ata11.00: irq_stat 0x00400000, PHY RDY changed
Oct 10 19:38:01 MediaServer kernel: ata11: SError: { PHYRdyChg LinkSeq TrStaTrns }
Oct 10 19:38:01 MediaServer kernel: ata11.00: cmd 25/00:00:57:c0:a2/00:04:63:00:00/e0 tag 0 dma 524288 in
Oct 10 19:38:01 MediaServer kernel:          res 50/00:00:56:c0:a2/00:00:63:00:00/e0 Emask 0x10 (ATA bus error)
Oct 10 19:38:01 MediaServer kernel: ata11.00: status: { DRDY }
Oct 10 19:38:01 MediaServer kernel: ata11: hard resetting link
Oct 10 19:38:11 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:38:11 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:38:11 MediaServer kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x40)
Oct 10 19:38:11 MediaServer kernel: ata11.00: revalidation failed (errno=-5)
Oct 10 19:38:16 MediaServer kernel: ata11: hard resetting link
Oct 10 19:38:26 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:38:26 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:38:26 MediaServer kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x40)
Oct 10 19:38:26 MediaServer kernel: ata11.00: revalidation failed (errno=-5)
Oct 10 19:38:26 MediaServer kernel: ata11: limiting SATA link speed to 1.5 Gbps
Oct 10 19:38:31 MediaServer kernel: ata11: hard resetting link
Oct 10 19:38:41 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:38:41 MediaServer kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 10 19:38:41 MediaServer kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x40)
Oct 10 19:38:41 MediaServer kernel: ata11.00: revalidation failed (errno=-5)
Oct 10 19:38:41 MediaServer kernel: ata11.00: disabled
Oct 10 19:38:46 MediaServer kernel: ata11: hard resetting link
Oct 10 19:38:56 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:38:56 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:38:56 MediaServer kernel: ata11: link online but device misclassified, retrying
Oct 10 19:38:56 MediaServer kernel: ata11: hard resetting link
Oct 10 19:39:06 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:39:06 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:39:06 MediaServer kernel: ata11: link online but device misclassified, retrying
Oct 10 19:39:06 MediaServer kernel: ata11: hard resetting link
Oct 10 19:39:17 MediaServer kernel: ata11: link is slow to respond, please be patient (ready=0)
Oct 10 19:39:41 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:39:41 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:39:41 MediaServer kernel: ata11: link online but device misclassified, retrying
Oct 10 19:39:41 MediaServer kernel: ata11: limiting SATA link speed to 1.5 Gbps
Oct 10 19:39:41 MediaServer kernel: ata11: hard resetting link
Oct 10 19:39:46 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:39:46 MediaServer kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 10 19:39:46 MediaServer kernel: ata11: link online but device misclassified, device detection might fail
Oct 10 19:39:46 MediaServer kernel: ata11: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t4
Oct 10 19:39:46 MediaServer kernel: ata11: irq_stat 0x00400040, connection status changed
Oct 10 19:39:46 MediaServer kernel: ata11: hard resetting link
Oct 10 19:39:56 MediaServer kernel: ata11: softreset failed (device not ready)
Oct 10 19:39:56 MediaServer kernel: ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 10 19:39:56 MediaServer kernel: ata11: link online but device misclassified, retrying
Oct 10 19:39:56 MediaServer kernel: ata11: hard resetting link
Oct 10 19:39:59 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:39:59 MediaServer kernel: ata11: reset failed (errno=-32), retrying in 8 secs
Oct 10 19:40:06 MediaServer kernel: ata11: limiting SATA link speed to 1.5 Gbps
Oct 10 19:40:06 MediaServer kernel: ata11: hard resetting link
Oct 10 19:40:09 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:40:09 MediaServer kernel: ata11: reset failed (errno=-32), retrying in 33 secs
Oct 10 19:40:41 MediaServer kernel: ata11: hard resetting link
Oct 10 19:40:44 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:40:44 MediaServer kernel: ata11: reset failed, giving up
Oct 10 19:40:44 MediaServer kernel: ata11: exception Emask 0x10 SAct 0x0 SErr 0x5950000 action 0xe frozen t3
Oct 10 19:40:44 MediaServer kernel: ata11: irq_stat 0x00400040, connection status changed
Oct 10 19:40:44 MediaServer kernel: ata11: SError: { PHYRdyChg CommWake Dispar LinkSeq TrStaTrns DevExch }
Oct 10 19:40:44 MediaServer kernel: ata11: hard resetting link
Oct 10 19:40:46 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:40:46 MediaServer kernel: ata11: reset failed (errno=-32), retrying in 8 secs
Oct 10 19:40:54 MediaServer kernel: ata11: limiting SATA link speed to 1.5 Gbps
Oct 10 19:40:54 MediaServer kernel: ata11: hard resetting link
Oct 10 19:40:56 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:40:56 MediaServer kernel: ata11: reset failed (errno=-32), retrying in 8 secs
Oct 10 19:41:04 MediaServer kernel: ata11: hard resetting link
Oct 10 19:41:06 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:41:06 MediaServer kernel: ata11: reset failed (errno=-32), retrying in 33 secs
Oct 10 19:41:39 MediaServer kernel: ata11: hard resetting link
Oct 10 19:41:41 MediaServer emhttp: disk_spinning: open: No such device or address
Oct 10 19:41:41 MediaServer emhttp: disk_spinning: open: No such device or address
Oct 10 19:41:41 MediaServer kernel: ata11: COMRESET failed (errno=-32)
Oct 10 19:41:41 MediaServer kernel: ata11: reset failed, giving up
Oct 10 19:41:41 MediaServer kernel: sd 11:0:0:0: [sdn] Result: hostbyte=0x00 driverbyte=0x08
Oct 10 19:41:41 MediaServer kernel: sd 11:0:0:0: [sdn] Sense Key : 0xb [current] [descriptor]
Oct 10 19:41:41 MediaServer kernel: Descriptor sense data with sense descriptors (in hex):
Oct 10 19:41:41 MediaServer kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 10 19:41:41 MediaServer kernel:         63 a2 c0 56 
Oct 10 19:41:41 MediaServer kernel: sd 11:0:0:0: [sdn] ASC=0x0 ASCQ=0x0
Oct 10 19:41:41 MediaServer kernel: end_request: I/O error, dev sdn, sector 1671610455
Oct 10 19:41:41 MediaServer kernel: ata11: EH complete
Oct 10 19:41:41 MediaServer kernel: md: disk15 read error
Oct 10 19:41:41 MediaServer kernel: handle_stripe read error: 1671610392/9, count: 1
Oct 10 19:41:41 MediaServer kernel: md: disk15 read error
Oct 10 19:41:41 MediaServer kernel: handle_stripe read error: 1671610400/9, count: 1
Oct 10 19:41:41 MediaServer kernel: md: disk15 read error
Oct 10 19:41:41 MediaServer kernel: handle_stripe read error: 1671610408/9, count: 1
Oct 10 19:41:41 MediaServer kernel: md: disk15 read error
Oct 10 19:41:41 MediaServer kernel: handle_stripe read error: 1671610416/9, count: 1

Quote

October 12, 200916 yr

Author

I'm copying the data back t the drives that was going to replace. Once that's done (about another 2 1/2 days at this rate), I'll reassign the disk and check the SATA connection to see if that's where the problem is. I did recently buy all locking SATA cables and that certainly increased my confidence in the connections.

Quote

October 12, 200916 yr

I'm copying the data back t the drives that was going to replace. Once that's done (about another 2 1/2 days at this rate), I'll reassign the disk and check the SATA connection to see if that's where the problem is. I did recently buy all locking SATA cables and that certainly increased my confidence in the connections.

Ah... but did you install them, and did they lock, and did they lock after making a good connection??

;D

Quote

October 12, 200916 yr

Author

They lock on the drive side - on the mobo controller, they're a very tight fit, but no click. On the PCI controllers, no lock and can be disrupted with less force than I'd like. I think I'll soon move to the 8 port PCIe card if the new beta supports it.

Quote

October 13, 200916 yr

Author

Update:

I copied all the data that was on the drive, back onto the 750 GB ones that are scheduled for replacement. I then unassigned and reassigned the 1.5 TB drive. It's rebuilding the data now - 15.4 % done. If the writes fail again, it won't be too big a deal as all the data is on the 750GB drives too. If the rebuild completes properly, I'll replace the 750 GB drives as planned. And now that 19 data drives are supported, I may just add a few more drives once the hot 750s are gone.

Thanks for all your help - again.

Quote

Disk failure during a large transfer - need advice

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)