Rebuilt drive and still getting errors

PsyVision · February 13, 2016

Hi All,

A few weeks ago one of my drives started failing on me. I switched it out and replaced it with a new drive. On rebuilding the drive it showed there were 288 errors. I thought there could be an issue with the SATA cable so I also replaced that, I then unassigned the drive and re-assigned it to rebuild it from parity. The drive rebuilt but showing 288 errors again. I've repeated this process again, changing the SATA cable (as some of mine are old and possibly broken) and again, 288 errors.

Below is a sample of my syslog and I have attached the diagnostics download.

Feb 11 21:11:39 nas kernel: ata7.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x6 frozen
Feb 11 21:11:39 nas kernel: ata7.00: failed command: WRITE FPDMA QUEUED
Feb 11 21:11:39 nas kernel: ata7: hard resetting link
Feb 11 21:11:49 nas kernel: ata7: COMRESET failed (errno=-16)
Feb 11 21:11:49 nas kernel: ata7: hard resetting link
Feb 11 21:11:59 nas kernel: ata7: COMRESET failed (errno=-16)
Feb 11 21:11:59 nas kernel: ata7: hard resetting link
Feb 11 21:12:10 nas kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 11 21:12:10 nas kernel: ata8.00: failed command: WRITE DMA EXT
Feb 11 21:12:10 nas kernel: ata8: hard resetting link
Feb 11 21:12:16 nas kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Feb 11 21:12:16 nas kernel: ata8.00: revalidation failed (errno=-5)
Feb 11 21:12:16 nas kernel: ata8: hard resetting link
Feb 11 21:12:27 nas kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Feb 11 21:12:27 nas kernel: ata8.00: revalidation failed (errno=-5)
Feb 11 21:12:27 nas kernel: ata8: hard resetting link
Feb 11 21:12:34 nas kernel: ata7: COMRESET failed (errno=-16)
Feb 11 21:12:34 nas kernel: ata7: hard resetting link
Feb 11 21:12:39 nas kernel: ata7: COMRESET failed (errno=-16)
Feb 11 21:12:39 nas kernel: ata7: reset failed, giving up
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 1465144063
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 977254148
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 977254151
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 0
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 977254181
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 1465328951
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 1465329367
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 0
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 37607
Feb 11 21:12:39 nas kernel: blk_update_request: I/O error, dev sdh, sector 978674575
Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x575452c0 ("xfs_buf_iodone_callbacks") error 5 numblks 32
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166111, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166112, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166163, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 183166164, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334314, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334315, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334344, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334345, lost async page write
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 124078201, lost async page write
Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x3a3fb6c5 ("xlog_iodone") error 5 numblks 64
Feb 11 21:12:39 nas kernel: Buffer I/O error on dev sdh1, logical block 122334346, lost async page write
Feb 11 21:12:39 nas kernel: XFS (sdh1): Log I/O Error Detected.  Shutting down filesystem
Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x3a3fb6c8 ("xlog_iodone") error 5 numblks 64
Feb 11 21:12:39 nas kernel: XFS (sdh1): metadata I/O error: block 0x3a3fb6e6 ("xlog_iodone") error 5 numblks 64
Feb 11 21:12:39 nas kernel: XFS (sdh1): xfs_log_force: error -5 returned.
Feb 11 21:12:59 nas kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Feb 11 21:12:59 nas kernel: ata8.00: revalidation failed (errno=-5)
Feb 11 21:12:59 nas kernel: ata8: hard resetting link
Feb 11 21:13:00 nas kernel: blk_update_request: I/O error, dev sdi, sector 251287064
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287000
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287008
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287016
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287024
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287032
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287040
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287048
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287056
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287064
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287072
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287080
Feb 11 21:13:00 nas kernel: md: disk5 write error, sector=251287088

I'm not sure how to resolve this? Is it possible that I've corrupted my parity such that when rebuilding the data drive it rebuilds incorrectly?

nas-diagnostics-20160213-1021.zip

JorgeB · February 13, 2016

Both disk5 and your cache disk dropped offline, maybe power/cable problem?

Post new diagnostics after checking cables and power cycling the server because SMART report for both is missing.

itimpi · February 13, 2016

which was the drive that you replaced?

The SMART reports show that the disk with serial WD-WCAU45077221 has 6 Pending sectors. These are sectors that are not being read successfully so can affect the rebuild of any other drive as it can mean that parity for those particular sectors is incorrect. There are also no SMART reports for WD-WCC4N0PKSD7P-20160213 and WD-WMAV50355779 which suggest these two drives have dropped offline.

The syslog shows that you started getting write errors to the 'sdh' disk. It looks like that the drive went offline and after that things went down hill from there.

The most likely causes are a cabling issue or a power related issue. Another possibility is a disk controller that is not properly seated in its motherboard slot.

JorgeB · February 13, 2016

Another possibility is a disk controller that is not properly seated in its motherboard slot.

Strong possibility, just noticed both dropped disks appear to be using the same marvell controller.

PsyVision · February 13, 2016

Yea disk 5 was replaced. That disk and another are on the same onboard motherboard controller, with only those two disks.

I will check cables and cycle later and then post back results with more information on things you've asked.

Thank you!

PsyVision · February 13, 2016

Okay I powered off the server and have re-connected all of the power connectors. I may have had an extra drive on one of the power leads that should have been on the other (3 and 5 rather than 4 and 4). I have powered on and then done nothing, logs attached.

WD-WMAV50355779 (sdh) is my cache drive

WD-WCAU45077221 (disk 2, sdc) is an old 1TB drive that potentially could be failing? I would be happy to replace this if needs be (of course).

nas-diagnostics-20160213-1620.zip

JorgeB · February 13, 2016

Disk 2 should be replaced asap, SMART for all others looks good.

If disk2 caused the disk5 rebuild errors you're probably going to have some corrupt files, if you have backups or cheksums you should check them.

PsyVision · February 14, 2016

THank you Johnnie.

How is best to do this? If I put a new disk2 in then it tells me it's the wrong one. I see I should mark it as empty and then shutdown, put the new one in and then assign it to the slot. However, I am told that i have either too many or the wrong disks assigned. Unfortunately disk5 is still showing red-balled/cross so I'm not sure it will let me build it.

JorgeB · February 14, 2016

So disk5 rebuild didn’t complete?

I understood that it completed with some errors, if it’s showing a red x it’s still being emulated and you can’t change another disk before that one is dealt with.

Trouble is that because disk2 has pending sectors a completely successful rebuild may be impossible, did old disk5 failed completely or is it still readable?

JorgeB · February 14, 2016

Looking at your diagnostics disk5 is still disable, rebuild failed when disk5 (and cache) dropped offline, since you checked all cables I would try to rebuild it again.

PsyVision · February 14, 2016

I tried a re-build before posting and it failed after a couple of hours, it should take ~9 to do a full rebuild apparently.

Cache and disk5 appear to be online, just that disk5 gets marked as a red.

I have enough disk space to move data off both drives, would it be okay to move the data to other drives and then replace 2 and 5 as though they were new drives?

JorgeB · February 14, 2016

If you have the space it might be a good solution, because the pending sectors from disk2 can prevent a successful disk5 rebuild.

trurl · February 14, 2016

... disk5 appear to be online, just that disk5 gets marked as a red...

The disk appears to be online because it is being emulated not because unRAID is actually using it. unRAID will not use a disabled drive. Instead it calculates the drive's data by reading all the other drives plus parity.

Rebuilt drive and still getting errors

Recommended Posts

PsyVision

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

PsyVision

Link to comment

PsyVision

Link to comment

JorgeB

Link to comment

PsyVision

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

PsyVision

Link to comment

JorgeB

Link to comment

trurl

Link to comment

Join the conversation