Multiple disk issues. Unsure of data integrity state. Best path forward.

air_marshall · November 19, 2020

The Issue - 1 Disk Disabled and 1 Disk with SMART Errors....

Last night when for all intents and purposes the array wasn't doing anything I received the following push notifications:

Tower: Warning [TOWER] - array has errors
Array has 1 disk with read errors

Tower: Alert [TOWER] - Disk 5 in error state (disk dsbl)
WDC_WD40EZRX-00SPEB0_WD-WCC4ECKCLDV9 (sdb)

Here's my first mistake, I didn't save the syslog before a reboot 😞 [idiot]

I went to the webgui with the intention of checking disk 5 SMART errors or report. However, when clicking on the disk the system was unable to retrieve any attributes. The read error count was up at 80. I checked the "Disk Log Information" and it did list some errors, IIRC related to "ata hard resets and restarting the link". There was more than one. At this point I felt a reboot would be useful to see if the connection to the disk could be re-established.

Upon reboot the disk was still disabled (red x) but connected and settings and reports were available. I ran a short SMART test, an extended test was run only a week or so prior (see background info below if you feel it's relevant!). It was and I believe remains clear. However, having read the wiki, I believe I can only re-enable that disk by rebuilding to it, is that correct?

I saw the "Read Check" option was now available and believed the system was prompting me to do one, so I ran one. I now realise it's the only option available when a drive is disabled. Nevertheless, off it went. A few hours into the "Read Check" disaster struck with the following notification:

Tower: Warning [TOWER] - array has errors
Array has 1 disk with read errors

Tower: Warning [TOWER] - reported uncorrect is 47
SAMSUNG_HD154UI_S1Y6JDWZ504027 (sdg)

Tower: Warning [TOWER] - current pending sector is 18
SAMSUNG_HD154UI_S1Y6JDWZ504027 (sdg)

WTF! But it hasn't disabled the disk (maybe because I already have a disk disabled - I don't know)

I have since run a short and extended SMART test on that drive, the reports should be in the diagnostics file attached. I definitely has issues.

I have been through what's on those drives, thankfully MOST of it, but not all, is non-critical data. I am in the process of copying off the array as much critical data I can, however I am unsure as to the data integrity state of the stuff on Disk 1, and Disk 5 is emulated.

I know the array is currently unprotected
What are my options here?
What order do I need to go about things in order to minimise further risk to data

The array is big enough to cope with dropping disk 1 altogether and I'm happy to shrink the array when all this is done. I'm just out of my depth here.

Background Info

I have recently completed a re-casing and upgrading project on my long time Unraid server. Earlier in the year I moved it from a HP N54L with a Mediasonic eSata Probox to a Lenovo P300 with said same Mediasonic eSata Probox. There were no issues with this phase, it was stable for a good 6 months without issue.

In the last couple of weeks I re-cased and consolidated the whole thing into a Fractal Design Node 804 case, doing away with the Mediasonic eSata box. Again no real issues...APART FROM DROPPING A STACK OF 4 DRIVES ON THE FLOOR!!! ARHGHGHGHG. When I got the system back-up one of my 2TB drives had a hard mechanical failure from the drop. It was just loudly clicking under power and the system couldn't initialize the drive. It would eventually boot, obviously the drive was disabled but the array came up anyway. I had a new 4TB drive that I was adding to the array anyway so I pre-cleared it and attempted to rebuild the disabled drive to it, however it was slightly bigger than my 4TB parity (as that was a shucked drive) so I had to do a parity swap, followed by a data rebuild to the ex-parity drive. That drive is Disk 5, the first one in question here.

After I had finished the data rebuild and run an extended SMART on the drive I left the array stable for the a couple of days in which time I believe it did a Parity Check finding 0 errors.

I then set about converting 4 drives from rfs to xfs as I had a mixed array and wanted to consolidate. A lengthly process using UnBalance to clear them down and then reformatting and moving data about to clear the next one. Some of those drives were my oldest, so during that process probably had the most use they'd had for ages. Either way I got through it without too much bother.

I then added an additional 4TB drive to the array to expand it further just the other day, which did involve moving some cables able, but the system had been up for a good few days for the issues described at the top of this post began.

tower-diagnostics-20201119-2244.zip

trurl · November 20, 2020

On mobile now so can't look at Diagnostics yet.

Safest would be to rebuild disk 5 to a new disk and keep the original as is in case of problems with the rebuild.

Are all disks mounting?

JorgeB · November 20, 2020

You can also use ddrescue to clone the disk(s) and identify any corrupt data.

air_marshall · November 20, 2020

6 hours ago, trurl said:

Are all disks mounting?

I'm not sure what you mean. Disk 5 still shows as disabled but all data remains available due emulation. /mnt/disk5 is also listed and available

There is only 300gb on disk 5 so I could transfer it off to a usb drive before attempting a rebuild to the same 4tb drive.

Regarding issues with the rebuild, could these result due the SMART errors on disk1 that have also appeared?

Thank you for trying to help me understand this situation.

air_marshall · November 20, 2020

37 minutes ago, JorgeB said:

You can also use ddrescue to clone the disk(s) and identify any corrupt data.

Thank you for trying to help me.

I think I'm struggling for drives in order to do this and having just spent on 2x4tb units I'm not inclined to get another.

I have a spare 1tb 2.5 spinner that I can use and hasn't been used much, at the moment I've been using it externally to copy the critical data off disk1 and disk5.

I also have a spare, slow, old 500gn 2.5 spinner that used to be my cache before I upgraded to an SSD for this purpose.

I can probably also free up disk9 a new 4tb unit.

Should attempt use ddrescue before attempting a rebuild to disk5?

I will set about copying everything off disk5 now to the external 1tb I have in the meantime. At least then I have it so if there are any issues with a rebuild I might be in a better place.

JorgeB · November 20, 2020

1 hour ago, air_marshall said:

Should attempt use ddrescue before attempting a rebuild to disk5?

You can rebuild first if you do it to a new disk, keep the old one intact, but if there are errors during the rebuild no way of knowing what files are corrupt, unless you have pre-existing checksums or are using btrfs.

air_marshall · November 20, 2020

Ok thanks.

I'm still trying to comprehend the situation here.

So in all cases I should address disk5 first, regardless of the SMART errors on disk1?

Errors during the rebuild of disk5 may result due to the SMART issues with disk1. Is that correct? Therefore if I retain disk5 in its current state I may be able to rescue data from it if required?

If disk5 is currently mounted what is wrong with just copying all the data off it? Or is the data actually replicated from parity despite the mount?

IF I get a successful rebuild of disk5 what do I need to do to address disk1 if I am happy to make it redundant.

JorgeB · November 20, 2020

3 minutes ago, air_marshall said:

If disk5 is currently mounted what is wrong with just copying all the data off it?

Similar to rebuilding, since the disk is being emulated if there are problems in another disk there also will be during the copy, but at least you'll know which files have problems, since you'll get an i/o error on those.

air_marshall · November 21, 2020

So I copied the emulated contents of disk5 off of the array onto a USB drive, no faults or errors occurred during this process.

Having completed another extended SMART on disk5 without issue, combined with the fact there was no critical data on disk5 and me not having a spare disk or willing to spend on one I elected to rebuild to the same drive. This completed but with Errors concurrent with read errors from disc1.

Why has this happened during the rebuild but not the copy of the same emulated data beforehand?

Realistically how much data are we talking about as being corrupt? The latest diagnostics is attached detailing the sector errors, which I should add is now increasing with drive usage, so disk1 is definitely on the way out.

Regarding data on disk1. If parity data is not part of the corrupted rebuild data on disk5 (i doubt there is any way of confirming this) I should be able to a) replace disk1 and rebuild without data corruption OR (if I am happy to shrink the array - which I am) unassign the disk1 and then copy the emulated data to another drive without data corruption; then proceed to shrink the array using the methods in the documentation. Should I run a non-correcting parity check first?

There is some critical data on disk1, I have already copied that off to an external source with the drive still assigned and mounted, there were no errors during this copy process.

tower-diagnostics-20201121-1408.zip

Edited November 21, 2020 by air_marshall

JorgeB · November 22, 2020

18 hours ago, air_marshall said:

Why has this happened during the rebuild but not the copy of the same emulated data beforehand?

Most likely sectors where the data was on disk5 correspond to good sectors on disk1, also sometimes these errors can be intermittent, but less likely in this case.

18 hours ago, air_marshall said:

ealistically how much data are we talking about as being corrupt?

Impossible to tell for sure without checksums but based on the number of errors probably not much.

18 hours ago, air_marshall said:

Regarding data on disk1. If parity data is not part of the corrupted rebuild data on disk5 (i doubt there is any way of confirming this) I should be able to a) replace disk1 and rebuild without data corruption

Since disk5 now has some corruption, rebuilding disk1 will also result in the same.

air_marshall · December 11, 2020

Having rebuilt to the same disk for disk5 and removing disk1 from the array all has been stable and fine for a couple of weeks. A parity check has been completed since too, after parity rebuild.

Today, when again the server was seemingly doing very little I received an alert of read errors on the same disk as before, disk5. It's now back to the disabled status.

Here is a diagnostics report. I have NOT rebooted the server or done anything else this time so hopefully I can get some more pointed assitance.

I note this from the Disk Log Information from the disk:

Dec 7 21:50:59 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 7 21:50:59 Tower kernel: ata1.00: failed command: READ DMA EXT
Dec 7 21:50:59 Tower kernel: ata1.00: cmd 25/00:00:78:f9:37/00:01:ff:00:00/e0 tag 14 dma 131072 in
Dec 7 21:50:59 Tower kernel: ata1.00: status: { DRDY }
Dec 7 21:50:59 Tower kernel: ata1: hard resetting link
Dec 7 21:51:03 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 7 21:51:03 Tower kernel: ata1.00: configured for UDMA/133
Dec 7 21:51:03 Tower kernel: ata1: EH complete
Dec 11 04:46:44 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Dec 11 04:46:49 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 11 04:46:49 Tower kernel: ata1.00: configured for UDMA/133
Dec 11 08:39:15 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 11 08:39:15 Tower kernel: ata1.00: failed command: READ DMA EXT
Dec 11 08:39:15 Tower kernel: ata1.00: cmd 25/00:08:d0:cf:11/00:00:fb:00:00/e0 tag 19 dma 4096 in
Dec 11 08:39:15 Tower kernel: ata1.00: status: { DRDY }
Dec 11 08:39:15 Tower kernel: ata1: hard resetting link
Dec 11 08:39:21 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Dec 11 08:39:25 Tower kernel: ata1: COMRESET failed (errno=-16)
Dec 11 08:39:25 Tower kernel: ata1: hard resetting link
Dec 11 08:39:31 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Dec 11 08:39:35 Tower kernel: ata1: COMRESET failed (errno=-16)
Dec 11 08:39:35 Tower kernel: ata1: hard resetting link
Dec 11 08:39:41 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Dec 11 08:40:10 Tower kernel: ata1: COMRESET failed (errno=-16)
Dec 11 08:40:10 Tower kernel: ata1: limiting SATA link speed to 3.0 Gbps
Dec 11 08:40:10 Tower kernel: ata1: hard resetting link
Dec 11 08:40:15 Tower kernel: ata1: COMRESET failed (errno=-16)
Dec 11 08:40:15 Tower kernel: ata1: reset failed, giving up
Dec 11 08:40:15 Tower kernel: ata1.00: disabled
Dec 11 08:40:15 Tower kernel: ata1: EH complete
Dec 11 08:40:15 Tower kernel: sd 1:0:0:0: [sdb] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Dec 11 08:40:15 Tower kernel: sd 1:0:0:0: [sdb] tag#20 CDB: opcode=0x88 88 00 00 00 00 00 fb 11 cf d0 00 00 00 08 00 00
Dec 11 08:40:15 Tower kernel: print_req_error: I/O error, dev sdb, sector 4212248528
Dec 11 08:40:23 Tower kernel: sd 1:0:0:0: [sdb] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Dec 11 08:40:23 Tower kernel: sd 1:0:0:0: [sdb] tag#23 CDB: opcode=0x8a 8a 00 00 00 00 00 fb 11 cf d0 00 00 00 08 00 00
Dec 11 08:40:23 Tower kernel: print_req_error: I/O error, dev sdb, sector 4212248528

When I was removing disk1 I took the opportunity to check and replace and re-route the sata cable for disk5.

When visiting the Disk 5 Settings page the system is unable to read SMART attributes or Disk Capabilities, it also cannot be spun up. So it seems it's also dropped offline on the controller.

tower-diagnostics-20201211-1238.zip

JorgeB · December 11, 2020

Looks more like a power/connection problem, if you already replaced the SATA cable do the same for power, if it keeps happening after that it could be the disk.

air_marshall · December 11, 2020

Thanks @JorgeB. I'll still have to rebuild the disk right? No way to avoid doing that?

I'll replace the power cable. I could also put it on a different controller / port and see if it re-occurs at a later date.

No critical data on it.

trurl · December 11, 2020

3 hours ago, air_marshall said:

I'll still have to rebuild the disk right? No way to avoid doing that?

Yes. It is disabled because it is out-of-sync now.

Multiple disk issues. Unsure of data integrity state. Best path forward.

Recommended Posts

air_marshall

Link to comment

trurl

Link to comment

JorgeB

Link to comment

air_marshall

Link to comment

air_marshall

Link to comment

JorgeB

Link to comment

air_marshall

Link to comment

JorgeB

Link to comment

air_marshall

Link to comment

JorgeB

Link to comment

air_marshall

Link to comment

JorgeB

Link to comment

air_marshall

Link to comment

trurl

Link to comment

Join the conversation