Got 'read errors' on a disk, replaced cable, now another disk is 'wrong'

dvaldez · July 21, 2019

Here is the sequence of events:

I see an alert saying health of array has failed, turns out a disk has read errors
I read that this 'read error' issue is often a cable issue so I get a new cable and use a new sata port on the motherboard.
After booting up again, I see that the disk is now too small to be added to be added back into the array (its the same disk). I have a gigabyte motherboard and I read about the HPA partition
I go into BIOS and disable the HPA, I also notice the system clock is off, so I update that as well
After booting up again, the previously 'read error' disk is now showing as a 'New Device', BUT an additional disk is now showing as 'Wrong'

What can I do here? I've read of people suggesting doing a 'New Config' in some situations, but I'm reading that this will prevent you from being able to rebuild any existing failed disks, which is what the 'read errors' disk is considered to be, right?

I've attached the before diagnostics (...-0805.zip) and the after diagnostics (...-1103.zip)

This is very stressful makes me want to add a parity disk for dual parity once I hopefully get this resolved

mainstore-diagnostics-20190721-0805.zip mainstore-diagnostics-20190721-1103.zip

Edited July 21, 2019 by dvaldez
added screenshot

dvaldez · July 22, 2019

Any ideas? I don't know what to do at this point

Vr2Io · July 22, 2019

You means orginal HPA was enable ? Anyway just keep orginal BIOS setting first. If you assume disk3 was in good state, then just perform new config and preserve all setting. Start array and check any abnormal, any unmountable file system, perform uncorrect parity check ..... etc.

Remember : Set parity valid, don't do any format disk

If anything wrong, you may invalid disk3 again and rebuild ( before this step, all other disk must mountabe and content readable ) and don't write anything to array.

Edited July 22, 2019 by Benson

JorgeB · July 22, 2019

Make sure bios backup function is disabled and remove HPA from disk4, other disks also have HPA enable but only remove from disk4 for now.

dvaldez · July 22, 2019

Disk 3 should be fine, but it was removed from the array for at least 18 hours when the 'read errors' started, I wrote backups to the array within that time, is it still ok to do a New Config on?

Also yes, originally HPA was enabled, I didn't know.

On my motherboard I have 2 options in BIOS: HPA or BIOS Backup, I cannot disable it seems. I changed from HPA to BIOS Backup

Edited July 22, 2019 by dvaldez
adding detail

Vr2Io · July 22, 2019

19 minutes ago, dvaldez said:

Disk 3 should be fine, but it was removed from the array for at least 18 hours when the 'read errors' started, I wrote backups to the array within that time, is it still ok to do a New Config on?

If disk 3 fine, then at least it have data and you won't total lost on disk 3. If it lost then data need rebuild by others disk. If you have backup, then things could more easy. The importance was other disk end up readable and mountable.

Due to I am not sure HPA will cause disk mountable or not. So you may best follow others suggestion. And in general, should keep orignal status.

Edited July 22, 2019 by Benson

dvaldez · July 24, 2019

I put the BIOS back to the original config, and I did New Config

Now I have all of the original disks assigned in all of the original slots, but I cannot start the array as "The parity drive is not the biggest"

I tried to remove the HPA using hdparm but I am getting some error:

root@mainstore:~# hdparm -N p7814037168 /dev/sdb

/dev/sdb:

setting max visible sectors to 7814037168 (permanent)

SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 10 51 40 01 21 00 00 00 a0 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00

max sectors = 7814035055/7814037168, HPA is enabled

I guess the next step is to try and use the 2nd HPA removal method.

If I can say that all of the data disks are good and have the data that I want, is it possible to just buy and add a larger (6TB or something) disk and put that in the parity slot instead? Would I retain all of the data?

How could I check if the original "read errors" disk is good? I definitely was not doing any rebuilding when it failed but there may have been a parity check running. I have not written any new data to the array to my knowledge, but I do have a few things that automatically mount the array via NFS so I'm not completely sure.

dvaldez · July 24, 2019

OK

I tried to use the HDAT2 method,

when I used the set max command - it gave some error, I think my motherboard may not support these commands...

I then tried the "Auto set max" command, and it said it was successful...

I then booted back into unraid, and set all the disks in the proper slots, and it still says the parity disk is not the biggest disk...

I know I did the correct one in HDAT2 because I disconnected all of the other disks to be sure.

At this point, if possible, I'll just buy a larger disk 6/8TB and put it in the parity slot, will this be ok? If the data on the "read errors" disk was never corrupted or damaged, and it was in fact just a cable issue, will it retain all of the data that was on there?

What should I do?

JorgeB · July 24, 2019

1 hour ago, dvaldez said:

SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 10 51 40 01 21 00 00 00 a0 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00

max sectors = 7814035055/7814037168, HPA is enabled

This is usually controller related, try using a different controller or removing the HPA on another computer.

Vr2Io · July 24, 2019

5 hours ago, dvaldez said:

If I can say that all of the data disks are good and have the data that I want, is it possible to just buy and add a larger (6TB or something) disk and put that in the parity slot instead? Would I retain all of the data?

Yes. Even you can start array now without parity and add parity later.

The reason for kept all disk as org. state really because hope have a last chance to resume data from parity and other disk.

But I still not confirm if a data disk which modify by HPA still mountable or not.

Edited July 24, 2019 by Benson

dvaldez · July 24, 2019

well luckily I was never able to properly remove HPA on the 'read errors' data disk, so it should be in the same state it was in when the read errors occurred, unless HPA was changed automatically afterwards...

dvaldez · July 26, 2019

I have a 6TB drive coming in the mail to use as a replacement parity disk, but I tried to start the array without the parity, just the remaining 4 disks.

At this time the last 2 disks are unmountable:

I didn't make any changes on disk 3 but I tried to remove the HPA on disk 4 previously, it failed however....How can I fix this??

I have attached an updated diagnostic if it matters

mainstore-diagnostics-20190726-0225.zip

Edited July 26, 2019 by dvaldez
correcting detail

dvaldez · July 26, 2019

should I try doing the xfs_repair method on the 2 disks?

dvaldez · July 26, 2019

So I went for an xfs_repair on disk 3, the disk I think (if I remember correctly) has the least amount of data

root@mainstore:~# xfs_repair -v /dev/md3
Phase 1 - find and verify superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
..found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
......................................................................................................................................

the .....'s have continued ever since for at least 40 mins and is still continuing...should I cancel at this point? there hasn't been anything else

Vr2Io · July 26, 2019

33 minutes ago, dvaldez said:

So I went for an xfs_repair on disk 3, the disk I think (if I remember correctly) has the least amount of data


root@mainstore:~# xfs_repair -v /dev/md3
Phase 1 - find and verify superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
..found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
......................................................................................................................................

the .....'s have continued ever since for at least 40 mins and is still continuing...should I cancel at this point? there hasn't been anything else

Stop it.

Vr2Io · July 26, 2019

On 7/24/2019 at 1:25 PM, dvaldez said:

SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 10 51 40 01 21 00 00 00 a0 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00

max sectors = 7814035055/7814037168, HPA is enabled

I guess the next step is to try and use the 2nd HPA removal method.

How about HPA remove on disk3 disk4 try on other disk controller ?

Disk 1,3,4 have HPA, but disk 1 mountable.

What output of ( I want to check does fail due to disk in frozen )

hdparm -I sde ( disk3 )

hdparm -I sdd ( disk4 )

Edited July 26, 2019 by Benson

JorgeB · July 26, 2019

Unsupported partition layout means current partition isn't using the full disk, because of the HPA, if you had valid parity you could rebuild one disk at a time, without it I would use UD to mount the disks and copy the data to other(s) before re-formatting it.

dvaldez · July 26, 2019

what is UD?

dvaldez · July 26, 2019

Here are the outputs of hdparm -I for both the unmountable disks:

root@mainstore:~# hdparm -I /dev/sde

/dev/sde:

ATA device, with non-removable media
	Model Number:       WDC WD40EFRX-68N32N0                    
	Serial Number:      WD-WCC7K5JCVK5E
	Firmware Revision:  82.00A82
	Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: unknown (minor revision code 0x006d) 
	Supported: 10 9 8 7 6 5 
	Likely used: 10
Configuration:
	Logical		max	current
	cylinders	16383	0
	heads		16	0
	sectors/track	63	0
	--
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:  7814035055
	Logical  Sector size:                   512 bytes
	Physical Sector size:                  4096 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:     3815446 MBytes
	device size with M = 1000*1000:     4000785 MBytes (4000 GB)
	cache/buffer size  = unknown
	Form Factor: 3.5 inch
	Nominal Media Rotation Rate: 5400
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	IDLE_IMMEDIATE with UNLOAD
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	Idle-Unload when NCQ is active
	   *	NCQ priority information
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	    	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	    	unknown 206[12] (vendor specific)
	    	unknown 206[13] (vendor specific)
	   *	DOWNLOAD MICROCODE DMA command
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	480min for SECURITY ERASE UNIT. 480min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee21119cfe8
	NAA		: 5
	IEEE OUI	: 0014ee
	Unique ID	: 21119cfe8
Checksum: correct
root@mainstore:~# hdparm -I /dev/sdd

/dev/sdd:

ATA device, with non-removable media
	Model Number:       WDC WD40EFRX-68N32N0                    
	Serial Number:      WD-WCC7K5XR9FEE
	Firmware Revision:  82.00A82
	Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: unknown (minor revision code 0x006d) 
	Supported: 10 9 8 7 6 5 
	Likely used: 10
Configuration:
	Logical		max	current
	cylinders	16383	0
	heads		16	0
	sectors/track	63	0
	--
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:  7814035055
	Logical  Sector size:                   512 bytes
	Physical Sector size:                  4096 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:     3815446 MBytes
	device size with M = 1000*1000:     4000785 MBytes (4000 GB)
	cache/buffer size  = unknown
	Form Factor: 3.5 inch
	Nominal Media Rotation Rate: 5400
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	IDLE_IMMEDIATE with UNLOAD
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	Idle-Unload when NCQ is active
	   *	NCQ priority information
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	    	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	    	unknown 206[12] (vendor specific)
	    	unknown 206[13] (vendor specific)
	   *	DOWNLOAD MICROCODE DMA command
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	490min for SECURITY ERASE UNIT. 490min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee2bb3783a5
	NAA		: 5
	IEEE OUI	: 0014ee
	Unique ID	: 2bb3783a5
Checksum: correct
root@mainstore:~#

BRiT · July 26, 2019

UD is Unassigned Devices plugin.

Vr2Io · July 26, 2019

3 hours ago, dvaldez said:

Here are the outputs of hdparm -I for both the unmountable disks:

Both not frozen

dvaldez · July 26, 2019

How do I use the UD plugin to copy the data out of the 'unmountable' disks? I have it installed, but if I unassign the 2 "unmountable" disks, I cannot start the array, even in maintenance mode

Vr2Io · July 26, 2019

If you can't mount disk3 4 in Unraid, no matter in array or UD, I would suggest you use HDAT2 bootable to remove HPA for disk 3 4 only, then try mount it by Unraid or XFS repair again.

Edited July 26, 2019 by Benson

Vr2Io · July 26, 2019

If you want try again remove HPA for disk 3 4 in Unraid, pls

. Power off whole system

. Disconnect disk 3 4 SATA link

. Power on, boot in unraid

. Connect back SATA link

. Try again remove HPA on disk 3 4

Edited July 26, 2019 by Benson

dvaldez · July 26, 2019

Finally a positive update:

I tried doing the hdparm -N command on the 2 unmountable disks as well as the parity disk which was "not the largest disk in the array" while plugged into a DIFFERENT storage controller, they were successful this time.

I did a new config with the 4 data disks in the proper positions (as 2 of the disks were showing 'wrong') and was able to see all of the data

I then assigned the parity disk and it seems to have been recognized, as it DIDNT show the message "all data on this disk will be erased when array started"

parity was showing invalid (as expected due to 18hrs downtime of disk 3) and it is now running a parity sync/data rebuild

Got 'read errors' on a disk, replaced cable, now another disk is 'wrong'

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation