Got 'read errors' on a disk, replaced cable, now another disk is 'wrong'


Recommended Posts

Here is the sequence of events:

 

  1. I see an alert saying health of array has failed, turns out a disk has read errors
  2. I read that this 'read error' issue is often a cable issue so I get a new cable and use a new sata port on the motherboard.
  3. After booting up again, I see that the disk is now too small to be added to be added back into the array (its the same disk).  I have a gigabyte motherboard and I read about the HPA partition
  4. I go into BIOS and disable the HPA, I also notice the system clock is off, so I update that as well
  5. After booting up again, the previously 'read error' disk is now showing as a 'New Device', BUT an additional disk is now showing as 'Wrong'

 

What can I do here?  I've read of people suggesting doing a 'New Config' in some situations, but I'm reading that this will prevent you from being able to rebuild any existing failed disks, which is what the 'read errors' disk is considered to be, right?

 

I've attached the before diagnostics (...-0805.zip) and the after diagnostics (...-1103.zip)

 

This is very stressful :( makes me want to add a parity disk for dual parity once I hopefully get this resolved

1498844320_ScreenShot2019-07-21at4_09_12AM.thumb.png.6fd89e2a964c9c510fc7fd596678221b.png

 

mainstore-diagnostics-20190721-0805.zip mainstore-diagnostics-20190721-1103.zip

Edited by dvaldez
added screenshot
Link to comment

You means orginal HPA was enable ? Anyway just keep orginal BIOS setting first. If you assume disk3 was in good state, then just perform new config  and preserve all setting. Start array and check any abnormal, any unmountable file system, perform uncorrect parity check ..... etc.

 

Remember : Set parity valid, don't do any format disk

 

If anything wrong, you may invalid disk3 again and rebuild ( before this step, all other disk must mountabe and content readable ) and don't write anything to array.

Edited by Benson
Link to comment

Disk 3 should be fine, but it was removed from the array for at least 18 hours when the 'read errors' started, I wrote backups to the array within that time, is it still ok to do a New Config on?

 

Also yes, originally HPA was enabled, I didn't know.

 

On my motherboard I have 2 options in BIOS: HPA or BIOS Backup, I cannot disable it seems.  I changed from HPA to BIOS Backup

Edited by dvaldez
adding detail
Link to comment
19 minutes ago, dvaldez said:

Disk 3 should be fine, but it was removed from the array for at least 18 hours when the 'read errors' started, I wrote backups to the array within that time, is it still ok to do a New Config on?

If disk 3 fine, then at least it have data and you won't total lost on disk 3. If it lost then data need rebuild by others disk. If you have backup, then things could more easy. The importance was other disk end up readable and mountable.

 

Due to I am not sure HPA will cause disk mountable or not. So you may best follow others suggestion. And in general, should keep orignal status.

Edited by Benson
Link to comment

I put the BIOS back to the original config, and I did New Config

 

Now I have all of the original disks assigned in all of the original slots, but I cannot start the array as "The parity drive is not the biggest"

 

I tried to remove the HPA using hdparm but I am getting some error:

 

root@mainstore:~# hdparm -N p7814037168 /dev/sdb

 

/dev/sdb:

 setting max visible sectors to 7814037168 (permanent)

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 10 51 40 01 21 00 00 00 a0 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00

 max sectors   = 7814035055/7814037168, HPA is enabled

 

I guess the next step is to try and use the 2nd HPA removal method.

 

If I can say that all of the data disks are good and have the data that I want, is it possible to just buy and add a larger (6TB or something) disk and put that in the parity slot instead?  Would I retain all of the data?

 

How could I check if the original "read errors" disk is good?  I definitely was not doing any rebuilding when it failed but there may have been a parity check running.  I have not written any new data to the array to my knowledge, but I do have a few things that automatically mount the array via NFS so I'm not completely sure.

Link to comment

OK

 

I tried to use the HDAT2 method,

when I used the set max command - it gave some error, I think my motherboard may not support these commands...

I then tried the "Auto set max" command, and it said it was successful...

I then booted back into unraid, and set all the disks in the proper slots, and it still says the parity disk is not the biggest disk...

I know I did the correct one in HDAT2 because I disconnected all of the other disks to be sure.

 

At this point, if possible, I'll just buy a larger disk 6/8TB and put it in the parity slot, will this be ok?  If the data on the "read errors" disk was never corrupted or damaged, and it was in fact just a cable issue, will it retain all of the data that was on there?

 

What should I do?

Link to comment
1 hour ago, dvaldez said:

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 10 51 40 01 21 00 00 00 a0 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00

 max sectors   = 7814035055/7814037168, HPA is enabled

This is usually controller related, try using a different controller or removing the HPA on another computer.

Link to comment
5 hours ago, dvaldez said:

If I can say that all of the data disks are good and have the data that I want, is it possible to just buy and add a larger (6TB or something) disk and put that in the parity slot instead?  Would I retain all of the data? 

Yes. Even you can start array now without parity and add parity later.

 

The reason for kept all disk as org. state really because hope have a last chance to resume data from parity and other disk.

 

But I still not confirm if a data disk which modify by HPA still mountable or not.

 

Edited by Benson
Link to comment

I have a 6TB drive coming in the mail to use as a replacement parity disk, but I tried to start the array without the parity, just the remaining 4 disks.

 

At this time the last 2 disks are unmountable:

 

PR6JvVr.png

 

I didn't make any changes on disk 3 but I tried to remove the HPA on disk 4 previously, it failed however....How can I fix this??

 

I have attached an updated diagnostic if it matters

mainstore-diagnostics-20190726-0225.zip

Edited by dvaldez
correcting detail
Link to comment

So I went for an xfs_repair on disk 3, the disk I think (if I remember correctly) has the least amount of data

 

root@mainstore:~# xfs_repair -v /dev/md3
Phase 1 - find and verify superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
..found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
......................................................................................................................................

 

 

the .....'s have continued ever since for at least 40 mins and is still continuing...should I cancel at this point? there hasn't been anything else

Link to comment
33 minutes ago, dvaldez said:

So I went for an xfs_repair on disk 3, the disk I think (if I remember correctly) has the least amount of data

 


root@mainstore:~# xfs_repair -v /dev/md3
Phase 1 - find and verify superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
..found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
.found candidate secondary superblock...
error reading superblock 4 -- seek to offset 4000785907712 failed
unable to verify superblock, continuing...
......................................................................................................................................

 

 

the .....'s have continued ever since for at least 40 mins and is still continuing...should I cancel at this point? there hasn't been anything else

Stop it.

Link to comment
On 7/24/2019 at 1:25 PM, dvaldez said:

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 10 51 40 01 21 00 00 00 a0 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00

 max sectors   = 7814035055/7814037168, HPA is enabled

 

I guess the next step is to try and use the 2nd HPA removal method.

How about HPA remove on disk3 disk4 try on other disk controller ?

 

Disk 1,3,4 have HPA, but disk 1 mountable.

 

What output of ( I want to check does fail due to disk in frozen )

hdparm -I sde ( disk3 )

hdparm -I sdd ( disk4 )

 

 

Edited by Benson
Link to comment

Here are the outputs of hdparm -I for both the unmountable disks:

 

root@mainstore:~# hdparm -I /dev/sde

/dev/sde:

ATA device, with non-removable media
	Model Number:       WDC WD40EFRX-68N32N0                    
	Serial Number:      WD-WCC7K5JCVK5E
	Firmware Revision:  82.00A82
	Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: unknown (minor revision code 0x006d) 
	Supported: 10 9 8 7 6 5 
	Likely used: 10
Configuration:
	Logical		max	current
	cylinders	16383	0
	heads		16	0
	sectors/track	63	0
	--
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:  7814035055
	Logical  Sector size:                   512 bytes
	Physical Sector size:                  4096 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:     3815446 MBytes
	device size with M = 1000*1000:     4000785 MBytes (4000 GB)
	cache/buffer size  = unknown
	Form Factor: 3.5 inch
	Nominal Media Rotation Rate: 5400
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	IDLE_IMMEDIATE with UNLOAD
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	Idle-Unload when NCQ is active
	   *	NCQ priority information
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	    	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	    	unknown 206[12] (vendor specific)
	    	unknown 206[13] (vendor specific)
	   *	DOWNLOAD MICROCODE DMA command
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	480min for SECURITY ERASE UNIT. 480min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee21119cfe8
	NAA		: 5
	IEEE OUI	: 0014ee
	Unique ID	: 21119cfe8
Checksum: correct
root@mainstore:~# hdparm -I /dev/sdd

/dev/sdd:

ATA device, with non-removable media
	Model Number:       WDC WD40EFRX-68N32N0                    
	Serial Number:      WD-WCC7K5XR9FEE
	Firmware Revision:  82.00A82
	Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: unknown (minor revision code 0x006d) 
	Supported: 10 9 8 7 6 5 
	Likely used: 10
Configuration:
	Logical		max	current
	cylinders	16383	0
	heads		16	0
	sectors/track	63	0
	--
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:  7814035055
	Logical  Sector size:                   512 bytes
	Physical Sector size:                  4096 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:     3815446 MBytes
	device size with M = 1000*1000:     4000785 MBytes (4000 GB)
	cache/buffer size  = unknown
	Form Factor: 3.5 inch
	Nominal Media Rotation Rate: 5400
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	IDLE_IMMEDIATE with UNLOAD
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	Idle-Unload when NCQ is active
	   *	NCQ priority information
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	    	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	    	unknown 206[12] (vendor specific)
	    	unknown 206[13] (vendor specific)
	   *	DOWNLOAD MICROCODE DMA command
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	490min for SECURITY ERASE UNIT. 490min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee2bb3783a5
	NAA		: 5
	IEEE OUI	: 0014ee
	Unique ID	: 2bb3783a5
Checksum: correct
root@mainstore:~# 

 

Link to comment

Finally a positive update:

 

I tried doing the hdparm -N command on the 2 unmountable disks as well as the parity disk which was "not the largest disk in the array" while plugged into a DIFFERENT storage controller, they were successful this time.

 

I did a new config with the 4 data disks in the proper positions (as 2 of the disks were showing 'wrong') and was able to see all of the data

 

I then assigned the parity disk and it seems to have been recognized, as it DIDNT show the message "all data on this disk will be erased when array started"

 

parity was showing invalid (as expected due to 18hrs downtime of disk 3) and it is now running a parity sync/data rebuild

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.