redballed disk - can't read block - General Support (V5 and Older)

December 12, 201213 yr

Hi,

after using my unraid server faultless for two years with great satisfaction, I've now encountered some issues.

Two 2TB disks, my cache and one in the array are giving issues. The array disk was redballed, whilst the cache drive was corrupted so much that it prevented the whole array to come up and didn't show in any Windows reiserfs read programs when attached with an USB disk.

Both disks were mounted in the same Sharkoon (Kingwin in the US) 3-in-2 hotswap bay. On install of this bay, one of the power connectors broke. I managed to reattach this and had no issues for two years.

With the current disk issues, my trust in the hotswap bay was gone and assumed the broken power connecter moved due to vibration etc and was giving issues. And as I needed room for expansion, I changed them for the Norco 5-in-3 bay (the outgoing model on sale at a supplier here in the Netherlands). The disk used for cache was sent back to the supplier and replaced under warranty. This new 2TB disk will now be added to the array (I'm going to use a smaller 500GB disk for cache). The SMART report for the redballed disk in the array looked ok, so I unassigned and re-assigned it and let the data rebuild. This while placed in the Norco cage. All seemed solved.

However, this morning the disk in the array was redballed again and my syslog showed the following entry every 10 sec:

Dec 12 19:14:36 Tower kernel: mdcmd (7318): spindown 4 (Routine)
Dec 12 19:14:46 Tower emhttp: disk_spinning: open: No such file or directory (Other emhttp)

Dec 12 19:14:46 Tower emhttp: mdcmd: write: No such device or address (Other emhttp)

I rebooted the server, the syslog of the bootup up to now is attached. The affected disk is disk 4, serial WCAVY6895159. Lots of errors, and the involved disk seems to come back as both sdf and sdi?

The SMART report of the affected disk looks ok and is attached. As next step I did a reiserfschk and this aborted with the following:

Checking internal tree.. \/ 3 (of 19|/ 95 (of 168\/ 47 (of 164-
The problem has occurred looks like a hardware problem. If you have

bad blocks, we advise you to get a new hard drive, because once you

get one bad block that the disk drive internals cannot hide from

your sight,the chances of getting more are generally said to become

much higher (precise statistics are unknown to us), and this disk

drive is probably not expensive enough for you to you to risk your

time and data on it. If you don't want to follow that follow that

advice then if you have just a few bad blocks, try writing to the

bad blocks and see if the drive remaps the bad blocks (that means

it takes a block it has in reserve and allocates it for use for

of that block number). If it cannot remap the block, use badblock

option (-B) with reiserfs utils to handle this block correctly.

bread: Cannot read the block (187662361): (Input/output error).

This seems very much like the problem user 'lostincable' in his very recent thread 'failing disk'?

I see the following steps:

1) recheck the hardware: as the disk is disabled is now in both the Sharkoon and the Norco cage, and the other 4 disks in the Norco cage seem fine, that seems unlikely to be the problem. Parity, disk1 and 2 were in the ok Sharkoon cage. Disk3, disk4 and the cache were in the dodgy cage. That leaves the cable and motherboard SATA port to check.

2) Follow the steps suggested by 'Joe L' in the 'failing disk' thread:

1) Power down server
2) Disconnect drive power

3) Re power server with drive disconnected

4) Check if drive is being simulated by parity

5) If drive is being simulated by parity run the reiserfsck with rebuild tree and hopefully it completes

6) Shut down server, reconnect drive and power back up

Am I correct in these steps? Any other suggestions also welcome.

SMART_12122012_WCAVY6895159.pdf

Quote

December 12, 201213 yr

Author

Trying to add the syslog but too big... finding a way

Quote

December 12, 201213 yr

Author

See if this works

https://dl.dropbox.com/u/9280177/syslog-2012-12-12-1.txt

Quote

December 12, 201213 yr

Author

Going through the syslog I also see a lot of other issues...

*Errors on ATA6. This is the recently added 2TB disk, pre-cleared but not added to the array yet. This is only disk in the second Norco cage.

*"handle_stripe read error" on disk3. Can't remember seeing these earlier...

Looks like my unraid box is dying on me! :'(

Change: I'm probably wrong assuming that ATA is related to the physical SATA port and the first bullet/conclusion is probably not valid.

Quote

December 12, 201213 yr

Author

I'm going to reboot without the affected disk and the new disk, see what happens...

BTW the affected disk is still under warranty I believe so probably can be exchanged if needed. I probably need to demonstrate that it has failed though.

Quote

December 12, 201213 yr

Author

Enclosed overview of disks.

According to the second column (and not diskx, disyy):

*number 5 is giving me headaches

*number 6 is the new disk replacing the broken (?) cache disk

*number 1-3 were previously in the 'ok' Sharkoon cage

*number 4-6 were previouslly in the suspect Sharkoon cage

*number 1-5 are now together in a Norco cage

*number 6 is alone in a Norco cage

*number 15 is alone in a Norco cage.

configuration_disks_server3.pdf

Quote

December 12, 201213 yr

Author

And syslog with

*number 5, 'trouble' disk.

*number 6, new unassigned disk

*number 15 new unassigned disk (meant for future cache)

all removed from the server and rebooted.

Still a lot of errors?

syslog-2012-12-12-2.txt

Quote

December 12, 201213 yr

Author

Ok, I'm running reiserfsk on number 5 / disk4 (the 'trouble' disk) with the actual physical disk removed from the server, as per 'Joe L' instructions.

In the meanwhile I've been looking at the syslog, I see a lot of errors on ata5. Following some logic in the way drives are assigned I'm led to assume this is number 4 /disk3.

Looking in MyMain for disk3, the SMART report, I see the following:

» udma_crc_error_count=2

This is explained as:

UltraDMA CRC Error Count : The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

Better check that physical connection again..

Quote

December 12, 201213 yr

Author

Additional info:

System: unraid 4.7

SMART reports for following disk also added: number 4 /disk 3

SMART_12122012_WCAZA4944944.pdf

Quote

December 13, 201213 yr

Author

Ok, going to bed now, already way too late here...

I see my strategy as follows:

1) finish the reiserfsck on the 'emulated' disk4 and see in what that results.

2) change the SATA cables on SATA ports 4 and 5 (these are disk3 and disk4)

3) Do reiserfsck tree rebuilds and fixes (if that follows from step 1) using 'emulated' disk4

4) Check if BadCRC's and link resets stop on disk3

5) If reiserfsck actions succesfull and completed, insert disk 4 in server.

6) remove and re-assign disk4 to array

7) rebuild disk4

8.) Check if BadCRC's and link resets stop on disk4

9) Let server run for some time and see if it is all stable now

I know I'm probably looking at two issues at the same time, a corrupted disk4 and probably bad connections on disk3 and disk4. It is never good to problem solve two issues at once, but with these steps I hopefully can solve both logically. Biggest risk is probably if it is bad SATA ports on the mainboard which will not easily be found in this apporach, but in that case it seems strange that 2 ports are giving issues whist the rest is fine?

If all is well, subsequently:

10) Add new disk as disk5 to server on SATA port 6.

11) Add disk5 to array

12) Check if BadCRC's and link resets are present on disk6. I suspect as this port and cable was also connected to the dodgy Sharkoon cage, it might be at risk?

And finally:

13) Add 500Gb disk to server on PCI-E 1x SATAII card

14) Pre-clear 500 Gb disk

15) Add 500Gb disk as cache

Lovin' being IT geek in your own home...

Quote

December 13, 201213 yr

Author

result of the reiserfsck on the 'emulated' disk4

root@Tower:~# reiserfsck --check /dev/md4
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************

** If you are using the latest reiserfsprogs and it fails **

** please email bug reports to [email protected], **

** providing as much information as possible -- your **

** hardware, kernel, patches, settings, all reiserfsck **

** messages (including version), the reiserfsck logfile, **

** check the syslog file for any related information. **

** If you would like advice on using this program, support **

** is available for $25 at www.namesys.com/support.html. **

*************************************************************

Will read-only check consistency of the filesystem on /dev/md4

Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Wed Dec 12 23:09:32 2012

###########

Replaying journal: Done.

Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

Leaves 473384

Internal nodes 2855

Directories 647

Other files 7302

Data block pointers 477965634 (0 of them are zero)

Safe links 0

###########

reiserfsck finished at Thu Dec 13 03:31:05 2012

Quote

December 13, 201213 yr

Author

BTW , PSU is a Seasonic S12ii 520w Bronze so I believe that shouldn't be an issue. Certainly not with only 4 drives installed at the moment...

Quote

December 13, 201213 yr

Author

So, if I add the conclusions up for the 'trouble' disk:

*The SMART report looks ok, no sectors pending or re-aloocated sectors.

*reiserfsck on the actual physical disk shows a unreadable block

*reiserfsck on the emulated disk shows no problems.

That should mean I can just reinsert the disk to the server and rebuild the data? The SMART software on the disk will re-allocate any bad sectors?

Or do I need to reiserfsck rebuild the tree or fix on the emulated disk to be sure?

Quote

December 13, 201213 yr

So, if I add the conclusions up for the 'trouble' disk:

*The SMART report looks ok, no sectors pending or re-aloocated sectors.

*reiserfsck on the actual physical disk shows a unreadable block

*reiserfsck on the emulated disk shows no problems.

That should mean I can just reinsert the disk to the server and rebuild the data? The SMART software on the disk will re-allocate any bad sectors?

Correct.

Or do I need to reiserfsck rebuild the tree or fix on the emulated disk to be sure?

I think you meant the physical disk... but no, it will not be necessary.

Quote

December 13, 201213 yr

One more point.

If you are getting CRC errors, they are an indication of NOISE on the data connection to the drive. It is as often as not caused by pickup of other signals from nearby cables. If you have neatly tie-wrapped all the SATA cables together, especially if power and data cables are near each other, cut the tie-wraps.

The other possibility is a NOISY power supply, it has the same effect. If there are too many splitters, or long lengths of cables, or a power supply with aging capacitors (or defective ones)

Lastly, if the SATA cables are not properly made, or not shielded, or use poor quality wire (not the proper twists per inch of conductors) it could also be more likely to pick up noise from other nearby conductors)

Lots to look at. Just remember, if a drive turns "red" a "write" to it failed. It is guaranteed to have the wrong content. It must be re-constructed from parity and the other data drives.

Quote

December 13, 201213 yr

Author

Thanks for the help Joe. I think many on the forum owe you a few beers for all your efforts!

What I've done now:

*I removed all the tie wrap holding my SATA cables together (just read that on another post on the forum as well). And it made my case look so clean and tidy...

*I replaced the SATA cables for the SATA ports 4 and 5

*Put the 'trouble' disk back in, it showed as a new disk (blue ball) and the message 'recon_disk, array stopped'. That means I have to rebuild the disk, if I'm correct?

*As I have a new and pre-cleared 2TB (same size) disk ready (the one meant to be added as disk5 later on), I decided to use this one as replacement disk, instead of the one original used for disk4.

*I'm now rebuilding disk4 using the new disk.

Looking at the syslog (see attached), I've had a few hiccups on one SATA link:

Dec 13 13:09:33 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x400000 action 0x6 frozen (Errors)
Dec 13 13:09:33 Tower kernel: ata6: SError: { Handshk } (Errors)

Dec 13 13:09:33 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED (Minor Issues)

Dec 13 13:09:33 Tower kernel: ata6.00: cmd 60/08:00:a0:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 in (Drive related)

Dec 13 13:09:33 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (Errors)

Dec 13 13:09:33 Tower kernel: ata6.00: status: { DRDY } (Drive related)

Dec 13 13:09:33 Tower kernel: ata6: hard resetting link (Minor Issues)

Dec 13 13:09:33 Tower kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)

Dec 13 13:09:33 Tower kernel: ata6.00: configured for UDMA/133 (Drive related)

Dec 13 13:09:33 Tower kernel: ata6.00: device reported invalid CHS sector 0 (Drive related)

Dec 13 13:09:33 Tower kernel: ata6: EH complete (Drive related)

and

Dec 13 13:15:21 Tower emhttp: shcmd (12): /usr/local/sbin/set_ncq sda 1 >/dev/null (Drive related)
Dec 13 13:15:21 Tower emhttp: shcmd (13): /usr/local/sbin/set_ncq sdb 1 >/dev/null (Drive related)

Dec 13 13:15:21 Tower emhttp: shcmd (14): /usr/local/sbin/set_ncq sdc 1 >/dev/null (Drive related)

Dec 13 13:15:21 Tower emhttp: shcmd (15): /usr/local/sbin/set_ncq sdd 1 >/dev/null (Drive related)

Dec 13 13:15:21 Tower emhttp: shcmd (16): /usr/local/sbin/set_ncq sde 1 >/dev/null (Drive related)

Dec 13 13:15:21 Tower emhttp: writing mbr on disk 4 (/dev/sde) with partition 1 offset 64 (Drive related)

Dec 13 13:15:21 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x0 SErr 0x400801 action 0x6 frozen (Errors)

Dec 13 13:15:21 Tower kernel: ata6.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 13 13:15:21 Tower kernel: ata6: SError: { RecovData HostInt Handshk } (Errors)

Dec 13 13:15:21 Tower kernel: ata6.00: failed command: WRITE DMA (Minor Issues)

Dec 13 13:15:21 Tower kernel: ata6.00: cmd ca/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 out (Drive related)

Dec 13 13:15:21 Tower kernel: res 50/00:00:07:00:00/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 13 13:15:21 Tower kernel: ata6.00: status: { DRDY } (Drive related)

Dec 13 13:15:21 Tower kernel: ata6: hard resetting link (Minor Issues)

Dec 13 13:15:21 Tower kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)

Dec 13 13:15:21 Tower kernel: ata6.00: configured for UDMA/133 (Drive related)

Dec 13 13:15:21 Tower kernel: ata6: EH complete (Drive related)

Dec 13 13:15:21 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x0 SErr 0x400800 action 0x6 frozen (Errors)

Dec 13 13:15:21 Tower kernel: ata6.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 13 13:15:21 Tower kernel: ata6: SError: { HostInt Handshk } (Errors)

Dec 13 13:15:21 Tower kernel: ata6.00: failed command: WRITE DMA (Minor Issues)

Dec 13 13:15:21 Tower kernel: ata6.00: cmd ca/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 out (Drive related)

Dec 13 13:15:21 Tower kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) (Errors)

Dec 13 13:15:21 Tower kernel: ata6.00: status: { DRDY } (Drive related)

Dec 13 13:15:21 Tower kernel: ata6: hard resetting link (Minor Issues)

Dec 13 13:15:22 Tower kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)

Dec 13 13:15:22 Tower kernel: ata6.00: configured for UDMA/133 (Drive related)

Dec 13 13:15:22 Tower kernel: ata6: EH complete (Drive related)

Dec 13 13:15:22 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps (Drive related)

Dec 13 13:15:22 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x0 SErr 0x400800 action 0x6 frozen (Errors)

Dec 13 13:15:22 Tower kernel: ata6.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 13 13:15:22 Tower kernel: ata6: SError: { HostInt Handshk } (Errors)

Dec 13 13:15:22 Tower kernel: ata6.00: failed command: WRITE DMA (Minor Issues)

Dec 13 13:15:22 Tower kernel: ata6.00: cmd ca/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 out (Drive related)

Dec 13 13:15:22 Tower kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) (Errors)

Dec 13 13:15:22 Tower kernel: ata6.00: status: { DRDY } (Drive related)

Dec 13 13:15:22 Tower kernel: ata6: hard resetting link (Minor Issues)

Dec 13 13:15:22 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (Drive related)

Dec 13 13:15:22 Tower emhttp: re-reading /dev/sde partition table (Drive related)

Dec 13 13:15:22 Tower kernel: ata6.00: configured for UDMA/133 (Drive related)

Dec 13 13:15:22 Tower kernel: ata6: EH complete (Drive related)

Dec 13 13:15:22 Tower kernel: sde: sde1 (Drive related)

But we're now 25 min further, rebuild is going, and I see no related SATA entries in the syslog any more so all seems well...

If my current array is back and protected, and seems stable, next steps Im going to do is to run a pre-clear on the old disk previously used as disk4. If it comes out ok I can use it, if I have issues I'm trying to get a new one under warranty.

Then it is further onwards with step 10 and further of my strategy, with the change that I will either use the old and re-pre-cleared disk (or new one if swapped under warranty) as disk5.

syslog-2012-12-13-1.txt

Quote

December 13, 201213 yr

Author

Spoke too soon: it's now repeating the following error..

Dec 13 16:51:03 Tower kernel: ata5: hard resetting link (Minor Issues)
Dec 13 16:51:03 Tower kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (Drive related)

Dec 13 16:51:03 Tower kernel: ata5.00: configured for UDMA/33 (Drive related)

Dec 13 16:51:03 Tower kernel: ata5: EH complete (Drive related)

Dec 13 16:51:03 Tower kernel: ata5.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)

Dec 13 16:51:03 Tower kernel: ata5.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 13 16:51:03 Tower kernel: ata5: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)

Dec 13 16:51:03 Tower kernel: ata5.00: failed command: READ DMA EXT (Minor Issues)

Dec 13 16:51:03 Tower kernel: ata5.00: cmd 25/00:00:90:23:80/00:04:41:00:00/e0 tag 0 dma 524288 in (Drive related)

Dec 13 16:51:03 Tower kernel: res 50/00:00:8f:23:80/00:00:41:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 13 16:51:03 Tower kernel: ata5.00: status: { DRDY } (Drive related)

Quote

December 13, 201213 yr

Author

I was away for 2 hours, when I left it it was fine had approx 500 min left for the rebuild, and now the errors and near 9000 min. Better stop it and troubleshoot the connection first. I replaced the cable, it is a new disk, so that leaves the motherboard as the only suspect?

Quote

December 13, 201213 yr

Author

I stopped the data rebuild. The unmenu main page shows:

STOPPED, unRAID ARRAY is STOPPED 5 disks in array. Parity is Valid:. Last parity check < 1 day ago . Parity updated 2410 times to address sync errors

Does the parity update mean that both my data on disk4 and the parity are now not good?

Quote

December 13, 201213 yr

I stopped the data rebuild. The unmenu main page shows:

STOPPED, unRAID ARRAY is STOPPED 5 disks in array. Parity is Valid:. Last parity check < 1 day ago . Parity updated 2410 times to address sync errors

Does the parity update mean that both my data on disk4 and the parity are now not good?

It has no meaning when doing a rebuild of a disk. Ignore it. It only shows how many times parity was updated to get in sync the last time (or not changed, if in NOCORECT mode) Unfortunately, it has no way of knowing... It only is reporting what is in the output of

mdcmd status|strings

Quote

December 13, 201213 yr

Author

After looking at the syslog again, it is mostly the errors on ata5, but also once in a while:

Dec 13 17:02:07 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x0 SErr 0x480800 action 0x6 frozen (Errors)
Dec 13 17:02:07 Tower kernel: ata6.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 13 17:02:07 Tower kernel: ata6: SError: { HostInt 10B8B Handshk } (Errors)

Dec 13 17:02:07 Tower kernel: ata6.00: failed command: WRITE DMA EXT (Minor Issues)

Dec 13 17:02:07 Tower kernel: ata6.00: cmd 35/00:00:30:7b:b2/00:04:41:00:00/e0 tag 0 dma 524288 out (Drive related)

Dec 13 17:02:07 Tower kernel: res 50/00:00:2f:7b:b2/00:00:41:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 13 17:02:07 Tower kernel: ata6.00: status: { DRDY } (Drive related)

Dec 13 17:02:07 Tower kernel: ata6: hard resetting link (Minor Issues)

Dec 13 17:02:08 Tower emhttp: shcmd (49): rmdir /mnt/disk3 >/dev/null 2>$stuff$1 (Other emhttp)

Dec 13 17:02:08 Tower emhttp: shcmd (50): umount /mnt/disk4 >/dev/null 2>$stuff$1 (Other emhttp)

Dec 13 17:02:08 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (Drive related)

Dec 13 17:02:08 Tower kernel: ata6.00: configured for UDMA/33 (Drive related)

Dec 13 17:02:08 Tower kernel: ata6: EH complete (Drive related)

Dec 13 17:02:08 Tower kernel: ata6.00: exception Emask 0x50 SAct 0x0 SErr 0x480800 action 0x6 frozen (Errors)

Dec 13 17:02:08 Tower kernel: ata6.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 13 17:02:08 Tower kernel: ata6: SError: { HostInt 10B8B Handshk } (Errors)

Something is not well somewhere in my box...

Quote

December 13, 201213 yr

Author

Full syslog (6,6 Mb)

https://dl.dropbox.com/u/9280177/syslog-2012-12-13-2.txt

Quote

December 13, 201213 yr

Author

Actually I believe ata5, the one reporting the most errors is connected to disk3. So that would be one of the drives also used in the suspect Sharkoon cage.

Quote

December 18, 201213 yr

Author

After a long weekend away, back home working on getting my unraid setup working again...

On rebooting the server this morning, I again got a lot of errors, most like the ones mentioned earlier. I also got a COMRESET error meaning that disk4 (my new EARX disk) wasn't been seen so I couldn't rebuild the array. Unfortunately I didn't save the syslog from this event,

So I started troublesolving, or so I hoped.

My initial config was: (SATA port mainboard - cage - disk - drive

4 - 1 - disk3 - EARS

5 - 1 - disk4 - EARX (new)

6 - 1 - empty - empty

This was the same configuration as I have tried sofar, with the only comment (as mentioned in earlier post) that after the reiserfsck came out ok I started rebuilding the array with the new EARX disk instead of the old EARS that was previously giving my trouble. This as I trusted this disk more and to have a 'spare' copy of the data on the old EARS. As mentioned earlier I got a lot of errors and stopped the rebuild. Swapping cables didn't clear the errors.

To remove the cage from the equation I did the following configuration

4 - 2 - disk3 - EARS

5 - 2 - disk4 - EARX (new)

6 - 2 - empty - empty

This gave again some errors, with disk4 not recognized (not visible in devices, probably due to COMRESET errors)

Then I tried:

4 - 2 - disk3 - EARS

5 - 2 - disk4 - EARS (old)

6 - 2 - empty - empty

This led to errors on both disk3 and disk4 (both disabled. COMRESET errors).

I then swapped cables and tried:

4 - 2 - disk3 - EARS

5 - 2 - disk4 - EARX

6 - 2 - unassigned - EARS (old)

Disk4 was now recognized, albeit with CRC errors and link resets. Unraid starting rebuilding the array, with multiple errors and resets. But I got write errors on disk4 after approx 7% of the rebuild. The syslog is attached. Disk4 is nog not recognize, see the SMART report:

smartctl -a -d ata /dev/sde (disk4)
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

As far as I see:

*When disks are recognized they seem ok, see earlier SMART reports.

*Changing cables does not solve it

*Changing cages does not solve it

*Other disks (parity, disk1 and disk2) in same cages seem ok, so it does not seem a power issue.

Seeing that disk3 and disk4 were originally used in the old Sharkoon cage with the suspect connector, I assumed the disks might be 'fried' by this. However using the brandnew EARX disk I also get errors.

Based on the errors and and variables tried, I assume somehow the SATA ports of the mainboard originally connected to the suspect Sharkoon cage were damaged? The only consistent factor in my experimentation is that I have used these three same ports?

Any other theories out there? I could use some help.

syslog-2012-12-18.txt

Quote

December 18, 201213 yr

Author

Little update:

I've now tried the following configuration:

SATA port mainboard - cage - disk - drive

4 - 1 - disk3 - EARS

5 - 1 - empty - empty

6 - 2 - disk4 - EARX

And this has been stable now for almost 4 hrs and is 40% in the rebuild of disk4.

There where some hiccups on bootup. Although funnily enough these relate to ata4, which seems to be port 3 and disk2 which have never given issues sofar... I can't seem to find any pattern or consistency in the errors I'm having....

Dec 18 17:21:50 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)
Dec 18 17:21:50 Tower kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 18 17:21:50 Tower kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)

Dec 18 17:21:50 Tower kernel: ata4.00: failed command: READ DMA EXT (Minor Issues)

Dec 18 17:21:50 Tower kernel: ata4.00: cmd 25/00:00:df:b9:00/00:04:00:00:00/e0 tag 0 dma 524288 in (Drive related)

Dec 18 17:21:50 Tower kernel: res 50/00:00:de:b9:00/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 18 17:21:50 Tower kernel: ata4.00: status: { DRDY } (Drive related)

Dec 18 17:21:50 Tower kernel: ata4: hard resetting link (Minor Issues)

Dec 18 17:21:51 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)

Dec 18 17:21:51 Tower kernel: ata4.00: configured for UDMA/133 (Drive related)

Dec 18 17:21:51 Tower kernel: ata4: EH complete (Drive related)

Dec 18 17:21:51 Tower emhttp: shcmd (75): echo \"/mnt/user/New_content\" '-async,no_subtree_check,anongid=0,anonuid=0,all_squash,fsid=105' '*(rw,insecure)' >>/etc/exports (Other emhttp)

Dec 18 17:21:51 Tower emhttp: shcmd (76): echo \"/mnt/user/Video\" '-async,no_subtree_check,anongid=0,anonuid=0,all_squash,fsid=102' '*(rw,insecure)' >>/etc/exports (Other emhttp)

Dec 18 17:21:51 Tower emhttp: shcmd (77): echo \"/mnt/user/_tempcache\" '-async,no_subtree_check,anongid=0,anonuid=0,all_squash,fsid=104' '*(rw,insecure)' >>/etc/exports (Other emhttp)

Dec 18 17:21:51 Tower emhttp: get_config_idx: fopen /boot/config/shares/mysql.cfg: No such file or directory - assigning defaults (Other emhttp)

Dec 18 17:21:51 Tower emhttp: shcmd (78): killall -HUP smbd (Minor Issues)

Dec 18 17:21:51 Tower emhttp: shcmd (79): /etc/rc.d/rc.nfsd restart | logger (Other emhttp)

Dec 18 17:21:53 Tower logger: Starting NFS server daemons:

Dec 18 17:21:53 Tower logger: /usr/sbin/exportfs -r

Dec 18 17:21:53 Tower logger: /usr/sbin/rpc.nfsd 8

Dec 18 17:21:53 Tower logger: /usr/sbin/rpc.mountd

Dec 18 17:21:54 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)

Dec 18 17:21:54 Tower kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 18 17:21:54 Tower kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)

Dec 18 17:21:54 Tower kernel: ata4.00: failed command: READ DMA EXT (Minor Issues)

Dec 18 17:21:54 Tower kernel: ata4.00: cmd 25/00:00:df:e4:05/00:04:00:00:00/e0 tag 0 dma 524288 in (Drive related)

Dec 18 17:21:54 Tower kernel: res 50/00:00:de:e4:05/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 18 17:21:54 Tower kernel: ata4.00: status: { DRDY } (Drive related)

Dec 18 17:21:54 Tower kernel: ata4: hard resetting link (Minor Issues)

Dec 18 17:21:54 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)

Dec 18 17:21:54 Tower kernel: ata4.00: configured for UDMA/133 (Drive related)

Dec 18 17:21:54 Tower kernel: ata4: EH complete (Drive related)

Dec 18 17:22:01 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)

Dec 18 17:22:01 Tower kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 18 17:22:01 Tower kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)

Dec 18 17:22:01 Tower kernel: ata4.00: failed command: READ DMA (Minor Issues)

Dec 18 17:22:01 Tower kernel: ata4.00: cmd c8/00:00:b7:0f:10/00:00:00:00:00/e0 tag 0 dma 131072 in (Drive related)

Dec 18 17:22:01 Tower kernel: res 50/00:00:b6:0f:10/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 18 17:22:01 Tower kernel: ata4.00: status: { DRDY } (Drive related)

Dec 18 17:22:01 Tower kernel: ata4: hard resetting link (Minor Issues)

Dec 18 17:22:02 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)

Dec 18 17:22:02 Tower kernel: ata4.00: configured for UDMA/133 (Drive related)

Dec 18 17:22:02 Tower kernel: ata4: EH complete (Drive related)

Dec 18 17:22:03 Tower kernel: ata4: limiting SATA link speed to 1.5 Gbps (Drive related)

Dec 18 17:22:03 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)

Dec 18 17:22:03 Tower kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)

Dec 18 17:22:03 Tower kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)

Dec 18 17:22:03 Tower kernel: ata4.00: failed command: READ DMA EXT (Minor Issues)

Dec 18 17:22:03 Tower kernel: ata4.00: cmd 25/00:00:b7:e7:11/00:04:00:00:00/e0 tag 0 dma 524288 in (Drive related)

Dec 18 17:22:03 Tower kernel: res 50/00:00:b6:e7:11/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

Dec 18 17:22:03 Tower kernel: ata4.00: status: { DRDY } (Drive related)

Dec 18 17:22:03 Tower kernel: ata4: hard resetting link (Minor Issues)

Dec 18 17:22:03 Tower kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) (Drive related)

Dec 18 17:22:03 Tower kernel: ata4.00: configured for UDMA/133 (Drive related)

Dec 18 17:22:03 Tower kernel: ata4: EH complete (Drive related)

Quote

redballed disk - can't read block

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)