my first red ball, after driving my server almost 3000 miles

JustinChase · August 26, 2014

So, I finally settled into a new place, and hooked up my server. I noticed some sketchy performance when watching video, and some investigation led me to discover that I had a drive red-balled. The array had started fine, but on drive seems to be bad.

I took off the cover, and reset all the cables, and when I restarted, the array wouldn't even start, it couldn't even find the drive. I then re-reset all the cables, and then refreshed the array display, and the array started, but the drive is still red-balled.

I have a 3TB drive on order, and the red-balled drive is a 2TB drive. So, I don't want to upgrade to beta7 just yet, and I'm not sure if there are any other troubleshooting steps I should take before just replacing the drive.

The drive on order is actually meant to replace a 1TB drive that is showing some errors, not the one that has red-balled, so I suspect I need to order yet another drive, but I don't want to do too much changing until I get some guidance/suggestions on how I should proceed to at least get back to having a protected drive.

Thoughts, ideas, suggestions?

thanks!

jphipps · August 26, 2014

I would check the smart report on the drive, and run a short and long test. I had a drive get red balled in the past, and it was from some flakey issue booting that it didn't see the drive, and something most have written to the array and it would not come back. I had to re-install the same drive in the same slot and let it rebuild the drive to get it back up and working...

If you wanted to be safe, rebuild that with the new disk, and the run a preclear on the red balled disk to make sure there are no errors on the drive..

JustinChase · August 26, 2014

Hmmm...

I can't find any way to run a smart test on the drive at this time. I may just be missing it, but it might be unavailable due to the red ball, I'm not sure.

jphipps · August 26, 2014

You would have to run it against the linux drive from a telnet or ssh shell, such as:

to review the smart report:

smartctl -a /dev/sda

to run a long test:

smartctl --test=long /dev/sda

JustinChase · August 26, 2014

ah, okay. Are you sure doing so won't 'harm' the driver further? I'm guessing not, but want to be sure I don't make matters worse by trying to make them better

sgibbers17 · August 26, 2014

There is always a chance that your drive may completely fail during testing, I had a drive do that once, but running a smart test will not change the data on the drive

jphipps · August 26, 2014

The report and the test are totally non-disruptive. You can monitor the test from the smart report. It will run about 3 hours, so make sure you aren't going to shutdown the server in that time..

JustinChase · August 26, 2014

...It will run about 3 hours...

or less...

root@media:~# smartctl -a /dev/sdm

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.15.0-unRAID] (local build)

Smartctl open device: /dev/sdm failed: No such device

root@media:~# smartctl --test=long /dev/sdm

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.15.0-unRAID] (local build)

Smartctl open device: /dev/sdm failed: No such device

itimpi · August 26, 2014

If smartctl cannot see the drive, then that means as far as the system is concerned the drive is offline.

I would check cables again, both SATA and power as bad connections are the commonest cause of this. However if the system is really not seeing the drive after powering the system off and on then maybe it has really failed.

trurl · August 26, 2014

If unRAID can't see the drive, how can it even be "sdm"?

What about?

ls  -l  /dev/disk/by-id

JustinChase · August 26, 2014

If unRAID can't see the drive, how can it even be "sdm"?

What about?
ls  -l  /dev/disk/by-id

root@media:~# ls  -l  /dev/disk/by-id
total 0
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1SUPC -> ../../sdg
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1SUPC-part1 -> ../../sdg1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1TLWC -> ../../sdf
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1TLWC-part1 -> ../../sdf1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1U3PC -> ../../sdd
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1U3PC-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Aug 26 11:17 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1UB8C -> ../../sdj
lrwxrwxrwx 1 root root 10 Aug 26 11:17 ata-Hitachi_HDS5C3030ALA630_MJ1323YNG1UB8C-part1 -> ../../sdj1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-SAMSUNG_HD102UJ_S1ZUJDWS308586 -> ../../sdb
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-SAMSUNG_HD102UJ_S1ZUJDWS308586-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-SAMSUNG_HD102UJ_S1ZUJDWS308596 -> ../../sdc
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-SAMSUNG_HD102UJ_S1ZUJDWS308596-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-ST3500830AS_6QG123RD -> ../../sdh
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-ST3500830AS_6QG123RD-part1 -> ../../sdh1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-ST4000DM000-1F2168_S30099K3 -> ../../sde
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-ST4000DM000-1F2168_S30099K3-part1 -> ../../sde1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-ST4000DM000-1F2168_W300LGJ3 -> ../../sdi
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-ST4000DM000-1F2168_W300LGJ3-part1 -> ../../sdi1
lrwxrwxrwx 1 root root  9 Aug 26 11:26 ata-WDC_WD20EACS-11BHUB0_WD-WCAZA3758422 -> ../../sdk
lrwxrwxrwx 1 root root 10 Aug 26 11:26 ata-WDC_WD20EACS-11BHUB0_WD-WCAZA3758422-part1 -> ../../sdk1
lrwxrwxrwx 1 root root  9 Aug 26 14:35 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2130725 -> ../../sdn
lrwxrwxrwx 1 root root 10 Aug 26 14:35 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2130725-part1 -> ../../sdn1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 ata-WL3000GSA6472C_WOL240243956 -> ../../sdl
lrwxrwxrwx 1 root root 10 Aug 26 11:15 ata-WL3000GSA6472C_WOL240243956-part1 -> ../../sdl1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 usb-_Patriot_Memory_07B80F151F7902B3-0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Aug 26 11:15 usb-_Patriot_Memory_07B80F151F7902B3-0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x5000c50069e7f8b4 -> ../../sdi
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x5000c50069e7f8b4-part1 -> ../../sdi1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x5000c5006d3c4a5f -> ../../sde
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x5000c5006d3c4a5f-part1 -> ../../sde1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x5000cca228c0cdd2 -> ../../sdg
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x5000cca228c0cdd2-part1 -> ../../sdg1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x5000cca228c0d0c0 -> ../../sdf
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x5000cca228c0d0c0-part1 -> ../../sdf1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x5000cca228c0d2aa -> ../../sdd
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x5000cca228c0d2aa-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Aug 26 11:17 wwn-0x5000cca228c0d395 -> ../../sdj
lrwxrwxrwx 1 root root 10 Aug 26 11:17 wwn-0x5000cca228c0d395-part1 -> ../../sdj1
lrwxrwxrwx 1 root root  9 Aug 26 11:26 wwn-0x50014ee2057a96e9 -> ../../sdk
lrwxrwxrwx 1 root root 10 Aug 26 11:26 wwn-0x50014ee2057a96e9-part1 -> ../../sdk1
lrwxrwxrwx 1 root root  9 Aug 26 14:35 wwn-0x50014ee2afd6fe55 -> ../../sdn
lrwxrwxrwx 1 root root 10 Aug 26 14:35 wwn-0x50014ee2afd6fe55-part1 -> ../../sdn1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x50014eef0255eba8 -> ../../sdl
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x50014eef0255eba8-part1 -> ../../sdl1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x50024e900126cff6 -> ../../sdb
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x50024e900126cff6-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 Aug 26 11:15 wwn-0x50024e900126d0af -> ../../sdc
lrwxrwxrwx 1 root root 10 Aug 26 11:15 wwn-0x50024e900126d0af-part1 -> ../../sdc1

Device 	Identification 	Temp. 	Size 	Used 	Free 	Reads 	Writes 	Errors 	View
[spin Up] Parity	ST4000DM000-1F2168_W300LGJ3 (sdi) 3907018532	*	4 TB	-	-	6253	6361	0	
[spin Up] Disk 1	Hitachi_HDS5C3030ALA630_MJ1323YNG1UB8C (sdj) 2930266532	*	3 TB	2.77 TB	232 GB	966	33	0	[browse /mnt/disk1]
[spin Up] Disk 2	Hitachi_HDS5C3030ALA630_MJ1323YNG1U3PC (sdd) 2930266532	*	3 TB	2.76 TB	236 GB	573	20	0	[browse /mnt/disk2]
[spin Up] Disk 3	Hitachi_HDS5C3030ALA630_MJ1323YNG1TLWC (sdf) 2930266532	*	3 TB	2.77 TB	227 GB	1401	12	0	[browse /mnt/disk3]
[spin Down] Disk 4	WDC_WD20EACS-11BHUB0_WD-WCAZA3758422 (sdk) 1953514552	30 °C	2 TB	1.50 TB	500 GB	777	105	0	[browse /mnt/disk4]
[spin Down] Disk 5	Hitachi_HDS5C3030ALA630_MJ1323YNG1SUPC (sdg) 2930266532	33 °C	3 TB	2.32 TB	682 GB	6689	6054	0	[browse /mnt/disk5]
[spin Down] Disk 6	SAMSUNG_HD102UJ_S1ZUJDWS308596 (sdc) 976762552	27 °C	1 TB	750 GB	250 GB	904	16	0	[browse /mnt/disk6]
[spin Up] Disk 7	SAMSUNG_HD102UJ_S1ZUJDWS308586 (sdb) 976762552	*	1 TB	668 GB	333 GB	443	16	0	[browse /mnt/disk7]
[spin Up] Disk 8	WL3000GSA6472C_WOL240243956 (sdl) 2930266532	*	3 TB	2.64 TB	363 GB	798	16	0	[browse /mnt/disk8]
[spin Down] Disk 9	WDC_WD20EARS-00MVWB0_WD-WCAZA2130725 (sdm) 1953514552	*	2 TB	1.82 TB	178 GB	1226	16	0	[browse /mnt/disk9]
[spin Down] Disk 10	ST4000DM000-1F2168_S30099K3 (sde) 3907018532	27 °C	4 TB	3 TB	999 GB	1827	97	0	[browse /mnt/disk10]
[spin Down] Cache	ST3500830AS_6QG123RD (sdh) 488386552	34 °C	500 GB	272 GB	228 GB	5047	24,476	0	[browse /mnt/cache]

jphipps · August 26, 2014

Is there any messages in your syslog relating to sdm?

garycase · August 26, 2014

"... after driving my server almost 3000 miles " ==> Clearly the server's been "shaken and stirred" a bit recently. In addition to reseating all of the cables (and the drive itself if it's in a caddy), did you try replacing the SATA cable for that drive?

If so, then it's likely really bad. I'd replace it with your new 3TB unit; and let the rebuild complete while you have the chance. Meanwhile, order another drive to replace your 1TB unit as well, so you don't have any "known flaky" drives in the system.

I assume your parity drive is at least 3TB, so you don't also have the issue of needing to upgrade parity at the same time -- correct?

JonathanM · August 26, 2014

If unRAID can't see the drive, how can it even be "sdm"?

What about?
ls  -l  /dev/disk/by-id

root@media:~# ls  -l  /dev/disk/by-id
total 0
lrwxrwxrwx 1 root root  9 Aug 26 14:35 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2130725 -> ../../sdn
lrwxrwxrwx 1 root root 10 Aug 26 14:35 ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2130725-part1 -> ../../sdn1

Device 	Identification 	Temp. 	Size 	Used 	Free 	Reads 	Writes 	Errors 	View
[spin Down] Disk 9	WDC_WD20EARS-00MVWB0_WD-WCAZA2130725 (sdm) 1953514552	*	2 TB	1.82 TB	178 GB	1226	16	0	[browse /mnt/disk9]

???

Were these two snippets captured during the same boot session?

JustinChase · August 26, 2014

"... after driving my server almost 3000 miles " ==> Clearly the server's been "shaken and stirred" a bit recently.

You can say that again! almost half of that was in Mexico, and several hours of terrible roads that felt like gravel roads, even though they were actually highways. That's why I even mentioned the miles at all.

I assume your parity drive is at least 3TB, so you don't also have the issue of needing to upgrade parity at the same time -- correct?

Yes, parity is 4TB actually. I only bought a 3TB drive now because frys had one for 89 delivered to my door, which I couldn't pass up. Unfortunately, only one per household, or I'd have already gotten another for the known flaky drive.

I'll shut down again right now and remove and reseat all drives, and replace the cable for that bad drive also. The drives are in 'caddy's', and by that I mean, they all have a plastic 'band' around them that has posts that fit into the screw holes on the drives, then the 'bands' slide into grooves in the server and lock into place. however, they are only connected with SATA cables and the power cables. I guess I'm just not sure what qualifies as a caddy $:-\$

Were these two snippets captured during the same boot session?

yes, without any reboot. The drive originally showed as missing, I reseated all cables while running, then refreshed the view and the array started, but the drive was still red-balled, but did show the sdm designation. Then, a bit later I ran the commands and posted the results. I have not rebooted yet.

Is there any messages in your syslog relating to sdm?

Yeah, several messages. I've attached the entire syslog, in case there's more going on here.

syslog.zip

garycase · August 26, 2014

When you use a hot-swap caddy, the drive is actually connecting to the caddy -- NOT to the SATA cable you're using. It sounds like that's the case; so be sure you remove the drive from the caddy (pull it out a bit); and then re-seat it. That connection can get loose just like a SATA connector can.

The more connections you have, the more potential failure points -- and to check them you have to reseat them all.

jphipps · August 26, 2014

Looks like that disk is not a happy camper from your syslog...

Aug 26 11:34:19 media kernel: sd 14:0:0:0: Attached scsi generic sg12 type 0

Aug 26 11:34:19 media kernel: sd 14:0:0:0: [sdm] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Aug 26 11:34:19 media kernel: sd 14:0:0:0: [sdm] Write Protect is off

Aug 26 11:34:19 media kernel: sd 14:0:0:0: [sdm] Mode Sense: 00 3a 00 00

Aug 26 11:34:19 media kernel: sd 14:0:0:0: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Aug 26 11:34:19 media kernel: sdm: sdm1

Aug 26 11:34:19 media kernel: sd 14:0:0:0: [sdm] Attached SCSI disk

Aug 26 11:34:25 media kernel: ata14: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Aug 26 11:34:25 media kernel: ata14: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:25 media kernel: ata14: SError: { RecovComm PHYRdyChg 10B8B }

Aug 26 11:34:25 media kernel: ata14: hard resetting link

Aug 26 11:34:28 media kernel: ata12: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Aug 26 11:34:28 media kernel: ata12: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:28 media kernel: ata12: SError: { RecovComm PHYRdyChg }

Aug 26 11:34:28 media kernel: ata12: hard resetting link

Aug 26 11:34:35 media kernel: ata14: softreset failed (device not ready)

Aug 26 11:34:35 media kernel: ata14: hard resetting link

Aug 26 11:34:36 media kernel: ata12: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Aug 26 11:34:36 media kernel: ata12.00: configured for UDMA/33

Aug 26 11:34:36 media kernel: ata12: EH complete

Aug 26 11:34:45 media kernel: ata14: softreset failed (device not ready)

Aug 26 11:34:45 media kernel: ata14: hard resetting link

Aug 26 11:34:46 media kernel: ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Aug 26 11:34:46 media kernel: ata14.00: configured for UDMA/133

Aug 26 11:34:46 media kernel: ata14: EH complete

Aug 26 11:34:52 media kernel: ata14: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Aug 26 11:34:52 media kernel: ata14: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:52 media kernel: ata14: SError: { RecovComm PHYRdyChg 10B8B }

Aug 26 11:34:52 media kernel: ata14: hard resetting link

Aug 26 11:34:53 media kernel: ata12: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Aug 26 11:34:53 media kernel: ata12: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:53 media kernel: ata12: SError: { RecovComm PHYRdyChg }

Aug 26 11:34:53 media kernel: ata12: hard resetting link

If all connections look good, if you have a usb->sata dock you could try connecting the drive to make sure it is working. Or if you get a new drive, connect it on that same connection and see if the new drive is recognized. If so it is probably a failed drive..

JustinChase · August 26, 2014

So, I've removed all drives (and re-labeled them all, so I can see what is what more easily), then moved a few around, to organize a bit better, then reinstalled all the drives. The result is that the red-balled drive is in a new location, and is connected to a different SATA drive.

When I booted unRAID, it booted really quickly, and the array started. I figured that was a good thing, but then I took a look, and is shows Disk 9 as not installed, but now it also shows the WD drive as just a normal, available drive in the list.

I want to think this is a good thing, and that I can just stop the array, assign it to disk9, then restart, but before I do, I figure I should check and see it that's actually a dumb idea.

Screenshot attached...

thoughts?

trurl · August 27, 2014

This sounds similar to the scenario where people intentionally make unRAID forget about the drive so they can rebuild it onto itself. I think if you assign it unRAID may want to rebuild it. Whether that is what you should do is unclear. Have you got another drive you could rebuild onto?

jphipps · August 27, 2014

Can you run the smartctl command against /dev/sdb ( the disk outsite the array )

My guess is that it came up once without the disk installed, so it is set to missing, and now showing up as a disk separate from the array.

You could try mounting the /dev/sdb1 partion and verify it is the data that should be on that disk if the smart test passes and looks like the drive is still in good health. The safest would be to rebuild slot 9 on another drive, then you still have the original data on that disk if for some reason the rebuild fails.

The last option, if you are sure all the data is correct on the drives, you can start with a new config and rebuild parity off all the data drives. Sounds hairy, but I have done that a few times and got my array back. You have to remember with unRaid all array disks are just individual stand alone data drives that are part of the array, and the parity disk is only really used for emulating a failed disk and rebuilding.

JustinChase · August 27, 2014

This sounds similar to the scenario where people intentionally make unRAID forget about the drive so they can rebuild it onto itself. I think if you assign it unRAID may want to rebuild it. Whether that is what you should do is unclear. Have you got another drive you could rebuild onto?

Not until sometime tomorrow. I don't NEED to get it all fixed tonight for any particular reason, so I'll wait until I have a new drive to rebuild onto, then see if I can preclear the red-balled drive and see if it will work okay moving forward.

Obviously, if someone else has any ideas, thoughts, or suggestions, I'm all ears.

Thanks everyone for all your help so far!

JustinChase · August 27, 2014

Can you run the smartctl command against /dev/sdb ( the disk outsite the array )

root@media:~# smartctl -a /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.15.0-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA2130725
LU WWN Device Id: 5 0014ee 2afd6fe55
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Aug 26 19:30:14 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (40260) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 388) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   163   021    Pre-fail  Always       -       1108
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4205
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12054
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1471
192 Power-Off_Retract_Count 0x0032   199   199   000    Old_age   Always       -       1381
193 Load_Cycle_Count        0x0032   182   182   000    Old_age   Always       -       54340
194 Temperature_Celsius     0x0022   116   098   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       7

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I'm not sure if I run the long test on that drive if it will disrupt my ability to stream the video I'm currently trying to watch, so I may wait until it's done to start that test.

jphipps · August 27, 2014

it shouldn't impact using the disk. The smart report looks pretty clean...

RobJ · August 27, 2014

Aug 26 11:34:25 media kernel: ata14: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Aug 26 11:34:25 media kernel: ata14: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:25 media kernel: ata14: SError: { RecovComm PHYRdyChg 10B8B }

Aug 26 11:34:25 media kernel: ata14: hard resetting link

Aug 26 11:34:28 media kernel: ata12: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Aug 26 11:34:28 media kernel: ata12: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:28 media kernel: ata12: SError: { RecovComm PHYRdyChg }

Aug 26 11:34:28 media kernel: ata12: hard resetting link

Aug 26 11:34:35 media kernel: ata14: softreset failed (device not ready)

Aug 26 11:34:35 media kernel: ata14: hard resetting link

Aug 26 11:34:36 media kernel: ata12: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Aug 26 11:34:36 media kernel: ata12.00: configured for UDMA/33

Aug 26 11:34:36 media kernel: ata12: EH complete

Aug 26 11:34:45 media kernel: ata14: softreset failed (device not ready)

Aug 26 11:34:45 media kernel: ata14: hard resetting link

Aug 26 11:34:46 media kernel: ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Aug 26 11:34:46 media kernel: ata14.00: configured for UDMA/133

Aug 26 11:34:46 media kernel: ata14: EH complete

Aug 26 11:34:52 media kernel: ata14: exception Emask 0x10 SAct 0x0 SErr 0x90002 action 0xe frozen

Aug 26 11:34:52 media kernel: ata14: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:52 media kernel: ata14: SError: { RecovComm PHYRdyChg 10B8B }

Aug 26 11:34:52 media kernel: ata14: hard resetting link

Aug 26 11:34:53 media kernel: ata12: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen

Aug 26 11:34:53 media kernel: ata12: irq_stat 0x00400000, PHY RDY changed

Aug 26 11:34:53 media kernel: ata12: SError: { RecovComm PHYRdyChg }

Aug 26 11:34:53 media kernel: ata12: hard resetting link

The SATA error flags you are seeing (RecovComm and PHYRdyChg) are typical of a bad connection, perhaps loose and vibrating, or poor backplane connection, or perhaps poor or noisy power. You have 2 separate drives with the same issues, connected to channels ata12 and ata14. I understand you've reconnected drives to different ports, so that in itself may have improved all connections. In cases like these, the drives themselves are usually completely fine.

garycase · August 27, 2014

The drive looks fine on the SMART report. I suspect you could simply assign it to Disk9 and it would just rebuild onto itself just fine ... HOWEVER -- it's safer to do that with a different disk. That way, if anything should go awry with the rebuild, you'll still have the original disk to copy your data off of.

my first red ball, after driving my server almost 3000 miles

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived