Preclear.sh results - Questions about your results? Post them here.

joeyke87 · March 23, 2012

Here are my results from new hard disk. After preclearing i replaced a bad disk for this one. Rebuilding worked fine i guess (everthing green now). Is there a way to check if the rebuild is succeful, or can i just trust the green lights?

syslog-2012-03-22-disk3new.txt

tr0910 · March 26, 2012

WDC WD30EZRS drive fails preclear 1.13 invoked with -n on both v5b12a and v5b14 on two different unraid servers. It completes all 10 steps fine, but at the very end if says drive (dev/sdf) fails preclear and drops from the list of drives that can be precleared. No preclear report is saved and syslog explodes to 200 mb with all kinds of errors with sdf (the drive being precleared). (Truncated version with first 10000 lines attached below)

I have precleared dozens of 2tb and 3tb drives on 4.7 and v5 without incident.

Any idea what is going wrong? Restarting the server will bring the drive back online and allow preclear to start again.

root@Tower1:/boot# preclear_disk.sh -l
====================================1.13
Disks not assigned to the unRAID array
  (potential candidates for clearing)
========================================
No un-assigned disks detected

Restarting the server and testing for preclear status shows

Serial Number:    WD-WCAWZ2017532
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes

Disk /dev/sdg: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/sdg doesn't contain a valid partition table
########################################################################
failed test 1
failed test 2 00000 00000 00000 00000
failed test 3 00000 00000 00000 00000
failed test 5
failed test 6
========================================================================1.13
==
== Disk /dev/sdg is NOT precleared
== 0 0 4294967295
============================================================================

tr0910 · March 26, 2012

Sorry, here is the truncated syslog.

syslog-2012-03-26_truncated.zip

Joe L. · March 26, 2012

Lots of errors in communicating with the disk in the syslog. Many are CRC errors (bad checksums in communications with the disk)

Mar 25 22:49:45 Tower1 emhttp: get_config_idx: fopen /boot/config/shares/Pix2012.cfg: No such file or directory - assigning defaults
Mar 25 22:49:45 Tower1 emhttp: Restart SMB...
Mar 25 22:49:45 Tower1 emhttp: shcmd (46): killall -HUP smbd
Mar 25 22:49:45 Tower1 emhttp: shcmd (47): ps axc | grep -q rpc.mountd
Mar 25 22:49:45 Tower1 emhttp: _shcmd: shcmd (47): exit status: 1
Mar 25 22:49:45 Tower1 emhttp: Start NFS...
Mar 25 22:49:45 Tower1 emhttp: shcmd (48): /etc/rc.d/rc.nfsd start |& logger
Mar 25 22:49:45 Tower1 logger: Starting NFS server daemons:
Mar 25 22:49:45 Tower1 logger:   /usr/sbin/exportfs -r
Mar 25 22:49:45 Tower1 logger:   /usr/sbin/rpc.nfsd 8
Mar 25 22:49:45 Tower1 logger:   /usr/sbin/rpc.mountd
Mar 25 22:49:45 Tower1 mountd[2091]: Kernel does not have pseudo root support.
Mar 25 22:49:45 Tower1 mountd[2091]: NFS v4 mounts will be disabled unless fsid=0
Mar 25 22:49:45 Tower1 mountd[2091]: is specfied in /etc/exports file.
Mar 25 22:49:45 Tower1 emhttp: shcmd (49): /usr/local/sbin/emhttp_event svcs_restarted
Mar 25 22:49:45 Tower1 emhttp_event: svcs_restarted
Mar 25 22:49:47 Tower1 kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Mar 25 22:49:47 Tower1 kernel: ata7.00: irq_stat 0x08000000
Mar 25 22:49:47 Tower1 kernel: ata7: SError: { UnrecovData 10B8B Dispar BadCRC }
Mar 25 22:49:47 Tower1 kernel: ata7.00: failed command: READ DMA
Mar 25 22:49:47 Tower1 kernel: ata7.00: cmd c8/00:08:47:ae:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Mar 25 22:49:47 Tower1 kernel:          res 50/00:00:46:ae:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Mar 25 22:49:47 Tower1 kernel: ata7.00: status: { DRDY }
Mar 25 22:49:47 Tower1 kernel: ata7: hard resetting link
Mar 25 22:49:48 Tower1 kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 25 22:49:53 Tower1 kernel: ata7.00: qc timeout (cmd 0xec)
Mar 25 22:49:53 Tower1 kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Mar 25 22:49:53 Tower1 kernel: ata7.00: revalidation failed (errno=-5)
Mar 25 22:49:53 Tower1 kernel: ata7: hard resetting link
Mar 25 22:49:53 Tower1 kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 25 22:49:53 Tower1 kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Mar 25 22:49:53 Tower1 kernel: ata7.00: revalidation failed (errno=-5)
Mar 25 22:49:58 Tower1 kernel: ata7: hard resetting link
Mar 25 22:49:59 Tower1 kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 25 22:49:59 Tower1 kernel: ata7.00: configured for UDMA/33
Mar 25 22:49:59 Tower1 kernel: ata7: EH complete
Mar 25 22:49:59 Tower1 kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Mar 25 22:49:59 Tower1 kernel: ata7.00: irq_stat 0x08000000
Mar 25 22:49:59 Tower1 kernel: ata7: SError: { UnrecovData 10B8B Dispar BadCRC }
Mar 25 22:49:59 Tower1 kernel: ata7.00: failed command: READ DMA
Mar 25 22:49:59 Tower1 kernel: ata7.00: cmd c8/00:08:47:ae:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Mar 25 22:49:59 Tower1 kernel:          res 50/00:42:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Mar 25 22:49:59 Tower1 kernel: ata7.00: status: { DRDY }
Mar 25 22:49:59 Tower1 kernel: ata7: hard resetting link
Mar 25 22:49:59 Tower1 kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 25 22:49:59 Tower1 kernel: ata7.00: configured for UDMA/33
Mar 25 22:49:59 Tower1 kernel: ata7: EH complete
Mar 25 22:50:40 Tower1 kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x380100 action 0x6
Mar 25 22:50:40 Tower1 kernel: ata7.00: irq_stat 0x08000000
Mar 25 22:50:40 Tower1 kernel: ata7: SError: { UnrecovData 10B8B Dispar BadCRC }
Mar 25 22:50:40 Tower1 kernel: ata7.00: failed command: IDENTIFY DEVICE
Mar 25 22:50:40 Tower1 kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
Mar 25 22:50:40 Tower1 kernel:          res 50/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)

Basically, the disk is ceasing to communicate with the disk controller somewhere in the process of reading and writing to the disk.

A power cycle seems to get it to re-initialize and communicate once more.

The 10 steps in the preclear are being performed, but when the verification is performed, the expected values are not there. (because the disk stopped responding to commands somewhere in the middle)

It could be a disk controller port issue, or a cable issue (noise on the cable causes CRC errors) or just a bad disk drive. It could even be a marginal power supply for that drive. The process of elimination is tedious, but I'd try a different PC first. If it fails there, it was the disk.

Joe L.

tr0910 · March 26, 2012

It failed on two different unraid servers and failed at least twice on each. (different cables were tested)

DOA is guess.....

Joe L. · March 26, 2012

It failed on two different unraid servers and failed at least twice on each. (different cables were tested)

DOA is guess.....

I guess that isolates the issue to the only common hardware involved. (the disk itself)

Not DOA, but Zombie. (keeps coming back from the dead)

I would not trust my data on it. Not unless you want to keep power cycling the server to get to a file.

grither · March 27, 2012

Hello all! i have just copied data off three of my drives and hope to migrate them into my server (then copy more data onto them, move more drives etc)

here are the 3 preclear reports. Note that all the drives `passed`as per the email reports.

see any serious problems which would prevent migrating the drives in

Each drive seems to have about 11 or 12 values of òld age and about 3 or 4 pre-fail. hoping these aren`t serious problems!!!

preclear_finish_+WD-WCAU41253328_2012-03-25.txt

preclear_finish_+6XW04QZD_2012-03-27.txt

preclear_finish_+WD-WCAVY2471414_2012-03-27.txt

tr0910 · March 27, 2012

DOA I guess.....

Not DOA, but Zombie. (keeps coming back from the dead)

Here is the detail of the last full test I ran (not -n). Three strikes and its out. (I hate RMAing stuff)

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE
= Step 3 of 10 - Disk is now cleared from MBR onward.           DONE
= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4       DONE
= Step 5 of 10 - Clearing MBR code area                         DONE
= Step 6 of 10 - Setting MBR signature bytes                    DONE
= Step 7 of 10 - Setting partition 1 to precleared state        DONE
= Step 8 of 10 - Notifying kernel we changed the partitioning   DONE
= Step 9 of 10 - Creating the /dev/disk/by* entries             DONE
= Step 10 of 10 - Verifying if the MBR is cleared.              DONE
=
Elapsed Time:  16:36:09
========================================================================1.13
==
== SORRY: Disk /dev/sdg MBR could NOT be precleared
==
== out4= 00000
== out5= 00000
============================================================================
0+0 records in
0+0 records out
0000000
0 bytes (0 B) copied, 2.2213e-05 s, 0.0 kB/s
root@Tower1:/boot#
root@Tower1:/boot#

Joe L. · March 28, 2012

Each drive seems to have about 11 or 12 values of òld age and about 3 or 4 pre-fail. hoping these aren`t serious problems!!!

Those are the categories those attributes belong to. Not failure unless they also say FAILING_NOW on the same line.

As an example, run-time-hours would be an old_age indicator of a disk. Un-correctable-disk-read-errors will be in a category of pre-failure. High run-time hours does not indicate the drive will fail, just that it is getting older. Un-correctable errors can occur at any age. A large number, or increasing numbers of them might indicate a pending failure (once the disk runs out of spare sectors to re-allocate in place of the un-readable ones)

You just need to compare the normalized value with the failure threshold for any given attribute. That will tell you of the drive's health.

Joe L. · March 28, 2012

Hello all! i have just copied data off three of my drives and hope to migrate them into my server (then copy more data onto them, move more drives etc)

here are the 3 preclear reports. Note that all the drives `passed`as per the email reports.

see any serious problems which would prevent migrating the drives in

Each drive seems to have about 11 or 12 values of òld age and about 3 or 4 pre-fail. hoping these aren`t serious problems!!!

The third disk shows 38 sectors pending re-allocation. There would normally be none as the writing of zeros should have re-allocated all the sectors.

There were no sectors re-allocated, so I'd suspect the 38 un-readable sectors were discovered in the post-read phase. (that is not good)

I'd run another pre-clear on that disk. If it continues to show sectors pending re-allocation, I'd not trust it.

slarco · March 29, 2012

What if my syslog weight more than 192k? Use a external host?

Just paste as quote?

Joe L. · March 29, 2012

What if my syslog weight more than 192k? Use a external host?

Just paste as quote?

zip it, (they zip really well) or, use ext host as you described. For pre-clear results I really do not need to see the entire syslog. You can attach only the pre-clear reports as found in /boot/preclear_reports

Joe L.

Darkoverlord · March 29, 2012

Hi guys,

What might be wrong here?

Mar 29 22:23:14 Tower kernel: end_request: I/O error, dev sde, sector 1301043712 (Errors)

Mar 29 22:23:14 Tower kernel: ata7: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 (Drive related)

Mar 29 22:23:14 Tower kernel: ata7.00: device reported invalid CHS sector 0 (Drive related)

Mar 29 22:23:14 Tower kernel: ata7: status=0x41 { DriveReady Error } (Errors)