Drive or cables or...? [SOLVED]

anthropoidape · February 12, 2012

Hi, I started having weird stability problems a little while ago, but haven't really had time to do much about it. Basically, failed writes of large files, sometimes but not always.

I didn't have any signs of a problem in my syslog so I assumed it was a fault at my desktop's end of things.

Messing around a bit I noticed that it happend when accessing the unraid server via user shares, but not via disk shares. In other words if I pasted a file to //tower/disk5/videos/ ... no problem. But if I pasted to a user share, craaash.

As of yesterday some syslog data has started appearing... lots of it. Basically the following, repeated over and over and over:

Feb 12 13:37:39 Lemur kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 12 13:37:39 Lemur kernel: ata7.00: BMDMA2 stat 0x80d1009
Feb 12 13:37:39 Lemur kernel: ata7.00: failed command: READ DMA EXT
Feb 12 13:37:39 Lemur kernel: ata7.00: cmd 25/00:08:58:12:c2/00:00:c8:00:00/e0 tag 0 dma 4096 in
Feb 12 13:37:39 Lemur kernel: res 51/40:08:58:12:c2/00:00:c8:00:00/f0 Emask 0x9 (media error)
Feb 12 13:37:39 Lemur kernel: ata7.00: status: { DRDY ERR }
Feb 12 13:37:39 Lemur kernel: ata7.00: error: { UNC }
Feb 12 13:37:39 Lemur kernel: ata7.00: configured for UDMA/100
Feb 12 13:37:39 Lemur kernel: ata7: EH complete

SMART test results:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4680         25953144
# 2  Short offline       Aborted by host               90%      4680         -
# 3  Extended offline    Completed: read failure       90%      4678         54840184
# 4  Short offline       Completed without error       00%      4430         -

Running unRAID Pro v 4.7

Clearly (I think!) this is a problem with disk7 on my server. Can I make any reliable assumptions about what the problem is? Specifically, should I assume it's the HDD itself (it's actually the newest disk in the system, from about August last year), and needs replacing, or should I troubleshoot more such as by replacing cables etc? Basically I don't want to fork for a new drive if I don't need to, but equally I don't want to mess around looking for phantom problems if it is definitely or almost definitely the drive itself.

Any suggestions?

Thanks,

Jason

Joe L. · February 12, 2012

ata7 is not necessarilly disk7. Since you did not post a complete syslog, we really cannot tell.

Media errors are un-readable sectors on a physical disk. They have nothing to do with user-share or direct disk access.

A "smartctl -a /dev/sdX" on each of your disks will probably show one (or more) with sectors pending re-allocation, or already re-allocated.

You can then evaluate the true health of your drives.

Joe L.

anthropoidape · February 12, 2012

ata7 is not necessarilly disk7. Since you did not post a complete syslog, we really cannot tell.

Media errors are un-readable sectors on a physical disk. They have nothing to do with user-share or direct disk access.

A "smartctl -a /dev/sdX" on each of your disks will probably show one (or more) with sectors pending re-allocation, or already re-allocated.

You can then evaluate the true health of your drives.

Joe L.

Thanks Joe.

Sorry it took me a while to find the info on copying a syslog. I think I've attached one now.

With regards to pending sectors, the unraid smart report facility says:

Disk 0/SDA:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1

Disk 1/SDB:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 2/SDD:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 3/SDC:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 4/SDE:
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

Disk 5/SDF:
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

Disk 6/SDH:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 7/SDG:
197 Current_Pending_Sector  0x0032   198   198   000    Old_age   Always       -       680

I now see that disk 7 is a bit of an outlier in this respect. Am I right to think that the 680 is a bad number?

BTW the reason I thought the user share vs disk share thing was relevant is just that it is what prompted me to think that I had an issue with a specific disk (the one automatically chosen by unraid), rather than some other kind of problem causing my file transfers to fail. Then I found with some experimenting that disk 7 seemed to be the culprit.

The drive in question is "eligible for replacement" according to WD's warranty website.

Advice welcome, this is all outside my expertise. If I have provided the wrong info or provided it the wrong way it's unintentional.

Thanks,

Jason

syslog.txt

Joe L. · February 12, 2012

As you suspect, that disk is starting to fail with many un-readable sectors.

The syslog confirms that ata7 is /dev/sdg

Feb 12 18:21:23 Lemur kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 12 18:21:23 Lemur kernel: ata7.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133

Feb 12 18:21:23 Lemur kernel: ata7.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32)

Feb 12 18:21:23 Lemur kernel: ata7.00: configured for UDMA/100

Feb 12 18:21:23 Lemur kernel: scsi 6:0:0:0: Direct-Access ATA WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Write Protect is off

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Mode Sense: 00 3a 00 00

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

and that /dev/sdg is assigned as disk7

Feb 12 18:21:23 Lemur kernel: md: import disk7: [8,96] (sdg) WDC WD20EARS-00M WD-WCAZA4789784 size: 1953514552

At this point you should replace that disk as soon as possible. You should NOT parfoem a parity sync, as it would over-write the existing parity with the zeros sent from the disk when un-readable sectors exist.

It is almost safer to un-assign that disk (disk7) and let unRAID simulate it from parity and the other disks until you can install a replacement. That would guarantee that a parity sync will not occur.

Joe L.

anthropoidape · February 12, 2012

Thanks Joe, I have just disabled the drive in question.

It seemed like unRAID struggled to even unmount it; it took a long time. Now the whole system is running more smoothly with it "not installed".

I should be able to get it replaced by WD as I only bought it last August, but I will get another drive as well while I wait for the replacement.

Thank you for the help.

Jason

Drive or cables or...? [SOLVED]

Recommended Posts

anthropoidape

Link to comment

Joe L.

Link to comment

anthropoidape

Link to comment

Joe L.

Link to comment

anthropoidape

Link to comment

Join the conversation