Drive or cables or...? [SOLVED]


Recommended Posts

Hi, I started having weird stability problems a little while ago, but haven't really had time to do much about it. Basically, failed writes of large files, sometimes but not always.

 

I didn't have any signs of a problem in my syslog so I assumed it was a fault at my desktop's end of things.

 

Messing around a bit I noticed that it happend when accessing the unraid server via user shares, but not via disk shares. In other words if I pasted a file to //tower/disk5/videos/ ... no problem. But if I pasted to a user share, craaash.

 

As of yesterday some syslog data has started appearing... lots of it. Basically the following, repeated over and over and over:

 

Feb 12 13:37:39 Lemur kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 12 13:37:39 Lemur kernel: ata7.00: BMDMA2 stat 0x80d1009
Feb 12 13:37:39 Lemur kernel: ata7.00: failed command: READ DMA EXT
Feb 12 13:37:39 Lemur kernel: ata7.00: cmd 25/00:08:58:12:c2/00:00:c8:00:00/e0 tag 0 dma 4096 in
Feb 12 13:37:39 Lemur kernel: res 51/40:08:58:12:c2/00:00:c8:00:00/f0 Emask 0x9 (media error)
Feb 12 13:37:39 Lemur kernel: ata7.00: status: { DRDY ERR }
Feb 12 13:37:39 Lemur kernel: ata7.00: error: { UNC }
Feb 12 13:37:39 Lemur kernel: ata7.00: configured for UDMA/100
Feb 12 13:37:39 Lemur kernel: ata7: EH complete

 

SMART test results:

 

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4680         25953144
# 2  Short offline       Aborted by host               90%      4680         -
# 3  Extended offline    Completed: read failure       90%      4678         54840184
# 4  Short offline       Completed without error       00%      4430         -

 

Running unRAID Pro v 4.7

 

Clearly (I think!) this is a problem with disk7 on my server. Can I make any reliable assumptions about what the problem is? Specifically, should I assume it's the HDD itself (it's actually the newest disk in the system, from about August last year), and needs replacing, or should I troubleshoot more such as by replacing cables etc? Basically I don't want to fork for a new drive if I don't need to, but equally I don't want to mess around looking for phantom problems if it is definitely or almost definitely the drive itself.

 

Any suggestions?

 

Thanks,

 

Jason

Link to comment

ata7 is not necessarilly disk7.  Since you did not post a complete syslog, we really cannot tell.

 

Media errors are un-readable sectors on a physical disk.  They have nothing to do with user-share or direct disk access.

 

A "smartctl -a /dev/sdX" on each of your disks will probably show one (or more) with sectors pending re-allocation, or already re-allocated.

 

You can then evaluate the true health of your drives.

 

Joe L.

Link to comment

ata7 is not necessarilly disk7.  Since you did not post a complete syslog, we really cannot tell.

 

Media errors are un-readable sectors on a physical disk.  They have nothing to do with user-share or direct disk access.

 

A "smartctl -a /dev/sdX" on each of your disks will probably show one (or more) with sectors pending re-allocation, or already re-allocated.

 

You can then evaluate the true health of your drives.

 

Joe L.

 

Thanks Joe.

 

Sorry it took me a while to find the info on copying a syslog.  I think I've attached one now.

 

With regards to pending sectors, the unraid smart report facility says:

 

Disk 0/SDA:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1

Disk 1/SDB:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 2/SDD:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 3/SDC:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 4/SDE:
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

Disk 5/SDF:
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

Disk 6/SDH:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk 7/SDG:
197 Current_Pending_Sector  0x0032   198   198   000    Old_age   Always       -       680

 

I now see that disk 7 is a bit of an outlier in this respect. Am I right to think that the 680 is a bad number?

 

BTW the reason I thought the user share vs disk share thing was relevant is just that it is what prompted me to think that I had an issue with a specific disk (the one automatically chosen by unraid), rather than some other kind of problem causing my file transfers to fail. Then I found with some experimenting that disk 7 seemed to be the culprit.

 

The drive in question is "eligible for replacement" according to WD's warranty website.

 

Advice welcome, this is all outside my expertise. If I have provided the wrong info or provided it the wrong way it's unintentional.

 

Thanks,

 

Jason

syslog.txt

Link to comment

As you suspect, that disk is starting to fail with many un-readable sectors.

 

The syslog confirms that ata7 is /dev/sdg

Feb 12 18:21:23 Lemur kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 12 18:21:23 Lemur kernel: ata7.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133

Feb 12 18:21:23 Lemur kernel: ata7.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32)

Feb 12 18:21:23 Lemur kernel: ata7.00: configured for UDMA/100

Feb 12 18:21:23 Lemur kernel: scsi 6:0:0:0: Direct-Access    ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Write Protect is off

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Mode Sense: 00 3a 00 00

Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

and that /dev/sdg is assigned as disk7

Feb 12 18:21:23 Lemur kernel: md: import disk7: [8,96] (sdg) WDC WD20EARS-00M WD-WCAZA4789784 size: 1953514552

 

At this point you should replace that disk as soon as possible.  You should NOT parfoem a parity sync, as it would over-write the existing parity with the zeros sent from the disk when un-readable sectors exist.

 

It is almost safer to un-assign that disk (disk7) and let unRAID simulate it from parity and the other disks until you can install a replacement.  That would guarantee that a parity sync will not occur.

 

Joe L.

Link to comment

Thanks Joe, I have just disabled the drive in question.

 

It seemed like unRAID struggled to even unmount it; it took a long time. Now the whole system is running more smoothly with it "not installed".

 

I should be able to get it replaced by WD as I only bought it last August, but I will get another drive as well while I wait for the replacement.

 

Thank you for the help.

 

Jason

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.