Disk is bad -- Or is it? Contradictory indications


Recommended Posts

Earlier this year I converted an older Ubuntu file server into an Unraid server. It's been fine until I started getting notices about 1 disk. First these:

Event: Unraid Disk 2 SMART health [197]
Subject: Warning [FLINT-UN] - current pending sector is 2
Description: WDC_WD10EZRX-00L4HB0_WD-WMC4J0164198 (sdc)
Importance: warning
Event: Unraid Disk 2 SMART health [198]
Subject: Warning [FLINT-UN] - offline uncorrectable is 2
Description: WDC_WD10EZRX-00L4HB0_WD-WMC4J0164198 (sdc)
Importance: warning

 

I followed another thread here and ran an extended SMART self-test, which finished without showing anything out of the ordinary other than the 197 and 198 attributes at 2. I ran a parity check without corrections and it passed fine. The logs showed no errors on the drive. So I monitored it. Shortly the 198 returned to 0:

Event: Unraid Disk 2 SMART message [198]
Subject: Notice [FLINT-UN] - offline uncorrectable returned to normal value
Description: WDC_WD10EZRX-00L4HB0_WD-WMC4J0164198 (sdc)
Importance: normal

 

I continued using the disk, and then this past weekend it started reporting errors:

Event: Unraid array errors
Subject: Warning [FLINT-UN] - array has errors
Description: Array has 1 disk with read errors
Importance: warning

Disk 2 - WDC_WD10EZRX-00L4HB0_WD-WMC4J0164198 (sdc) (errors 856)

The error count went up to about 2700 and then stayed there. This all happened during a scheduled parity check. There haven't been any further errors logged since that parity check.

 

After seeing those read errors, I got myself a new drive and installed it and precleared it. Once precleared I put the new drive in for the old drive. This went smoothly and the rebuild onto the new drive completed without error.

 

The old drive is now in as an unassigned device. I wanted to see what I could learn about it's condition and now it's looking like there's nothing amiss with it (but I don't fully understand all the SMART stats, so I may be missing something). I ran an extended self-test which came back with the current pending sector at 2 as before. So I ran a pre-clear and was anticipating that there would be read errors, but there were none. The preclear completed without error, and the current pending sector has now gone back to 0:

Event: Unraid device dev1 SMART message [197]
Subject: Notice [FLINT-UN] - current pending sector returned to normal value
Description: WDC_WD10EZRX-00L4HB0_WD-WMC4J0164198 (dev1)
Importance: normal

 

So now I don't know how to treat the disk. The read errors on the weekend suggested it was bad, but nothing since seems to support that.

 

I'm attaching the latest SMART self tests - the extended one before the pre-clear and a regular one after, and the pre-clear logs and summary. I'm also including an extended self test form Jul 14 when the current pending sector first appeared.

 

What do you think? How would you treat this disk now?

 

Thanks for any feedback.

 

Bob 

 

 

disk-info.zip

Link to comment

It's not that uncommon for WD drives to show "false positives", if the extended SMART test passes disk is OK for now, a full disk write should get rid of the pending sectors, as for the array errors we'd need to see the syslog to see if it was reported as disk issue or a connection problem.

Link to comment

Thanks for the info. I don't have syslogs for the read errors. I didn't save a copy prior to shutting down to swap disks - I forgot logs aren't persistent across reboots. What are the specific log messages that distinguish disk issue from connection issues? Maybe I can recognize from memory. The messages I saw were showing a read error followed by a block number, if that helps.

Link to comment
9 minutes ago, MrChip said:

What are the specific log messages that distinguish disk issue from connection issues?

Depends on the controller, if it's onboard SATA/AHCI this usually indicates a disk problem:

 


 

Quote

 

Nov 12 11:37:00 nas0 kernel: ata1.00: cmd 25/00:00:40:f2:08/00:04:00:00:00/e0 tag 0 dma 524288 in

Nov 12 11:37:00 nas0 kernel:          res 51/40:7f:b8:f5:08/40:00:00:00:00/e0 Emask 0x9 (media error)

Nov 12 11:37:00 nas0 kernel: ata1.00: status: { DRDY ERR }

Nov 12 11:37:00 nas0 kernel: ata1.00: error: { UNC }

 

 

Link to comment

Those messages don't look familiar. What I saw was more like:

      <date> <time> <hostname> <???> Read error <disk> block=123456789

where I don't recall what was at <???> , and <disk> was either Disk2 or sdc - it unambiguously identified the disk listed on the Main tab.

 

There were thousands of these error logs, all together as a block in the log file with only a handful of other log messages interspersed though the block. The block numbers weren't consecutive, but they appeared to be inside a range. There was a parity check running at the time, so I at first thought that there was a bad spot on the disk that the parity check was reading through at the time. But if that were so, I would anticipate errors to appear on the pre-read portion of the follow-up preclear, but there were no errors reported during any  of the preclear.

 

The disks in this server are on the main board SATA ports. 

Link to comment
11 hours ago, JorgeB said:

the important part is before that

And I don't have that record, so I'm out of luck I think to diagnose further at this point.

 

I've configured the syslog server feature to help retain logs going forward.

 

I've put the disk on the shelf, marked as questionable. 

 

Thank you for your help and insights.

Edited by MrChip
Add syslog note.
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.