Another topic about interpreting SMART diagnostics...

drwatson · January 6, 2022

I have a failed data drive that is in emulation mode. I went out and purchased a larger (and equal in size as parity) drive to replace it. I was doing a preclear on the new drive and I started noticing errors on my parity drive. (I only run a single parity drive.) Here are the diagnostics on it:

The log for the disk repeats these lines:
Jan 5 22:02:16 Ceti kernel: blk_update_request: critical medium error, dev sdc, sector 11083279656 op 0x0:(READ) flags 0x0 phys_seg 66 prio class 0
Jan 5 22:02:40 Ceti kernel: sd 10:0:0:0: [sdc] tag#7962 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=6s
Jan 5 22:02:40 Ceti kernel: sd 10:0:0:0: [sdc] tag#7962 Sense Key : 0x3 [current] [descriptor]
Jan 5 22:02:40 Ceti kernel: sd 10:0:0:0: [sdc] tag#7962 ASC=0x11 ASCQ=0x0
Jan 5 22:02:40 Ceti kernel: sd 10:0:0:0: [sdc] tag#7962 CDB: opcode=0x88 88 00 00 00 00 02 94 9d c7 e0 00 00 04 00 00 00
Jan 5 22:02:40 Ceti kernel: blk_update_request: critical medium error, dev sdc, sector 11083303424 op 0x0:(READ) flags 0x0 phys_seg 60 prio class 0
Jan 5 22:35:38 Ceti kernel: sd 10:0:0:0: [sdc] tag#8485 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=6s
Jan 5 22:35:38 Ceti kernel: sd 10:0:0:0: [sdc] tag#8485 Sense Key : 0x3 [current] [descriptor]
Jan 5 22:35:38 Ceti kernel: sd 10:0:0:0: [sdc] tag#8485 ASC=0x11 ASCQ=0x0
Jan 5 22:35:38 Ceti kernel: sd 10:0:0:0: [sdc] tag#8485 CDB: opcode=0x88 88 00 00 00 00 02 8f da ee 90 00 00 04 00 00 00
Jan 5 22:35:38 Ceti kernel: blk_update_request: critical medium error, dev sdc, sector 11003425248 op 0x0:(READ) flags 0x0 phys_seg 22 prio class 0

I've done a bunch of looking around and have gathered information but I just want confirmation that I have the right answers. My questions are these:

1) Does this parity drive look like it needs replacing? (I think yes)

2) When replacing a parity drive, is a pre-clear necessary? (I think no. I think preclear is recommended to replace data drives.) Is it bad to do it? (I don't think so.)

3) Is it ok to do a parity-swap procedure with a drive that is reporting read errors like this? (I think yes.) If not, what's the alternative? (I don't have any idea.)

4) Maybe a dumb question, but the errors started on the parity drive when I removed the bad data disk from the array. I don't suppose having a bad data disk could make the parity read go bad, could it? I don't think so, but I thought I'd ask anyway.

trurl · January 6, 2022

7 minutes ago, drwatson said:

Here are the diagnostics on it

That is only the SMART attributes.

Attach the complete Diagnostics zip to your NEXT post in this thread.

drwatson · January 6, 2022

Sorry about that. I thought I attached them. Guess not.

ceti-diagnostics-20220106-0816.zip

trurl · January 6, 2022

SMART attributes for parity looks OK. You could run a SMART extended self-test.

drwatson · January 6, 2022

That's great news!

Can you explain why there are 5916 errors listed for that drive on the main dashboard?

What do these error messages mean then if there isn't a problem?
Jan 6 00:51:27 Ceti kernel: blk_update_request: critical medium error, dev sdc, sector 11202867192 op 0x0:(READ) flags 0x0 phys_seg 6 prio class 0

I am running a SMART extended self-test, but seems to be sitting at 10% for a long time (like 15 minutes). Is that normal?

trurl · January 6, 2022

1 minute ago, drwatson said:

That's great news!

Can you explain why there are 5916 errors listed for that drive on the main dashboard?

What do these error messages mean then if there isn't a problem?
Jan 6 00:51:27 Ceti kernel: blk_update_request: critical medium error, dev sdc, sector 11202867192 op 0x0:(READ) flags 0x0 phys_seg 6 prio class 0

I am running a SMART extended self-test, but seems to be sitting at 10% for a long time (like 15 minutes). Is that normal?

That does look like a disk problem, and nothing in syslog suggests it isn't a disk problem.

You might have to disable spindown on the disk to get the extended test to complete.

Sometimes SMART attributes don't tell the whole story.

drwatson · January 6, 2022

I have stopped the SMART extended self-test, set spin down to Never and then restarted the extended self-test. It jump straight to 10% and then stops. Been there for about 40 minutes so I stopped the test. Then I ran the short test with the spin down still set to never and that completed with "Completed without error".

I have another identical drive to this one in the server. I started a SMART extended self-test on that one and it too sat at 10% complete for about 45 minutes. I then turned the spin down on that drive to Never and started the SMART extended self-test again. It too sits at 10% and doesn't move on after 40 mins of waiting.

Not sure what to do here. I'm leaning toward replacing the drive and then doing a pre-clear on it and see what happens. Thoughts?

ChatNoir · January 6, 2022

The SMART test only update in increments of 10%, you should wait, it can take quite a while to perform an extended test.

Jumping to 10 is the expected behavior.

trurl · January 6, 2022

8TB will take many hours to complete extended test.

drwatson · January 6, 2022

Ok, I'm letting it run now and it's up to 50%. I'll see what the results are for that once it's done and post here. I'd say another 5 hours until it's done.

drwatson · January 7, 2022

SMART extended self-test finished and completed without errors. Is the SMART test result something that can be relied on? Does this mean the disk is fine?

trurl · January 7, 2022

Post a new SMART report.

drwatson · January 7, 2022

ceti-smart-20220107-1050.zip

Thanks for all your help. Here's the report.

drwatson · January 7, 2022

Looking through the log, I see the most recent 4 errors. They all look like this:

Error 137 [0] occurred at disk power-on lifetime: 10328 hours (430 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 41 00 00 00 00 00 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 a8 00 02 9b be 2c 28 40 00  1d+07:49:43.717  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 00 00  1d+07:49:43.717  READ LOG EXT
  61 04 00 00 a0 00 02 9b be 30 28 40 00  1d+07:49:30.714  WRITE FPDMA QUEUED
  60 00 30 00 d0 00 02 9b bc 3a e0 40 00  1d+07:49:24.272  READ FPDMA QUEUED
  60 00 08 00 b8 00 02 9b bc 3a d8 40 00  1d+07:49:24.271  READ FPDMA QUEUED

These are the two lines that seem concerning. Everything else looks ok.

0x03  0x028  4          100435  ---  Read Recovery Attempts
0x04  0x008  4             137  ---  Number of Reported Uncorrectable Errors

So obviously there have been some errors. I just don't know where the tipping point is for replacing the drive. What would you do?

Would these errors be enough to warranty-replace the drive?

trurl · January 7, 2022

None of the important monitored attributes has changed, but it is still a little concerning this was logged as a medium error.

Parity has none of your data, so it is slightly less important than a data disk, but of course, it must be reliable to allow a data disk to be rebuilt.

Another test could be to resync parity and see if it has any I/O errors but of course you can't do this with a disabled disk and single parity.

On 1/5/2022 at 10:59 PM, drwatson said:

I have a failed data drive that is in emulation mode.

Is this still the state of the array? How do you know that data drive is bad? Do you have any diagnostics or syslog from when that drive became disabled? If the data disk is truly dead you really have no choice but to rely on the existing parity to recover its data, but if the data disk isn't dead there may be other options.

drwatson · January 7, 2022

Yes, that disk is still being emulated. Unraid detected a full drive failure so it automatically flipped into emulation mode. I went to look at the drive and I couldn't get any SMART information at all. That drive is dead. It doesn't look like there are any files on it though. When I browse to it in the Unraid UI (far right button) it says 0 objects: 0 directories, 0 files (0 B total). So I think technically, I could remove that disk all together and I wouldn't mind. It was only 2TB.

trurl · January 7, 2022

3 minutes ago, drwatson said:

Unraid detected a full drive failure so it automatically flipped into emulation mode.

Unraid disables/emulates a disk when any write to it fails for any reason. It has to do this because the failed write makes it out-of-sync with parity. That initial failed write, and any subsequent writes to the emulated disk, can be recovered by rebuilding.

So, there isn't really any "detected a full drive failure", no quick or reliable way for Unraid to know that anyway. It just couldn't write to the disk for some reason. I expect all RAID systems work the same way, they just know that disk access isn't working for some reason and kick it out.

Sometimes a read failure can cause Unraid to try to get the emulated data from the parity calculation by reading all other disks, and then try to write that data back to the disk it couldn't read. If that write fails the disk would also be disabled, so a failed read can cause a failed write that disables the disk.

Connection problems are by far more common than bad disks, and are a likely cause of a disk getting disabled/emulated. So, maybe the disk isn't really dead. What makes you think so?

The reason I asked about that disk is because if there is nothing actually wrong with it, you could just New Config it back into the array and rebuild parity, maybe to another disk instead of that questionable parity disk. And if you are sure there was nothing on it, you could even New Config without it and rebuild parity.

Another topic about interpreting SMART diagnostics...

Recommended Posts

drwatson

Link to comment

trurl

Link to comment

drwatson

Link to comment

trurl

Link to comment

drwatson

Link to comment

trurl

Link to comment

drwatson

Link to comment

ChatNoir

Link to comment

trurl

Link to comment

drwatson

Link to comment

drwatson

Link to comment

trurl

Link to comment

drwatson

Link to comment

drwatson

Link to comment

trurl

Link to comment

drwatson

Link to comment

trurl

Link to comment

Join the conversation