Read Errors -- Causes

January 9, 20233 yr

Hello,

I have two drives that are claiming to get read errors (twice in the last month). I pulled the drives and ran an extended self test AND a full sector test (surface test) with hard disk check software (hard disk sentinel). All checks passed without any issue. I removed drive from the disk, restarted array (to clear it out), removed disk from historical devices, put the disk back into the array to have the contents full rebuilt on the drive. About 3 weeks, worked without issue and then boom, read error again. I even switched the drive bay in the physical server expecting that if it was a hardware issue the disk# would change that had the error. But, same disk had read errors again (even in the other slot).

My question is, what could keep causing read errors on same disk in different drive slots (different back planes, sas cables, power) when all tests keep passing on the disk outside of the array? Uploaded SMART report and other test results for reference.

Thanks in advance for your input.

image.png.7a9b0e45e5b161a28f2a445df6b9b0f6.png

image.png.033fb8bc8c0c87c9099d58ef572cf1e3.png

Quote

January 9, 20233 yr

Community Expert

Do you have the Unraid diags after the errors? If yes please post those.

Quote

January 9, 20233 yr

Author

Whoops sorry, forgot to upload those. Just generated them now, hopefully the diags you need are still there.

unraid-diagnostics-20230109-1222.zip

Quote

January 9, 20233 yr

Community Expert

No read errors there, probably rebooted since, if it happens again gran and post new diags.

Quote

January 9, 20233 yr

Author

yes I did reboot. I had 2 months of uptime but had to reboot for ... reasons. Will post If i get another, thanks

Quote

January 10, 20233 yr

Author

Hi @JorgeB -- Ok, so I've got another disk with issues. I haven't been able to verify if it's in the same slot (or backplane) as the others that were having issues but hopefully you can look at the diagnostics (attached) and make some guess as to what could be causing these read errors.

Thank you.

unraid-diagnostics-20230110-1037.zip

Quote

January 10, 20233 yr

Community Expert

It's not logged as a disk problem, basically the disk dropped offline and reconnected with a different ID:

Jan 10 10:08:54 pumbaa kernel: scsi 17:0:13:0: rejecting I/O to dead device

This could be power/connection related, also suggest updating the LSI firmware to latest.

Quote

January 10, 20233 yr

Author

thank you for reviewing for me. Strange how it's been working for years and all of a sudden starts having power issues...but I guess it happens.

Since it just bumped offline is there anyway to tell unraid that this disk contents are fine and to add it back without having to rebuild it?

I will look into updating the LSI firmware. Are there firmware improvements that you're aware of that can resolve intermittent power issues?

Quote

January 10, 20233 yr

Author

One more question, do you recommend upgrading to the latest firmware version (the post you linked looks like it's upgrading to 16.00.10.00). I'm still on unraid 6.9.2 -- have been waiting until these issues were confirmed resolved before upgrading. Should I (do I need to) upgrade unraid first to ensure it's compatible with the latest LSI firmware version?

Quote

January 10, 20233 yr

Community Expert

23 hours ago, srfnmnk said:

I pulled the drives and ran an extended self test

Just FYI

You can do extended self-test on the disk in Unraid even with it still in the array.

Quote

January 10, 20233 yr

Community Expert

7 minutes ago, srfnmnk said:

Since it just bumped offline is there anyway to tell unraid that this disk contents are fine and to add it back without having to rebuild it?

Unraid disables a disk because a write failed, and that failed write makes it out-of-sync with parity since parity is updated so that failed write can be recovered by rebuild. Also, while a disk is disabled, writes can still go to the emulated disk by updating parity and those can be recovered by rebuild.

You can New Config and rebuild parity instead of rebuilding the data disk, but any missed writes would be lost.

If you think there hasn't been many writes, you could New Config and trust parity, but you should run a correcting parity check after just to get things back in sync.

Rebuilding the disk to recover the missed writes is the normal thing to do.

Quote

January 10, 20233 yr

Community Expert

15 minutes ago, srfnmnk said:

the post you linked looks like it's upgrading to 16.00.10.00

That's for the 9300-8i, yours is a different model, latest firmware for the one you have is 20.00.07.00

Quote

January 10, 20233 yr

Author

Thank you again @trurl and @JorgeB

18 minutes ago, trurl said:

You can do extended self-test on the disk in Unraid even with it still in the array.

Yes, I know but the disk was offline and I wanted to ensure it wasn't power issues when I ran the extended self-test, this was the main reason I tested it outside of the array.

3 minutes ago, JorgeB said:

That's for the 9300-8i, yours is a different model, latest firmware for the one you have is 20.00.07.00

Gotcha -- are you aware of firmware updates that can resolve power bumps? Trying to guestimate if firmware is likely to solve the issue or if I should assume power path / backplane is . I know you can't answer the root cause other than power but curious as to if firmware updates have resolved similar issues in the past?

17 minutes ago, trurl said:

You can New Config and rebuild parity instead of rebuilding the data disk, but any missed writes would be lost.

Wouldn't this invalidate existing parity and in resulting in a risk of data loss during the parity rebuild?

Quote

January 10, 20233 yr

Community Expert

Just now, srfnmnk said:

Wouldn't this invalidate existing parity and in resulting in a risk of data loss during the parity rebuild?

Not entirely sure what you mean. New Config is going to rebuild parity (by default) based on the current contents of all the data disks in the array. So it is rebuilding parity. Maybe you mean something else when you say "parity rebuild" here. Do you actually mean rebuilding a data disk from the parity calculation?

If you New Config/Rebuild parity instead of just rebuilding the data disk, then that initial failed write that disabled the disk, and any subsequent emulated writes to that disk, can't be recovered since parity would be in sync with the current contents of the disk and no longer have the data for the missed writes.

And it is possible that missed writes would contain filesystem metadata, so there is some slight chance that the current contents of the disk is missing those and the filesystem might have corruption unless you rebuild those missed writes.

27 minutes ago, trurl said:

Rebuilding the disk to recover the missed writes is the normal thing to do.

Quote

January 10, 20233 yr

Author

Is there a way to determine the exact model number of my sas controllers? I have an idea but don't want to be guessing.

I see these two lines for cm0 and cm1 -- so it appears I have 2 sas controllers but not sure how to ensure I select the right firmware based on LSISAS2008 and LSISAS2116

mpt2sas_cm0: LSISAS2008: FWVersion(19.00.00.00), ChipRevision(0x03), BiosVersion(07.24.01.00)

mpt2sas_cm1: LSISAS2116: FWVersion(17.00.01.00), ChipRevision(0x02), BiosVersion(07.24.01.00)

Edited January 10, 20233 yr by srfnmnk

Quote

January 10, 20233 yr

Author

2 minutes ago, trurl said:

Not entirely sure what you mean. New Config is going to rebuild parity (by default) based on the current contents of all the data disks in the array. So it is rebuilding parity. Maybe you mean something else when you say "parity rebuild" here. Do you actually mean rebuilding a data disk from the parity calculation?

I understand you now. I meant if one of my disks were to fail during the parity rebuild the contents of the failed disk would be lost. Based on what you said, it sounds like that's accurate.

Quote

January 10, 20233 yr

Community Expert

5 minutes ago, srfnmnk said:

I understand you now. I meant if one of my disks were to fail during the parity rebuild the contents of the failed disk would be lost. Based on what you said, it sounds like that's accurate.

Technically, if you only have single parity, rebuilding a data disk or rebuilding parity is really the same thing as far as the other disks are concerned. While one disk is disabled/invalid/rebuilding, the others are unprotected.

There are ways to try to rebuild a truly dead disk based on the contents of a disabled disk that can still be read though. How successful that will be depends on how out-of-sync the disabled disk is.

Quote

January 10, 20233 yr

Author

3 minutes ago, trurl said:

if you only have single parity

I have a double parity

Quote

January 10, 20233 yr

Author

29 minutes ago, srfnmnk said:

Is there a way to determine the exact model number of my sas controllers? I have an idea but don't want to be guessing.

I see these two lines for cm0 and cm1 -- so it appears I have 2 sas controllers but not sure how to ensure I select the right firmware based on LSISAS2008 and LSISAS2116

mpt2sas_cm0: LSISAS2008: FWVersion(19.00.00.00), ChipRevision(0x03), BiosVersion(07.24.01.00)

mpt2sas_cm1: LSISAS2116: FWVersion(17.00.01.00), ChipRevision(0x02), BiosVersion(07.24.01.00)

This is the last question I have for now -- If I can confirm my model numbers / firmware version (without ripping into the case) that would be fantastic. Any ideas on how to do that?

I was able to use sas2IRCU to list the two devices

image.png.5706ee62f512b1491257ba6f111ab3d3.png

Edited January 10, 20233 yr by srfnmnk

Quote

January 10, 20233 yr

Community Expert

39 minutes ago, srfnmnk said:

Gotcha -- are you aware of firmware updates that can resolve power bumps?

It can resolve a disconnected, which was what happened, it could have been power/connection related but it could also have been other reasons.

37 minutes ago, srfnmnk said:

LSISAS2008

Look for 9211-8i

37 minutes ago, srfnmnk said:

LSISAS2116

Look for 9201-16i

Quote

October 28, 20241 yr

Hi

help on the below appreciated:

1. Array reported Health FAIL. I identified it was due to 10 read errors on one drive (rebooted since, so not sending a dignositics).

2. Did an extended check on the drive (attached)

I know, the check is not comprehensive and no guarantee, but I didnt spot anything that seems to be a concern.

So, could 10 read errors be just "bad luck" or should I consider getting a new drive my utmost urgent task?

Is something to "just watch" and act if read errors happen again?

Regards Nico

ps, Thank you!

WDC_WD40EFRX-68N32N0_WD-WCC7K6RRJVCE-20241028-1654.txt

Quote

October 28, 20241 yr

Community Expert

It's logged as a disk error on SMART, it would be good to see the syslog to see if it was the same there, but since it passed the extended test, give it another change, any more errors I would replace it.

Quote

October 28, 20241 yr

Thx, I will watch it and yes, my data is too valuable not to buy a new drive if needed

Funny enough, I had received this information literally the minute after I posted:

Tower: Notice [TOWER] - array turned good

1730131201

Array has 0 disks with read errors

Is that "a good sign" ?

Quote

October 28, 20241 yr

Community Expert

That's not really relevant, it just means the errors were cleared, likely due to the array being restarted.

Quote

1

Read Errors -- Causes

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)