January 9, 20233 yr Hello, I have two drives that are claiming to get read errors (twice in the last month). I pulled the drives and ran an extended self test AND a full sector test (surface test) with hard disk check software (hard disk sentinel). All checks passed without any issue. I removed drive from the disk, restarted array (to clear it out), removed disk from historical devices, put the disk back into the array to have the contents full rebuilt on the drive. About 3 weeks, worked without issue and then boom, read error again. I even switched the drive bay in the physical server expecting that if it was a hardware issue the disk# would change that had the error. But, same disk had read errors again (even in the other slot). My question is, what could keep causing read errors on same disk in different drive slots (different back planes, sas cables, power) when all tests keep passing on the disk outside of the array? Uploaded SMART report and other test results for reference. Thanks in advance for your input.
January 9, 20233 yr Community Expert Do you have the Unraid diags after the errors? If yes please post those.
January 9, 20233 yr Author Whoops sorry, forgot to upload those. Just generated them now, hopefully the diags you need are still there. unraid-diagnostics-20230109-1222.zip
January 9, 20233 yr Community Expert No read errors there, probably rebooted since, if it happens again gran and post new diags.
January 9, 20233 yr Author yes I did reboot. I had 2 months of uptime but had to reboot for ... reasons. Will post If i get another, thanks
January 10, 20233 yr Author Hi @JorgeB -- Ok, so I've got another disk with issues. I haven't been able to verify if it's in the same slot (or backplane) as the others that were having issues but hopefully you can look at the diagnostics (attached) and make some guess as to what could be causing these read errors. Thank you. unraid-diagnostics-20230110-1037.zip
January 10, 20233 yr Community Expert It's not logged as a disk problem, basically the disk dropped offline and reconnected with a different ID: Jan 10 10:08:54 pumbaa kernel: scsi 17:0:13:0: rejecting I/O to dead device This could be power/connection related, also suggest updating the LSI firmware to latest.
January 10, 20233 yr Author thank you for reviewing for me. Strange how it's been working for years and all of a sudden starts having power issues...but I guess it happens. Since it just bumped offline is there anyway to tell unraid that this disk contents are fine and to add it back without having to rebuild it? I will look into updating the LSI firmware. Are there firmware improvements that you're aware of that can resolve intermittent power issues?
January 10, 20233 yr Author One more question, do you recommend upgrading to the latest firmware version (the post you linked looks like it's upgrading to 16.00.10.00). I'm still on unraid 6.9.2 -- have been waiting until these issues were confirmed resolved before upgrading. Should I (do I need to) upgrade unraid first to ensure it's compatible with the latest LSI firmware version?
January 10, 20233 yr Community Expert 23 hours ago, srfnmnk said: I pulled the drives and ran an extended self test Just FYI You can do extended self-test on the disk in Unraid even with it still in the array.
January 10, 20233 yr Community Expert 7 minutes ago, srfnmnk said: Since it just bumped offline is there anyway to tell unraid that this disk contents are fine and to add it back without having to rebuild it? Unraid disables a disk because a write failed, and that failed write makes it out-of-sync with parity since parity is updated so that failed write can be recovered by rebuild. Also, while a disk is disabled, writes can still go to the emulated disk by updating parity and those can be recovered by rebuild. You can New Config and rebuild parity instead of rebuilding the data disk, but any missed writes would be lost. If you think there hasn't been many writes, you could New Config and trust parity, but you should run a correcting parity check after just to get things back in sync. Rebuilding the disk to recover the missed writes is the normal thing to do.
January 10, 20233 yr Community Expert 15 minutes ago, srfnmnk said: the post you linked looks like it's upgrading to 16.00.10.00 That's for the 9300-8i, yours is a different model, latest firmware for the one you have is 20.00.07.00
January 10, 20233 yr Author Thank you again @trurl and @JorgeB 18 minutes ago, trurl said: You can do extended self-test on the disk in Unraid even with it still in the array. Yes, I know but the disk was offline and I wanted to ensure it wasn't power issues when I ran the extended self-test, this was the main reason I tested it outside of the array. 3 minutes ago, JorgeB said: That's for the 9300-8i, yours is a different model, latest firmware for the one you have is 20.00.07.00 Gotcha -- are you aware of firmware updates that can resolve power bumps? Trying to guestimate if firmware is likely to solve the issue or if I should assume power path / backplane is . I know you can't answer the root cause other than power but curious as to if firmware updates have resolved similar issues in the past? 17 minutes ago, trurl said: You can New Config and rebuild parity instead of rebuilding the data disk, but any missed writes would be lost. Wouldn't this invalidate existing parity and in resulting in a risk of data loss during the parity rebuild?
January 10, 20233 yr Community Expert Just now, srfnmnk said: Wouldn't this invalidate existing parity and in resulting in a risk of data loss during the parity rebuild? Not entirely sure what you mean. New Config is going to rebuild parity (by default) based on the current contents of all the data disks in the array. So it is rebuilding parity. Maybe you mean something else when you say "parity rebuild" here. Do you actually mean rebuilding a data disk from the parity calculation? If you New Config/Rebuild parity instead of just rebuilding the data disk, then that initial failed write that disabled the disk, and any subsequent emulated writes to that disk, can't be recovered since parity would be in sync with the current contents of the disk and no longer have the data for the missed writes. And it is possible that missed writes would contain filesystem metadata, so there is some slight chance that the current contents of the disk is missing those and the filesystem might have corruption unless you rebuild those missed writes. 27 minutes ago, trurl said: Rebuilding the disk to recover the missed writes is the normal thing to do.
January 10, 20233 yr Author Is there a way to determine the exact model number of my sas controllers? I have an idea but don't want to be guessing. I see these two lines for cm0 and cm1 -- so it appears I have 2 sas controllers but not sure how to ensure I select the right firmware based on LSISAS2008 and LSISAS2116 mpt2sas_cm0: LSISAS2008: FWVersion(19.00.00.00), ChipRevision(0x03), BiosVersion(07.24.01.00) mpt2sas_cm1: LSISAS2116: FWVersion(17.00.01.00), ChipRevision(0x02), BiosVersion(07.24.01.00) Edited January 10, 20233 yr by srfnmnk
January 10, 20233 yr Author 2 minutes ago, trurl said: Not entirely sure what you mean. New Config is going to rebuild parity (by default) based on the current contents of all the data disks in the array. So it is rebuilding parity. Maybe you mean something else when you say "parity rebuild" here. Do you actually mean rebuilding a data disk from the parity calculation? I understand you now. I meant if one of my disks were to fail during the parity rebuild the contents of the failed disk would be lost. Based on what you said, it sounds like that's accurate.
January 10, 20233 yr Community Expert 5 minutes ago, srfnmnk said: I understand you now. I meant if one of my disks were to fail during the parity rebuild the contents of the failed disk would be lost. Based on what you said, it sounds like that's accurate. Technically, if you only have single parity, rebuilding a data disk or rebuilding parity is really the same thing as far as the other disks are concerned. While one disk is disabled/invalid/rebuilding, the others are unprotected. There are ways to try to rebuild a truly dead disk based on the contents of a disabled disk that can still be read though. How successful that will be depends on how out-of-sync the disabled disk is.
January 10, 20233 yr Author 3 minutes ago, trurl said: if you only have single parity I have a double parity
January 10, 20233 yr Author 29 minutes ago, srfnmnk said: Is there a way to determine the exact model number of my sas controllers? I have an idea but don't want to be guessing. I see these two lines for cm0 and cm1 -- so it appears I have 2 sas controllers but not sure how to ensure I select the right firmware based on LSISAS2008 and LSISAS2116 mpt2sas_cm0: LSISAS2008: FWVersion(19.00.00.00), ChipRevision(0x03), BiosVersion(07.24.01.00) mpt2sas_cm1: LSISAS2116: FWVersion(17.00.01.00), ChipRevision(0x02), BiosVersion(07.24.01.00) This is the last question I have for now -- If I can confirm my model numbers / firmware version (without ripping into the case) that would be fantastic. Any ideas on how to do that? I was able to use sas2IRCU to list the two devices Edited January 10, 20233 yr by srfnmnk
January 10, 20233 yr Community Expert 39 minutes ago, srfnmnk said: Gotcha -- are you aware of firmware updates that can resolve power bumps? It can resolve a disconnected, which was what happened, it could have been power/connection related but it could also have been other reasons. 37 minutes ago, srfnmnk said: LSISAS2008 Look for 9211-8i 37 minutes ago, srfnmnk said: LSISAS2116 Look for 9201-16i
October 28, 20241 yr Hi help on the below appreciated: 1. Array reported Health FAIL. I identified it was due to 10 read errors on one drive (rebooted since, so not sending a dignositics). 2. Did an extended check on the drive (attached) I know, the check is not comprehensive and no guarantee, but I didnt spot anything that seems to be a concern. So, could 10 read errors be just "bad luck" or should I consider getting a new drive my utmost urgent task? Is something to "just watch" and act if read errors happen again? Regards Nico ps, Thank you! WDC_WD40EFRX-68N32N0_WD-WCC7K6RRJVCE-20241028-1654.txt
October 28, 20241 yr Community Expert It's logged as a disk error on SMART, it would be good to see the syslog to see if it was the same there, but since it passed the extended test, give it another change, any more errors I would replace it.
October 28, 20241 yr Thx, I will watch it and yes, my data is too valuable not to buy a new drive if needed Funny enough, I had received this information literally the minute after I posted: Tower: Notice [TOWER] - array turned good 1730131201 Array has 0 disks with read errors Is that "a good sign" ?
October 28, 20241 yr Community Expert That's not really relevant, it just means the errors were cleared, likely due to the array being restarted.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.