[6.7.0-rc5] Read Disk Errors on Parity Check start

interwebtech · March 1, 2019

Not sure if this is a beta issue or more general so will err on side of not being related to beta.

Last night 12:30ish, monthly parity check launched. I immediately got an email warning that the array had errors:

Event: Unraid array errors
Subject: Warning [TOWER] - array has errors
Description: Array has 3 disks with read errors
Importance: warning

Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) (errors 128)
Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) (errors 128)
Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) (errors 128)

Parity check continued and logged via GUI that is made corrections to fix 128 errors to each of 3 disks (all next to each other). Diagnostics attached.

tower-diagnostics-20190301-1944.zip

JorgeB · March 1, 2019

Mar  1 00:33:02 Tower kernel: sd 7:0:13:0: timing out command, waited 180s
...
Mar  1 00:33:02 Tower kernel: sd 7:0:14:0: timing out command, waited 180s
...
Mar  1 00:33:02 Tower kernel: sd 7:0:15:0: timing out command, waited 180s

All 3 disks timed out at the same time, do they have anything in common besides the controller? If yes, like a power or miniSAS you'll want to look at that.

3 hours ago, interwebtech said:

Parity check continued and logged via GUI that is made corrections to fix 128 errors to each

Nothing was correct, luckily for you, since the errors on the disks were all at the same time, scheduled parity check should always be non correct because in some cases parity can be wrongly updated if there are errors on a disk.

trurl · March 2, 2019

8 hours ago, interwebtech said:

made corrections to fix 128 errors to each of 3 disks

Parity check never corrects the data disks, just the parity disk.

interwebtech · March 5, 2019

Followup several days later... how do I clear the "FAIL" moniker on emails? I ran a correcting Parity check after the one referenced above to verify there were no remaining errors but the FAIL still appears on emails.

Last check completed on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors.
Notice [TOWER] - array health report [FAIL] (email)

JorgeB · March 5, 2019

51 minutes ago, interwebtech said:

Notice [TOWER] - array health report [FAIL] (email)

There should be more detains on why it fails, alternatively post diags.

interwebtech · March 5, 2019

1 minute ago, johnnie.black said:

There should be more detains on why it fails, alternatively post diags.

It's complaining about those 3 disks that threw errors on spinup for the monthly parity check (OP above). Here is the full email:

Event: Unraid Status
Subject: Notice [TOWER] - array health report [FAIL]
Description: Array has 18 disks (including parity & cache)
Importance: warning

Parity - ST8000VN0002-1Z8112_ZA124ASG (sdc) - standby [OK]
Parity2 - ST8000VN0002-1Z8112_ZA12BHMW (sdd) - standby [OK]
Disk 1 - ST8000VN0022-2EL112_ZA17V13V (sdb) - standby [OK]
Disk 2 - ST6000DX000-1H217Z_Z4D04L2A (sde) - standby [OK]
Disk 3 - ST8000VN0022-2EL112_ZA17SPGS (sdf) - standby [OK]
Disk 4 - ST6000DM001-1XY17Z_Z4D23K9N (sdg) - standby [OK]
Disk 5 - ST8000AS0002-1NA17Z_Z840J4R8 (sdh) - standby [OK]
Disk 6 - HGST_HDN724040ALE640_PK1334PCKDKRPX (sdi) - standby [OK]
Disk 7 - HGST_HDN724040ALE640_PK1334PCKAX1MX (sdn) - standby [OK]
Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) - standby (disk has read errors) [NOK]
Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) - standby (disk has read errors) [NOK]
Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) - standby (disk has read errors) [NOK]
Disk 11 - ST8000VN0022-2EL112_ZA179JR6 (sdj) - standby [OK]
Disk 12 - WDC_WD80EMAZ-00WJTA0_7SJNBMRU (sdk) - standby [OK]
Disk 13 - WDC_WD80EMAZ-00WJTA0_7SJNBNVU (sdl) - standby [OK]
Disk 14 - WDC_WD80EMAZ-00WJTA0_7HJZ25AF (sdm) - standby [OK]
Cache - Samsung_SSD_970_EVO_1TB_S467NF0K603458F (nvme0n1) - active 22 C [OK]
Cache 2 - Samsung_SSD_970_EVO_1TB_S467NF0K602897J (nvme1n1) - active 23 C [OK]

Parity is valid
Last checked on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors.
Duration: 1 day, 30 minutes, 42 seconds. Average speed: 90.7 MB/s

I ran a 2nd parity check with corrections turned on that completed without error (see the last line of email). I thought that would clear the errors being reported. Diags and Main screen cap attached.

tower-diagnostics-20190305-0921.zip

JorgeB · March 5, 2019

Stop/Start the array to reset the errors.

interwebtech · March 5, 2019

I stopped/restarted the array but the errors are still listed on Main. Also, Fix Common Problems alerted me to the error state on the 3 disks.

Event: Fix Common Problems - Tower
Subject: Errors have been found with your server (Tower).
Description: Investigate at Settings / User Utilities / Fix Common Problems
Importance: alert

**** disk8 (ST8000AS0002-1NA17Z_Z8406M0L) has read errors ****   **** disk9 (ST4000DM000-1F2168_Z3024WY8) has read errors ****   **** disk10 (ST4000DM000-1F2168_Z3024WMZ) has read errors ****

Fresh set diags and screenie attached.

tower-diagnostics-20190305-1617.zip

Edited March 5, 2019 by interwebtech

trurl · March 5, 2019

I/O error counts only reset on reboot.

interwebtech · March 5, 2019

6 minutes ago, trurl said:

I/O error counts only reset on reboot.

That did it. Thanks.

JorgeB · March 5, 2019

Yes, sorry, was convinced start/stopping was enough.

[6.7.0-rc5] Read Disk Errors on Parity Check start

Recommended Posts

interwebtech

Link to comment

JorgeB

Link to comment

trurl

Link to comment

interwebtech

Link to comment

JorgeB

Link to comment

interwebtech

Link to comment

JorgeB

Link to comment

interwebtech

Link to comment

trurl

Link to comment

interwebtech

Link to comment

JorgeB

Link to comment

Join the conversation