Jump to content
interwebtech

[6.7.0-rc5] Read Disk Errors on Parity Check start

11 posts in this topic Last Reply

Recommended Posts

Not sure if this is a beta issue or more general so will err on side of not being related to beta.

 

Last night 12:30ish, monthly parity check launched. I immediately got an email warning that the array had errors:

Event: Unraid array errors
Subject: Warning [TOWER] - array has errors
Description: Array has 3 disks with read errors
Importance: warning

Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) (errors 128)
Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) (errors 128)
Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) (errors 128)

Parity check continued and logged via GUI that is made corrections to fix 128 errors to each of 3 disks (all next to each other). Diagnostics attached.

tower-diagnostics-20190301-1944.zip

Share this post


Link to post
Mar  1 00:33:02 Tower kernel: sd 7:0:13:0: timing out command, waited 180s
...
Mar  1 00:33:02 Tower kernel: sd 7:0:14:0: timing out command, waited 180s
...
Mar  1 00:33:02 Tower kernel: sd 7:0:15:0: timing out command, waited 180s

 

All 3 disks timed out at the same time, do they have anything in common besides the controller? If yes, like a power or miniSAS you'll want to look at that.

 

3 hours ago, interwebtech said:

Parity check continued and logged via GUI that is made corrections to fix 128 errors to each

Nothing was correct, luckily for you, since the errors on the disks were all at the same time, scheduled parity check should always be non correct because in some cases parity can be wrongly updated if there are errors on a disk.

 

Share this post


Link to post
8 hours ago, interwebtech said:

made corrections to fix 128 errors to each of 3 disks

Parity check never corrects the data disks, just the parity disk.

Share this post


Link to post

Followup several days later... how do I clear the "FAIL" moniker on emails? I ran a correcting Parity check after the one referenced above to verify there were no remaining errors but the FAIL still appears on emails. 

Last check completed on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors. 
Notice [TOWER] - array health report [FAIL] (email)

Share this post


Link to post
51 minutes ago, interwebtech said:

Notice [TOWER] - array health report [FAIL] (email)

There should be more detains on why it fails, alternatively post diags.

Share this post


Link to post
1 minute ago, johnnie.black said:

There should be more detains on why it fails, alternatively post diags.

It's complaining about those 3 disks that threw errors on spinup for the monthly parity check (OP above). Here is the full email:
 

Event: Unraid Status
Subject: Notice [TOWER] - array health report [FAIL]
Description: Array has 18 disks (including parity & cache)
Importance: warning

Parity - ST8000VN0002-1Z8112_ZA124ASG (sdc) - standby [OK]
Parity2 - ST8000VN0002-1Z8112_ZA12BHMW (sdd) - standby [OK]
Disk 1 - ST8000VN0022-2EL112_ZA17V13V (sdb) - standby [OK]
Disk 2 - ST6000DX000-1H217Z_Z4D04L2A (sde) - standby [OK]
Disk 3 - ST8000VN0022-2EL112_ZA17SPGS (sdf) - standby [OK]
Disk 4 - ST6000DM001-1XY17Z_Z4D23K9N (sdg) - standby [OK]
Disk 5 - ST8000AS0002-1NA17Z_Z840J4R8 (sdh) - standby [OK]
Disk 6 - HGST_HDN724040ALE640_PK1334PCKDKRPX (sdi) - standby [OK]
Disk 7 - HGST_HDN724040ALE640_PK1334PCKAX1MX (sdn) - standby [OK]
Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) - standby (disk has read errors) [NOK]
Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) - standby (disk has read errors) [NOK]
Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) - standby (disk has read errors) [NOK]
Disk 11 - ST8000VN0022-2EL112_ZA179JR6 (sdj) - standby [OK]
Disk 12 - WDC_WD80EMAZ-00WJTA0_7SJNBMRU (sdk) - standby [OK]
Disk 13 - WDC_WD80EMAZ-00WJTA0_7SJNBNVU (sdl) - standby [OK]
Disk 14 - WDC_WD80EMAZ-00WJTA0_7HJZ25AF (sdm) - standby [OK]
Cache - Samsung_SSD_970_EVO_1TB_S467NF0K603458F (nvme0n1) - active 22 C [OK]
Cache 2 - Samsung_SSD_970_EVO_1TB_S467NF0K602897J (nvme1n1) - active 23 C [OK]

Parity is valid
Last checked on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors.
Duration: 1 day, 30 minutes, 42 seconds. Average speed: 90.7 MB/s

I ran a 2nd parity check with corrections turned on that completed without error (see the last line of email). I thought that would clear the errors being reported. Diags and Main screen cap attached.

Screenshot_1.jpg

tower-diagnostics-20190305-0921.zip

Share this post


Link to post
Posted (edited)

I stopped/restarted the array but the errors are still listed on Main. Also, Fix Common Problems alerted me to the error state on the 3 disks.
 

Event: Fix Common Problems - Tower
Subject: Errors have been found with your server (Tower).
Description: Investigate at Settings / User Utilities / Fix Common Problems
Importance: alert

**** disk8 (ST8000AS0002-1NA17Z_Z8406M0L) has read errors ****   **** disk9 (ST4000DM000-1F2168_Z3024WY8) has read errors ****   **** disk10 (ST4000DM000-1F2168_Z3024WMZ) has read errors ****   


Fresh set diags and screenie attached.

Screenshot_2.jpg

tower-diagnostics-20190305-1617.zip

Edited by interwebtech

Share this post


Link to post

I/O error counts only reset on reboot.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.