interwebtech Posted March 1, 2019 Share Posted March 1, 2019 Not sure if this is a beta issue or more general so will err on side of not being related to beta. Last night 12:30ish, monthly parity check launched. I immediately got an email warning that the array had errors: Event: Unraid array errors Subject: Warning [TOWER] - array has errors Description: Array has 3 disks with read errors Importance: warning Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) (errors 128) Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) (errors 128) Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) (errors 128) Parity check continued and logged via GUI that is made corrections to fix 128 errors to each of 3 disks (all next to each other). Diagnostics attached. tower-diagnostics-20190301-1944.zip Quote Link to comment
JorgeB Posted March 1, 2019 Share Posted March 1, 2019 Mar 1 00:33:02 Tower kernel: sd 7:0:13:0: timing out command, waited 180s ... Mar 1 00:33:02 Tower kernel: sd 7:0:14:0: timing out command, waited 180s ... Mar 1 00:33:02 Tower kernel: sd 7:0:15:0: timing out command, waited 180s All 3 disks timed out at the same time, do they have anything in common besides the controller? If yes, like a power or miniSAS you'll want to look at that. 3 hours ago, interwebtech said: Parity check continued and logged via GUI that is made corrections to fix 128 errors to each Nothing was correct, luckily for you, since the errors on the disks were all at the same time, scheduled parity check should always be non correct because in some cases parity can be wrongly updated if there are errors on a disk. Quote Link to comment
trurl Posted March 2, 2019 Share Posted March 2, 2019 8 hours ago, interwebtech said: made corrections to fix 128 errors to each of 3 disks Parity check never corrects the data disks, just the parity disk. Quote Link to comment
interwebtech Posted March 5, 2019 Author Share Posted March 5, 2019 Followup several days later... how do I clear the "FAIL" moniker on emails? I ran a correcting Parity check after the one referenced above to verify there were no remaining errors but the FAIL still appears on emails. Last check completed on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors. Notice [TOWER] - array health report [FAIL] (email) Quote Link to comment
JorgeB Posted March 5, 2019 Share Posted March 5, 2019 51 minutes ago, interwebtech said: Notice [TOWER] - array health report [FAIL] (email) There should be more detains on why it fails, alternatively post diags. Quote Link to comment
interwebtech Posted March 5, 2019 Author Share Posted March 5, 2019 1 minute ago, johnnie.black said: There should be more detains on why it fails, alternatively post diags. It's complaining about those 3 disks that threw errors on spinup for the monthly parity check (OP above). Here is the full email: Event: Unraid Status Subject: Notice [TOWER] - array health report [FAIL] Description: Array has 18 disks (including parity & cache) Importance: warning Parity - ST8000VN0002-1Z8112_ZA124ASG (sdc) - standby [OK] Parity2 - ST8000VN0002-1Z8112_ZA12BHMW (sdd) - standby [OK] Disk 1 - ST8000VN0022-2EL112_ZA17V13V (sdb) - standby [OK] Disk 2 - ST6000DX000-1H217Z_Z4D04L2A (sde) - standby [OK] Disk 3 - ST8000VN0022-2EL112_ZA17SPGS (sdf) - standby [OK] Disk 4 - ST6000DM001-1XY17Z_Z4D23K9N (sdg) - standby [OK] Disk 5 - ST8000AS0002-1NA17Z_Z840J4R8 (sdh) - standby [OK] Disk 6 - HGST_HDN724040ALE640_PK1334PCKDKRPX (sdi) - standby [OK] Disk 7 - HGST_HDN724040ALE640_PK1334PCKAX1MX (sdn) - standby [OK] Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) - standby (disk has read errors) [NOK] Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) - standby (disk has read errors) [NOK] Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) - standby (disk has read errors) [NOK] Disk 11 - ST8000VN0022-2EL112_ZA179JR6 (sdj) - standby [OK] Disk 12 - WDC_WD80EMAZ-00WJTA0_7SJNBMRU (sdk) - standby [OK] Disk 13 - WDC_WD80EMAZ-00WJTA0_7SJNBNVU (sdl) - standby [OK] Disk 14 - WDC_WD80EMAZ-00WJTA0_7HJZ25AF (sdm) - standby [OK] Cache - Samsung_SSD_970_EVO_1TB_S467NF0K603458F (nvme0n1) - active 22 C [OK] Cache 2 - Samsung_SSD_970_EVO_1TB_S467NF0K602897J (nvme1n1) - active 23 C [OK] Parity is valid Last checked on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors. Duration: 1 day, 30 minutes, 42 seconds. Average speed: 90.7 MB/s I ran a 2nd parity check with corrections turned on that completed without error (see the last line of email). I thought that would clear the errors being reported. Diags and Main screen cap attached. tower-diagnostics-20190305-0921.zip Quote Link to comment
JorgeB Posted March 5, 2019 Share Posted March 5, 2019 Stop/Start the array to reset the errors. Quote Link to comment
interwebtech Posted March 5, 2019 Author Share Posted March 5, 2019 (edited) I stopped/restarted the array but the errors are still listed on Main. Also, Fix Common Problems alerted me to the error state on the 3 disks. Event: Fix Common Problems - Tower Subject: Errors have been found with your server (Tower). Description: Investigate at Settings / User Utilities / Fix Common Problems Importance: alert **** disk8 (ST8000AS0002-1NA17Z_Z8406M0L) has read errors **** **** disk9 (ST4000DM000-1F2168_Z3024WY8) has read errors **** **** disk10 (ST4000DM000-1F2168_Z3024WMZ) has read errors **** Fresh set diags and screenie attached. tower-diagnostics-20190305-1617.zip Edited March 5, 2019 by interwebtech Quote Link to comment
trurl Posted March 5, 2019 Share Posted March 5, 2019 I/O error counts only reset on reboot. 1 Quote Link to comment
interwebtech Posted March 5, 2019 Author Share Posted March 5, 2019 6 minutes ago, trurl said: I/O error counts only reset on reboot. That did it. Thanks. Quote Link to comment
JorgeB Posted March 5, 2019 Share Posted March 5, 2019 Yes, sorry, was convinced start/stopping was enough. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.