March 1, 20197 yr Not sure if this is a beta issue or more general so will err on side of not being related to beta. Last night 12:30ish, monthly parity check launched. I immediately got an email warning that the array had errors: Event: Unraid array errors Subject: Warning [TOWER] - array has errors Description: Array has 3 disks with read errors Importance: warning Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) (errors 128) Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) (errors 128) Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) (errors 128) Parity check continued and logged via GUI that is made corrections to fix 128 errors to each of 3 disks (all next to each other). Diagnostics attached. tower-diagnostics-20190301-1944.zip
March 1, 20197 yr Mar 1 00:33:02 Tower kernel: sd 7:0:13:0: timing out command, waited 180s ... Mar 1 00:33:02 Tower kernel: sd 7:0:14:0: timing out command, waited 180s ... Mar 1 00:33:02 Tower kernel: sd 7:0:15:0: timing out command, waited 180s All 3 disks timed out at the same time, do they have anything in common besides the controller? If yes, like a power or miniSAS you'll want to look at that. 3 hours ago, interwebtech said: Parity check continued and logged via GUI that is made corrections to fix 128 errors to each Nothing was correct, luckily for you, since the errors on the disks were all at the same time, scheduled parity check should always be non correct because in some cases parity can be wrongly updated if there are errors on a disk.
March 2, 20197 yr 8 hours ago, interwebtech said: made corrections to fix 128 errors to each of 3 disks Parity check never corrects the data disks, just the parity disk.
March 5, 20197 yr Author Followup several days later... how do I clear the "FAIL" moniker on emails? I ran a correcting Parity check after the one referenced above to verify there were no remaining errors but the FAIL still appears on emails. Last check completed on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors. Notice [TOWER] - array health report [FAIL] (email)
March 5, 20197 yr 51 minutes ago, interwebtech said: Notice [TOWER] - array health report [FAIL] (email) There should be more detains on why it fails, alternatively post diags.
March 5, 20197 yr Author 1 minute ago, johnnie.black said: There should be more detains on why it fails, alternatively post diags. It's complaining about those 3 disks that threw errors on spinup for the monthly parity check (OP above). Here is the full email: Event: Unraid Status Subject: Notice [TOWER] - array health report [FAIL] Description: Array has 18 disks (including parity & cache) Importance: warning Parity - ST8000VN0002-1Z8112_ZA124ASG (sdc) - standby [OK] Parity2 - ST8000VN0002-1Z8112_ZA12BHMW (sdd) - standby [OK] Disk 1 - ST8000VN0022-2EL112_ZA17V13V (sdb) - standby [OK] Disk 2 - ST6000DX000-1H217Z_Z4D04L2A (sde) - standby [OK] Disk 3 - ST8000VN0022-2EL112_ZA17SPGS (sdf) - standby [OK] Disk 4 - ST6000DM001-1XY17Z_Z4D23K9N (sdg) - standby [OK] Disk 5 - ST8000AS0002-1NA17Z_Z840J4R8 (sdh) - standby [OK] Disk 6 - HGST_HDN724040ALE640_PK1334PCKDKRPX (sdi) - standby [OK] Disk 7 - HGST_HDN724040ALE640_PK1334PCKAX1MX (sdn) - standby [OK] Disk 8 - ST8000AS0002-1NA17Z_Z8406M0L (sdo) - standby (disk has read errors) [NOK] Disk 9 - ST4000DM000-1F2168_Z3024WY8 (sdp) - standby (disk has read errors) [NOK] Disk 10 - ST4000DM000-1F2168_Z3024WMZ (sdq) - standby (disk has read errors) [NOK] Disk 11 - ST8000VN0022-2EL112_ZA179JR6 (sdj) - standby [OK] Disk 12 - WDC_WD80EMAZ-00WJTA0_7SJNBMRU (sdk) - standby [OK] Disk 13 - WDC_WD80EMAZ-00WJTA0_7SJNBNVU (sdl) - standby [OK] Disk 14 - WDC_WD80EMAZ-00WJTA0_7HJZ25AF (sdm) - standby [OK] Cache - Samsung_SSD_970_EVO_1TB_S467NF0K603458F (nvme0n1) - active 22 C [OK] Cache 2 - Samsung_SSD_970_EVO_1TB_S467NF0K602897J (nvme1n1) - active 23 C [OK] Parity is valid Last checked on Mon 04 Mar 2019 08:54:16 AM PST (yesterday), finding 0 errors. Duration: 1 day, 30 minutes, 42 seconds. Average speed: 90.7 MB/s I ran a 2nd parity check with corrections turned on that completed without error (see the last line of email). I thought that would clear the errors being reported. Diags and Main screen cap attached. tower-diagnostics-20190305-0921.zip
March 5, 20197 yr Author I stopped/restarted the array but the errors are still listed on Main. Also, Fix Common Problems alerted me to the error state on the 3 disks. Event: Fix Common Problems - Tower Subject: Errors have been found with your server (Tower). Description: Investigate at Settings / User Utilities / Fix Common Problems Importance: alert **** disk8 (ST8000AS0002-1NA17Z_Z8406M0L) has read errors **** **** disk9 (ST4000DM000-1F2168_Z3024WY8) has read errors **** **** disk10 (ST4000DM000-1F2168_Z3024WMZ) has read errors **** Fresh set diags and screenie attached. tower-diagnostics-20190305-1617.zip Edited March 5, 20197 yr by interwebtech
March 5, 20197 yr Author 6 minutes ago, trurl said: I/O error counts only reset on reboot. That did it. Thanks.
Archived
This topic is now archived and is closed to further replies.