about parity errors

fr05ty · August 2, 2019

last week i had both of my parity drives go red, 1 died replaced it done a rebuild then the second died, done the next rebuild and 3 days later i had the scheduled sync run and got a message of over 100k errors fixed, so i stopped the server done a quick memtest (~8hr 1 pass) had no issues i am currntly doing a 3rd p.s. and im at 50% and 68K errors corrected, i have ordered some new 8087 to 4 sata cables just test the cables but they wont arrive until wed.

whet should i be looking for in the logs

after 1st re-sync and before 2nd hdd crash iceberg-diagnostics-20190727-0936.zip this is where the drive died

Jul 27 17:13:28 iceberg kernel: md: sync done. time=87060sec
Jul 27 17:13:28 iceberg kernel: md: recovery thread: exit status: 0
Jul 27 19:29:47 iceberg kernel: sd 7:0:0:0: attempting task abort! scmd(000000001dc9c3cc)
Jul 27 19:29:47 iceberg kernel: sd 7:0:0:0: [sdd] tag#0 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
Jul 27 19:29:47 iceberg kernel: scsi target7:0:0: handle(0x000a), sas_address(0x5001e67467de7fec), phy(12)
Jul 27 19:29:47 iceberg kernel: scsi target7:0:0: enclosure logical id(0x5001e67467de7fff), slot(12) 
Jul 27 19:29:47 iceberg kernel: sd 7:0:0:0: device_block, handle(0x000a)
Jul 27 19:29:49 iceberg kernel: sd 7:0:0:0: device_unblock and setting to running, handle(0x000a)
Jul 27 19:29:49 iceberg kernel: sd 7:0:0:0: [sdd] Synchronizing SCSI cache
Jul 27 19:29:49 iceberg rc.diskinfo[7620]: SIGHUP received, forcing refresh of disks info.
Jul 27 19:29:49 iceberg rc.diskinfo[7620]: SIGHUP ignored - already refreshing disk info.
Jul 27 19:29:51 iceberg kernel: sd 7:0:0:0: task abort: SUCCESS scmd(000000001dc9c3cc)
Jul 27 19:29:51 iceberg kernel: md: disk29 read error, sector=4916493968
Jul 27 19:29:51 iceberg kernel: md: disk29 read error, sector=4916493976

after scheduled parity check iceberg-diagnostics-20190801-2003.zip

JorgeB · August 3, 2019

Last check was non correct, you need to run a correcting check first then a non correct one to confirm no more errors.

fr05ty · August 5, 2019

Aug 5 11:31:54 iceberg kernel: md: sync done. time=109225sec

Aug 5 11:31:54 iceberg kernel: md: recovery thread: exit status: 0

14tb dives suck 30Hrs for the scrub, but they were on sale for a good price at the time

so i just finished the correcting yesterday and the non-correcting today, fixed over 101K errors in first pass and had 0 after the N.C check, so why would i have had so many errors to fix? could it be because parity disk 2 was dying after i rebuilt disk 1?

also should a monthly scrub be as a correcting or non-correcting?

JorgeB · August 5, 2019

2 hours ago, fr05ty said:

so why would i have had so many errors to fix? could it be because parity disk 2 was dying after i rebuilt disk 1?

Possibly, impossible to known without diags showing those issues.

2 hours ago, fr05ty said:

also should a monthly scrub be as a correcting or non-correcting?

Parity check should always be non correcting, unless errors are expected, like after an unclean shutdown.

about parity errors

Recommended Posts

fr05ty

Link to comment

JorgeB

Link to comment

fr05ty

Link to comment

JorgeB

Link to comment

Join the conversation