Jump to content

[SOLVED] Parity Check - Large number of errors


Recommended Posts

Posted

Hey All,

 

I've been running Unraid for about a year now but recent parity checks have returned some worrying results. I run a scheduled monthly check and for the first 8 months or so everything was fine and it resulted in 0 errors but the last few months have been getting increasingly bad;

 

01/04/2020 = 0 Errors

01/05/2020 = 120 Errors

01/06/2020 = 1732 Errors

 

After this most recent monthly check I ran a manual check last night and the test returned a result of 3926 errors and also a SMART warning for one of the parity drives ("reported uncorrect is 1"). Since the monthly check and last night I would estimate another ~40GB of data was added to the data drives.

 

Quick background on the installed drives;

 

Data Array

  • 2 x 3TB Seagate Ironwolf (Model: ST3000VN007)
  • 3 x 3TB WD Red (older non SMR drives, model WDC_WD30EFRX)

Parity Array

  • 2 x 4TB Seagate Barracuda (Model: ST4000DM004), pulled from external USB enclosures

Controller

  • LSI 9200-8i (I suspect this to be a Chinese copy)

 

I suspect that the 4TB Barracuda drives are not up to the stress of parity and 24/7 operation and was going to order 6TB Ironwolf drives but wanted to get the your opinion before spending any cash.

 

Thanks

 

 

Posted

Do you have your monthly parity check set to be ‘correcting’ or not?   Note that If you provide the system diagnostics zip file we could look in there to find out.    Non-correcting is the recommended setting so that you do not corrupt parity if a drive starts acting up, but if set that way no errors will get fixed until you run a correcting check so the number would not go back to zero.  
 

Have you had any ‘unclean’ shutdowns for any reason.    Any time you get one of those it’s likely that a few parity errors would happen after rebooting.

 

you DO want to get back to having zero errors as that is the only way you can expect a rebuilt disk (if you have a drive fail) to have no file system corruption.

 

Assuming your monthly checks were non-correcting and you think all your drives are healthy then I would suggest your best course of action is:

  • run a correcting parity check - this will report the number of errors it has corrected
  • then run a non-correcting parity check to make sure the number of errors is now zero

it might make sense before doing this to post your system’s diagnostics zip file (obtained via Tools->Diagnostics) so we can check if anything  appears to be amiss that you have not mentioned.

Posted

Sorry should have mentioned that monthly checks were originally set to correcting but have now set it to non-correcting. So monthly scan found and fixed 1732 errors on the first then 4 days later 3926 errors were found.

No unclean shutdowns that I am aware of and system is on a UPS with unraid configured to shutdown if runtime falls below a set time or battery drops below set percentage.

 

Please find diag zip attached

theblackbox-diagnostics-20200605-1653.zip

Posted

There are several timeout errors with various devices like this one:

 

Jun  1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: attempting task abort! scmd(0000000082b077c3)
Jun  1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: tag#363 CDB: opcode=0x0 00 00 00 00 00 00
Jun  1 08:16:30 TheBlackBox kernel: scsi target7:0:1: handle(0x0009), sas_address(0x4433221101000000), phy(1)
Jun  1 08:16:30 TheBlackBox kernel: scsi target7:0:1: enclosure logical id(0x500605b00544d8c0), slot(2)
Jun  1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: task abort: SUCCESS scmd(0000000082b077c3)
Jun  1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: Power-on or device reset occurred
### [PREVIOUS LINE REPEATED 1 TIMES] ###

 

Check all cables and power.

Posted

Also a good idea to update the LSI's firmware since it's very old:

 

May 29 22:40:21 TheBlackBox kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00), ChipRevision(0x03), BiosVersion(07.19.00.00)

 

Current one is 20.00.07.00

Posted (edited)

Thanks for the help.

I've updated HBA to latest firmware and also checked all connections to drives and backplane. Ran another manual check last night and parity drive 2 died in spectacular fashion with over 7000 reported bad sector errors so i've shut down the array until I can source another temporary replacement drive locally.

Edited by Trigonal
  • JorgeB changed the title to [SOLVED] Parity Check - Large number of errors

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...