June 4, 20206 yr Hey All, I've been running Unraid for about a year now but recent parity checks have returned some worrying results. I run a scheduled monthly check and for the first 8 months or so everything was fine and it resulted in 0 errors but the last few months have been getting increasingly bad; 01/04/2020 = 0 Errors 01/05/2020 = 120 Errors 01/06/2020 = 1732 Errors After this most recent monthly check I ran a manual check last night and the test returned a result of 3926 errors and also a SMART warning for one of the parity drives ("reported uncorrect is 1"). Since the monthly check and last night I would estimate another ~40GB of data was added to the data drives. Quick background on the installed drives; Data Array 2 x 3TB Seagate Ironwolf (Model: ST3000VN007) 3 x 3TB WD Red (older non SMR drives, model WDC_WD30EFRX) Parity Array 2 x 4TB Seagate Barracuda (Model: ST4000DM004), pulled from external USB enclosures Controller LSI 9200-8i (I suspect this to be a Chinese copy) I suspect that the 4TB Barracuda drives are not up to the stress of parity and 24/7 operation and was going to order 6TB Ironwolf drives but wanted to get the your opinion before spending any cash. Thanks
June 5, 20206 yr Community Expert Do you have your monthly parity check set to be ‘correcting’ or not? Note that If you provide the system diagnostics zip file we could look in there to find out. Non-correcting is the recommended setting so that you do not corrupt parity if a drive starts acting up, but if set that way no errors will get fixed until you run a correcting check so the number would not go back to zero. Have you had any ‘unclean’ shutdowns for any reason. Any time you get one of those it’s likely that a few parity errors would happen after rebooting. you DO want to get back to having zero errors as that is the only way you can expect a rebuilt disk (if you have a drive fail) to have no file system corruption. Assuming your monthly checks were non-correcting and you think all your drives are healthy then I would suggest your best course of action is: run a correcting parity check - this will report the number of errors it has corrected then run a non-correcting parity check to make sure the number of errors is now zero it might make sense before doing this to post your system’s diagnostics zip file (obtained via Tools->Diagnostics) so we can check if anything appears to be amiss that you have not mentioned.
June 5, 20206 yr Author Sorry should have mentioned that monthly checks were originally set to correcting but have now set it to non-correcting. So monthly scan found and fixed 1732 errors on the first then 4 days later 3926 errors were found. No unclean shutdowns that I am aware of and system is on a UPS with unraid configured to shutdown if runtime falls below a set time or battery drops below set percentage. Please find diag zip attached theblackbox-diagnostics-20200605-1653.zip
June 5, 20206 yr Community Expert There are several timeout errors with various devices like this one: Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: attempting task abort! scmd(0000000082b077c3) Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: tag#363 CDB: opcode=0x0 00 00 00 00 00 00 Jun 1 08:16:30 TheBlackBox kernel: scsi target7:0:1: handle(0x0009), sas_address(0x4433221101000000), phy(1) Jun 1 08:16:30 TheBlackBox kernel: scsi target7:0:1: enclosure logical id(0x500605b00544d8c0), slot(2) Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: task abort: SUCCESS scmd(0000000082b077c3) Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: Power-on or device reset occurred ### [PREVIOUS LINE REPEATED 1 TIMES] ### Check all cables and power.
June 5, 20206 yr Community Expert Also a good idea to update the LSI's firmware since it's very old: May 29 22:40:21 TheBlackBox kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00), ChipRevision(0x03), BiosVersion(07.19.00.00) Current one is 20.00.07.00
June 6, 20206 yr Author Thanks for the help. I've updated HBA to latest firmware and also checked all connections to drives and backplane. Ran another manual check last night and parity drive 2 died in spectacular fashion with over 7000 reported bad sector errors so i've shut down the array until I can source another temporary replacement drive locally. Edited June 6, 20206 yr by Trigonal
June 13, 20206 yr Author Quick update. Replaced the dodgy drive and all is good now. Thanks again for the help
June 13, 20206 yr Community Expert Thanks for reporting back, and if you don't mind going to tag this as solved.
Archived
This topic is now archived and is closed to further replies.