Trigonal Posted June 4, 2020 Posted June 4, 2020 Hey All, I've been running Unraid for about a year now but recent parity checks have returned some worrying results. I run a scheduled monthly check and for the first 8 months or so everything was fine and it resulted in 0 errors but the last few months have been getting increasingly bad; 01/04/2020 = 0 Errors 01/05/2020 = 120 Errors 01/06/2020 = 1732 Errors After this most recent monthly check I ran a manual check last night and the test returned a result of 3926 errors and also a SMART warning for one of the parity drives ("reported uncorrect is 1"). Since the monthly check and last night I would estimate another ~40GB of data was added to the data drives. Quick background on the installed drives; Data Array 2 x 3TB Seagate Ironwolf (Model: ST3000VN007) 3 x 3TB WD Red (older non SMR drives, model WDC_WD30EFRX) Parity Array 2 x 4TB Seagate Barracuda (Model: ST4000DM004), pulled from external USB enclosures Controller LSI 9200-8i (I suspect this to be a Chinese copy) I suspect that the 4TB Barracuda drives are not up to the stress of parity and 24/7 operation and was going to order 6TB Ironwolf drives but wanted to get the your opinion before spending any cash. Thanks Quote
itimpi Posted June 5, 2020 Posted June 5, 2020 Do you have your monthly parity check set to be ‘correcting’ or not? Note that If you provide the system diagnostics zip file we could look in there to find out. Non-correcting is the recommended setting so that you do not corrupt parity if a drive starts acting up, but if set that way no errors will get fixed until you run a correcting check so the number would not go back to zero. Have you had any ‘unclean’ shutdowns for any reason. Any time you get one of those it’s likely that a few parity errors would happen after rebooting. you DO want to get back to having zero errors as that is the only way you can expect a rebuilt disk (if you have a drive fail) to have no file system corruption. Assuming your monthly checks were non-correcting and you think all your drives are healthy then I would suggest your best course of action is: run a correcting parity check - this will report the number of errors it has corrected then run a non-correcting parity check to make sure the number of errors is now zero it might make sense before doing this to post your system’s diagnostics zip file (obtained via Tools->Diagnostics) so we can check if anything appears to be amiss that you have not mentioned. Quote
Trigonal Posted June 5, 2020 Author Posted June 5, 2020 Sorry should have mentioned that monthly checks were originally set to correcting but have now set it to non-correcting. So monthly scan found and fixed 1732 errors on the first then 4 days later 3926 errors were found. No unclean shutdowns that I am aware of and system is on a UPS with unraid configured to shutdown if runtime falls below a set time or battery drops below set percentage. Please find diag zip attached theblackbox-diagnostics-20200605-1653.zip Quote
JorgeB Posted June 5, 2020 Posted June 5, 2020 There are several timeout errors with various devices like this one: Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: attempting task abort! scmd(0000000082b077c3) Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: tag#363 CDB: opcode=0x0 00 00 00 00 00 00 Jun 1 08:16:30 TheBlackBox kernel: scsi target7:0:1: handle(0x0009), sas_address(0x4433221101000000), phy(1) Jun 1 08:16:30 TheBlackBox kernel: scsi target7:0:1: enclosure logical id(0x500605b00544d8c0), slot(2) Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: task abort: SUCCESS scmd(0000000082b077c3) Jun 1 08:16:30 TheBlackBox kernel: sd 7:0:1:0: Power-on or device reset occurred ### [PREVIOUS LINE REPEATED 1 TIMES] ### Check all cables and power. Quote
JorgeB Posted June 5, 2020 Posted June 5, 2020 Also a good idea to update the LSI's firmware since it's very old: May 29 22:40:21 TheBlackBox kernel: mpt2sas_cm0: LSISAS2008: FWVersion(10.00.08.00), ChipRevision(0x03), BiosVersion(07.19.00.00) Current one is 20.00.07.00 Quote
Trigonal Posted June 6, 2020 Author Posted June 6, 2020 (edited) Thanks for the help. I've updated HBA to latest firmware and also checked all connections to drives and backplane. Ran another manual check last night and parity drive 2 died in spectacular fashion with over 7000 reported bad sector errors so i've shut down the array until I can source another temporary replacement drive locally. Edited June 6, 2020 by Trigonal Quote
Trigonal Posted June 13, 2020 Author Posted June 13, 2020 Quick update. Replaced the dodgy drive and all is good now. Thanks again for the help Quote
JorgeB Posted June 13, 2020 Posted June 13, 2020 Thanks for reporting back, and if you don't mind going to tag this as solved. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.