jeffreywhunter Posted July 29, 2017 Share Posted July 29, 2017 (edited) I'm seeing entries like these in my log when a Parity Check runs (regardless whether its automatic or if I start it manually). There's no other entries (i.e. naming a disk or something), just starts pumping this out. Its about 44% through the parity check. Jul 28 21:00:01 HunterNAS root: mover started Jul 28 21:00:01 HunterNAS root: mover finished Jul 28 22:51:01 HunterNAS sSMTP[6377]: Creating SSL connection to host Jul 28 22:51:01 HunterNAS sSMTP[6377]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256 Jul 28 22:51:03 HunterNAS sSMTP[6377]: Sent mail for [email protected] (221 2.0.0 closing connection j15sm7871893ioo.27 - gsmtp) uid=0 username=root outbytes=762 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293616 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293624 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293632 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293640 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293648 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293656 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293664 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293672 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293680 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293688 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293696 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293704 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293712 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293720 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293728 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293736 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293744 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293752 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293760 Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293768 Does this indicate a problem with my parity drive? How can I identify which drive is having the problem (if its a drive problem)... Smart Report for the Parity Drive does show a lot of Raw read error rate. Smart log attached. Thanks! ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 100 006 Pre-fail Always - 43793432 3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 551 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 13726094039 hunternas-smart-20170728-2338.zip Edited July 29, 2017 by jeffreywhunter Quote Link to comment
itimpi Posted July 29, 2017 Share Posted July 29, 2017 Those messages indicate that errors are being found that have resulted in PARITY1 being corrected. Unfortunately it is not possible to easily determine the cause. If you include the full diagnostics (Tools->Diagnostics) zip file then you may get more informed feedback about the state of your system/disks. I notice that you are using a Supermicro SAS2LP-MV8 HBA. On some systems these seem to cause problems when using unRAID v6 so these messages could be a side-effect of that. The question is whether you always get such error messages doing a parity check, and whether they are the same sectors each time. 1 Quote Link to comment
jeffreywhunter Posted July 29, 2017 Author Share Posted July 29, 2017 Thanks for the reply itimpi. Log attached. I've had unraid on this hardware since 5.x and made the transition to 6. Never noticed these kind of errors before, although I am not a unix guru by any stretch and could easily overlook important factors. I see a number of other oddities in the log: some ACPI issues, Namespace lookup failure, AE_NOT_FOUND and sas_scsi_recover_host busy errors Failed to read /var/lib/nfs/state Server listening on 0.0.0.0 port 22 (which I think is an FTP server error - but it works fine - also get a proftpd warning about 32bit capabilities), WARNING: No NSS support for mDNS detected, consider installing nss-mdns! I get inotify watches error on disk 5, even though I've increased them. This recovery thread error (which does look to happen every time parity runs - not the same sectors, but near each other) I started having odd problems around the 6.2 to 6.3 conversion. I try to keep the system up to date. Odd things like the server locking up for no apparent reason (been the worse and much more frequent since 6.3 - so much so that I completely rebuilt my flash a few months ago). If the SAS2LP-MV8 HBA is not longer a good choice, perhaps I should replace it? Here's the system info: Model: Custom M/B: ASUSTeK COMPUTER INC. - P8Z77-V LK CPU: Intel® Core™ i5-2500K CPU @ 3.30GHz HVM: Disabled IOMMU: Disabled Cache: 256 kB, 1024 kB, 6144 kB Memory: 24 GB (max. installable capacity 32 GB) Network: eth0: 1000 Mb/s, full duplex, mtu 1500 Kernel: Linux 4.9.30-unRAID x86_64 OpenSSL: 1.0.2k Perhaps my hardware choices have fallen out of favor with 6.3 onwards? hunternas-diagnostics-20170729-0756.zip Quote Link to comment
JorgeB Posted July 29, 2017 Share Posted July 29, 2017 Fist check on the log was non-correct, so no errors were corrected, 2nd check was correct, so a 3rd check should return 0 errors. These may be related to the SAS2LP, but probably aren't, any unclean shutdowns recently? Having said that Marvell controllers are not currently recommended, LSI is a much better option. 1 Quote Link to comment
jeffreywhunter Posted August 1, 2017 Author Share Posted August 1, 2017 Yes, I've been having problems with the server locking up hard (even the console!). I've reinstalled everything and still having problems. So yes, probably a dozen lockups/power off/on in the past 6 months. When i installed the SAS2LP, it was the recommended one. What has changed to make LSI the better one? Quote Link to comment
JorgeB Posted August 1, 2017 Share Posted August 1, 2017 1 minute ago, jeffreywhunter said: What has changed to make LSI the better one? Both the SASLP and the SAS2LP have been the source of many issues for some users with unRAID v6, mainly dropped disks, other users have no issues, but AFAIK nobody ever had any problems with the recommended LSI models, hence why they are now the recommended ones. Quote Link to comment
jeffreywhunter Posted August 7, 2017 Author Share Posted August 7, 2017 (edited) On 8/1/2017 at 4:32 PM, johnnie.black said: Both the SASLP and the SAS2LP have been the source of many issues for some users with unRAID v6, mainly dropped disks, other users have no issues, but AFAIK nobody ever had any problems with the recommended LSI models, hence why they are now the recommended ones. Is it possible that this SAS2LP controller could be causing hard server lockups (i.e. unresponsive console)? For about the past 6 months, I've been having periodic server lockups where only option is hard reboot. I'd love to capture the syslog after the failure, but don't know of a way to do that with unRaid? Ideas to get a persistent log? It would be nice to have a feature like this...OS feature to write syslog to flash... https://www.cisco.com/c/en/us/td/docs/ios/12_0s/feature/guide/cs_sysls.html I'll post something on this topic separately... Edited August 7, 2017 by jeffreywhunter Quote Link to comment
JorgeB Posted August 7, 2017 Share Posted August 7, 2017 Possible yes, but not likely, usual issues are parity sync errors and/or dropping disks. Quote Link to comment
leeknight1981 Posted August 24, 2020 Share Posted August 24, 2020 Hi Guys See most of this is from 2017 but i am having an issue and not sure where to Start! Parity Scan has always had 0 sync errors but after todays i have " Last check completed on Monday, 24.08.2020, 22:06 (today), finding 386 errors " Previous Scans 2020-08-24, 22:06:051 day, 5 hr, 14 min, 17 sec114.0 MB/sOK386 2020-08-23, 12:04:1814 hr, 45 min, 4 secUnavailableCanceled492 2020-07-22, 02:50:271 day, 3 hr, 15 min, 50 sec122.3 MB/sOK1435 2020-05-10, 16:42:411 day, 3 hr, 16 min, 50 sec122.2 MB/sOK0 2020-05-09, 12:35:401 day, 2 hr, 33 min, 14 sec125.5 MB/sOK0 2020-04-29, 20:27:5620 hr, 8 sec111.1 MB/sOK0 2020-04-17, 14:16:401 day, 2 hr, 22 min, 54 sec84.3 MB/sOK3 i have checked the smart on all the drives but cant see anything. Anything i should be worried about or anything i can do Cheers L33 r720xd-diagnostics-20200824-2209.zip Quote Link to comment
trurl Posted August 24, 2020 Share Posted August 24, 2020 9 minutes ago, leeknight1981 said: Parity Scan has always had 0 sync errors Really? What about these? 10 minutes ago, leeknight1981 said: 2020-08-23, 12:04:1814 hr, 45 min, 4 secUnavailableCanceled492 2020-07-22, 02:50:271 day, 3 hr, 15 min, 50 sec122.3 MB/sOK1435 ... 2020-04-17, 14:16:401 day, 2 hr, 22 min, 54 sec84.3 MB/sOK3 Exactly zero is the only acceptable result. After correcting parity errors you should run another non-correcting check to verify that you have exactly zero sync errors. If not then you have some problem that needs to be addressed. Unless parity has zero sync errors you can't expect to accurately rebuild a failed or missing disk. Quote Link to comment
leeknight1981 Posted August 24, 2020 Share Posted August 24, 2020 6 minutes ago, trurl said: Really? What about these? Exactly zero is the only acceptable result. After correcting parity errors you should run another non-correcting check to verify that you have exactly zero sync errors. If not then you have some problem that needs to be addressed. Unless parity has zero sync errors you can't expect to accurately rebuild a failed or missing disk. Well they have always been 0 until these last 3 or so scans iv Never in about 5 years had any parity issues so was wondering if it was a disk? I have run memtest also that had 0 errors after 6 passes. Ok so run another Parity and untick the write corrections to parity Yeah Quote Link to comment
JorgeB Posted August 25, 2020 Share Posted August 25, 2020 Run another check without rebooting and post new diags if more sync errors are found. Quote Link to comment
leeknight1981 Posted August 25, 2020 Share Posted August 25, 2020 5 hours ago, johnnie.black said: Run another check without rebooting and post new diags if more sync errors are found. Ok mate i am doing that now have 14:45 left and ill post diag, 32 found so far NO Reboot i just started a new parity scan. i cant see anything in the SMART Reports is there anything i should be looking for? the drives are in a Dell R720XD Total size: 12 TB Elapsed time: 14 hours, 45 minutes Current position: 6.56 TB (54.7 %) Estimated speed: 107.2 MB/sec Estimated finish: 14 hours, 6 minutes Sync errors detected: 32 Quote Link to comment
trurl Posted August 25, 2020 Share Posted August 25, 2020 Since you already have sync errors go ahead and post diagnostics. Comparing parity checks may tell us something. Quote Link to comment
leeknight1981 Posted August 25, 2020 Share Posted August 25, 2020 2 minutes ago, trurl said: Since you already have sync errors go ahead and post diagnostics. Comparing parity checks may tell us something. i have the Diag i got yesterday and will attach below, and ill add the one i download after this scan finishes r720xd-diagnostics-20200824-2209.zip Quote Link to comment
trurl Posted August 25, 2020 Share Posted August 25, 2020 We already had that one. The point was to get a new one that showed the new parity check, since it already has some sync errors that we might compare to the earlier parity check. Quote Link to comment
leeknight1981 Posted August 26, 2020 Share Posted August 26, 2020 18 hours ago, trurl said: We already had that one. The point was to get a new one that showed the new parity check, since it already has some sync errors that we might compare to the earlier parity check. Hiya i have attached the new log parity scan stopped about 3am and iv just got home so nothing's been done with server just a download of diagnostics r720xd-diagnostics-20200826-0927.zip Quote Link to comment
JorgeB Posted August 26, 2020 Share Posted August 26, 2020 Previous check: Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312232 Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312240 Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312504 Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312512 Notice the block jump from block ending in sector 240 to the one starting in 504. This last check found all blocks in between those incorrect: Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312248 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312256 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312264 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312272 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312280 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312288 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312296 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312304 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312312 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312320 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312328 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312336 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312344 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312352 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312360 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312368 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312376 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312384 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312392 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312400 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312408 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312416 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312424 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312432 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312440 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312448 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312456 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312464 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312472 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312480 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312488 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312496 This to me rules out any RAM issue, which would already not be really a suspect since you are using ECC RAM, next suspect for me would be the controller, any chance you can test with a different one? Ideally one in IT mode, not raid mode as current one. Quote Link to comment
leeknight1981 Posted August 28, 2020 Share Posted August 28, 2020 On 8/26/2020 at 10:02 AM, johnnie.black said: Previous check: Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312232 Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312240 Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312504 Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312512 Notice the block jump from block ending in sector 240 to the one starting in 504. This last check found all blocks in between those incorrect: Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312248 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312256 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312264 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312272 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312280 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312288 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312296 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312304 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312312 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312320 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312328 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312336 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312344 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312352 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312360 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312368 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312376 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312384 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312392 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312400 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312408 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312416 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312424 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312432 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312440 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312448 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312456 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312464 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312472 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312480 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312488 Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312496 This to me rules out any RAM issue, which would already not be really a suspect since you are using ECC RAM, next suspect for me would be the controller, any chance you can test with a different one? Ideally one in IT mode, not raid mode as current one. OK So i ran one more and still got 32 diag is attached, I ran it after the last no mover no re start etc. So is it worth my buying a new PERC H310 Mini (Embedded) if it wasn't the card am i just going to be replacing bits bit by bit lol Many Thanks for trying to help r720xd-diagnostics-20200828-1007.zip Quote Link to comment
JorgeB Posted August 28, 2020 Share Posted August 28, 2020 44 minutes ago, leeknight1981 said: OK So i ran one more and still got 32 diag is attached, Those are expected since previous check was non correct. 44 minutes ago, leeknight1981 said: So is it worth my buying a new PERC H310 Mini Before buying one you could try flashing that one to IT mode, it will use a different (and better for Unraid) driver. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.