Recovery Thread: P corrected during parity check.


Recommended Posts

I'm seeing entries like these in my log when a Parity Check runs (regardless whether its automatic or if I start it manually).  There's no other entries (i.e. naming a disk or something), just starts pumping this out.  Its about 44% through the parity check.

 

 

Jul 28 21:00:01 HunterNAS root: mover started
Jul 28 21:00:01 HunterNAS root: mover finished
Jul 28 22:51:01 HunterNAS sSMTP[6377]: Creating SSL connection to host
Jul 28 22:51:01 HunterNAS sSMTP[6377]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Jul 28 22:51:03 HunterNAS sSMTP[6377]: Sent mail for [email protected] (221 2.0.0 closing connection j15sm7871893ioo.27 - gsmtp) uid=0 username=root outbytes=762
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293616
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293624
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293632
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293640
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293648
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293656
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293664
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293672
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293680
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293688
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293696
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293704
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293712
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293720
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293728
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293736
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293744
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293752
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293760
Jul 28 23:32:48 HunterNAS kernel: md: recovery thread: P corrected, sector=4157293768

 

Does this indicate a problem with my parity drive?  How can I identify which drive is having the problem (if its a drive problem)...

 

Smart Report for the Parity Drive does show a lot of Raw read error rate.  Smart log attached.

Thanks!

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   112   100   006    Pre-fail  Always       -       43793432
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       551
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       13726094039

hunternas-smart-20170728-2338.zip

Edited by jeffreywhunter
Link to comment

Those messages indicate that errors are being found that have resulted in PARITY1 being corrected.   Unfortunately it is not possible to easily determine the cause.

 

If you include the full diagnostics (Tools->Diagnostics) zip file then you may get more informed feedback about the state of your system/disks.

 

I notice that you are using a Supermicro SAS2LP-MV8 HBA.    On some systems these seem to cause problems when using unRAID v6 so these messages could be a side-effect of that.   The question is whether you always get such error messages doing a parity check, and whether they are the same sectors each time.  

  • Upvote 1
Link to comment

Thanks for the reply itimpi.  Log attached.  

 

I've had unraid on this hardware since 5.x and made the transition to 6.  Never noticed these kind of errors before, although I am not a unix guru by any stretch and could easily overlook important factors.  

I see a number of other oddities in the log:

  1. some ACPI issues, Namespace lookup failure, AE_NOT_FOUND and sas_scsi_recover_host busy errors
  2. Failed to read /var/lib/nfs/state
  3. Server listening on 0.0.0.0 port 22 (which I think is an FTP server error - but it works fine - also get a proftpd warning about 32bit capabilities),
  4.  WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
  5. I get inotify watches error on disk 5, even though I've increased them.
  6. This recovery thread error (which does look to happen every time parity runs - not the same sectors, but near each other)

I started having odd problems around the 6.2 to 6.3 conversion.  I try to keep the system up to date.  Odd things like the server locking up for no apparent reason (been the worse and much more frequent since 6.3 - so much so that I completely rebuilt my flash a few months ago).

 

If the SAS2LP-MV8 HBA is not longer a good choice, perhaps I should replace it?  

 

Here's the system info:

Model: Custom
M/B: ASUSTeK COMPUTER INC. - P8Z77-V LK
CPU: Intel® Core™ i5-2500K CPU @ 3.30GHz
HVM: Disabled
IOMMU: Disabled
Cache: 256 kB, 1024 kB, 6144 kB
Memory: 24 GB (max. installable capacity 32 GB)
Network: eth0: 1000 Mb/s, full duplex, mtu 1500
Kernel: Linux 4.9.30-unRAID x86_64
OpenSSL: 1.0.2k

Perhaps my hardware choices have fallen out of favor with 6.3 onwards?

hunternas-diagnostics-20170729-0756.zip

Link to comment

Fist check on the log was non-correct, so no errors were corrected, 2nd check was correct, so a 3rd check should return 0 errors.

 

These may be related to the SAS2LP, but probably aren't, any unclean shutdowns recently?

 

Having said that Marvell controllers are not currently recommended, LSI is a much better option.

  • Upvote 1
Link to comment
1 minute ago, jeffreywhunter said:

What has changed to make LSI the better one?

 

Both the SASLP and the SAS2LP have been the source of many issues for some users with unRAID v6, mainly dropped disks, other users have no issues, but AFAIK nobody ever had any problems with the recommended LSI models, hence why they are now the recommended ones.

Link to comment
On 8/1/2017 at 4:32 PM, johnnie.black said:

 

Both the SASLP and the SAS2LP have been the source of many issues for some users with unRAID v6, mainly dropped disks, other users have no issues, but AFAIK nobody ever had any problems with the recommended LSI models, hence why they are now the recommended ones.

 

Is it possible that this SAS2LP controller could be causing hard server lockups (i.e. unresponsive console)?  For about the past 6 months, I've been having periodic server lockups where only option is hard reboot.  I'd love to capture the syslog after the failure, but don't know of a way to do that with unRaid?  Ideas to get a persistent log?

 

It would be nice to have a feature like this...OS feature to write syslog to flash...

https://www.cisco.com/c/en/us/td/docs/ios/12_0s/feature/guide/cs_sysls.html

 

I'll post something on this topic separately...

Edited by jeffreywhunter
Link to comment
  • 3 years later...

Hi Guys 

 

See most of this is from 2017 but i am having an issue and not sure where to Start! Parity Scan has always had 0 sync errors but after todays i have " Last check completed on Monday, 24.08.2020, 22:06 (today), finding 386 errors "

Previous Scans 

 

2020-08-24, 22:06:051 day, 5 hr, 14 min, 17 sec114.0 MB/sOK386

2020-08-23, 12:04:1814 hr, 45 min, 4 secUnavailableCanceled492

2020-07-22, 02:50:271 day, 3 hr, 15 min, 50 sec122.3 MB/sOK1435

2020-05-10, 16:42:411 day, 3 hr, 16 min, 50 sec122.2 MB/sOK0

2020-05-09, 12:35:401 day, 2 hr, 33 min, 14 sec125.5 MB/sOK0

2020-04-29, 20:27:5620 hr, 8 sec111.1 MB/sOK0

2020-04-17, 14:16:401 day, 2 hr, 22 min, 54 sec84.3 MB/sOK3

i have checked the smart on all the drives but cant see anything.

Anything i should be worried about or anything i can do 

Cheers 

 

L33

r720xd-diagnostics-20200824-2209.zip

Link to comment
9 minutes ago, leeknight1981 said:

Parity Scan has always had 0 sync errors

Really? What about these?

10 minutes ago, leeknight1981 said:

2020-08-23, 12:04:1814 hr, 45 min, 4 secUnavailableCanceled492

2020-07-22, 02:50:271 day, 3 hr, 15 min, 50 sec122.3 MB/sOK1435

...

2020-04-17, 14:16:401 day, 2 hr, 22 min, 54 sec84.3 MB/sOK3

Exactly zero is the only acceptable result. After correcting parity errors you should run another non-correcting check to verify that you have exactly zero sync errors. If not then you have some problem that needs to be addressed.

 

Unless parity has zero sync errors you can't expect to accurately rebuild a failed or missing disk.

Link to comment
6 minutes ago, trurl said:

Really? What about these?

Exactly zero is the only acceptable result. After correcting parity errors you should run another non-correcting check to verify that you have exactly zero sync errors. If not then you have some problem that needs to be addressed.

 

Unless parity has zero sync errors you can't expect to accurately rebuild a failed or missing disk.

Well they have always been 0 until these last 3 or so scans iv Never in about 5 years had any parity issues so was wondering if it was a disk? I have run memtest also that had 0 errors after 6 passes. Ok so run another Parity and untick the write corrections to parity Yeah 

Link to comment
5 hours ago, johnnie.black said:

Run another check without rebooting and post new diags if more sync errors are found.

Ok mate i am doing that now have 14:45 left and ill post diag, 32 found so far NO Reboot i just started a new parity scan. i cant see anything in the SMART Reports is there anything i should be looking for? the drives are in a Dell R720XD 

 

Total size: 12 TB

Elapsed time: 14 hours, 45 minutes

Current position: 6.56 TB (54.7 %)

Estimated speed: 107.2 MB/sec

Estimated finish: 14 hours, 6 minutes

Sync errors detected: 32

 

Link to comment
18 hours ago, trurl said:

We already had that one. The point was to get a new one that showed the new parity check, since it already has some sync errors that we might compare to the earlier parity check.

Hiya i have attached the new log parity scan stopped about 3am and iv just got home so nothing's been done with server just a download of diagnostics  

r720xd-diagnostics-20200826-0927.zip

Link to comment

Previous check:

 

Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312232
Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312240
Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312504
Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312512

 

Notice the block jump from block ending in sector 240 to the one starting in 504.

 

This last check found all blocks in between those incorrect:

 

Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312248
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312256
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312264
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312272
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312280
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312288
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312296
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312304
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312312
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312320
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312328
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312336
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312344
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312352
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312360
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312368
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312376
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312384
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312392
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312400
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312408
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312416
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312424
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312432
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312440
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312448
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312456
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312464
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312472
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312480
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312488
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312496

This to me rules out any RAM issue, which would already not be really a suspect since you are using ECC RAM, next suspect for me would be the controller, any chance you can test with a different one? Ideally one in IT mode, not raid mode as current one.

Link to comment
On 8/26/2020 at 10:02 AM, johnnie.black said:

Previous check:

 


Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312232
Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312240
Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312504
Aug 24 05:45:33 R720XD kernel: md: recovery thread: P corrected, sector=10563312512

 

Notice the block jump from block ending in sector 240 to the one starting in 504.

 

This last check found all blocks in between those incorrect:

 


Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312248
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312256
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312264
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312272
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312280
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312288
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312296
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312304
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312312
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312320
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312328
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312336
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312344
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312352
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312360
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312368
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312376
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312384
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312392
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312400
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312408
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312416
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312424
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312432
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312440
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312448
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312456
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312464
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312472
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312480
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312488
Aug 25 10:47:52 R720XD kernel: md: recovery thread: P incorrect, sector=10563312496

This to me rules out any RAM issue, which would already not be really a suspect since you are using ECC RAM, next suspect for me would be the controller, any chance you can test with a different one? Ideally one in IT mode, not raid mode as current one.

OK So i ran one more and still got 32 diag is attached, I ran it after the last no mover no re start etc. So is it worth my buying a new PERC H310 Mini (Embedded) if it wasn't the card am i just going to be replacing bits bit by bit lol 

Many Thanks for trying to help 

Screenshot 2020-08-28 at 10.05.48.png

Screenshot 2020-08-28 at 10.09.15.png

r720xd-diagnostics-20200828-1007.zip

Link to comment
44 minutes ago, leeknight1981 said:

OK So i ran one more and still got 32 diag is attached,

Those are expected since previous check was non correct.

 

44 minutes ago, leeknight1981 said:

So is it worth my buying a new PERC H310 Mini

Before buying one you could try flashing that one to IT mode, it will use a different (and better for Unraid) driver.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.