Jump to content

Parity Check Errors, please help!


Recommended Posts

Hey everyone, I am running unRAID 4.7. My recent monthly parity check occured on the 1st at midnight and reported 121 errors when performing the parity check with NOCORRECT. I ran it once more with NOCORRECT and this time it reported 159 errors.

 

I performed short SMART tests on all hard drives and no drives report errors. However, I notice that my parity drive has 15 under "Current Pending Sector" and nothing under Offline Uncorrectable, most likely because I don't turn off my server (it only updates this parameter if the computer is off, correct? And not when the hard drive is only spun down?). This led me to think that it may be the parity drive that is at fault.

 

I was forced to do a hard shutdown about 1-2 weeks ago, which may also have contributed to some of this problem. Since my last successful parity check (a month ago), I have also replaced one of the drives (by adding a new precleared drive and rebuilding parity) and added in a new drive, but I don't think either should be a source of this problem. However, the fact that I received different errors between the two times performing the parity check is a bit of a concern to me. What exactly does this mean?

 

What are my next steps? Should I run reiserfsck on all hard drives (except parity, of course) and check to see if those are intact? I read that bad memory could also lead to this problem. How likely is that in my case? I do not have physical access to the server for a couple weeks (debugging via Hamachi right now) so I can't even run memtest at the moment.

 

I've attached the syslog since the first parity check, which shows the errors reported during the two parity checks.

 

Thanks, and happy new year :D

syslog_parity.txt

Link to comment

I have similar issues and I'm at a loss. A quick search shows that parity error questions on the forum are sometimes left untouched. I can only guess that this is because they are difficult to diagnose and troubleshoot. Perhaps we can help each other.

 

This is the closest advice I can find for a procedure:

 

http://lime-technology.com/forum/index.php?action=printpage;topic=15041.0

 

The mention of a bug in #2 " Upsizing / replacing a disk under 4.7 can cause this to occur" is most concerning since I'm running 4.7.

 

I'm currently running long SMART reports on all the drives and I have far more errors than you and they are growing in the 10's of thousands each day. Not a fun start to the year.

 

 

Link to comment

Thanks for your reponse, ixnu! Yea, other problems I've posted tend to have responses within a few hours (thanks to some of the awesome members here :D) but I guess this may also be because people are away for the holidays right now =\

 

I didn't increase the size of any of my HDD's, so that's unlikely the cause of my problem specifically. I think my problem is more because of an improper shutdown, though because I've also replaced a HDD and added a new drive in the same month, I'm not sure. I will run another parity check soon (probably tonight) to see if the number of errors are increasing or not, and report back.

Link to comment

Performed another parity check last night and got 162 errors, up 3 from 159 the night before :(

 

Here's the parity drive's SMART report:

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    ,,,
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan  3 16:11:53 2012 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (36180) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   166   164   021    Pre-fail  Always       -       6683
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1003
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       4868
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       238
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   191   191   000    Old_age   Always       -       28542
194 Temperature_Celsius     0x0022   128   095   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       15
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4831         -
# 2  Short offline       Completed without error       00%      4830         -
# 3  Short offline       Aborted by host               10%      4825         -
# 4  Short offline       Aborted by host               80%      4824         -
# 5  Short offline       Aborted by host               10%      4824         -
# 6  Short offline       Completed without error       00%      3237         -
# 7  Extended offline    Completed without error       00%        34         -
# 8  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

 

The parity drive has unreadable sectors. You can try a correcting check and then a no correct check again to see if writing to the sectors fixes the problem. Otherwise replace the drive.

 

 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       15

Link to comment

If the parity drive is the only one showing errors then it must be corrupt. I'd replace the parity drive and rebuild parity. Then run pre-clear on the drive with sector issues and the check the Current_Pending_Sector and Reallocated_Sector_Ct values. If you want to try to fix this without a replacement, run initconfig to invalidate parity and then rebuild parity. Then check the Current_Pending_Sector and Reallocated_Sector_Ct values.

Link to comment

A parity check "nocorrect" that shows errors does not give you any clue to which disk is in error.

 

A clue could be any of your disks with un-readable sectors. 

 

If you then correct parity, you will lose the contents of those sectors on the data disk.  That is NOT a good way to handle the situation.

 

As mentioned, get SMART reports from all the drives.  If only one has un-readable sectors, you can probably:

  stop the array

  un-assign that one disk

  start the array with it un-assigned

  stop the array once more

  re-assign that one disk

  start the array and let it re-construct that one disk back onto itself.  That should re-allocate the un-readable sectors and re-write their contents.

 

As long as all the other disks are readable, you should have your data intact and the un-readable sectors pending re-allocation should not be re-allocated.

 

Joe L.

Link to comment

Thanks Joe, so if the parity drive is the only one with CURRENT_PENDING_SECTORS it can be safe to assume that the parity drive's the culprit and it's safe to trust my array, and instead rebuild parity according to the current array configuration?

actually, if it is the parity disk I'd un-assign the parit disk, start/stop the array, re-assign it, and re-construct parity, as that will re-allocate the sectors marked for re-allocation.
Link to comment

I was just wondering, when I do the above and during the parity-sync procedure, does it matter if certain operations are writing to the array? For example, I have mover set so that it runs twice a day. Would it interfere with the parity-sync procedure or is unRAID smart enough to figure that out?

Link to comment

I removed my parity drive from the array and added it back. After parity-sync, I conducted another parity check, and it found errors within the first 5 minutes. However, the CURRENT_PENDING_SECTOR count on the parity drive is now 0.

 

I decided to reseat all the SATA cables and RAM modules. Should I do a parity-correct after this, and then another parity check afterwards?

 

One observation is that the sector that unRAID reports to have incorrect parity changes each time I run a parity check. They don't appear to be the same sectors each time. Does this imply that it is likely a RAM issue? Unfortunately I don't have time to do MemTest on the modules right now, however if this is the case I can swap a pair of older modules and try again. Since I recently replaced one of the drives, I'm also wondering if that might be an issue. I have seen other threads where changing the drive in unRAID 4.7 may cause this issue as well.

 

Thanks.

Link to comment

I removed my parity drive from the array and added it back. After parity-sync, I conducted another parity check, and it found errors within the first 5 minutes. However, the CURRENT_PENDING_SECTOR count on the parity drive is now 0.

 

I decided to reseat all the SATA cables and RAM modules. Should I do a parity-correct after this, and then another parity check afterwards?

 

One observation is that the sector that unRAID reports to have incorrect parity changes each time I run a parity check. They don't appear to be the same sectors each time. Does this imply that it is likely a RAM issue? Unfortunately I don't have time to do MemTest on the modules right now, however if this is the case I can swap a pair of older modules and try again. Since I recently replaced one of the drives, I'm also wondering if that might be an issue. I have seen other threads where changing the drive in unRAID 4.7 may cause this issue as well.

 

Thanks.

Random parity errors at differing blocks can be cause by almost anything  (RAM, Motherboard, Disk Controller, Disks, Power Supply)

 

The version of unRAID does not matter... they all handle parity the same way...

 

If you have random parity errors, you have bad hardware somewhere.  It could be one disk, or memory, or bad settings for memory clock speed, memory voltage, of memory timing in the BIOS, or entirely something else.  Most of the time,, no other symptoms show in the error logs.  You can only find it by process of elimination.

 

Good luck.

 

If the errors just started with the addition of new hardware, start there.

 

Joe L.

 

Link to comment

I replaced one of my data drives because it was showing some sector problems in SMART report. How can I go about debugging if the new drive has problems or not? SMART does not show problems and the old drive has been sent back to RMA already... :(

 

I saw in the post below regarding the possible problem with version 4.7:

The syslog is showing that you have several parity sync errors.  If your server had a successful parity check, and then, before any reboot, you ran another parity check and it had sync errors, this is a serious problem.  In order to provide its protection, unRaid must be able to maintain parity accurately.  If it can't, you will be be able to recover from a drive failure.

 

There a few reasons that come time mind for parity getting out of whack:

 

1 - As already mentioned, a hard power down / restart will cause parity sync errors.  This is by far the most common cause.

 

2 - Upsizing / replacing a disk under 4.7 can cause this to occur.  A bug fix was made in a recent 5.0 beta for this.  You said you are running a recent beta, so this should not be the cause.

 

3 - Bad or misconfigured RAM.  Parity calculations can be corrupted if RAM is bad.  This can happen on the parity update (i.e., during the write) or can happen on the parity read (i.e., during the parity check).

 

4 - Bad or marginal data or power connections to the disks.  Resecuring data and power cable connections is quick and easy to do, and has been known to solve all sorts of problems.

 

5 - Underpowered or bad PSU.  Good steady power is needed to read and write data to your disks.  (Very hard to isolate a power issue without swapping out the PSU.)

 

6 - You mention that you have had some power fluctuations, but are using a UPS which has kept the server running.  This is as it should be.  But I can't rule out some power-related issue occuring during one of the power losses. 

 

First thing I would do is powerdown, check / resecure your drive cabling, and reseat your RAM modules.  On reboot, go into the BIOS and reconfirm / correct the memory parameters.  Then boot unRAID and run a couple of very short non-correcting parity checks.  All but one of your sync errors occurred very early on the array, so within a minute (or even a few seconds) you should see them repeat (you can then stop the parity check).  Run it 10 times.  Check the syslog and compare the block #s.  They should be identical on each run.

 

Run the othernight memory test.

 

If your parity sync errors are consistent, and your memory test shows no errors after an overnight run, run the correcting parity check. 

 

Then run parity checks every night for a few days.  Use the array during the day as per normal.

 

If you don't get more parity sync errors, I'd still run the parity checks weekly for 4-6 weeks to gain confidence that the array is maintaining parity.  If parity is still being maintained after all this testing, then it begins to look like either your resecuring of the cabling fixed things or the problem had something to do with your UPS.

 

Post back on your progress.

 

Good luck!

 

Since I replaced one of my data drives and I am using v4.7, how do I know whether or not I am affected by this problem?

 

How do I go about troubleshooting the problem? Do I have to parity-sync and subsequently parity check each time? Or is there a faster method? This might take a while...

 

I will try to replace it with a spare set of RAM that I have available. Hope that changes things.

 

Also, does it matter if there are operations to the array during the parity-sync or parity-check?

 

Thanks again.

Link to comment

Hey guys,

 

I did a completely new parity sync after reseating my hardware and doing two parity checks in a row afterwards. I noticed that some of the sectors reporting parity problems are the same, but not all of them. Why is that? Is this because unRAID does not output all parity sector locations that have parity problems? How do I get a full list of the locations?

 

First parity check:

Jan 16 00:25:26 LAI_SERVER kernel: md: parity incorrect: 34151840 (Errors)
Jan 16 00:28:10 LAI_SERVER kernel: md: parity incorrect: 59197904 (Errors)
Jan 16 00:30:31 LAI_SERVER kernel: md: parity incorrect: 80629400 (Errors)
Jan 16 00:30:51 LAI_SERVER kernel: md: parity incorrect: 83648168 (Errors)
Jan 16 00:31:46 LAI_SERVER kernel: md: parity incorrect: 92013400 (Errors)
Jan 16 00:32:25 LAI_SERVER kernel: md: parity incorrect: 97962976 (Errors)
Jan 16 00:32:45 LAI_SERVER kernel: md: parity incorrect: 101035416 (Errors)
Jan 16 00:33:08 LAI_SERVER kernel: md: parity incorrect: 104461352 (Errors)
Jan 16 00:33:11 LAI_SERVER kernel: md: parity incorrect: 104967648 (Errors)
Jan 16 00:33:26 LAI_SERVER kernel: md: parity incorrect: 107211600 (Errors)
Jan 16 00:33:29 LAI_SERVER kernel: md: parity incorrect: 107769056 (Errors)
Jan 16 00:33:39 LAI_SERVER kernel: md: parity incorrect: 109175016 (Errors)
Jan 16 00:34:29 LAI_SERVER kernel: md: parity incorrect: 116807000 (Errors)
Jan 16 00:34:33 LAI_SERVER kernel: md: parity incorrect: 117482464 (Errors)
Jan 16 00:35:01 LAI_SERVER kernel: md: parity incorrect: 121635824 (Errors)
Jan 16 00:35:37 LAI_SERVER kernel: md: parity incorrect: 127168056 (Errors)
Jan 16 00:36:53 LAI_SERVER kernel: md: parity incorrect: 138736992 (Errors)
Jan 16 00:37:07 LAI_SERVER kernel: md: parity incorrect: 140833408 (Errors)
Jan 16 00:37:18 LAI_SERVER kernel: md: parity incorrect: 142469320 (Errors)
Jan 16 00:37:50 LAI_SERVER kernel: md: parity incorrect: 147381888 (Errors)

 

Second parity check:

Jan 16 16:14:47 LAI_SERVER kernel: md: parity incorrect: 53036016 (Errors)
Jan 16 16:15:06 LAI_SERVER kernel: md: parity incorrect: 55982672 (Errors)
Jan 16 16:15:28 LAI_SERVER kernel: md: parity incorrect: 59197904 (Errors)
Jan 16 16:17:18 LAI_SERVER kernel: md: parity incorrect: 75990408 (Errors)
Jan 16 16:17:19 LAI_SERVER kernel: md: parity incorrect: 76226392 (Errors)
Jan 16 16:17:48 LAI_SERVER kernel: md: parity incorrect: 80629400 (Errors)
Jan 16 16:18:08 LAI_SERVER kernel: md: parity incorrect: 83648168 (Errors)
Jan 16 16:18:28 LAI_SERVER kernel: md: parity incorrect: 86714608 (Errors)
Jan 16 16:19:04 LAI_SERVER kernel: md: parity incorrect: 92013400 (Errors)
Jan 16 16:19:21 LAI_SERVER kernel: md: parity incorrect: 94561408 (Errors)
Jan 16 16:19:43 LAI_SERVER kernel: md: parity incorrect: 97962976 (Errors)
Jan 16 16:20:01 LAI_SERVER kernel: md: parity incorrect: 100704048 (Errors)
Jan 16 16:20:03 LAI_SERVER kernel: md: parity incorrect: 101035416 (Errors)
Jan 16 16:20:26 LAI_SERVER kernel: md: parity incorrect: 104461352 (Errors)
Jan 16 16:20:29 LAI_SERVER kernel: md: parity incorrect: 104967648 (Errors)
Jan 16 16:20:51 LAI_SERVER kernel: md: parity incorrect: 108314768 (Errors)
Jan 16 16:21:12 LAI_SERVER kernel: md: parity incorrect: 111476696 (Errors)
Jan 16 16:21:47 LAI_SERVER kernel: md: parity incorrect: 116807000 (Errors)
Jan 16 16:21:51 LAI_SERVER kernel: md: parity incorrect: 117472984 (Errors)
Jan 16 16:21:51 LAI_SERVER kernel: md: parity incorrect: 117482464 (Errors)

 

Thanks again.

Link to comment
  • 3 weeks later...

I did 30 hours testing my RAM using the included memtest on the unRAID server, and seems like the RAM's fine.

 

What's the next logical thing to test for? SATA cables? I also upgraded the PSU to a higher wattage one recently, what's the possibility that it's the PSU? Thanks.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...