Parity errors (7) from multiple NOCORRECTs - advice requested [4.7] - General Support (V5 and Older)

January 13, 201214 yr

Getting 7 parity errors on the NOCORRECT, same blk numbers.

I have done the following:

memtest over night (10 hrs) - no errors

smartctl -d ata -tshort /dev/sd* on all the drives

Reallocated_Sector_Ct = 0 on all drives

Current_Pending_Sector = 0 on all drives

Not sure what else to look at, although some seem to have a high Seek_Error_Rate (did have loose cable earlier)

BTW - how can I pull the smartlog that I have on the USB while SAMBA/ARRAY are down while I am doing the reiserfsck

Example

SMART Attributes Data Structure revision number: 10^M
Vendor Specific SMART Attributes with Thresholds:^M
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE^M
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       63126504^M
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0^M
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       310^M
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0^M
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       4310015949^M
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       12222^M
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0^M
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       186^M
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0^M
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0^M
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0^M
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0^M
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1^M
190 Airflow_Temperature_Cel 0x0022   071   063   045    Old_age   Always       -       29 (Lifetime Min/Max 28/33)^M
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 18 0 0)^M
195 Hardware_ECC_Recovered  0x001a   049   023   000    Old_age   Always       -       63126504^M
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0^M
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0^M
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0^M
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       78069620551897^M
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2438286620^M
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3968780009^M
^M
SMART Error Log Version: 1^M
No Errors Logged^M

ran the following script on all drives at same time. md5 sums all matched

#!/bin/bash
LOG_DIR=/boot/LOGS/hashes
DeviceAddr=$1
BadParity=$2
StartHere=`expr ${BadParity} - 2000000 `
RunTime=`date +%F_%H%M`

for i in {1..5}
  do
    echo "Begin ${DeviceAddr} for the $i time. Bad spot = ${BadParity}  Start at ${StartHere} `date` "
    dd if=/dev/${DeviceAddr} skip=${StartHere} count=10000000 | md5sum -b >> ${LOG_DIR}/${DeviceAddr}_${RunTime}.log
  done
exit

Currently have array offline and running reiserfsck --check /dev/sd*1 on all drives EXCEPT the parity drive.

So is there anything to be concerned of on the Smart reports?

If the reiserfsck --check do not report any problems, how do I go from here?

Is there a method of determining the file(s) on the data drives for specific blk numbers? So that I might know what file(s) could potentially be impacted?

What next?

Quote

January 13, 201214 yr

Did you change anything in the system before these errors started to occur???

Here is a testing method that will validate the hard drives and controllers;

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

Note that much of this calls for using the unRAID filesystem which is stored in RAM. Instead of /var you may want to create drectories on the flash drive. I want to edit the info one day to use the flash drive so the data isn't lost during a reboot. If you use the info, feedback or editing would be helpful.

Peter

Quote

January 13, 201214 yr

Author

Did you change anything in the system before these errors started to occur???

Here is a testing method that will validate the hard drives and controllers;

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

Note that much of this calls for using the unRAID filesystem which is stored in RAM. Instead of /var you may want to create drectories on the flash drive. I want to edit the info one day to use the flash drive so the data isn't lost during a reboot. If you use the info, feedback or editing would be helpful.

Peter

Parity check in Dec was without errors.

Earlier (week or so ago) I replaced a bad drive but data rebuild worked without issue reported.

Doing the parity checks since I plan on moving unRaid to my 4224 from my CM-590 this weekend. So want to resolve this prior to doing so.

Yes, that is what I used for my md5 checks...5 passes on each drive at same time, they all matched.

I made me a Driver script that executes the "worker script" for each drive

Driver script md5_all <parity error blk number>

#!/bin/bash
LOG_DIR=/boot/LOGS/hashes
BadParity=$1
RunTime=`date +%F_%H%M`

echo "${RunTime}  BadParity=${BadParity}" >> ${LOG_DIR}/BadParity.log

md5_sd sdb $1 & md5_sd sdc $1 & md5_sd sdd $1 & md5_sd sde $1 & md5_sd sdf $1 & md5_sd sdg $1 & md5_sd sdh $1 & md5_sd sdi $1 & md5_sd sdj $1 & md5_sd sdk $1 & md5_sd sdl $1 & md5_sd sdm $1 &

exit

"worker script" md5_sd (parms passed from md5_all script) puts the log files on the USB drive

#!/bin/bash
LOG_DIR=/boot/LOGS/hashes
DeviceAddr=$1
BadParity=$2
StartHere=`expr ${BadParity} - 2000000 `
RunTime=`date +%F_%H%M`

for i in {1..5}
  do
    echo "Begin ${DeviceAddr} for the $i time. Bad spot = ${BadParity}  Start at ${StartHere} `date` "
    dd if=/dev/${DeviceAddr} skip=${StartHere} count=10000000 | md5sum -b >> ${LOG_DIR}/${DeviceAddr}_${RunTime}.log
  done
exit

if you see other ways to improve it I'd be interested.

When get a little more time, thinking of md5_all being passed not only the BLK #, but also the beginning/ending sd* letters and have a loop that kicks off the md5_sd with the proper sd*

Just need to find an example online of doing a for loop using the alphabet and how to handle the looping for # iterations

ie.

Quote

January 13, 201214 yr

Just need to find an example online of doing a for loop using the alphabet and how to handle the looping for # iterations

ie.

for iteration in {1..10}

do

for alpha in {a..z}

do

echo $iteration $alpha

done

It might be easier to simply use

for disk in /dev/[hs]d[a-z]

do

echo $disk

done

Quote

January 13, 201214 yr

Author

reiserfsck --check results

No corruption on any drive.

2 drives have Data Block pointers that are Zero (0)....Is this a concern??

So. Any further actions on my part?

Is it ok/recommended at this point to do a Parity CORRECT?

###########
reiserfsck --check started at Fri Jan 13 09:23:43 2012
###########
Replaying journal: Done.
Reiserfs journal '/dev/sdc1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 426773
        Internal nodes 2739
        Directories 1072
        Other files 14412
        Data block pointers 428380324 (699301 of them are zero)
        Safe links 0
###########
reiserfsck finished at Fri Jan 13 11:09:56 2012
###########

###########
reiserfsck --check started at Fri Jan 13 09:28:00 2012
###########
Replaying journal: Done.
Reiserfs journal '/dev/sdl1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 479894
        Internal nodes 3052
        Directories 438
        Other files 5581
        Data block pointers 484698420 (15097 of them are zero)
        Safe links 0
###########
reiserfsck finished at Fri Jan 13 11:03:31 2012
###########

Quote

January 13, 201214 yr

Author

smart reports mentioned in initial post

smart_all_2012-01-13_0848.txt

Quote

January 13, 201214 yr

I guess I should read the post more closely next time. I've got to learn scripting better. Does the badparity come from a user prompt? You're asking about a loop to replace the long line in the md5.all script, right? I think I understand what Joe is recommending and that could be used to create 2 scripts that just prompt for the badparity location and then automatically runs the test on all drives. That's good stuff, we've got to get that in the Wiki for future use.

unRAID 4.7 has a bug that can corrupt data during a rebuild. Usually it appears as 2 or 3 parity errors right near the start of the disk. Often, the errors will appear within a few seconds of starting the parity check. If they exist, the problem files will be on the rebuilt disk. At this point, everything is looking OK so I would run a correcting check and ensure the parity errors go away. You can even just run the correct long enough to fix the problems and then cancel it and run a nocorrect again.

Peter

Quote

January 13, 201214 yr

Author

Yep, this is 4.7...should have put that in the Subj line

These are the blocks in question, so not near the beginning

Jan 13 00:37:31 Tower kernel: md: parity incorrect: 1196411456 (Errors)
Jan 13 00:41:35 Tower kernel: md: parity incorrect: 1234449664 (Errors)
Jan 13 00:41:35 Tower kernel: md: parity incorrect: 1234449672 (Errors)
Jan 13 00:41:35 Tower kernel: md: parity incorrect: 1234449680 (Errors)
Jan 13 08:05:23 Tower kernel: md: parity incorrect: 3902847360 (Errors)
Jan 13 08:05:23 Tower kernel: md: parity incorrect: 3902847368 (Errors)
Jan 13 08:05:23 Tower kernel: md: parity incorrect: 3902847376 (Errors)

The BadParity is not a prompt, but rather a parm when you execute the script.

For example:

md5_all 3902847360

That would do md5 sums on blks 3900847360 ( 3902847360 - 2000000 ) thru 3910847360 ( 3902847360 - 2000000 + 10000000 )

Quote

January 13, 201214 yr

You'd have to do some reading about the bug, if you haven't already. I believe it can corrupt if there is a write to the data disk while it is being rebuilt. So, those errors could be caused by another write during the rebuild. I'm not too clear on the issue though - I did read about it and recall it being an issue with writing during the rebuild but it's been a while now.

I think I get it now, you start at 1gig before and read for 5gig. If I really get it, then this script combined with md5_sd should allow a single run from the command line and test all the drives on a server. No need to do any editing for the specific server.

md5_all

#!/bin/bash
LOG_DIR=/boot/LOGS/hashes
BadParity=$1
RunTime=`date +%F_%H%M`

echo "${RunTime}  BadParity=${BadParity}" >> ${LOG_DIR}/BadParity.log

for disk in /dev/[hs]d[a-z]
  do
     md5_sd $disk $1 &
  done
exit

Another question, do the directories have to exist before this is run???

Peter

Quote

January 13, 201214 yr

Author

That is the goal.

I asked in another thread http://lime-technology.com/forum/index.php?topic=17842.0 some questions regard dd. and the answers gave me potential to be concerned, so I would want to modify/enhance it some more before adding it to the wiki.

as written there, yes. but adding mkdir -p ${LOG_DIR} after the LOG_DIR= line would take care of that issue

Quote

Parity errors (7) from multiple NOCORRECTs - advice requested [4.7]

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)