January 13, 201214 yr Getting 7 parity errors on the NOCORRECT, same blk numbers. I have done the following: memtest over night (10 hrs) - no errors smartctl -d ata -tshort /dev/sd* on all the drives Reallocated_Sector_Ct = 0 on all drives Current_Pending_Sector = 0 on all drives Not sure what else to look at, although some seem to have a high Seek_Error_Rate (did have loose cable earlier) BTW - how can I pull the smartlog that I have on the USB while SAMBA/ARRAY are down while I am doing the reiserfsck Example SMART Attributes Data Structure revision number: 10^M Vendor Specific SMART Attributes with Thresholds:^M ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE^M 1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 63126504^M 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0^M 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 310^M 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0^M 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 4310015949^M 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 12222^M 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0^M 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 186^M 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0^M 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0^M 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0^M 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0^M 189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1^M 190 Airflow_Temperature_Cel 0x0022 071 063 045 Old_age Always - 29 (Lifetime Min/Max 28/33)^M 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 18 0 0)^M 195 Hardware_ECC_Recovered 0x001a 049 023 000 Old_age Always - 63126504^M 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0^M 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0^M 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0^M 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 78069620551897^M 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2438286620^M 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3968780009^M ^M SMART Error Log Version: 1^M No Errors Logged^M ran the following script on all drives at same time. md5 sums all matched #!/bin/bash LOG_DIR=/boot/LOGS/hashes DeviceAddr=$1 BadParity=$2 StartHere=`expr ${BadParity} - 2000000 ` RunTime=`date +%F_%H%M` for i in {1..5} do echo "Begin ${DeviceAddr} for the $i time. Bad spot = ${BadParity} Start at ${StartHere} `date` " dd if=/dev/${DeviceAddr} skip=${StartHere} count=10000000 | md5sum -b >> ${LOG_DIR}/${DeviceAddr}_${RunTime}.log done exit Currently have array offline and running reiserfsck --check /dev/sd*1 on all drives EXCEPT the parity drive. So is there anything to be concerned of on the Smart reports? If the reiserfsck --check do not report any problems, how do I go from here? Is there a method of determining the file(s) on the data drives for specific blk numbers? So that I might know what file(s) could potentially be impacted? What next?
January 13, 201214 yr Did you change anything in the system before these errors started to occur??? Here is a testing method that will validate the hard drives and controllers; http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors Note that much of this calls for using the unRAID filesystem which is stored in RAM. Instead of /var you may want to create drectories on the flash drive. I want to edit the info one day to use the flash drive so the data isn't lost during a reboot. If you use the info, feedback or editing would be helpful. Peter
January 13, 201214 yr Author Did you change anything in the system before these errors started to occur??? Here is a testing method that will validate the hard drives and controllers; http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors Note that much of this calls for using the unRAID filesystem which is stored in RAM. Instead of /var you may want to create drectories on the flash drive. I want to edit the info one day to use the flash drive so the data isn't lost during a reboot. If you use the info, feedback or editing would be helpful. Peter Parity check in Dec was without errors. Earlier (week or so ago) I replaced a bad drive but data rebuild worked without issue reported. Doing the parity checks since I plan on moving unRaid to my 4224 from my CM-590 this weekend. So want to resolve this prior to doing so. Yes, that is what I used for my md5 checks...5 passes on each drive at same time, they all matched. I made me a Driver script that executes the "worker script" for each drive Driver script md5_all <parity error blk number> #!/bin/bash LOG_DIR=/boot/LOGS/hashes BadParity=$1 RunTime=`date +%F_%H%M` echo "${RunTime} BadParity=${BadParity}" >> ${LOG_DIR}/BadParity.log md5_sd sdb $1 & md5_sd sdc $1 & md5_sd sdd $1 & md5_sd sde $1 & md5_sd sdf $1 & md5_sd sdg $1 & md5_sd sdh $1 & md5_sd sdi $1 & md5_sd sdj $1 & md5_sd sdk $1 & md5_sd sdl $1 & md5_sd sdm $1 & exit "worker script" md5_sd (parms passed from md5_all script) puts the log files on the USB drive #!/bin/bash LOG_DIR=/boot/LOGS/hashes DeviceAddr=$1 BadParity=$2 StartHere=`expr ${BadParity} - 2000000 ` RunTime=`date +%F_%H%M` for i in {1..5} do echo "Begin ${DeviceAddr} for the $i time. Bad spot = ${BadParity} Start at ${StartHere} `date` " dd if=/dev/${DeviceAddr} skip=${StartHere} count=10000000 | md5sum -b >> ${LOG_DIR}/${DeviceAddr}_${RunTime}.log done exit if you see other ways to improve it I'd be interested. When get a little more time, thinking of md5_all being passed not only the BLK #, but also the beginning/ending sd* letters and have a loop that kicks off the md5_sd with the proper sd* Just need to find an example online of doing a for loop using the alphabet and how to handle the looping for # iterations ie.
January 13, 201214 yr Just need to find an example online of doing a for loop using the alphabet and how to handle the looping for # iterations ie. for iteration in {1..10} do for alpha in {a..z} do echo $iteration $alpha done done It might be easier to simply use for disk in /dev/[hs]d[a-z] do echo $disk done
January 13, 201214 yr Author reiserfsck --check results No corruption on any drive. 2 drives have Data Block pointers that are Zero (0)....Is this a concern?? So. Any further actions on my part? Is it ok/recommended at this point to do a Parity CORRECT? ########### reiserfsck --check started at Fri Jan 13 09:23:43 2012 ########### Replaying journal: Done. Reiserfs journal '/dev/sdc1' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 426773 Internal nodes 2739 Directories 1072 Other files 14412 Data block pointers 428380324 (699301 of them are zero) Safe links 0 ########### reiserfsck finished at Fri Jan 13 11:09:56 2012 ########### ########### reiserfsck --check started at Fri Jan 13 09:28:00 2012 ########### Replaying journal: Done. Reiserfs journal '/dev/sdl1' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 479894 Internal nodes 3052 Directories 438 Other files 5581 Data block pointers 484698420 (15097 of them are zero) Safe links 0 ########### reiserfsck finished at Fri Jan 13 11:03:31 2012 ###########
January 13, 201214 yr I guess I should read the post more closely next time. I've got to learn scripting better. Does the badparity come from a user prompt? You're asking about a loop to replace the long line in the md5.all script, right? I think I understand what Joe is recommending and that could be used to create 2 scripts that just prompt for the badparity location and then automatically runs the test on all drives. That's good stuff, we've got to get that in the Wiki for future use. unRAID 4.7 has a bug that can corrupt data during a rebuild. Usually it appears as 2 or 3 parity errors right near the start of the disk. Often, the errors will appear within a few seconds of starting the parity check. If they exist, the problem files will be on the rebuilt disk. At this point, everything is looking OK so I would run a correcting check and ensure the parity errors go away. You can even just run the correct long enough to fix the problems and then cancel it and run a nocorrect again. Peter
January 13, 201214 yr Author Yep, this is 4.7...should have put that in the Subj line These are the blocks in question, so not near the beginning Jan 13 00:37:31 Tower kernel: md: parity incorrect: 1196411456 (Errors) Jan 13 00:41:35 Tower kernel: md: parity incorrect: 1234449664 (Errors) Jan 13 00:41:35 Tower kernel: md: parity incorrect: 1234449672 (Errors) Jan 13 00:41:35 Tower kernel: md: parity incorrect: 1234449680 (Errors) Jan 13 08:05:23 Tower kernel: md: parity incorrect: 3902847360 (Errors) Jan 13 08:05:23 Tower kernel: md: parity incorrect: 3902847368 (Errors) Jan 13 08:05:23 Tower kernel: md: parity incorrect: 3902847376 (Errors) The BadParity is not a prompt, but rather a parm when you execute the script. For example: md5_all 3902847360 That would do md5 sums on blks 3900847360 ( 3902847360 - 2000000 ) thru 3910847360 ( 3902847360 - 2000000 + 10000000 )
January 13, 201214 yr You'd have to do some reading about the bug, if you haven't already. I believe it can corrupt if there is a write to the data disk while it is being rebuilt. So, those errors could be caused by another write during the rebuild. I'm not too clear on the issue though - I did read about it and recall it being an issue with writing during the rebuild but it's been a while now. I think I get it now, you start at 1gig before and read for 5gig. If I really get it, then this script combined with md5_sd should allow a single run from the command line and test all the drives on a server. No need to do any editing for the specific server. md5_all #!/bin/bash LOG_DIR=/boot/LOGS/hashes BadParity=$1 RunTime=`date +%F_%H%M` echo "${RunTime} BadParity=${BadParity}" >> ${LOG_DIR}/BadParity.log for disk in /dev/[hs]d[a-z] do md5_sd $disk $1 & done exit Another question, do the directories have to exist before this is run??? Peter
January 13, 201214 yr Author That is the goal. I asked in another thread http://lime-technology.com/forum/index.php?topic=17842.0 some questions regard dd. and the answers gave me potential to be concerned, so I would want to modify/enhance it some more before adding it to the wiki. as written there, yes. but adding mkdir -p ${LOG_DIR} after the LOG_DIR= line would take care of that issue
Archived
This topic is now archived and is closed to further replies.