February 8, 201016 yr I have a question about the parity algorithm. Yesterday I added a new parity drive and did the parity build. It built without any errors. Today, after an accidental power down without an orderly shutdown (loss of power, oops!), it is doing a parity check. It's 22% finished and 113136 sync errors and is writing to the parity drive. No errors in the drive error column as of yet. What I'm curious about is why are there so many parity changes? The array was sleeping when the power down happened. Parity shouldn't be different from before the crash. Is there a time element to the parity algorithm? If two consecutive parity checks are run, shouldn't the second one never find differences? Dave
February 8, 201016 yr I have a question about the parity algorithm. Yesterday I added a new parity drive and did the parity build. It built without any errors. Yesterday you wrote parity to the parity disk... Did you perform a parity check yesterday? If not, you had no idea if the data written could be read back. Today, after an accidental power down without an orderly shutdown (loss of power, oops!), it is doing a parity check. It's 22% finished and 113136 sync errors and is writing to the parity drive. No errors in the drive error column as of yet. Errors in the "drive" column represent errors in reading from the disk. It is good there are no read errors. Yes, that sounds like a lot of errors. It is not unusual to see some after a un-expected power down. It is because the reiserfs is a journaling file-system. It can replay transactions that were written too it, but not committed to it. The parity drive has no such journel. It is relying on being written to properly. What I'm curious about is why are there so many parity changes? The array was sleeping when the power down happened. Parity shouldn't be different from before the crash. Is there a time element to the parity algorithm? The data should be flushed from the buffer cache to the disks, but it is hard to know exactly when the memory is flushed. If you have a lot of RAM it could be that it was buffering some... If two consecutive parity checks are run, shouldn't the second one never find differences? True. If you are seeing parity errors on a second parity check you have hardware errors of some kind to resolve. Have you done a memory test? If it is not set correctly (voltage, timing, and speed), all bets are off. Joe L.
February 8, 201016 yr Author How do I do a memory test? Also, regarding the journaling filesystem...when does Unraid commit it's transactions? Does it commit before spinning down a drive? Why is it hard to know when the data is flushed from memory? Does it sit in memory even after Unraid commits it? What process has control of it in the buffer cache? Dave
February 8, 201016 yr How do I do a memory test? http://lime-technology.com/forum/index.php?topic=5229.msg48907#msg48907
February 8, 201016 yr Author Ok, I'm running the memory check. It's an older machine with locked voltages and speed on the memory so I don't think that could have any effect. I'll report results later. Dave
February 8, 201016 yr Author The memory test shows zero errors after 4.5 hours. Since the parity check takes about that long and had 100K sync errors I'd say the RAM is not part of the problem. Is there a diagnostic for testing the parity drive? When data is written to one of the data drives does Unraid catch an event while the data is still in memory or does it read it back off the drive once it's written before it calculates and writes to the parity drive? Since I've seen parity check errors when no data was written in between the checks then I have to assume that: a.) The data being read off the data drive is inconsistant (maybe cable on data drive) or b.) The old parity information being read off the parity drive is inconsistant (maybe cable on parity drive). or c.) Weird remote possiblity that the parity calculation process is getting all the data correctly but for some reason is coming up with different numbers due to a problematic math co-processor bug/feature. It appears that Unraid must write to the parity disk in raw mode or something. Is there an Unraid diagnostic test for the parity drive? Dave
February 8, 201016 yr Is there a diagnostic for testing the parity drive? ... Is there an Unraid diagnostic test for the parity drive? A quick question... Did you preclear that disk befor trusting that it's a good disk to add to your array? You may want to have a look at this thread: http://lime-technology.com/forum/index.php?topic=2817.msg23246#msg23246
February 8, 201016 yr Author No I didn't pre-clear it. I guess I'll do that now. edit: What's the deal with the preclear_disk.sh file not downloading using IE or Firefox? Man! Dave
February 8, 201016 yr The memory test shows zero errors after 4.5 hours. Since the parity check takes about that long and had 100K sync errors I'd say the RAM is not part of the problem. From your results, I'd agree. Is there a diagnostic for testing the parity drive? There is no specific diagnostic, but there are several tests. 1. You can run a "smartctl" test on the drive to get tis basic status. You can ask the smart firmware on the disk to perform a "short" test. (it takes a few minutes) or you can ask it to perform a "long" test. (typically takes about 4 or 5 hours, as it reads the entire surface of the drive.) 2. If the drive is not assigned to the array, you can run a "pre-clear" script on it. Although the clearing itself has no purpose on the parity drive, the script will read every block on the drive, write to every block, and then re-read every block. It will find bad sectors on the disk(if there are any). When data is written to one of the data drives does Unraid catch an event while the data is still in memory or does it read it back off the drive once it's written before it calculates and writes to the parity drive? The data drive AND parity drive are both issued "read" requests for the current contents of their respective sectors. Then an "xor" operation is performed in memory, then parallel "writes" are issued to both. It is up to the underlying "linux file-system/disk buffering at that point and the buffering on the disks themselves. There is no way to know when the actual data is written to the physical disk platter, on ANY drive. Basically, the disk is written to immediately but the "journal" that commits writes is written to less frequently. On my 4.5 server, the "max commit time" seems to be set to 30 seconds. It is not something I set, but a timing internal to the reiser file-system. Since I've seen parity check errors when no data was written in between the checks then I have to assume that: a.) The data being read off the data drive is inconsistant (maybe cable on data drive) or b.) The old parity information being read off the parity drive is inconsistant (maybe cable on parity drive). or c.) Weird remote possiblity that the parity calculation process is getting all the data correctly but for some reason is coming up with different numbers due to a problematic math co-processor bug/feature. You are correct. It could be any of the hardware. It appears that Unraid must write to the parity disk in raw mode or something. Is there an Unraid diagnostic test for the parity drive? Correct. It is just bits.... no file-system exists on the parity drive. Tests were described above. Be certain to disable any disk spin-down timer if running a smartctl "long" test, if you do not, the test will abort when your server requests the disk spin down. To make it much easier to run many of these tests you can install unMENU, an alternate web-interface, as described here: http://lime-technology.com/wiki/index.php?title=UnRAID_Add_Ons#UnMENU its disk-management page allows you to quickly submit tests on the disks:
February 8, 201016 yr No I didn't pre-clear it. I guess I'll do that now. edit: What's the deal with the preclear_disk.sh file not downloading using IE or Firefox? Man! Dave It should download with either... The file is preclear_disk.zip and is attached at the bottom of the first post in the preclear_disk.sh thread. A link to the attachment is here: http://lime-technology.com/forum/index.php?action=dlattach;topic=2817.0;attach=1937
February 8, 201016 yr Author Thanks. I do have Unmenu installed. I have to start it manually every boot for some reason but I will do some smart tests on the parity drive. The long one is running now and should finish in a few hours.
February 8, 201016 yr Thanks. I do have Unmenu installed. I have to start it manually every boot for some reason but I will do some smart tests on the parity drive. The long one is running now and should finish in a few hours. Make sure you turn off the spin-down timer... I'll check back in a few hours myself. (I have a dance lesson to attend with my wife.)
February 9, 201016 yr Author I have no idea if the long smart test finished. No end time in the report. The drives didn't spin down if that helps. SMART status Info for /dev/sdc smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Warning! Drive Identity Structure error: invalid SMART checksum. === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 Device Model: ST3500320AS Serial Number: 5QM01S58 Firmware Version: SD04 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Feb 8 17:47:11 2010 GMT+8 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 642) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 106) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003b) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 100 006 Pre-fail Always - 69269567 3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 97 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503281392 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 11569 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 88 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 21475164165 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 071 049 045 Old_age Always - 29 (Lifetime Min/Max 29/31) 194 Temperature_Celsius 0x0022 029 051 000 Old_age Always - 29 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 048 032 000 Old_age Always - 69269567 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 90% 11569 - # 2 Extended offline Aborted by host 10% 11569 - # 3 Extended offline Aborted by host 90% 11568 - # 4 Extended offline Aborted by host 90% 11568 - # 5 Extended offline Interrupted (host reset) 00% 11568 - # 6 Extended offline Aborted by host 90% 11568 - # 7 Extended offline Aborted by host 90% 11568 - # 8 Extended offline Aborted by host 90% 11567 - # 9 Extended offline Aborted by host 90% 11567 - #10 Short offline Completed without error 00% 11567 - #11 Short offline Aborted by host 80% 11567 - #12 Short offline Aborted by host 90% 11567 - #13 Short offline Aborted by host 90% 11567 - #14 Short offline Aborted by host 90% 11567 - #15 Short offline Aborted by host 90% 11567 - #16 Short offline Aborted by host 90% 11567 - #17 Short offline Aborted by host 60% 11567 - #18 Short offline Aborted by host 50% 11567 - #19 Short offline Completed without error 00% 11567 - #20 Short offline Completed without error 00% 11567 - #21 Short offline Aborted by host 40% 11567 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
February 9, 201016 yr Author I'm running the Pre-clear process. It will probably run into the AM so I'll check it's results tomorrow morning.
February 9, 201016 yr I have no idea if the long smart test finished. No end time in the report. The drives didn't spin down if that helps. As seen below... it is still running as of this report. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. (The long test always seems to say 90% when running... I don't think I've ever seen anything other than 90% (or completed with 9 remaining) Extended self-test routine recommended polling time: ( 106) minutes. It seems to indicate it expects to run a bit under 2 hours for the long test. 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 There are no reallocated sectors, or sectors pending re-allocation. This is good. No indication of anything internal to the hard disk that would have resulted in the errors you've experienced earlier. Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 90% 11569 - # 2 Extended offline Aborted by host 10% 11569 - # 3 Extended offline Aborted by host 90% 11568 - # 4 Extended offline Aborted by host 90% 11568 - # 5 Extended offline Interrupted (host reset) 00% 11568 - # 6 Extended offline Aborted by host 90% 11568 - # 7 Extended offline Aborted by host 90% 11568 - # 8 Extended offline Aborted by host 90% 11567 - # 9 Extended offline Aborted by host 90% 11567 - #10 Short offline Completed without error 00% 11567 - #11 Short offline Aborted by host 80% 11567 - #12 Short offline Aborted by host 90% 11567 - #13 Short offline Aborted by host 90% 11567 - #14 Short offline Aborted by host 90% 11567 - #15 Short offline Aborted by host 90% 11567 - #16 Short offline Aborted by host 90% 11567 - #17 Short offline Aborted by host 60% 11567 - #18 Short offline Aborted by host 50% 11567 - #19 Short offline Completed without error 00% 11567 - #20 Short offline Completed without error 00% 11567 - #21 Short offline Aborted by host 40% 11567 - You've run a number of short and long tests recently. Several of the "short" tests were completed successfully. A long test is in progress. It appears as if most of the tests were aborted. (You MUST disable the spin down timer, otherwise, as soon as the tests is started and unRAID sees the disk spinning it will spin it down and abort the test.) Joe L.
February 9, 201016 yr Author I'll try the long smart test again but first...I ran the pre-clear process overnight. It just finished. But it scrolled the monitor. How can I scroll up from the command line to see what is above?
February 9, 201016 yr I'd highly recommend telneting into the system (I use PuttyTel), then you can scroll through the output much more easily.
February 9, 201016 yr Author Here is the information from the pre-clear. I have started the long smart test again. I'll post those results later. --------------------------------------------------------------------------------------------------- Warning! Drive Identity Structure error: invalid smart checksum 1 Raw_Read_Error_Rate 0x000f 114 100 006 Pre-fail Always - 69285200 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 140439618 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503294050 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503337794
February 9, 201016 yr I'll try the long smart test again but first...I ran the pre-clear process overnight. It just finished. But it scrolled the monitor. How can I scroll up from the command line to see what is above? I think it is "Shift-Page-Up"
February 9, 201016 yr Here is the information from the pre-clear. I have started the long smart test again. I'll post those results later. 1 Raw_Read_Error_Rate 0x000f 114 100 006 Pre-fail Always - 69285200 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 140439618 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503294050 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503337794 See this post on how to interpret your results. http://lime-technology.com/forum/index.php?topic=4068.msg48756#msg48756
February 9, 201016 yr Author Here is the information from the pre-clear. I have started the long smart test again. I'll post those results later. 1 Raw_Read_Error_Rate 0x000f 114 100 006 Pre-fail Always - 69285200 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 140439618 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503294050 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503337794 See this post on how to interpret your results. http://lime-technology.com/forum/index.php?topic=4068.msg48756#msg48756 My interpretation using the link you provided is that my drive is very consistant and reliable. Anybody read it differently?
February 9, 201016 yr Here is the information from the pre-clear. I have started the long smart test again. I'll post those results later. 1 Raw_Read_Error_Rate 0x000f 114 100 006 Pre-fail Always - 69285200 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 140439618 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503294050 1 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503337794 See this post on how to interpret your results. http://lime-technology.com/forum/index.php?topic=4068.msg48756#msg48756 My interpretation using the link you provided is that my drive is very consistant and reliable. Anybody read it differently? You are very likely correct, but you might still want to look at the FULL smart report, not just the differences. The output of the pre-clear shows differences. You could have a failing parameter in a different category that goes unchanged from start to finish... (unlikely, but theoretically possible) and it would not show in the end "difference" report. The full before and after smart reports are in your syslog. They are also in the /tmp folder as individual files. Joe L.
February 9, 201016 yr Author After the long smart test is finished i'll look at the syslog like you indicate. In the mean time, is there any diagnostic for testing the drives with data written to them? I'm thinking that some process that could read beginning to end of a data drive making a hash value for every 10k bytes and then every 1000 hashes into a new hash value could allow multiple passes to check if the data is being returned exactly the same every time. That would verify a drive/cable pair. Any thoughts?
February 9, 201016 yr Author Here is the long smart test results. None of the drives spun down. I have attached the syslog. SMART status Info for /dev/sdc smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Warning! Drive Identity Structure error: invalid SMART checksum. === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 Device Model: ST3500320AS Serial Number: 5QM01S58 Firmware Version: SD04 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Feb 9 13:26:42 2010 GMT+8 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 642) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 106) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003b) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 140444031 3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 97 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 21503350704 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 11589 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 88 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 21475164165 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 049 045 Old_age Always - 28 (Lifetime Min/Max 28/32) 194 Temperature_Celsius 0x0022 028 051 000 Old_age Always - 28 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 051 032 000 Old_age Always - 140444031 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 11588 - # 2 Extended offline Completed without error 00% 11571 - # 3 Extended offline Aborted by host 10% 11569 - # 4 Extended offline Aborted by host 90% 11568 - # 5 Extended offline Aborted by host 90% 11568 - # 6 Extended offline Interrupted (host reset) 00% 11568 - # 7 Extended offline Aborted by host 90% 11568 - # 8 Extended offline Aborted by host 90% 11568 - # 9 Extended offline Aborted by host 90% 11567 - #10 Extended offline Aborted by host 90% 11567 - #11 Short offline Completed without error 00% 11567 - #12 Short offline Aborted by host 80% 11567 - #13 Short offline Aborted by host 90% 11567 - #14 Short offline Aborted by host 90% 11567 - #15 Short offline Aborted by host 90% 11567 - #16 Short offline Aborted by host 90% 11567 - #17 Short offline Aborted by host 90% 11567 - #18 Short offline Aborted by host 60% 11567 - #19 Short offline Aborted by host 50% 11567 - #20 Short offline Completed without error 00% 11567 - #21 Short offline Completed without error 00% 11567 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. syslog-2010-02-09.txt
Archived
This topic is now archived and is closed to further replies.