April 7, 201412 yr Hey guys, So I need some help. Once I set up unrid almost a year ago everything was running smooth and I haven't had to touch it at all. Today I logged in for my monthly partiy check and noticed that on the main page my parity drive is showing 1,666 errors! Actually that number was closer to 900 a few days ago I believe and I ran a parity check. Parity check shows "0" errors, the drives still has a green ball next to it, but it is showing 1,666 error in the row next to the parity drive. Can someone help me out with this? I guess this is the downside of unraid (the lack of being intuitive and user friendly." So now I'm concerned. Would appreciate your help Version 5.0.3
April 7, 201412 yr Author Smart report Model Family: Seagate Desktop HDD.15 Device Model: ST4000DM000-1F2168 Serial Number: Z300ASTX LU WWN Device Id: 5 000c50 050381f70 Firmware Version: CC51 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5900 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Mon Apr 7 17:04:28 2014 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 623) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 553) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x1085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 218486096 3 Spin_Up_Time 0x0003 091 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 563 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 8043421 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7361 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 2 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 074 074 000 Old_age Always - 26 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 093 093 000 Old_age Always - 7 190 Airflow_Temperature_Cel 0x0022 074 060 045 Old_age Always - 26 (Min/Max 7/40) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1 193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always - 2474 194 Temperature_Celsius 0x0022 026 040 000 Old_age Always - 26 (0 7 0 0 0) 197 Current_Pending_Sector 0x0012 100 098 000 Old_age Always - 72 198 Offline_Uncorrectable 0x0010 100 098 000 Old_age Offline - 72 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 464h+08m+22.617s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 29559569616 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 117867593215 SMART Error Log Version: 1 ATA Error Count: 26 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 26 occurred at disk power-on lifetime: 7282 hours (303 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 ff ff ff ef 00 43d+04:21:26.618 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:23.234 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:23.169 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:21.691 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:21.627 READ DMA EXT Error 25 occurred at disk power-on lifetime: 7282 hours (303 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 ff ff ff ef 00 43d+04:21:05.694 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:05.656 READ DMA EXT 35 00 a8 ff ff ff ef 00 43d+04:21:05.014 WRITE DMA EXT 25 00 58 ff ff ff ef 00 43d+04:21:04.933 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:04.817 READ DMA EXT Error 24 occurred at disk power-on lifetime: 7282 hours (303 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 ff ff ff ef 00 43d+04:21:00.683 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:21:00.613 READ DMA EXT 35 00 38 ff ff ff ef 00 43d+04:21:00.292 WRITE DMA EXT 25 00 c8 ff ff ff ef 00 43d+04:21:00.181 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:20:56.331 READ DMA EXT Error 23 occurred at disk power-on lifetime: 7282 hours (303 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 ff ff ff ef 00 43d+04:20:52.589 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:20:52.531 READ DMA EXT 35 00 68 ff ff ff ef 00 43d+04:20:51.589 WRITE DMA EXT 25 00 98 ff ff ff ef 00 43d+04:20:51.468 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:20:48.375 READ DMA EXT Error 22 occurred at disk power-on lifetime: 7282 hours (303 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 ff ff ff ef 00 43d+04:20:42.833 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:20:42.786 READ DMA EXT 35 00 78 ff ff ff ef 00 43d+04:20:41.132 WRITE DMA EXT 25 00 e8 ff ff ff ef 00 43d+04:20:38.853 READ DMA EXT 25 00 00 ff ff ff ef 00 43d+04:20:38.786 READ DMA EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute del
April 7, 201412 yr See here: http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector
April 7, 201412 yr Author Sorry. Don't have the time to read right now at work. That's why j made this post. So is my drive failing? I don't know how to interpret these things... The server hasn't moved at all so don't think it's a loose cable... But also, I thought if there's an actual error there would be a red circle next the drive and not a green one which indicates everything is fine? Sent from my iPhone using Tapatalk
April 7, 201412 yr Sorry. Don't have the time to read right now at work. That's why j made this post. Your drive has pending sectors.197 Current_Pending_Sector 0x0012 100 098 000 Old_age Always - 72 So you need to follow the directions at the link he posted. Do you have a specific question about what you need to do?
April 7, 201412 yr Author I couldn't read everything from my phone. Can you explain what lending sectors are and why they didn't red ball the drive? So this doesn't mean the hard drive is bad? Thanks Sent from my iPhone using Tapatalk
April 9, 201412 yr Author Thanks so much guys! You are the best. I stopped the array, unassigned, started, stopped, then re-assigned the disc and it re-building the parity now. Am I correct in that the instructions say if there are more errors after this time that the drive then needs to be replaced? Also, could someone please explain why this still showed a green ball signlaing a functional drive next to it? I thought that unraid would place a red ball next to a drive whenever it detects it is failing? I almost didn't notice the "errors" because I usually just look at the green balls next to the drives and go right into a parity check once a month...
April 9, 201412 yr Also, could someone please explain why this still showed a green ball signlaing a functional drive next to it? I thought that unraid would place a red ball next to a drive whenever it detects it is failing?Until a write operation fails, the drive will be green. Read errors result in the rest of the array being spun up, and the data that should be at that location is calculated from the rest of the drives and written back to the drive. If the write succeeds, then the drive stays green, and the error counter is incremented. Stock unraid does not monitor smart statistics, or try to determine whether or not a drive is healthy. It simply keeps track of how many times a read or write operation failed, and disables the drive and red balls it if a write fails.
April 9, 201412 yr Author Got it. Thanks. Now I used putty before with screen so I could close the computer. I have since got a MacBook. Is there anything like screen for it? I only have a laptop so it will eventually be disconnected Also, I see there is a new webgui dominix? Does it have preclear as an add-on from the webgui by any chance? Thanks again Sent from my iPhone using Tapatalk
April 9, 201412 yr Got it. Thanks. Now I used putty before with screen so I could close the computer. I have since got a MacBook. Is there anything like screen for it? I only have a laptop so it will eventually be disconnected Screen is running at the unRAID end so the fact a Mac is being used is irrelevant.
April 9, 201412 yr Got it. Thanks. Now I used putty before with screen so I could close the computer. I have since got a MacBook. Is there anything like screen for it? I only have a laptop so it will eventually be disconnected Also, I see there is a new webgui dominix? Does it have preclear as an add-on from the webgui by any chance? Thanks again Sent from my iPhone using Tapatalk DYNAMIX ... It does NOT include the preclear script as best I can tell. It is however quite nice for overall management.
April 9, 201412 yr Author I'll have to research it. I have simple features installed now but I don't remember which files I actually need to erase to remove it. But I guess that's another post. So I can log in to unRAID using "terminal" on the MacBook right? Sent from my iPhone using Tapatalk
April 10, 201412 yr Simple Features is incompatible with v.5 Simple Features COMPLETE Uninstall http://lime-technology.com/forum/index.php?topic=28927.msg279478#msg279478 AFTER uninstalling SF, and things are back to normal and working, you might consider adding 'DYNAMIX' which is the successor to SF.
April 10, 201412 yr Sorry. Don't have the time to read right now at work. That's why j made this post. So is my drive failing? I don't know how to interpret these things... The server hasn't moved at all so don't think it's a loose cable... But also, I thought if there's an actual error there would be a red circle next the drive and not a green one which indicates everything is fine? Sent from my iPhone using Tapatalk Your questions about this drive are good ones. The short answer is that the drive has not failed. But it has detected 72 internal read errors, and it looks like 26 of them have been reported back to the OS. A pending sector means that an attempt to read a specific sector has failed or triggered error recovery sufficient for the drive to put that sector in a "pending" state. On the next write, that sector will be reevaluated and likely be "reallocated" meaning a spare sector will replace this bad sector. Drives have a limited number of spare sectors to do this type of remapping. The reallocated sector smart attribute, currently zero on this drive, tells you how many times this has happened. I have occasionally seen pending sectors simply go away. Can't explain it but suddenly all the pending sectors are gone. Maybe a bug or something. But when this has happened there seem to be no ill effects. But once sectors actually start to remap, my experience is that they continue to remap and the drive's days are numbered. In your situation I like to run parity checks and monitor the results. If there is a read error unRaid will handle it and use parity to figure out the data and issue the magic write. That write will trigger the reallocation. You'll start to see the smart attributes reflect this type of activity. If you can run 3 straight parity checks and the pending / reallocated sectors hold steady, I would tend to trust the disk and continue to monitor. But if they get worse on every parity check or two, even if only by a few each time, and don't stabilize, I would look to RMA the drive. I think of it like a pot hole. Every time a car goes by a little more pavement is affected and it is just a matter of time before the road is unusable. But everyone has their own tolerance, and some are willing to keep driving and monitor the pothole and only replace the drive when spare sectors are in short supply. The closer the actual value of the smart attribute gets to zero the worse it is, and when it passes the threshold it is considered failed. UnRaid is not monitoring these values to red ball the disk though. It is looking for errors writing to the disk. And frequently write errors are caused by loose cables and not bad disks, so a red ball is actually a poor indicator of bad drive health. So run a few parity checks and monitor the results. If things get worse every time, RMA the drive. If it stabilizes, keep it but continue to monitor it over time. Hope this helps.
April 10, 201412 yr Any pending sector on any array disk will interfere with rebuilding a different disk which has failed. An array with any pending sectors on any array disk is NOT protected. The sectors, corresponding with the pending sectors, on any other failed array disk cannot be determined. The entire disk has not failed but the parts of it that cannot be read are a partial failure. The source of any read errors needs to be corrected. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector
April 10, 201412 yr Any pending sector on any array disk will interfere with rebuilding a different disk which has failed. An array with any pending sectors on any array disk is NOT protected. The sectors, corresponding with the pending sectors, on any other failed array disk cannot be determined. The entire disk has not failed but the parts of it that cannot be read are a partial failure. The source of any read errors needs to be corrected. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector I disagree with the concept that ANY pending reallocation WILL interfere with a rebuild. I would agree that they raise the risk of a problem, but have actually never seen a rebuild negatively impacted by a pending secter. I used to see many more read errors coming back in the IDE drive days, but believe the drives have sophisticated error correcting features build into the hardware and even if the drive has read issues, it is able to return correct data.
April 10, 201412 yr Tom did not respond when I asked him how unRAID handles an unreadable sector during a rebuild. It appears not to halt the rebuild but the corresponding data cannot be computed if the sector cannot be read. A sector is marked as unreadable (pending) after the error correction on the disk has failed. A pending sector cannot return correct data. The disk is reporting a read failure has occurred and the integrated error correction has NOT been able to recover. The corresponding sectors on the other disks may not be important, i.e. the space may be unallocated or the error may appear as static in video or audio. However, they may be very important. Do you want the numbers in a spreadsheet to be modified during a rebuild? How about your tax information? Do you feel lucky? Give any pending sector the same attention that is given a failed disk. The situations are matter of degree; the pending sector represents a partial disk failure.
April 10, 201412 yr Author Wow. Thanks guys. You are the best! I didn't even know it was not compatible. It seems to be working fine for me. Guys this whole thing made me realize that I need to somehow have a backup of my cache drive! I configured everything a year ago, but man if I had to do it all over again I would probably put money together for a huge Synology. I guess it's for another thread but I would like my cache drive to be mirrored and be truly plug and play. Drive fails, all I have to do is replace... Sent from my iPhone using Tapatalk
April 10, 201412 yr Author Just re-read everything. Wow this seems so complex. I need to run that smart test in terminal every time? Is there really no way to have this be more user friendly? How is Synology able to do it so well and keep things simple? Drive fails, take out replace rebuild all done. Are there any plugins that show the smart errors on the GUI? Or is it enough for me to just watch for errors again on the main page? Sent from my iPhone using Tapatalk
April 10, 201412 yr I rarely run SMART...only if a disk is failing. I *do* have 'DYNAMIX' installed. It has a page that queries summary SMART data from all drives automatically when opened.
April 10, 201412 yr Author Oh okay. Because you guys are talking about these specific errors and I know unRAID only has one error column Sent from my iPhone using Tapatalk
April 10, 201412 yr Author So what;s the basic rule of thumb? Is it enough to just look at the error column on the main page? The reason I'm being so picky now is because I want to set up a storage server at my parents place too. So I need them to be able to do simple stuff like parity checks, and to detect correct errors, and replace drives as needed (will have hot-swappable cages). But the truth s, if I can't even remember all these things I can't expect them to be able to maintain it. How is this made so simple in Synology systems? Or is it enough to just keep an eye on the error column?
Archived
This topic is now archived and is closed to further replies.