October 21, 201114 yr Hi, I recently discovered unRAID and started to setup my own using an HP ProLiant Microserver and 3 Samsung HD204UI. I followed the instructions in the wiki to the letter and precleared and setup 2 disks as data disks, I then copied my data before building parity (I have read both ways in the forums - before and after parity) and all went well so far. I then started building the parity disk yesterday and while building one of my disks started showing up read errors (384 of them): Oct 20 20:32:31 Tower kernel: md: disk1 read error (Errors) Oct 20 20:32:31 Tower kernel: handle_stripe read error: 2364507704/1, count: 1 (Errors) The unMENU page shows Parity updated 384 times to address sync errors. which matches the number of read errors and parity is valid. Short SMART Test returns with no errors, SMART info for this drive is: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 067 067 025 Pre-fail Always - 10193 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 94 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 543 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 181 Program_Fail_Cnt_Total 0x0022 252 252 000 Old_age Always - 0 191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 064 000 Old_age Always - 23 (Min/Max 19/33) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 5 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 0 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 543 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 560 I know these drives had firmware issues in earlier versions, my disks are manufactured 2011.9 so I didn't do a firmware upgrade, since Samsung says they only produce drives with the fixed firmware now. Can these read errors be connected with this? Please advice what to do: Do a manual parity check, send in drive, recopy data, checksum all data on the disk... If I can provide any further information please let me know. many thanks taalas
October 21, 201114 yr Author To follow up on this: The error column in unMENU (Main page) shows no errors at all - actually it doesn't even show a number and i forgot: I am using the latest beta version while browsing the syslog I also found 2 kinds of other entries which started happening some days ago: this Oct 14 13:12:00 Tower kernel: Buffer I/O error on device sdd, logical block 295563123 (Errors) and this Oct 14 13:12:03 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 (Errors) Oct 14 13:12:03 Tower kernel: res 41/40:00:50:7b:ef/00:00:8c:00:00/40 Emask 0x409 (media error) <F> (Errors) Oct 14 13:12:03 Tower kernel: ata3.00: error: { UNC } (Errors) which also are from the affected disk
October 21, 201114 yr I'd say the disk has started to fail. It is showing 5 pending sectors for reallocation. That shouldn't be happening. If you have the old data still then try new cables and run a preclear on that disk. You want the sectors pending to either go away or become sectors reallocated and then you want to see it stabilize without either number changing (the numbers on lines 5 and 197). I would either run preclears until it stabilizes through at least 2 preclears or until it fails completely. Another option is to just try and RMA it as it sits right now. In any case, you don't want to mess with a failing disk. It either needs to behave or go back. One other thing, the unRAID parity check valid and errors indicators can be a little misleading. You need to investigate the disks if you see any errors, exactly like you are doing right now. Don't assume that everything is OK when there are errors because it says "Parity Valid". All that really means is that a parity build was done, not that it was successful. Peter
October 21, 201114 yr I'd say the disk has started to fail. It is showing 5 pending sectors for reallocation. That shouldn't be happening. If you have the old data still then try new cables and run a preclear on that disk. You want the sectors pending to either go away or become sectors reallocated and then you want to see it stabilize without either number changing (the numbers on lines 5 and 197). I would either run preclears until it stabilizes through at least 2 preclears or until it fails completely. Another option is to just try and RMA it as it sits right now. In any case, you don't want to mess with a failing disk. It either needs to behave or go back. One other thing, the unRAID parity check valid and errors indicators can be a little misleading. You need to investigate the disks if you see any errors, exactly like you are doing right now. Don't assume that everything is OK when there are errors because it says "Parity Valid". All that really means is that a parity build was done, not that it was successful. Peter It is exactly what was described. The un-readable sectors caused the read-errors you saw. Do you still have the original data on other disks? (I hope so, since the only copy otherwise is un-readable, and since you elected to install the parity drive after loading the data, it was unable to assist in the re-construction when the sectors turned out to be un-readable) If you would have installed and calculated parity before loading the data, it could have dynamically fixed those unreadable sectors by re-writing the unreadable sectors, thereby forcing a re-allocation if needed. You must now do file compares between the original disks and the unRAID server, since there is no easy way to determine which file(s) correspond to the unreadable sectors. Joe L.
October 22, 201114 yr Author Thanks very much for you replies and explanations, these help alot! Since I knew that the data wouldn't be protected I kept almost all of the original data, saying almost because I had to delete some (very few) files from their original destination because of space running low (and the system needed to keep running). I copied these files (the ones I don't have duplicates of) off of the unRAID now to another disk and no errors occured. Does the fact that these were readable for me mean that they weren't affected by the unreadable sectors? Or is this a longshot and I should do a checksum file compare? Since I bought the drive merely two weeks ago I kind of tend to sending it back and hopefully getting a (working) new one. I hope the fact the errors show in the SMART report is enough to convince them to RMA it? Or would you advise to try and preclear the drive to see what happens? Also I am now a little worried that I should have upgraded the firmware just to be sure, but these errors are a different kind, right? sorry for my bad english btw
October 22, 201114 yr Author possibly unrelated but wanting to be sure: I have 2 errors in the syslog that correspond to another disk (these happened around installing time I think) Oct 13 15:43:33 Tower kernel: ata1.00: failed to enable AA(error_mask=0x1) (Errors) Oct 13 15:43:33 Tower kernel: ata1.00: failed to enable AA(error_mask=0x1) (Errors)
October 23, 201114 yr Yes, they are different errors - sectors pending reallocation should mean the disk is having problems reading the actual surface of the platter. I have seen a few people find these pending sectors went away with a new power supply or new cables connected to the drive. Peter
November 2, 201114 yr Author Hi again, I managed to organize for the drive to be send back for RMA (I hope they acknowledge the defect if i attach the SMART reports and such). That said I am unsure on how to proceed on two things and hope you can give me some advice: 1) the defective drive It contains my data currently. Can I erase it writing zeros from an unraid terminal? I faintly remember that this can be accomplished by 'dd'ing from /dev/zero to the device...could you give me advice on how to do this? Will this change the status of the drives SMART (as I would really like to send it in with the pending reallocation sectors still showing). 2) the remaining raid As the two remaining drives showed no problems so far I would really like to use them while I wait for the replacement drive. What should I do to make sure that all is stable? Since the errors occured while building parity I guess it would be advisable to build parity again? What would be the best way to go: remove defective drive from array -> shutdown -> remove defective drive from system -> start -> build parity (or rebuild)? short step for step instructions would be greatly appreciated. also: I do have data on the remaining data drive. If I CRC check this data against the originals after this whole process would it be safe to assume all is well? Or would you advise to start all over again and recopy? thanks very much for your help! edit: I totally forgot to ask: I am using the alpha version (5.0-beta12a)...was this a mistake to begin with? Should I be using the 4.7 stable?
November 2, 201114 yr edit: I totally forgot to ask: I am using the alpha version (5.0-beta12a)...was this a mistake to begin with? Should I be using the 4.7 stable? My personal feeling on this one point is since you are not running 3TB drives, I'd stick with 4.7 until you are comfortable that your hardware is stable. The latest builds add some neat stuff, but they don't have as long a stable track record as 4.7.
Archived
This topic is now archived and is closed to further replies.