Jump to content

read errors while building parity, need help


taalas

Recommended Posts

Hi,

 

I recently discovered unRAID and started to setup my own using an HP ProLiant Microserver and 3 Samsung HD204UI.

 

I followed the instructions in the wiki to the letter and precleared and setup 2 disks as data disks, I then copied my data before building parity (I have read both ways in the forums - before and after parity) and all went well so far.

 

I then started building the parity disk yesterday and while building one of my disks started showing up read errors (384 of them):

 

Oct 20 20:32:31 Tower kernel: md: disk1 read error (Errors)
Oct 20 20:32:31 Tower kernel: handle_stripe read error: 2364507704/1, count: 1 (Errors)

 

The unMENU page shows

 

Parity updated  384  times to address sync errors.

 

which matches the number of read errors and parity is valid.

 

Short SMART Test returns with no errors, SMART info for this drive is:

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   067   067   025    Pre-fail  Always       -       10193
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       94
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       543
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
181 Program_Fail_Cnt_Total  0x0022   252   252   000    Old_age   Always       -       0
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       23 (Min/Max 19/33)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       0
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       543
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       560

 

I know these drives had firmware issues in earlier versions, my disks are manufactured 2011.9 so I didn't do a firmware upgrade, since Samsung says they only produce drives with the fixed firmware now. Can these read errors be connected with this?

 

Please advice what to do: Do a manual parity check, send in drive, recopy data, checksum all data on the disk...

 

If I can provide any further information please let me know.

 

many thanks

 

taalas

 

 

Link to comment

To follow up on this:

 

The error column in unMENU (Main page) shows no errors at all - actually it doesn't even show a number ;)

 

and i forgot: I am using the latest beta version

 

while browsing the syslog I also found 2 kinds of other entries which started happening some days ago:

 

this

 

Oct 14 13:12:00 Tower kernel: Buffer I/O error on device sdd, logical block 295563123 (Errors)

 

and this

 

Oct 14 13:12:03 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 (Errors)
Oct 14 13:12:03 Tower kernel:          res 41/40:00:50:7b:ef/00:00:8c:00:00/40 Emask 0x409 (media error) <F> (Errors)
Oct 14 13:12:03 Tower kernel: ata3.00: error: { UNC } (Errors)

 

which also are from the affected disk

Link to comment

I'd say the disk has started to fail. It is showing 5 pending sectors for reallocation. That shouldn't be happening.

 

If you have the old data still then try new cables and run a preclear on that disk.  You want the sectors pending to either go away or become sectors reallocated and then you want to see it stabilize without either number changing (the numbers on lines 5 and 197).  I would either run preclears until it stabilizes through at least 2 preclears or until it fails completely.

 

Another option is to just try and RMA it as it sits right now.

 

In any case, you don't want to mess with a failing disk. It either needs to behave or go back.

 

One other thing, the unRAID parity check valid and errors indicators can be a little misleading. You need to investigate the disks if you see any errors, exactly like you are doing right now. Don't assume that everything is OK when there are errors because it says "Parity Valid". All that really means is that a parity build was done, not that it was successful.

 

Peter

Link to comment

I'd say the disk has started to fail. It is showing 5 pending sectors for reallocation. That shouldn't be happening.

 

If you have the old data still then try new cables and run a preclear on that disk.  You want the sectors pending to either go away or become sectors reallocated and then you want to see it stabilize without either number changing (the numbers on lines 5 and 197).  I would either run preclears until it stabilizes through at least 2 preclears or until it fails completely.

 

Another option is to just try and RMA it as it sits right now.

 

In any case, you don't want to mess with a failing disk. It either needs to behave or go back.

 

One other thing, the unRAID parity check valid and errors indicators can be a little misleading. You need to investigate the disks if you see any errors, exactly like you are doing right now. Don't assume that everything is OK when there are errors because it says "Parity Valid". All that really means is that a parity build was done, not that it was successful.

 

Peter

It is exactly what was described.  The un-readable sectors caused the read-errors you saw.

 

Do you still have the original data on other disks?  (I hope so, since the only copy otherwise is un-readable, and since you elected to install the parity drive after loading the data, it was unable to assist in the re-construction when the sectors turned out to be un-readable)  If you would have installed and calculated parity before loading the data, it could have dynamically fixed those unreadable sectors by re-writing the unreadable sectors, thereby forcing a re-allocation if needed.

 

You must now do file compares between the original disks and the unRAID server, since there is no easy way to determine which file(s) correspond to the unreadable sectors.

 

Joe L.

Link to comment

Thanks very much for you replies and explanations, these help alot!

 

Since I knew that the data wouldn't be protected I kept almost all of the original data, saying almost because I had to delete some (very few) files from their original destination because of space running low (and the system needed to keep running). I copied these files (the ones I don't have duplicates of) off of the unRAID now to another disk and no errors occured. Does the fact that these were readable for me mean that they weren't affected by the unreadable sectors? Or is this a longshot and I should do a checksum file compare?

 

Since I bought the drive merely two weeks ago I kind of tend to sending it back and hopefully getting a (working) new one. I hope the fact the errors show in the SMART report is enough to convince them to RMA it? Or would you advise to try and preclear the drive to see what happens?

 

Also I am now a little worried that I should have upgraded the firmware just to be sure, but these errors are a different kind, right?

 

sorry for my bad english btw ;)

Link to comment

possibly unrelated but wanting to be sure:

 

I have 2 errors in the syslog that correspond to another disk (these happened around installing time I think)

 

Oct 13 15:43:33 Tower kernel: ata1.00: failed to enable AA(error_mask=0x1) (Errors)
Oct 13 15:43:33 Tower kernel: ata1.00: failed to enable AA(error_mask=0x1) (Errors)

Link to comment
  • 2 weeks later...

Hi again,

 

I managed to organize for the drive to be send back for RMA (I hope they acknowledge the defect if i attach the SMART reports and such).

 

That said I am unsure on how to proceed on two things and hope you can give me some advice:

 

1) the defective drive

It contains my data currently. Can I erase it writing zeros from an unraid terminal? I faintly remember that this can be accomplished by 'dd'ing from /dev/zero to the device...could you give me advice on how to do this? Will this change the status of the drives SMART (as I would really like to send it in with the pending reallocation sectors still showing).

 

2) the remaining raid

As the two remaining drives showed no problems so far I would really like to use them while I wait for the replacement drive. What should I do to make sure that all is stable? Since the errors occured while building parity I guess it would be advisable to build parity again? What would be the best way to go: remove defective drive from array -> shutdown -> remove defective drive from system -> start -> build parity (or rebuild)?

short step for step instructions would be greatly appreciated.

also: I do have data on the remaining data drive. If I CRC check this data against the originals after this whole process would it be safe to assume all is well? Or would you advise to start all over again and recopy?

 

thanks very much for your help!

 

edit: I totally forgot to ask: I am using the alpha version (5.0-beta12a)...was this a mistake to begin with? Should I be using the 4.7 stable?

Link to comment

edit: I totally forgot to ask: I am using the alpha version (5.0-beta12a)...was this a mistake to begin with? Should I be using the 4.7 stable?

 

 

My personal feeling on this one point is since you are not running 3TB drives, I'd stick with 4.7 until you are comfortable that your hardware is stable. The latest builds add some neat stuff, but they don't have as long a stable track record as 4.7.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...