Parity drive read error (md0: read error)


Recommended Posts

I recently upgraded the parity drive in my unRAID system from a 500GB IDE to 750GB Barracuda 7200.10 SATA (was pretty easy).

 

About five days ago, I noticed the error count on the parity drive wasn't the zero I was expecting. It was a 1. Two nights ago, I added an additional 750GB Barracuda to my server and after it finally became available to the array, I started moving content between the drives in a small attempt at better organization.

 

Today I woke up and the error count on the parity drive was "1" so I looked in the syslog and this was in it:

 

Feb 16 08:01:26 Tower kernel: ata2: status=0x50 { DriveReady SeekComplete }
Feb 16 08:01:26 Tower kernel: SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 8000002
Feb 16 08:01:26 Tower kernel: Current sd08:11: sns = 70  0
Feb 16 08:01:26 Tower kernel: Raw sense data:0x70 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00
Feb 16 08:01:26 Tower kernel:  I/O error: dev 08:11, sector 28764376
Feb 16 08:01:26 Tower kernel: md0: read error!
Feb 16 08:01:26 Tower kernel: end_read_request 28764376/0, count: 1, uptodate 0.
Feb 16 08:07:38 Tower smbd[2336]: [2008/02/16 08:07:38, 0] rpc_server/srv_pipe.c:api_pipe_bind_req(993)
Feb 16 08:07:38 Tower smbd[2336]:   api_pipe_bind_req: unknown auth type 1 requested.

 

Question: Is the occasional error on the parity drive itself cause for alarm? Or do I just monitor it more closely for a while and see what happens? I don't know if the earlier error was the same sector or not.

 

I'm running unRAID 3.1 beta 2

My motherboard is older and doesn't have any SATA ports so I'm using a Promise SATA300 PCI adapter.

 

Thanks.

 

Aaron

Link to comment

A parity error is not the same as a read error, although a read error could cause a parity error.  A parity error is not specific to any drive, it is an indication that the parity calculation across all of the drives for a given bit position was wrong.  In the syslog, a parity error always says md0.

 

Having said that, your syslog excerpt does appear to show a read error on md0.  You might want to confirm the actual drive by checking the earliest mention of ata2, and the drive associated with it.

 

I would consider it cause for mild alarm, definitely something to watch.  Read errors and parity errors are never acceptable.  As others here have said, probably the highest failure rate for any drive is within the first month.  Capture and keep the syslog after any error, so you can compare the error messages and sector numbers.

 

Link to comment

A parity error is not the same as a read error, although a read error could cause a parity error.  A parity error is not specific to any drive, it is an indication that the parity calculation across all of the drives for a given bit position was wrong.  In the syslog, a parity error always says md0.

 

Not quite... what you are describing is a parity sync error & these can show up only when running a Check operation.  The error in aaron330i's syslog is a media read error on the parity hard drive (md0 refers to the parity disk).

 

Having said that, your syslog excerpt does appear to show a read error on md0.  You might want to confirm the actual drive by checking the earliest mention of ata2, and the drive associated with it.

 

I would consider it cause for mild alarm, definitely something to watch.  Read errors and parity errors are never acceptable.  As others here have said, probably the highest failure rate for any drive is within the first month.  Capture and keep the syslog after any error, so you can compare the error messages and sector numbers.

 

 

Good advise!  Also, run a parity check from time to time.

Link to comment

It helps to know a bit about how SMART drives work, and what unRAID does, with physical read errors.

 

When a drive get an error reading a sector, it will return the error to the OS, but it will also "remember" that the sector is bad.  The next time someone tries to WRITE to that sector, the drive will take the bad sector out of service and assign another (spare) sector in its place.  This is completely invisiible to the OS.  It happens with every modern hard drive on the planet.

 

So, when Tom gets a read error on the parity (or any other) drive doing a parity check (or any other read operation), he will correctly compute the value that SHOULD have been read at that sector (using redundancy).  He will then WRITE that sector back to the offending drive.  The drive, noting that it had a read error on that sector before, will promptly take the sector out of service, assign a new one, and Tom's write will put the exact correct data on this fresh remapped spot on the disk.  (If he DIDN'T do this, the read error would stay there until that sector needed to be rewritten for some other reason (possibly a very long time) - and if you ever had to rebuild parity, you'd have a good chance of data corruption.)

 

This is why, if you get a read error on a parity check, you will NOT get another one when doing the parity check a second time.  You can also have confidence that your investment in unRAID was a good one - because it just did its job and saved yoiur precious data!

 

Others may disagree with me, but I have found that a remapped sector once in a blue moon on a drive is not a big deal.  It happens.  If you start to see new bad sectors develop on the drive (even one a month is too much), I'd start to get worried and RMA the drive.

 

-Brian

Link to comment

What I had read was when you get a large number of these or they start growing, then you drive is dieing for sure.

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

These are sectors that are waiting to be remapped, but cannot be, because the data can not be read first.

 

Link to comment

It helps to know a bit about how SMART drives work, and what unRAID does, with physical read errors.

 

...

(snip)

...

-Brian

 

Thanks for the lengthy explanation Brian, it explains a lot!

 

So far, so good with the drive. I think that part of the problem also may have been a heat buildup problem. After removing the front of the case, the vent opening in front of that drive was completely blocked with dust. After thoroughly cleaning the entire case of dust buildup the average temps of all drives went down between 5-10 degrees. The parity drive being one drive that went down 10 degrees.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.