Looking for advice from the experts

December 17, 201411 yr

Let me start off by saying that I'm not an expert when it comes to servers (but I've learned a lot from building and using unraid) so I have some questions that I'd like to get some input from the community on. I am asking these questions because I got involved in a project at work where we're trying to model the reliability/availability of a server that a vendor is building. The server uses a bunch of RAID-5 arrays for storage and I have no first hand experience with RAID-5. My biggest concern is that most of the storage arrays are built using 4 x 4TB HDDs and I'm concerned about encountering a URE during a rebuild after a failed disk. The data contained on the arrays is mission essential and the vendor claims that they do have on-site backup that is no more than 24 hours old. I'm pressing the vendor for more info on that topic but at this point I just need to better understand how a RAID-5 array behaves so I can properly model it's availability.

1) When one of the drives in the array experiences a failure the array is still accessible and can provide data (although at a slower throughput), correct?

2) When an array is in the degraded state (one drive failed) what happens if a URE is encountered? I'm under the impression that the specific data/file being accessed could be unavailable. Is that correct? If this happens does the array go offline or does it continue to operate and you just have problems accessing the data that is on the disk where URE is occurring?

3) After a disk fails it must replaced with a new disk and the data will be rebuilt to it using the remaining data & parity info. I assume that the array is still functional at this point but with lower throughput, correct?

4) If a URE is encountered during a disk rebuild what happens? I think that the rebuild operation simply fails and the array remains functional but in the degraded state (#1 above). Is that correct?

5) Is there anything other than the URE rate (i.e 1E15, 1E16, etc) and drive size that influences the likelihood of encountering a URE during a rebuild operation? I had a long discussion about UREs with the vendor yesterday and they were claiming that the cache on the RAID controller can reduce the likelihood of encountering a URE. I think that is complete BS but I suppose I could be wrong.

Please excuse my ignorance and feel free to educate me as necessary.

Quote

December 17, 201411 yr

at this point I just need to better understand how a RAID-5 array behaves so I can properly model it's availability.

This is all dependant on the hardware involved, however most scenarios are similar.

1) When one of the drives in the array experiences a failure the array is still accessible and can provide data (although at a slower throughput), correct?

Yes, the missing data is calculated from the remaining drives.

2) When an array is in the degraded state (one drive failed) what happens if a URE is encountered? I'm under the impression that the specific data/file being accessed could be unavailable. Is that correct? If this happens does the array go offline or does it continue to operate and you just have problems accessing the data that is on the disk where URE is occurring?

From what I've heard and experienced, the whole array fails and goes off line if it cannot be recovered from the remaining drives.

This is why many people do not have large raid 5 arrays, but will string together multiple smaller raid5 arrays.

3) After a disk fails it must replaced with a new disk and the data will be rebuilt to it using the remaining data & parity info. I assume that the array is still functional at this point but with lower throughput, correct?

It is usually functional, some performance will suffer. Either the data being rebuilt gets rebuilt slower or normal read/write performance is slower.

4) If a URE is encountered during a disk rebuild what happens? I think that the rebuild operation simply fails and the array remains functional but in the degraded state (#1 above). Is that correct?

If the drives doing the reading to recover the missing data get the URE, the array usually goes offline.

If the destination drive gets an URE, array health is determined by the hardware/software implementation.

5) Is there anything other than the URE rate (i.e 1E15, 1E16, etc) and drive size that influences the likelihood of encountering a URE during a rebuild operation? I had a long discussion about UREs with the vendor yesterday and they were claiming that the cache on the RAID controller can reduce the likelihood of encountering a URE. I think that is complete BS but I suppose I could be wrong.

I can't answer this authoritatively. Yet the comment about the cache is not true.

Also, with the newer advanced format drives even sectors that are weak or have issues, have a high probability of being accessed.

At least with desktop drives they retry many times. With RAID specific drives, they give up sooner to let the controller deal with the issues. In a enterprise solution, it's not so cut and dry.

In addition, there are reasons that ZFS and other environments do a scrub to check for bad sectors.

Depending on the width of the array, a periodic scrub of the underlying data is important.

There's a reason we schedule parity checks in unRAID.

Quote

December 17, 201411 yr

It is important to note that unRAID is not RAID-5. The best description is to say it has single parity. Which is like RAID-5 (and RAID-3 and RAID-4) but unlike RAID-5 the parity is not distributed. There is a huge benefit with unRAID not distributing the parity. The data drives are complete usable filesystems. This greatly reduces the amount data loss in the event of failures and/or errors.

Without specifics of your RAID-5 build I would be very cautious of stating what the performance impact of a failed drive, or the rebuild process, and the behavior handling of UREs.

In commercial/enterprise arrays (and most everything engineered), there are three factors; Cost, Reliability, Performance.

RAID-5 is often described as the best low cost solution, with good performance and reliability. Which means there are better performing solutions, at higher costs and/or lower reliability. If reliability is your concern, reliability can be increased with higher costs and/or lower performance. If you can increase cost, you can get better performance and/or reliability.

1) There are many arrays which do not slow down when a drive is failed. The performance bottleneck on these systems is not the drives.

2) URE behavior varies depending on the system. Some will report the UNC to the block device handler, others will take the entire device offline.

3) Most array will allow for online rebuild, but there are a very few which this is an offline operation. And there are some like unRAID where the rebuild is online, but it requires an offline period to reconfigure the array. The rebuild is online, but there is an outage period.

4) See #2

5) There are a huge number of things that impact rebuild operations. And controller cache does not reduce URE errors. That whole discussion should be very suspect due to that claim.

It is best to know what performance you really need. You need to know the number of iops/sec required at what size, which yields throughput, but be careful about throughput numbers. They are often reported as maximum which is not a typical workload. Know your actual requirements in terms of iops and workload.

Bottom line if you want better reliability it's available, at a price.

You might read up on RPO and RTO. Knowing the will help you explain the requirements to your storage vendor.

Quote

Looking for advice from the experts

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)