Errors on Disk - Time to replace?

madgino · April 28, 2021

Hello,

I've recently installed unraid and had some errors of one of my disks, refer to attached logs.

Can you please inform me of what steps I should take next, I have a spare disk of the same model that can be used as a replacement.

Is it time to replace the disk or can it be saved?

tower-diagnostics-20210428-2324.zip

JorgeB · April 28, 2021

It's logged as a disk problem, run an extended SMART test.

madgino · April 29, 2021

Ran an extended test as you suggest, from what I can understand, it doesn't look good.

What you think?

tower-smart-20210429-1113.zip

6of6 · April 29, 2021

I'd replace the drive. I've got a drive that's had "one" SMART error for years. You've got three (best I can tell) and in a small amount of time.

I would bet money that, if you replace it now, you won't loose a single byte of data. After it's replaced, you can test it further and decide if you want it back in your array.

6.

JorgeB · April 29, 2021

3 hours ago, madgino said:

What you think?

SMART test failed, disk should be replaced.

codefaux · April 29, 2021

Data pro weighing in.

Your drive is reporting uncorrectable sectors. Twice at one address, once at another address. It hasn't marked the second address as a permanent-fail, thus only tagging 1 in the attributes column.

This is a big pre-warning state. Some drives may limp like this for years, but if this data or reliable performance are at all important, back it up and swap the drive.

Are you using Parity? If you're not, I'd like to note that this case was an early-warning case, and a Parity drive would've saved you even if you hadn't noticed.

Also an important question because it somewhat determines how you proceed, as does your experience with things like ssh and Linux in general.

madgino · April 29, 2021

1 hour ago, codefaux said:

Your drive is reporting uncorrectable sectors. Twice at one address, once at another address. It hasn't marked the second address as a permanent-fail, thus only tagging 1 in the attributes column.

This is a big pre-warning state. Some drives may limp like this for years, but if this data or reliable performance are at all important, back it up and swap the drive.

Are you using Parity? If you're not, I'd like to note that this case was an early-warning case, and a Parity drive would've saved you even if you hadn't noticed.

Also an important question because it somewhat determines how you proceed, as does your experience with things like ssh and Linux in general.

Thanks for the reply,

I don't know if I've read this correctly or not, but it appears that that the 1st and 2nd smart error happened at the same time "29801 hours" on the same block "3894526032"

The third error happened some time later at at "48162 hours" on block "3166961520"

So from what I understand there has been "18361 hours" or 765 days just over 2 years since the second and third errors.

I'm assuming I can put move any critical data onto the other drives in my array and leave this drive just for media content that can be easily retrieved again if the drive ever decides to totally fail.

I currently have a 4 drives, with one used for parity

In regards to linux and ssh experience, I'm confident I'll manage.

Can you see any problem with this thought pattern?

JorgeB · April 29, 2021

3 minutes ago, madgino said:

I'm assuming I can put move any critical data onto the other drives in my array and leave this drive just for media content that can be easily retrieved again if the drive ever decides to totally fail.

You can, but remember that Unraid requires all other drives to be read correctly for a rebuild, i.e., if another disks fails and you need to rebuild it it may end up having some data corruption.

codefaux · April 29, 2021

4 minutes ago, madgino said:

So from what I understand there has been "18361 hours" or 765 days just over 2 years since the second and third errors.

I hadn't thought to inspect the timestamp on the errors, but that does line up.

4 minutes ago, madgino said:

I'm assuming I can put move any critical data onto the other drives in my array and leave this drive just for media content that can be easily retrieved again if the drive ever decides to totally fail.

It seems you understand the data safety implications, but do consider that a failing disk will eventually begin to stall I/O heavily while it strains to pull bytes. It's safe and justifiable to leave on the condition that you keep an eye on the numbers, and if you start having I/O issues look there first.

6 minutes ago, madgino said:

I currently have a 4 drives, with one used for parity

If you're using read-modify-write for Parity, a failing disk won't stall IO unless it is specifically involved in the IO. Turbo Mode, aka reconstruct-write, will read a block for each block written on other disks, and can cause IO stall and/or Parity integrity issues if the read fails. However, read-modify-write has its own drawbacks - each operation requires a full platter revolution, so your Parity will slow down overall IO.

2 minutes ago, JorgeB said:

You can, but remember that Unraid requires all other drives to be read correctly for a rebuild, i.e., if another disks fails and you need to rebuild it it may end up having some data corruption.

This is a very important fact to consider.

8 minutes ago, madgino said:

In regards to linux and ssh experience, I'm confident I'll manage.

Okay - the short of it is, first you need to learn how unRAID maps your disk names. My less than ideal configuration results in some mangling, so each shows something such as;

1AMCC_ZA1GXCX5000000000000

They all end in a block of twelve zeroes. They all start with 1AMCC_ -- which leaves the unique part in this case to be ZA1GXCX5. Using the SMART Identity information and a keen eye, that's the first part of my Parity drive's serial number. So, however I'm keeping track, I note the model, serial, what it shows up as, and Parity.

When you make The Swap, plug everything in and boot it. If you're lucky and both controllers choose to identify drives the same, you're done. It'll probably just start. If not, it should come up complaining about missing disks, with dropdowns. You'll have to again use naming convention reverse-engineering to figure out which is which.

This is where ssh and smartctl come in handy if naming conventions have changed -- if you're working with direct disks, you can query the disk directly, such as smartctl -i /dev/sdb or similar. -i is identity. -a is all and also dumps selftest logs, error logs, attributes, and capabilities.

If your controller is ~~dumb like mine~~ a RAID controller, it's gonna be a headache. First resort, see if you can cross-flash it to IT mode. That's fancy speak for "don't be a RAID controller, just expose disks directly". If not, you'll need extra parameters for smartctl to access the disks ~~the dumb way~~ like I am, and I'm not gonna go there unless we need to.

Once you know how to identify your disks, just match them in the drop-downs. I'm reasonably sure that the order doesn't matter, except that Parity disks must stay in Parity-land, Cache disks must stay in Cache-land, and Disks are Disks in any order. I could be wrong though, and it doesn't hurt to be exact.

Questions?

Errors on Disk - Time to replace?

Recommended Posts

madgino

Link to comment

JorgeB

Link to comment

madgino

Link to comment

6of6

Link to comment

JorgeB

Link to comment

codefaux

Link to comment

madgino

Link to comment

JorgeB

Link to comment

codefaux

Link to comment

Join the conversation