Read errors on one disk, current disabled and emulated. Next steps...


Recommended Posts

Hello,

 

I have a disabled drive due to read errors.  I'm at work and can't physically access the server until later, so diagnostic files and such will come later.  I'd like to ask a few questions first in case I want to order a drive or something now so it's here tonight (or tomorrow).  I'm worried and want to make sure I don't do something wrong....  First disk issue in 7 years on Unraid.

 

Facts

  • Unraid v6.8.0
  • 1x 6TB WD Red parity.
  • 1x 6TB WD Red Enterprise
  • 3x 3TB Seagate drives shucked from external enclosures 6 years ago (notorious for early failures).  One of these is the disabled drive.  I'm aware that I'm living on borrowed time here.
  • Smart values were all good, no errors or pending sectors, etc. when checked a few weeks ago.
  • Power flipped off then on this morning and kicked off a parity check. 
  • Halfway through the check I get a notification that there are read errors and Disk 2 is disabled. 
  • Verified by logging in remotely on my phone.  Showing 2048 errors Disk2 and doing a read check.
  • Judging by the giant generator on the street outside my house and the electrical company truck sitting there, I'm assuming my house is being powered by this generator while they are working on the something.  Yay for clean power.

 

Questions

  1. Should I order a replacement drive ASAP?   Is there a possibility that it's something else (bad connection or similar). 
  2. I'll probably replace all the old Seagate disks with 10 or 12TB drives.  If I replace the current failed drive with a >6TB drive I'll have to format for 6TB to match parity, then later I'll need to reformat it once I get a bigger parity drive, correct?   I'd rather avoid getting a 6TB drive now (it's a SFF case so larger capacity the better).
  3. What should be my first steps tonight for troubleshooting?  My thoughts
    • Backup USB
    • Get the diagnostics
    • Run a SMART on Disk 2 if possible
    • Reboot and see if drive is responsive?
  4. What are my options for data recovery?
    • Copy data off, assuming drive is responsive (it's not currently).
    • Rebuild with new drive?
  5. The 3x Seagate 3TB drives were at 90% for several years and didn't see much or any write activity.  I recently deleted some old media I didn't need, so it's likely new files have been written in the last month.  Might this have accelerated the death clock a bit?

 

And of course this all happens as I'm in the middle of finally testing out a proper offsite backup system in place.  The most critical files (several TB or family photos) are duplicated somewhere else thankfully.

 

Thanks!

edit0:  numbers added on questions

 

Edited by jimbobulator
Link to comment
56 minutes ago, jimbobulator said:

I have a disabled drive due to read errors.

Unraid only disables a disk for write errors. It is possible for a read error to cause Unraid to get the data for the failed read from the parity calculation and try to write it back to the disk. If that write-back fails then the disk would be disabled.

 

1 hour ago, jimbobulator said:

Should I order a replacement drive ASAP?   Is there a possibility that it's something else (bad connection or similar). 

It is possible something else is the cause. In any case it is always safer to rebuild to another disk if you have one or can afford it, and save the original in case there are issues during rebuild.

 

1 hour ago, jimbobulator said:

I'll probably replace all the old Seagate disks with 10 or 12TB drives.  If I replace the current failed drive with a >6TB drive I'll have to format for 6TB to match parity, then later I'll need to reformat it once I get a bigger parity drive, correct?   I'd rather avoid getting a 6TB drive now (it's a SFF case so larger capacity the better).

This is confusing, I suspect because you are confused about what "format" does. You must let Unraid format any disk it will use in the array, and you must NEVER format a disk you are rebuilding. And you must NEVER format any disk that has data on it you want to keep. I just don't see how the idea of "format" figures into this at all.

 

See if you can clarify what you are thinking in that part.

 

1 hour ago, jimbobulator said:

What should be my first steps tonight for troubleshooting?

Don't reboot. Just get us Diagnostics, and we can consider how to proceed from there. Maybe best if you don't do anything else without further advice since some of the other ideas you have may not be on the right track.

 

1 hour ago, jimbobulator said:

The most critical files (several TB or family photos) are duplicated somewhere else thankfully.

👍

  • Thanks 1
Link to comment

I thought I should go ahead and put this out since you are considering buying a new disk.

 

1 hour ago, jimbobulator said:

I'll probably replace all the old Seagate disks with 10 or 12TB drives.  If I replace the current failed drive with a >6TB drive I'll have to format for 6TB to match parity, then later I'll need to reformat it once I get a bigger parity drive, correct?   I'd rather avoid getting a 6TB drive now (it's a SFF case so larger capacity the better).

I think perhaps you are talking about "short stroking" the drive. I remember some discussion about this a very long time ago for some reason, but I don't recommend it and not entirely sure it applies here.

 

There is already a method for replacing a disk with a disk larger than parity. Actually, what it does is copy parity to the larger disk then use the original parity for rebuilding the failed data disk.

 

This will give you something to read while we wait on diagnostics, the Parity Swap procedure:

 

https://wiki.unraid.net/The_parity_swap_procedure

 

  • Thanks 1
Link to comment

Thanks for the answers!

 

2 hours ago, trurl said:
3 hours ago, jimbobulator said:

I have a disabled drive due to read errors.

Unraid only disables a disk for write errors. It is possible for a read error to cause Unraid to get the data for the failed read from the parity calculation and try to write it back to the disk. If that write-back fails then the disk would be disabled.

Ok, the notification just said read errors, but I'll check tonight.

 

1 hour ago, trurl said:
3 hours ago, jimbobulator said:

I'll probably replace all the old Seagate disks with 10 or 12TB drives.  If I replace the current failed drive with a >6TB drive I'll have to format for 6TB to match parity, then later I'll need to reformat it once I get a bigger parity drive, correct?   I'd rather avoid getting a 6TB drive now (it's a SFF case so larger capacity the better).

This is confusing, I suspect because you are confused about what "format" does. You must let Unraid format any disk it will use in the array, and you must NEVER format a disk you are rebuilding. And you must NEVER format any disk that has data on it you want to keep. I just don't see how the idea of "format" figures into this at all.

Right that was not clear at all.   First and foremost, my priority is to recover the data from the failed disk and get the array healthy again.  I could buy a 6TB drive and rebuild disk2 onto that one.  But I'd rather not buy another 6TB drive, as I have limited physical space in my server.  So if the replacement drive could be bigger than my current parity disk, I might consider that.

 

As it's not a normal procedure, let's scrap this idea.   This is why I asked...   What I was thinking, which is likely the 'short-stroking' you mention:  Could I assign a new disk2 that is larger than parity and let it rebuild?  Would Unraid would limit the size of the filesystem to parity size?  Then later on, I'd replace parity with a drive matching or exceeding disk2's true capacity, copy of the data from the disk2 somewhere else, and then format disk2 it to it's true capacity. 

 

1 hour ago, trurl said:

There is already a method for replacing a disk with a disk larger than parity. Actually, what it does is copy parity to the larger disk then use the original parity for rebuilding the failed data disk.

Ok, this might work...  I'm already at the device limit for my Unraid Basic license and the limit of SATA ports on my mobo, so I'm not sure I can do all the steps yet.  Will read more closely and see if this is a possible solution.

Link to comment

Well, unfortunately my server rebooted today for unknown reasons.  I got a notification that "array turned good", which was sent after the reboot.  So sadly I don't have the juicy syslogs, and stupidly I turned turned off the syslog to flash setting a while back.  Doh.

 

Here is a recap of the notifications I got today.

07:36 - Notice: Parity Check Started
10:54 - Alert: Disk 2 in error state (disk dsbl)
10:54 - Warning: array has errors.  Array has 1 disk with read errors
15:11 - Notice: array turned good.  Array has 0 disks with read errors
15:11 - Notice: Parity check has finished (0 errors).  Duration 17 hours, 53 minutes, 19 seconds.  Average Speed nan B/s

Notes

  • After the notification at 10:54 I logged in remotely and saw the disk2 was disabled, and the "Main" page summary that disk2 had 2048 errors.
  • I couldn't see the SMART data for disk. 
  • The parity check was now a 'read check'
  • I wasn't able to log back in throughout the day.
  • The parity check notification is bugged - clearly wasn't 17 hours, and probably should've said 'Read check has finished", I suppose?

Anyway, disk2 now shows 88 reallocated sectors.  Most of the contents of this disk are non-important media, but there are some things with clear priority that I'd like to try and save.  What's the best step forward?

 

Diagnostics removed.

Edited by jimbobulator
Link to comment

Not sure why it would say array turned good when you still had a disabled disk.

 

You cannot rebuild the disk to a disk larger than parity, but you can do the parity swap procedure.

 

15 minutes ago, jimbobulator said:

Most of the contents of this disk are non-important media, but there are some things with clear priority that I'd like to try and save. 

Everything should be recovered when you rebuild.

 

If you want to backup anything before you start you can copy files from the emulated disk to another computer. I never recommend trying to shuffle files to other disks in the array when you are already emulating a disk, since all disks must be read to get the contents of the emulated disk, and moving or copying to the array is just more activity when you are already without protection.

  • Thanks 1
Link to comment
1 minute ago, trurl said:

Not sure why it would say array turned good when you still had a disabled disk.

Agreed, this is strange.  It's how it went down, though.

The drive is still disabled - I assume Unraid keeps the drive disabled on successive boots after a write error, to ensure no further damage?

 

1 minute ago, trurl said:

If you want to backup anything before your start you can copy files from the emulated disk to another computer. I never recommend trying to shuffle files to other disks in the array when you are already emulating a disk, since all disks must be read to get the contents of the emulated disk, and moving or copying to the array is just more activity when you are already without protection.

This was my thoughts as well.  I've got a plan now: copy off critical files to an external disk (paranoia), then attempt a parity swap with a bigger drive.  Will report back in a few days or a week once it's *hopefully* cleared up.

Link to comment
4 hours ago, jimbobulator said:

Ok, this might work...  I'm already at the device limit for my Unraid Basic license and the limit of SATA ports on my mobo, so I'm not sure I can do all the steps yet.  Will read more closely and see if this is a possible solution.

Parity swap does not require any more devices than you already have connected. The original "failed" disk would be removed and kept as a backup in case anything goes wrong. Probably most of its files can still be read if necessary.

 

3 minutes ago, jimbobulator said:

The drive is still disabled - I assume Unraid keeps the drive disabled on successive boots after a write error, to ensure no further damage?

A disabled disk is no longer in sync with parity. The only options to get it enabled again are to rebuild it, or New Config and rebuild parity instead. Rebuilding the disabled disk is the recommended procedure in most cases (and in your case).

 

When a write to a disk fails, Unraid disables it and won't use it again until rebuilt. But, the data for that failed write, and for any subsequent writes to the disabled (and emulated) disk is used to update parity. So, that failed write and any writes after that can still be recovered from the parity calculation. The original, physical disk is out of sync and no longer has valid contents, since it is missing any writes that happened after the failure. The valid contents are in the array and will be rebuilt from the parity calculation.

  • Thanks 1
Link to comment
2 minutes ago, trurl said:

A disabled disk is no longer in sync with parity. The only options to get it enabled again are to rebuild it, or New Config and rebuild parity instead. Rebuilding the disabled disk is the recommended procedure in most cases (and in your case).

Right, of course.  Thank you very much for the prompt and efficient help!

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.