Jump to content

Two disks with SMART errors, one marked disabled, advice appreciated


thither
Go to solution Solved by JorgeB,

Recommended Posts

Hello! I seem to be in a bit of a pickle. Two of the data disks in my 4-disk array are reporting SMART errors. One seems to pass xfs_repair with no problems, while with the other one xfs_repair fails on an i/o error.

 

image.thumb.png.c45016f731e23360b4bb151726dc0495.png

 

xfs_repair on disk #2 give me this:

 

    Phase 1 - find and verify superblock...
            - block cache size set to 1478872 entries
    Phase 2 - using internal log
            - zero log...
    zero_log: head block 1285582 tail block 1285582
            - scan filesystem freespace and inode maps...
            - found root inode chunk
    Phase 3 - for each AG...
            - scan and clear agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
    xfs_repair: read failed: Input/output error
    can't read block 0 for directory inode 2096964
    no . entry for directory 2096964
    no .. entry for directory 2096964
    problem with directory contents in inode 2096964
    cleared inode 2096964
    xfs_repair: read failed: Input/output error
    cannot read inode 3144960, disk block 3144992, cnt 32

 

(That's xfs_repair -v, I get errors just running -n though.) 

 

Meanwhile, a third disk is marked "disabled" although it reports no SMART errors, and will not seem to come back into the array even after stopping and restarting the array. I've run xfs_repair on it and it doesn't seem to have any errors.

My shares are acting a bit strange, with two of them refusing to respond to an `ls` command:
 

root@Eurydice:/mnt/user/Video/Television# ls
/bin/ls: reading directory '.': Input/output error

 

From looking at the xfs_repair output and smart tests, it seems clear that disk2 is a goner and will need to be replaced. I've actually got two new disks that I can swap in right now, but I'd like some advice before I do it.

 

 

Right now my shares are all marked as "unprotected" which makes me worry that I'll lose data if I just clear a disk.

 

So my questions are: 

- What should I do about disk #2, which has the i/o failures shown above?

- Disk #1 shows a little SMART thumbs-down icon in the dashboard, but the last time I ran self-tests it was fine. Does hitting "acknowledge" clear the thumb-down icon? Should I consider this disk compromised as well?

- What can I do to get disk #3, which shows status "disabled", back into the array?

- What can I do do try to minimize data loss before I pull disk #2 (and maybe the others) out of the array?

 

I'll include a diagnostics zip file (if I can manage to upload it, I was having trouble in Firefox).

 

Thanks!

 

eurydice-diagnostics-20220223-1737.zip

Link to comment

Oh, one other thing: this was probably a mistake, but I ran a parity check after I noticed things were failing. I got a lot of errors:

 

image.thumb.png.accc666c6297a1cc82f6a46f130925d0.png

 

Does this mean I'm screwed for data recovery? Like, if the parity disk couldn't read a bit from the failing disk, it wouldn't be able to compute the parity for that bit across all four disks, would it?

Link to comment
21 hours ago, JorgeB said:

There appear to be 3 failing disks with single parity, so some some data loss is expected, disk2 is failing for sure, disk1 might also be, run an extended SMART test on disks 1 and 3 and post new diags once they are done.

 

Thanks for taking a look. I've got SMART tests running on disk2 and disk3 and will post them once they're done. The disk1 test finished and reported a passing test, as far as I can make out from the logs ("SMART overall-health self-assessment test result: PASSED"), but I'm not super familiar with what I should be looking for in there.

 

 

disk1-eurydice-smart-20220224-2352.zip

Link to comment
2 hours ago, thither said:

"SMART overall-health self-assessment test result: PASSED"

This doesn't matter, what does is the test result:

 

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     47825  

       -

 

In this case it passed.

Link to comment

Ok, so The disk2 and disk3 reports completed.

 

disk3, the disabled one, shows the test completing without error in that same section of the report:

 

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     21383         -
# 2  Short offline       Completed without error       00%     21365         -
# 3  Short offline       Completed without error       00%     21328         -

 

disk2, which is doomed, shows 4218 errors (and unraid shows the test as "completed: read failure").

 

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     46611         43138432
# 2  Extended offline    Completed: read failure       90%     46566         3145072
# 3  Short offline       Completed without error       00%     46557         -

 

So it looks like only disk2 fails the SMART tests, and if I'm lucky I'll be able to swap it out and rebuild from parity.

 

One thing I still don't understand is how I can get disk3 back into the array. Just starting the array doesn't seem to do it. Do I need to erase the disk or something? Remove it from the array and re-add it? Will a cold reboot do it? (I've rebooted, but haven't turned the power all the way off.)

 

Relatedly, I would think it would be best to get disk3 back online before I swap out disk2 for a fresh drive, but is that the wrong order to do it in? I would think that as long as disk2 is unreliable, the system as a whole wouldn't be able to reliably compute the parity.

 

disk3-eurydice-smart-20220225-0900.zip disk2-eurydice-smart-20220225-0038.zip

Link to comment
9 minutes ago, thither said:

So it looks like only disk2 fails the SMART tests,

Correct.

 

9 minutes ago, thither said:

and if I'm lucky I'll be able to swap it out and rebuild from parity.

Not possible, because you only have single parity and already have a disable disk, you can use ddrescue on disk2 to recover as much data as possible then do a new config with the clone and the remaining disks to re-sync parity, note that since it's not clear for how long disk3 has been disabled any data written to the emulated disk will be lost after the new config.

Link to comment

Thanks for the link to ddrescue. So If I'm understanding correctly, my next steps would be:

  • install a new disk as disk5
  • try to ddrescue as much stuff as I can from disk2 to disk5
  • remove disk2 from the array, add disk5
  • Rebuild parity based on the recovered data, probably with some data loss

Is there any chance of un-disabling disk3? Should I just uninstall the disk and trash it?

Link to comment
15 minutes ago, thither said:

Is there any chance of un-disabling disk3?

We can force Unraid to enable disk3 to rebuild disk2, but that will only work if parity is still valid, i.e., nothing was written to disk3 once it got disable, if you believe it's still valid you can try to rebuild using a new disk, even if it's not successful it won't make things worse, unless another disk fails during the rebuild.

Link to comment

As a brief update, I removed disk3 (the disabled one) from the array and tried to add it back in to rebuild, but it started throwing SMART errors, and then finally wouldn't mount at all. xfs_repair told me to run it again with -L, which I may try to do, but in theory everything in there should be rebuildable from parity, so I'll likely just get rid of the disk. I'll be trying a ddrescue from disk2 to the new disk as soon as the new disk's extended SMART test is complete.

Link to comment
6 hours ago, JorgeB said:

Not a good idea since you have a known bad disk in the array, disk2.

 

Ah, right. Well I just wanted to see whether I could get any data off of it at all, but it seemed to be totally unresponsive.

 

I've got ddrescue running right now on disk2 (current remaining time: 222d 15h, though I'm hopeful that will improve).

 

Am I correct in saying that at this point my existing parity drive isn't useful any more, since it's been trying to check two faulty drives at once and whatever parity information is on it is unreliable now? So my best course of action is just to recover as much stuff as I can from disk2 and then recompute parity from scratch with whatever recovered data I can get off of it?

Link to comment

Parity status is unclear, you didn't even mentioned for how long has disk 3 been disabled, and if you know if there were any writes to it after, like mentioned above you could try to force enable disk3 to rebuild disk2 to a new disk, I can still post instructions for that if you want, but it will only work if parity is still valid.

Link to comment
18 hours ago, JorgeB said:

Parity status is unclear, you didn't even mentioned for how long has disk 3 been disabled, and if you know if there were any writes to it after, like mentioned above you could try to force enable disk3 to rebuild disk2 to a new disk, I can still post instructions for that if you want, but it will only work if parity is still valid.

 

Sorry about that. I first got a notification that disk3 was out on 2022-01-15, and I definitely didn't intentionally write anything to the disk after that. Most of the data going into the array since that time would be automatic downloads (from Sonarr etc) which are not super important and could be redownloaded if needed.

 

After running for quite a while, ddrescue from disk2 to my new replacement for it succeeded, rescuing 99.99% of the data, and after running a xfs_repair on the replacement it mounts fine, with just a few random files (6GB or so) winding up in lost+found.

 

Do you think it would be worthwhile to try to add in this replacement disk back into the array as disk2, and then once it's in there to try to rebuild my replacement for disk3 from parity, or would I just be risking 6GB or more of corrupt data on the replacement (since that would be changed on disk2 since parity was computed)? Or should I just start over with a new config and live without whatever was on disk3?

Edited by thither
Link to comment
  • Solution
9 hours ago, thither said:

Most of the data going into the array since that time would be automatic downloads (from Sonarr etc) which are not super important and could be redownloaded if needed.

That's not the point, point is if any data was written to the emulated disk parity will no longer be valid if we use actual disk instead.

 

9 hours ago, thither said:

After running for quite a while, ddrescue from disk2 to my new replacement for it succeeded, rescuing 99.99% of the data

In that case and IMHO best way forward would be to do a new config with the new disk2 and the remaining disks and re-sync parity.

 

Link to comment

Ok, well after a good deal of messing around I created a new config and am rebuilding parity now. I definitely lost a bunch of data, and without disk3 being readable it's a little hard to say what exactly went away, but the system is stable again and I definitely learned something through this whole process.

 

Thanks very much for helping me out with this @JorgeB!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...