Jump to content

(Solved) Help. disk showing red X while shrinking array


CaptainTivo

Recommended Posts

I have been in the process of shrinking my array by removing two older, low density drives.

I have copied all the data from 2 drives to a third one that will stay in the array and I was ready to remove the drives and let the parity rebuild.

The rsync finished but this morning a third drive (not one of the ones I copied) is showing a red X.

 

So, to be clear and concrete here is what I have done:

- copied all the data from disk3 and disk6 to disk10 using rsync via /mnt/diskx  (I had intended to remove disk3 and disk6)

- after the copying finished last night, disk2 started showing read errors

- array is now showing a red X on disk2 (a disk not involved in the shrink process)

"Fix Common Problems" plugin displays:

    "unRAID Disk 2 error: 13-08-2018 02:01

    Alert (TOWER) - Disk 2 in error state (disk dsbl)

    WDC_WD30EZRX-ooDC0B0_WD-WCC1T1078908 (sdk)"

 

disk2 is still reiserfs because I just upgraded to version 6.  I had intended to do a data shuffle and reformat later.

 

some screen shots and the diagnostics log are attached.  sys Log shows disk read errors but earlier mentions an "interface fatal error".  Maybe the disk is OK but the interface failed?  Tried to get a SMART report but it failed.

 

So what do I do now?  In the past, when I got redballs, I just replaced the disk and let it rebuild the data.

If I replace disk2 with a new disk now and let it reconstruct the data on disk2, won't I end up with all the data on disks 3 and 6 duplicated in the array?

Obviously, this is a worst case scenario (or I guess worse would be 2 drives failing).  After a fair amount of searching, I do not have a clear path of action.

Can someone give me some help?

Thanks.

 

 

 

 

main_error.JPG

fix_common_problems.jpg

tower-diagnostics-20180813-1020.zip

Link to comment
1 hour ago, johnnie.black said:

Disk2 dropped offline, possibly a cable issue, power off, check connections and power back on, then post new diags since there's no SMART for it.

Thanks for the advice.  Unfortunately, I now have another issue: the server no longer recognizes the flash drive as a boot volume.  I will try to restore a backup and then repost. Hopefully it's just a restore problem although the flash is now 7 years old.

Link to comment
37 minutes ago, johnnie.black said:

Disk2 dropped offline, possibly a cable issue, power off, check connections and power back on, then post new diags since there's no SMART for it.

OK. Rebooted (no idea why the flash was not seen by the BIOS - something to debug later).

Here is the diagnostics file.  Only thing that look fishy in SMART is 

199 UDMA_CRC_Error_Count    -O--CK   200   199   000    -    6

never had CRC errors before.  Could that be caused by a SATA cable issue?

tower-diagnostics-20180813-1344.zip

Link to comment
7 minutes ago, johnnie.black said:

BTW, if the disk looks good, and assuming you're sure nothing was written to the emulate disk after it disk2 became disabled, you can do a new config now without the disks you want to remove and rebuild parity, instead of rebuilding the disk to then do a new config.

 

I am pretty sure no writes to *that* disk. I have a cache disk with some dockers that would have had writes, but I have not transferred any data to the server via SMB since last week.  The last write to any array disk would have been from disk to disk during the rsync and none to disk2 (the red X).

 

So now, I could:

1. shut down server

2. physically remove old disks to shrink array

3. reboot, and "new config" without the shrinked (shrunken?) disks, and click "start array"

Basically, what I was going to do before this error occurred.

Correct?

 

BTW, I'm very patient when it comes to these errors. I know that hasty actions may end badly.  If you (or anyone else) needs time to respond, no problem.  Really appreciate your help.

Link to comment

Actually, thinking more about it, wouldn't it be best to do a parity check before removing any disks?

 

Here is my logic:

1. assume that the redballed disk (disk2) is actually fine, all data is there and has not been written since before the redball (read error) occurred.

2. If this is true, then the parity data is actually correct and I just need to force unRAID to recognize that

3. if I can force it to think that the disk is ok (remove red X), then do a parity check

4. If the parity check passes, that guarantees that there is no error disk2

5. then proceed with the disk shrinking process

 

Question is: how to do step 2?  Is there some way to simply restart the array and tell unRAID that the disk if fine and parity is correct?

Link to comment
7 hours ago, CaptainTivo said:

Actually, thinking more about it, wouldn't it be best to do a parity check before removing any disks?

 

Here is my logic:

1. assume that the redballed disk (disk2) is actually fine, all data is there and has not been written since before the redball (read error) occurred.

2. If this is true, then the parity data is actually correct and I just need to force unRAID to recognize that

3. if I can force it to think that the disk is ok (remove red X), then do a parity check

4. If the parity check passes, that guarantees that there is no error disk2

5. then proceed with the disk shrinking process

 

Question is: how to do step 2?  Is there some way to simply restart the array and tell unRAID that the disk if fine and parity is correct?

 

If a disk drop out ( redballed ), then it already invalid with the parity. Don't just think Disk2 have write or not, how about any write in other disks ?

In fact, array start / stop already have "write" in, even you haven't write anything.

 

You just need consider rebuild it with new disk or original disk.

 

 

11 hours ago, CaptainTivo said:

So what do I do now?  In the past, when I got redballs, I just replaced the disk and let it rebuild the data.

If I replace disk2 with a new disk now and let it reconstruct the data on disk2, won't I end up with all the data on disks 3 and 6 duplicated in the array?

 

No duplicated, suggest better understanding operation of parity.

 

 

7 hours ago, CaptainTivo said:

BTW, I'm very patient when it comes to these errors.

 

AOC-SASLP-MV8 may be the source of problem too and your disks also some age.

I opposive to those error, I haven't patient fot this.

Link to comment
11 hours ago, CaptainTivo said:

Actually, thinking more about it, wouldn't it be best to do a parity check before removing any disks?

IMO no point in doing this, and it will always find a few errors due to filesystem housekeeping, screenshot appears to confirm there were no data writes to disk2, so basically you have to options:

 

1) do a new config after removing the disks you don't want and sync parity

 

2)rebuild disk2 to a new disk, don't use the old one because if anything else goes wrong you can be in a worse situation you are now, then do a new config without the smaller disks and re-sync parity.

 

I would go with option 1, but I have full backups of all my servers, if you want to play it safer go with option 2, just in case there really is a problem with disk2 despite the healthy looking SMART, it's rare but it can happen.

Link to comment
7 hours ago, johnnie.black said:

IMO no point in doing this, and it will always find a few errors due to filesystem housekeeping, screenshot appears to confirm there were no data writes to disk2, so basically you have to options:

 

1) do a new config after removing the disks you don't want and sync parity

 

2)rebuild disk2 to a new disk, don't use the old one because if anything else goes wrong you can be in a worse situation you are now, then do a new config without the smaller disks and re-sync parity.

 

I would go with option 1, but I have full backups of all my servers, if you want to play it safer go with option 2, just in case there really is a problem with disk2 despite the healthy looking SMART, it's rare but it can happen.

Thanks.  I see your point.  I don't have full backups but what little data on the server I care about, I have copies.

May I ask what you use for backups?  26 TB is a lot of data to back up.

Link to comment
May I ask what you use for backups? 

All my unRAID servers are backuped to another unRAID server, it's not cheap since I have a lot more than 26TB, and although most data isn't personal and irreplaceable likes photos, etc, it would be a great pain to recover and a large part impossible to recover, so for me it's worth the investment.

 

 

Link to comment
12 hours ago, johnnie.black said:

IMO no point in doing this, and it will always find a few errors due to filesystem housekeeping, screenshot appears to confirm there were no data writes to disk2, so basically you have to options:

 

1) do a new config after removing the disks you don't want and sync parity

 

2)rebuild disk2 to a new disk, don't use the old one because if anything else goes wrong you can be in a worse situation you are now, then do a new config without the smaller disks and re-sync parity.

 

I would go with option 1, but I have full backups of all my servers, if you want to play it safer go with option 2, just in case there really is a problem with disk2 despite the healthy looking SMART, it's rare but it can happen.

Well, I decided to go with option 1.  However, I ran into a question right away.

In my original procedure, the first is to copy the data from the drives to be removed to a drive with space.  I did that.

Next step is to alter the share definitions to only include drives that will be left in the array after the "shrink".

Of course, I can't do that because the array is not "started" so I can't alter the shares definitions.

If I start the array with the current state (disk2 is stilled redballed) what will it do?  Will it start building parity?  Should I start it in "maintenance mode"?

Or am I stuck?

Link to comment
22 hours ago, johnnie.black said:

Don't worry about the include/exclude settings for now, just concentrate on getting the array back to a protected state, though if you want you can change those right after starting the parity sync.

 

Right, thanks.  Removed the "shrinked" discs, New Config and reordered disks, restarted parity sync.

Thanks for help, everyone.

Link to comment
  • CaptainTivo changed the title to (Solved) Help. disk showing red X while shrinking array

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...