Jump to content

Replace drive question (array order/infos get lost on array stop)


Recommended Posts

Hi,

 

for some time, whenever I stop the array or reboot the server, the info on the array order gets lost (all slots empty, I have to manually select each drive in the correct position; just parity/data drives, my two cache drives are not affected). Usually I just manually put every drive in the correct position again, check "parity is valid" and start the array without issue. Since this only happens when I reboot after an upgrade or a power outage, it was fine in the past and I only had to do that a few times.

 

Today I went to replace a failed drive (red x mark), so I took out the old drive, put in the new (the system supports hot-swap) and stopped the array. Sure enough, all the slots were empty again so I manually reassigned them. I am not sure how to proceed at this point. I did replace drives over the years, but the last time was a few years ago.

 

I did format the new drive as XFS but there was no preclear/zero drive option available. From what I remember, I did this to drives in the past.

 

The manual for replacing disks states just to stop the array, assign the new drive in the slot of the old one and then click a checkbox that says "Yes I want to do this" (which I cannot see anywhere). I could check "parity is valid", but will this automatically rebuild the contents of the drive on the new disk? I am pretty sure I do not want parity to rebuild at this point, since the old disk fails and this would remove any info on the failed drive that had been emulated, correct?

 

I am running Unraid 6.9.2, with two parity drives, ten data drives (XFS) and two cache drives (BTRFS).

 

Hope someone can help me out here!

Edited by Darkguy
Added system info
Link to comment

Dual parity actually.

Checking the logs, I saw that the super.dat file was bad:

 

Jun 17 15:27:43 tower kernel: read_file: error 2 opening /boot/config/super.dat

Jun 17 15:27:43 tower kernel: md: could not read superblock from /boot/config/super.dat

Jun 17 15:27:43 tower kernel: md: initializing superblock

 

I re-inserted the old disk, "checked parity is valid" and started the array.  Could I re-create a new super dat after checking the flash drive for errors?

 

The disk in question was disabled previously, but got re-enabled upon restarting the array, could this lead to problems with parity?

Link to comment

You need to fix flash drive so super.dat can be reliably saved. Any time you start or stop the array super.dat is saved, but flash needs to be reliably written. And as you have seen, if super.dat can't be reliably read, Unraid doesn't know your disk assignments.

 

1 hour ago, Darkguy said:

I did format the new drive as XFS

No point in formatting a disk that will be used as a replacement, since every bit of the entire disk will be overwritten during rebuild, and those bits represent whatever filesystem was on the original disk, XFS or otherwise, and the files and folders of the filesystem.

 

Hopefully you don't mean you formatted the emulated disk, or formatted the physical disk after assigning it to the array, since that would result in rebuilding an empty filesystem (formatted disk).

 

1 hour ago, Darkguy said:

there was no preclear/zero drive option

Unraid only requires a clear disk when assigning a disk to a new slot in an array that already has valid parity. This is so parity will remain valid, since a clear disk is all zeros and has no effect on parity. In this scenario, if the disk hasn't been precleared, Unraid will clear it.

 

(Note that a formatted disk isn't a clear disk since a formatted disk has an empty filesystem written to it.)

 

1 hour ago, Darkguy said:

(the system supports hot-swap)

No point in hotswapping anything, since Unraid will not do anything with the new disk until you assign it, and you can't assign it with the array started. I only recommend hotswapping Unassigned Devices if your hardware supports hotswap.

Link to comment
3 minutes ago, trurl said:

Hopefully you don't mean you formatted the emulated disk, or formatted the physical disk after assigning it to the array, since that would result in rebuilding an empty filesystem (formatted disk).

 

No, I formatted it as an unassigned device.

 

So what would my next steps be?

 

  1. stop array
  2. shutdown server
  3. CHKDSK flash drive on another machine - fix or potentially replace Flash drive and renew license?
  4. proceed with replacing the faulty drive once super.dat issue is fixed?
Link to comment

Unraid disables a disk when a write to it fails. After a disk is disabled, it isn't used again until rebuilt (or New Config such as you seem to have done).

 

Instead of the physical disk, Unraid emulates the missing or disabled disk from the parity calculation by reading all other disks. The emulated data can be read this way, and writes to the emulated disk also happen by updating parity as if the disk had been written. That initial failed write, and any subsequent writes to the emulated disk, can be recovered by rebuilding.

 

When you New Config your way out of a disabled disk, all those emulated writes are lost. And Parity is probably NOT valid for the array with that not rebuilt physical disk, since parity still has those emulated writes that the physical disk is missing.

Link to comment
15 minutes ago, JorgeB said:

Re-reading my reply I wasn't as clear as I should have been, by no I meant, no, don't do that, wait, but good news if the old disk has the data, you might post current diags so we can check SMART.

 

 

It also has tons of read errors sadly, but so far everything seems to work. 

 

I've enclosed the current diagnostics. The disk in question is disk 6

 

I'd like to fix and replace things ASAP. All in all, I have got three disks to replace (disk 6 got disabled in May and I want to replace it right away, disk 5 also has read errors and I've got a second disk to replace it for here). I would also replace another 500 GB disk with larger one I have lying around.

 

I'll patiently await further instructions.

darkrack-diagnostics-20220617-1743.zip

Edited by Darkguy
Link to comment
2 hours ago, Darkguy said:

for some time

Since you have been neglecting your server...

 

Do you have Notifications setup to alert you by email or other agent as soon as a problem is detected?

 

Do you have backups of anything important and irreplaceable? Parity is not a substitute.

Link to comment
7 minutes ago, JorgeB said:

Errors for disk 6 look a little strange, before doing anything else lets see if it can still be emulated:

 

stop the array

unassign disk6

start array and post new diags.

 

I'm currently trying to move some data (pictures) off it which I have not yet backed up using unBALANCER. Read speeds are very slow, so this will probably take around 8-9 hours.

 

From what was written above, changes to the emulated disk6 will have been lost (can confirm, some data that would have been written to that disk after it got disabled is missing now).  So I presume parity is invalid now as well, especially as two tiny writes (config files for a syncing tool I am using and which had a share on that disk) to disk6 have happened in the past hour.

 

Once I unassign it, do I click parity is valid or not?

Link to comment
14 minutes ago, trurl said:

Since you have been neglecting your server...

 

Do you have Notifications setup to alert you by email or other agent as soon as a problem is detected?

 

Do you have backups of anything important and irreplaceable? Parity is not a substitute.

 

I've been aware of the SMART problems of course and have e-mail notifications turned on, but just got around to getting new disks today. I do know parity is not a substitute for backups, but since I opted for dual parity, I was under the assumption that I could wait a few weeks to replace the faulty drive.

 

Nothing on that drive is irreplaceable, but it's still a hassle to replace some of it.

 

I was not aware of the error with the super.dat since today, which seems to be the major issue here, since it basically forced a new config as I understand it and damaged parity.

 

I believe that there should be a notification if any of the files on /boot cannot be read/accessed. Neither the e-mail reports, the GUI nor Fix Common Problems have alerted me to this, or I would have tried fixing that problem back when all of my disks were OK. I even have regular backups of those files, but they only go back about a month.

 

Sure, I should have tried to dig deeper when the problem of the array config first came up, but since it never led to any problems, I didn't look into it.

Link to comment
15 minutes ago, Darkguy said:

I'm currently trying to move some data (pictures) off it

When you have disabled disks in the array, I always recommend copying (not moving) any data you need to preserve to somewhere off the array. That way you get as little array activity as possible.

 

Copying only needs to read all disks to emulate the disabled disk. Moving needs to also delete the data from the disabled disk, and deletes are just another type of write operation that must be emulated by updating parity.

 

And copying or moving to other array disks requires writing the array disks and updating parity. Copying off the array doesn't require updating any array disks or parity.

Link to comment
15 minutes ago, JorgeB said:

While there could be some issues most emulated data should probably be still OK, but any array changes like moving data will make that much less likely, or impossible.

 

Makes sense, but I guess it's tool ate since the move is already underway.

 

8 minutes ago, trurl said:

When you have disabled disks in the array, I always recommend copying (not moving) any data you need to preserve to somewhere off the array. That way you get as little array activity as possible.

 

Copying only needs to read all disks to emulate the disabled disk. Moving needs to also delete the data from the disabled disk, and deletes are just another type of write operation that must be emulated by updating parity.

 

And copying or moving to other array disks requires writing the array disks and updating parity. Copying off the array doesn't require updating any array disks or parity.

 

Sounds sensible. But probably too late.

 

So probably no use in stopping the array and looking to emulate the faulty disk now?

Link to comment

Presuming parity is invalid now anyway, could I just

  • remove disk 6 (and try to get data off it on another system)
  • shut the system down, fix the issue with the flash drive
  • put in the new disk
  • build a new config with the new disk in the slot of disk6
  • rebuild parity
  • then go on to replace the other two disks and rebuild them from parity?
Link to comment

Update, starting to think there is some underlying issue with the server, either the drive bay, a cable or the LSI controller:

 

First of all, the flash drive was damaged to some extent, which explained the super.dat issue. I could not reboot off of it, but was able to copy the config-folder onto another flash drive on my Windows workstation, boot off it, replace the license and set up the array. Parity is rebuilding now, BUT I get a ton of UDMA CRC errors (about 15,000 over the past 12 hours) for the new replacement drive in the old slot of disk6.

 

I also popped the old disk6 (the one with all the read errors) into an external bay on my Widows workstation and cloned it in an Ubuntu VM (in Hyper-V) using ddrescue. Drive was cloned without issue in around 9 hours and without a single error, I mounted the image (in the VM) and all data seems to be there.

 

Any suggestions in which order I should check components? Cable, bay, controller?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...