Jump to content

Data disk disabled, rebuild failed, disk now shows only fraction of space used


Recommended Posts

Posted

Hello everyone who's reading me and thank you very much for stopping by!

I'm in a situation that I find scary and I'd like some directions on what best to do.


I recently had a data disk that has been disabled by the system (with the udma crc error count as the cause). It happened to me before where a cable wouldn't be making connection right and it ended up causing read/write errors.
So, like the previous time, I stopped the array, shut down the server, replugged the hard drive and booted everything back up.

Although, when the array finished starting, the drive was appearing as a new device that I could rebuild the data disk on.

So, I choose to start the data rebuild as the parity was probably the best source of info on what's supposed to be on that disk instead of the drive that had errors happen.

Although, the data rebuild failed because the disk went out and reappeared unnassigned (like if it was yanked out while running and the server didn't recognize it when it came back, putting it under unassigned devices).


So, where I'm at now is that when the array is running, the disk 3 is showing only a fraction of the space used (like +80 GB out of the 3 TB instead of over 2.8 TB out of the 3).
The array is currently stopped. I do have two drives coming but their delivery dates are still not set in stone. I should at least have one arriving on Monday but even if I'd had them, I wouldn't start the rebuild yet because of the current storage use metric on the system's third data disk.

 

I really hope that the failed rebuild did not erase the data disk's stuff out of the parity. :/

Posted
4 hours ago, johnnie.black said:

Did you format the disk at any time before or after rebuilding?

I didn't. The format of the disks always have been xfs and never changed.

Posted

If was formatted it would remain XFS, formatting a disk is the only explanation I can think off that would make a full disk now showing basically empty, and unfortunately a common mistake.

Posted
1 minute ago, johnnie.black said:

If was formatted it would remain XFS, formatting a disk is the only explanation I can think off that would make a full disk now showing basically empty, and unfortunately a common mistake.

I know but it never prompted me to do that. That's where I have a hard time understanding.

Posted

I just checked again. It still prompts me to do a Parity-Sync/Data Rebuild. No text/prompts about formatting.

 

I'm genuinely lost to how it could be formatting the data disk instead of doing what it says (Data-Rebuild).

 

And like, I have big doubts it actually formatted. There's no way 80 GB of data could pop in in an instant. My computer can barely push a 100 MB/s during backups.

Posted

Even when I'm looking at my notifications since the day the data disk has been reported as Disabled by F.C.P. (on 2020-05-31), there is no mention of formatting. Only rebuilds that attempted and failed when I was asleep.

 

There must be something I never saw happen and I don't know where to look for it.

Notifications.log

Posted
1 hour ago, johnnie.black said:

I'm sorry but without all the logs from when the problem started (they start over after any reboot) I don't have other ideas on what could have happened.

Yeah. I don't blame ya. Like I said in the beginning, I shutdown the server to check on the cables and everything which apparently resets the diagnostics (would be really cool if it didn't though!).

 

I was hoping that maybe there was a way to check what the parity is holding. If that was possible, I could be able to know once and for all if when I'll be rebuilding the data on the new drive, if it'll be rebuilding 3 TB of nothing for 11 days or if it'll actually have my data and I wouldn't need to have that post open for weeks.

Posted
Just now, johnnie.black said:

You could always unassign the disk and check the emulated disk data before rebuilding, but I see no reason for it to be any different then current one.

Well, that's where I'm at actually. Once the rebuild failed, I left it unassigned (because the drive is probably actually gone bad) and that's where the indicators started to show only 80 GB used instead of 2,8 TB like it was before the rebuild. Since I saw that, I stopped the array so that there isn't more corruption done.

Posted (edited)

'Just got the new hard drive in and data rebuild is in progress.
One encouraging sign is that the amount of storage used on the third data disk is going up even though I have nothing else than the rebuild writing to the drive (it's gone up from 80 to 117 GB in the first minutes already). So, data might still be there.

 

Although, the rebuild is even slower than before (it was taking about 10 days before at ~10-20 MB/s, now it's 50-100 days at ~1-2 MB/s). Which I find very weird considering I have nothing creating drive activity other than the rebuild. Could it be caused by putting in different size of hard drive (going from 3 to 8 TB)?

 

 

Edited by campfred
Posted
12 hours ago, campfred said:

(it's gone up from 80 to 117 GB in the first minutes already).

That's not data, the larger the disk the more space will be used by XFS metadata, and data never changes during the rebuild, the data you see with the disk emulated before starting to rebuild is the same you'll see after the rebuild completes.

 

There are constant ATA errors with the parity disk, check cables.

Posted
9 hours ago, johnnie.black said:

That's not data, the larger the disk the more space will be used by XFS metadata, and data never changes during the rebuild, the data you see with the disk emulated before starting to rebuild is the same you'll see after the rebuild completes.

 

There are constant ATA errors with the parity disk, check cables.

Yeah. It went back down to 117 GB. Bummer. Thankfully, it was only non sensible data that was on there that I can't easily pull back from online.

 

About the errors, which attribute should I look for after verifying/replacing the cable?

Posted
22 minutes ago, campfred said:

About the errors, which attribute should I look for after verifying/replacing the cable?

Check the syslog and make sure there are no more of these:

 

Jun  7 16:05:24 Alfred kernel: ata5.00: configured for UDMA/33
Jun  7 16:05:24 Alfred kernel: ata5: EH complete
Jun  7 16:05:25 Alfred kernel: ata5.00: exception Emask 0x10 SAct 0x8303e00 SErr 0x4090000 action 0xe frozen
Jun  7 16:05:25 Alfred kernel: ata5.00: irq_stat 0x00400040, connection status changed
Jun  7 16:05:25 Alfred kernel: ata5: SError: { PHYRdyChg 10B8B DevExch }
Jun  7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun  7 16:05:25 Alfred kernel: ata5.00: cmd 60/08:48:18:ec:00/00:00:00:00:00/40 tag 9 ncq dma 4096 in
Jun  7 16:05:25 Alfred kernel:         res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jun  7 16:05:25 Alfred kernel: ata5.00: status: { DRDY }
Jun  7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun  7 16:05:25 Alfred kernel: ata5.00: cmd 60/38:50:20:ec:00/05:00:00:00:00/40 tag 10 ncq dma 684032 in
Jun  7 16:05:25 Alfred kernel:         res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jun  7 16:05:25 Alfred kernel: ata5.00: status: { DRDY }
Jun  7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun  7 16:05:25 Alfred kernel: ata5.00: cmd 60/18:58:58:f1:00/00:00:00:00:00/40 tag 11 ncq dma 12288 in
Jun  7 16:05:25 Alfred kernel:         res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jun  7 16:05:25 Alfred kernel: ata5.00: status: { DRDY }
Jun  7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun  7 16:05:25 Alfred kernel: ata5.00: cmd 60/40:60:70:f1:00/05:00:00:00:00/40 tag 12 ncq dma 688128 in
Jun  7 16:05:25 Alfred kernel:         res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jun  7 16:05:25 Alfred kernel: ata5.00: status: { DRDY }

 

Posted

Okay. I replaced the cable just to be sure and no failed commands are appearing in the system logs anymore.


Although, the disk 3 had an unmountable file system. I attempted a repair both from the webGUI and from the SSH session using the instructions on the wiki regarding checking the disk's filesystem and it seems to have brought it back to be usable (still not my 2,8 TB but I gave up on them at this point).

 

I'll stay tuned on the ATA errors for a few days and see how it goes. If nothing else is appearing in the next coming days, I'll edit the title appropriately to close the post. Should I put Solved even though I wasn't able to get my data back?

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...