campfred Posted June 4, 2020 Posted June 4, 2020 Hello everyone who's reading me and thank you very much for stopping by! I'm in a situation that I find scary and I'd like some directions on what best to do. I recently had a data disk that has been disabled by the system (with the udma crc error count as the cause). It happened to me before where a cable wouldn't be making connection right and it ended up causing read/write errors. So, like the previous time, I stopped the array, shut down the server, replugged the hard drive and booted everything back up. Although, when the array finished starting, the drive was appearing as a new device that I could rebuild the data disk on. So, I choose to start the data rebuild as the parity was probably the best source of info on what's supposed to be on that disk instead of the drive that had errors happen. Although, the data rebuild failed because the disk went out and reappeared unnassigned (like if it was yanked out while running and the server didn't recognize it when it came back, putting it under unassigned devices). So, where I'm at now is that when the array is running, the disk 3 is showing only a fraction of the space used (like +80 GB out of the 3 TB instead of over 2.8 TB out of the 3). The array is currently stopped. I do have two drives coming but their delivery dates are still not set in stone. I should at least have one arriving on Monday but even if I'd had them, I wouldn't start the rebuild yet because of the current storage use metric on the system's third data disk. I really hope that the failed rebuild did not erase the data disk's stuff out of the parity. :/ Quote
trurl Posted June 4, 2020 Posted June 4, 2020 Go to Tools-diagnostics and attach the complete Diagnostics ZIP file to your NEXT post. Quote
campfred Posted June 4, 2020 Author Posted June 4, 2020 (edited) 2 hours ago, trurl said: Go to Tools-diagnostics and attach the complete Diagnostics ZIP file to your NEXT post. Sure! Here is the diagnostics data. alfred-diagnostics-20200604-1450.zip Edited June 4, 2020 by campfred Word was missing. Quote
JorgeB Posted June 5, 2020 Posted June 5, 2020 Did you format the disk at any time before or after rebuilding? Quote
campfred Posted June 5, 2020 Author Posted June 5, 2020 4 hours ago, johnnie.black said: Did you format the disk at any time before or after rebuilding? I didn't. The format of the disks always have been xfs and never changed. Quote
JorgeB Posted June 5, 2020 Posted June 5, 2020 If was formatted it would remain XFS, formatting a disk is the only explanation I can think off that would make a full disk now showing basically empty, and unfortunately a common mistake. Quote
campfred Posted June 5, 2020 Author Posted June 5, 2020 1 minute ago, johnnie.black said: If was formatted it would remain XFS, formatting a disk is the only explanation I can think off that would make a full disk now showing basically empty, and unfortunately a common mistake. I know but it never prompted me to do that. That's where I have a hard time understanding. Quote
campfred Posted June 5, 2020 Author Posted June 5, 2020 I just checked again. It still prompts me to do a Parity-Sync/Data Rebuild. No text/prompts about formatting. I'm genuinely lost to how it could be formatting the data disk instead of doing what it says (Data-Rebuild). And like, I have big doubts it actually formatted. There's no way 80 GB of data could pop in in an instant. My computer can barely push a 100 MB/s during backups. Quote
campfred Posted June 5, 2020 Author Posted June 5, 2020 Even when I'm looking at my notifications since the day the data disk has been reported as Disabled by F.C.P. (on 2020-05-31), there is no mention of formatting. Only rebuilds that attempted and failed when I was asleep. There must be something I never saw happen and I don't know where to look for it. Notifications.log Quote
JorgeB Posted June 5, 2020 Posted June 5, 2020 I'm sorry but without all the logs from when the problem started (they start over after any reboot) I don't have other ideas on what could have happened. Quote
campfred Posted June 5, 2020 Author Posted June 5, 2020 1 hour ago, johnnie.black said: I'm sorry but without all the logs from when the problem started (they start over after any reboot) I don't have other ideas on what could have happened. Yeah. I don't blame ya. Like I said in the beginning, I shutdown the server to check on the cables and everything which apparently resets the diagnostics (would be really cool if it didn't though!). I was hoping that maybe there was a way to check what the parity is holding. If that was possible, I could be able to know once and for all if when I'll be rebuilding the data on the new drive, if it'll be rebuilding 3 TB of nothing for 11 days or if it'll actually have my data and I wouldn't need to have that post open for weeks. Quote
JorgeB Posted June 5, 2020 Posted June 5, 2020 You could always unassign the disk and check the emulated disk data before rebuilding, but I see no reason for it to be any different then current one. Quote
campfred Posted June 5, 2020 Author Posted June 5, 2020 Just now, johnnie.black said: You could always unassign the disk and check the emulated disk data before rebuilding, but I see no reason for it to be any different then current one. Well, that's where I'm at actually. Once the rebuild failed, I left it unassigned (because the drive is probably actually gone bad) and that's where the indicators started to show only 80 GB used instead of 2,8 TB like it was before the rebuild. Since I saw that, I stopped the array so that there isn't more corruption done. Quote
campfred Posted June 7, 2020 Author Posted June 7, 2020 (edited) 'Just got the new hard drive in and data rebuild is in progress. One encouraging sign is that the amount of storage used on the third data disk is going up even though I have nothing else than the rebuild writing to the drive (it's gone up from 80 to 117 GB in the first minutes already). So, data might still be there. Although, the rebuild is even slower than before (it was taking about 10 days before at ~10-20 MB/s, now it's 50-100 days at ~1-2 MB/s). Which I find very weird considering I have nothing creating drive activity other than the rebuild. Could it be caused by putting in different size of hard drive (going from 3 to 8 TB)? Edited June 7, 2020 by campfred Quote
trurl Posted June 7, 2020 Posted June 7, 2020 Rebuilding to 8TB should only take about a day. Post diagnostics Quote
campfred Posted June 7, 2020 Author Posted June 7, 2020 1 minute ago, trurl said: Rebuilding to 8TB should only take about a day. Post diagnostics Hmm...Okay! Here are the diagnostics! alfred-diagnostics-20200607-1654.zip Quote
JorgeB Posted June 8, 2020 Posted June 8, 2020 12 hours ago, campfred said: (it's gone up from 80 to 117 GB in the first minutes already). That's not data, the larger the disk the more space will be used by XFS metadata, and data never changes during the rebuild, the data you see with the disk emulated before starting to rebuild is the same you'll see after the rebuild completes. There are constant ATA errors with the parity disk, check cables. Quote
campfred Posted June 8, 2020 Author Posted June 8, 2020 9 hours ago, johnnie.black said: That's not data, the larger the disk the more space will be used by XFS metadata, and data never changes during the rebuild, the data you see with the disk emulated before starting to rebuild is the same you'll see after the rebuild completes. There are constant ATA errors with the parity disk, check cables. Yeah. It went back down to 117 GB. Bummer. Thankfully, it was only non sensible data that was on there that I can't easily pull back from online. About the errors, which attribute should I look for after verifying/replacing the cable? Quote
JorgeB Posted June 8, 2020 Posted June 8, 2020 22 minutes ago, campfred said: About the errors, which attribute should I look for after verifying/replacing the cable? Check the syslog and make sure there are no more of these: Jun 7 16:05:24 Alfred kernel: ata5.00: configured for UDMA/33 Jun 7 16:05:24 Alfred kernel: ata5: EH complete Jun 7 16:05:25 Alfred kernel: ata5.00: exception Emask 0x10 SAct 0x8303e00 SErr 0x4090000 action 0xe frozen Jun 7 16:05:25 Alfred kernel: ata5.00: irq_stat 0x00400040, connection status changed Jun 7 16:05:25 Alfred kernel: ata5: SError: { PHYRdyChg 10B8B DevExch } Jun 7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED Jun 7 16:05:25 Alfred kernel: ata5.00: cmd 60/08:48:18:ec:00/00:00:00:00:00/40 tag 9 ncq dma 4096 in Jun 7 16:05:25 Alfred kernel: res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Jun 7 16:05:25 Alfred kernel: ata5.00: status: { DRDY } Jun 7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED Jun 7 16:05:25 Alfred kernel: ata5.00: cmd 60/38:50:20:ec:00/05:00:00:00:00/40 tag 10 ncq dma 684032 in Jun 7 16:05:25 Alfred kernel: res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Jun 7 16:05:25 Alfred kernel: ata5.00: status: { DRDY } Jun 7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED Jun 7 16:05:25 Alfred kernel: ata5.00: cmd 60/18:58:58:f1:00/00:00:00:00:00/40 tag 11 ncq dma 12288 in Jun 7 16:05:25 Alfred kernel: res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Jun 7 16:05:25 Alfred kernel: ata5.00: status: { DRDY } Jun 7 16:05:25 Alfred kernel: ata5.00: failed command: READ FPDMA QUEUED Jun 7 16:05:25 Alfred kernel: ata5.00: cmd 60/40:60:70:f1:00/05:00:00:00:00/40 tag 12 ncq dma 688128 in Jun 7 16:05:25 Alfred kernel: res 40/00:68:b0:f6:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Jun 7 16:05:25 Alfred kernel: ata5.00: status: { DRDY } Quote
campfred Posted June 8, 2020 Author Posted June 8, 2020 Okay. I replaced the cable just to be sure and no failed commands are appearing in the system logs anymore. Although, the disk 3 had an unmountable file system. I attempted a repair both from the webGUI and from the SSH session using the instructions on the wiki regarding checking the disk's filesystem and it seems to have brought it back to be usable (still not my 2,8 TB but I gave up on them at this point). I'll stay tuned on the ATA errors for a few days and see how it goes. If nothing else is appearing in the next coming days, I'll edit the title appropriately to close the post. Should I put Solved even though I wasn't able to get my data back? Quote
JorgeB Posted June 8, 2020 Posted June 8, 2020 13 minutes ago, campfred said: Should I put Solved even though I wasn't able to get my data back? It's up to you. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.