XFS File System Corruption - Safest way to Fix?


jsx12

Recommended Posts

Hello!

 

I have a custom-built 70+ TB unRAID Pro (6.4?) server that I use to backup a variety of systems where I work.

 

It has these specs:

Xeon E5-2630 V4
32GB DDR4 2133 MHz ECC
Supermicro X10SRL-F
Supermicro 846 24 Bay Chassis
2 Intel DC S3710 400GB, Dual Cache
2 WD Gold 8TB 7200 RPM, Dual Parity
10 WD Red 8TB 5400 RPM, Data

-Parity was last run on the 15th of January 2018, three days ago.
-The system sits on a Windows Domain network, and is joined to the AD.
-The systems that primarily access the unRAID server are Windows Server 2016 Standard and Windows Server 2012 R2 Standard.

 

I have recently been made aware of a scheduled backup that did not complete, only to find out one the the disks in the array had corrupted it's XFS file-system. 
The log is full of I/O errors trying to write to that disk. I unfortunately am not at liberty to provide logs, as they contain customer data that would need to be sanitized.

I stopped the array, and then re-started it in maintenance mode and ran a "Check File-system Status" on that disk with the unRAID-recommend -nv flag.
The log produced by this is something like 2200 lines long, specifying many, many files. Again, I am not at liberty to provide the entire log, but I can provide the end results:

XFS_REPAIR Summary    Thu Jan 18 08:24:33 2018

Phase        Start        End        Duration
Phase 1:    01/18 08:17:12    01/18 08:17:13    1 second
Phase 2:    01/18 08:17:13    01/18 08:17:17    4 seconds
Phase 3:    01/18 08:17:17    01/18 08:21:57    4 minutes, 40 seconds
Phase 4:    01/18 08:21:57    01/18 08:21:57
Phase 5:    Skipped
Phase 6:    01/18 08:21:57    01/18 08:24:33    2 minutes, 36 seconds
Phase 7:    01/18 08:24:33    01/18 08:24:33

Total run time: 7 minutes, 21 seconds

 

It appears as though the file-system on this disk was severely corrupted, as upon restarting the array, the drive shows "Unmountable - No File System." Disk contents from that drive are also no longer available through drive emulation, after removing the drive from the array. Upon attempting to re-add the drive to the array, I am greeted with "Drive contents will be OVERWRITTEN." I did not re-add the drive to the array in fear of losing what was on that disk. I then simply shut the system down and pulled the suspect drive.

 

I have ordered another WD Red 8TB to DD clone this drive with Ubuntu 16.04 and will attempt an XFS_Repair on the cloned drive, as this seems to be the safest way to go about doing this.

 

Let me know if what I have planned out in the following lines looks okay:

1 - Plug in both the new and old drives in a basic sata controller on a separate system running Ubuntu 16.04

2 - Clone the old drive (lets say /dev/sdc) to the new drive (lets say /dev/sdd):
	
	sudo dd if=/dev/sdc of=/dev/sdd

3 - Shutdown after the clone completes and remove the old drive to keep it safe.

4 - Reboot, force an XFS_Repair of the new drive (/dev/sdd) with:
    
	xfs_repair -L /dev/sdd1

5 - If the XFS_Repair is successful, I am guessing I am ready to pop the (new) drive back in the server?

 

I might also figure a way to recover whatever was on that disk to another network share in the event I am unable to add it back to the array.

 

After starting the server with the new drive is where I begin to get a little fuzzy, as I am not sure how unRAID acts with a now-foreign drive with existing data on it. I am pretty sure there is no way to recover from this without fully rebuilding parity, as it the existing parity yields a corrupted drive anyway? Would it be best to create a completely new config from here, and rebuild parity from the new config?


Thank you for your help!

Edited by jsx12
Link to comment
30 minutes ago, jsx12 said:

It appears as though the file-system on this disk was severely corrupted, as upon restarting the array, the drive shows "Unmountable - No File System." Disk contents from that drive are also no longer available through drive emulation, after removing the drive from the array. Upon attempting to re-add the drive to the array, I am greeted with "Drive contents will be OVERWRITTEN." I did not re-add the drive to the array in fear of losing what was on that disk. I then simply shut the system down and pulled the suspect drive.

 

Rebuilding can't fix filesystem corruption.

 

You should store the new cloned disk and repair the old one using unRAID and do it on the mdX device to keep parity synced, if you try to add the cloned disk unRAID will complain it's the wrong disk and if it was checked on Ubuntu parity will become invalid.

 

 

Link to comment

Thank you for your response johnnie.black!

 

So the procedure to do that (after a DD clone for safe keeping) would be something like this?:

Lets say the suspect drive is (/dev/md5).

1 - Re-install the suspect drive in the bay it was in.
2 - Start the server, login to GUI, stop the array, then re-start it in mainteneance mode.
3 - SSH into the server.

4 - Run:
    xfs_repair -V -L /dev/md5

5 - After the repair, run:
    /mnt/disk5/lost+found

6 - And check for corrupted files in that lost+found folder.

Beware, I have have a lot of questions and general information below:

 

 

I am still not entirely sure how this XFS corruption happened in the first place. The server is on a 3000VA Eaton UPS with an EBM shared with only one other Windows Server. Run-time is well over an hour, and there has not been a single outage in 3-4 months according to Eaton's IPM. All of the drives are on a SAS2 expander back-plane, exposed through a LSI 2008 HBA. We use a Xeon E5 with ECC DDR4 for multi-bit ECC protection.


Is it common for XFS file-systems to just randomly "bite the dust," corrupt, and lose data like this? Or should I toss out that suspect WD Red? The system was built in Jan 2017, the drives date to that time. Nothing has been physically touched on this server, to my knowledge, since it was built. It has been reliable since then. The system was recently updated to the newest version with no apparent issues. Luckily we use this for backup only, but having downtime like this is not right.


Would moving to BTRFS help this at all, or was this just a fluke? Some of the files I see mentioned in that log are quite old, likely from around the time the server was built. Is it possible the XFS file-system on that disk has been silently corrupt since day one? How could I prevent this from happening in the future? Do file system checks periodically on all drives?

One other thing I notice in the log is that the I/O errors on that disk happened right after the cache drive(s) were filled up, and the backup proceeded to write directly to the array. Would this somehow exacerbate file system corruption? I can upgrade to 800GB or even 1.2TB SSDs if this would help at all. The nightly payload of new data is something like 600GB, and maybe 66% of that is over-writing old data.

 

Thank you again for your help!

Edited by jsx12
Link to comment

Yes, also forgot to say since it's kind of obvious that instead of removing the disk and cloning using another system you can just rebuild it, and in that case, save the old disk and run xfs_repair on the rebuilt one, and like itimpi mentioned you can use the GUI.

 

1 hour ago, jsx12 said:

Is it common for XFS file-systems to just randomly "bite the dust," corrupt, and lose data like this?

Not really, unclean shutdowns are the number one reason, I'm assuming you checked SMART for that disk and all looks good, maybe also a good idea to run an extended SMART test even if all look good.

 

1 hour ago, jsx12 said:

Would moving to BTRFS help this at all, or was this just a fluke?

I use btrfs on all my servers, mostly because of the checksums, and I believe it's stable especially for single disk usage, like unRAID data disks, still won't say that btrfs is more stable than xfs, the opposite is the consensus, so maybe a fluke but it's difficult to guess, especially without seeing long time diagnostics.

 

1 hour ago, jsx12 said:

One other thing I notice in the log is that the I/O errors on that disk happened right after the cache drive(s) were filled up, and the backup proceeded to write directly to the array. Would this somehow exacerbate file system corruption?

Are these ATA errors or cache full errors? maybe post an excerpt from the syslog, if they are ATA errors most likely the reason for the problem.

Link to comment
On 1/19/2018 at 2:36 AM, johnnie.black said:

Yes, also forgot to say since it's kind of obvious that instead of removing the disk and cloning using another system you can just rebuild it, and in that case, save the old disk and run xfs_repair on the rebuilt one, and like itimpi mentioned you can use the GUI.

 

Not really, unclean shutdowns are the number one reason, I'm assuming you checked SMART for that disk and all looks good, maybe also a good idea to run an extended SMART test even if all look good.

 

I use btrfs on all my servers, mostly because of the checksums, and I believe it's stable especially for single disk usage, like unRAID data disks, still won't say that btrfs is more stable than xfs, the opposite is the consensus, so maybe a fluke but it's difficult to guess, especially without seeing long time diagnostics.

 

Are these ATA errors or cache full errors? maybe post an excerpt from the syslog, if they are ATA errors most likely the reason for the problem.

 

Hello.

 

I came in to check up on the DD clone and see what I could repair on the server.

 

The suspect drive passed both the quick and extended SMART tests. The server was last rebooted for a v6.4 update maybe a week ago or more. Last power down before that was in August of 2017, apparently to move to a different UPS zone. No other power disruptions since. Not a brownout, not a reason to trip over to battery.

 

I believe they were legitimate ATA errors, as the Windows server reports a device "I/O" error copying to unRAID. They suspiciously popped up right after the cache was filled up.

 

DD Took something like 36 hours at 61 MB/s onto a spare HGST 7200 RPM Enterprise drive. At least that was successful.

 

Unfortunately, running xfs_repair -L -v I am greeted with (last few lines in phase 6):
 

entry ".." in directory inode 1093044826 points to non-existent inode 6448754485, marking entry to be junked
bad hash table for directory inode 1093044826 (no data entry): rebuilding
rebuilding directory inode 1093044826
entry ".." in directory inode 1093051943 points to non-existent inode 6448754488, marking entry to be junked
bad hash table for directory inode 1093051943 (no data entry): rebuilding
rebuilding directory inode 1093051943
Invalid inode number 0x0
xfs_dir_ino_validate: XFS_ERROR_REPORT

fatal error -- couldn't map inode 1124413091, err = 117

It appears as though there is SOME file or directory that simply cannot be re-mapped.

 

What would be the next step in my attempt to repair this?

 

Would deleting -that- inode help xfs_repair complete, or am I asking for trouble doing this?

 

I do have a complete clone of this drive If things go south with unRAID, but I am guessing that clone would have issues rebuilding XFS with any operating system or PC.

 

EDIT: Could this be cause by something as simple as a file copied to unRAID with "illegal" characters in it's name? I know there were some old 1990's-era DOS files that were exported from an old machine someone had gone through that had some crazy names.

 

Any help you can provide is much appreciated.

 

Thank you!

 

 

Edited by jsx12
Link to comment
3 minutes ago, johnnie.black said:

It's difficult to help without the diagnostics and the output of xfs_repair, but if -L doesn't work your best bet it to ask for help to a xfs maintainer on the xfs mailing list.

 

Here is what pops up on boot for that specific drive (sanitized corrupted metadata buffer that had odd characters and a name in it):

Jan 21 07:38:13 SRV58302 kernel: XFS (md5): Mounting V5 Filesystem
Jan 21 07:38:13 SRV58302 kernel: XFS (md5): Starting recovery (logdev: internal)
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Metadata corruption detected at _xfs_buf_ioapply+0x95/0x38a [xfs], xfs_allocbt block 0x15d514890
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Unmount and run xfs_repair
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): xfs_do_force_shutdown(0x8) called from line 1367 of file fs/xfs/xfs_buf.c.  Return address = 0xffffffffa03d1082
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Corruption of in-memory data detected.  Shutting down filesystem
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): log mount/recovery failed: error -117
Jan 21 07:38:14 SRV58302 kernel: XFS (md5): log mount failed
Jan 21 07:38:14 SRV58302 root: mount: /mnt/disk5: mount(2) system call failed: Structure needs cleaning.
Jan 21 07:38:14 SRV58302 emhttpd: shcmd (73): exit status: 32
Jan 21 07:38:14 SRV58302 emhttpd: /mnt/disk5 mount error: No file system
Jan 21 07:38:14 SRV58302 emhttpd: shcmd (74): umount /mnt/disk5
Jan 21 07:38:14 SRV58302 root: umount: /mnt/disk5: not mounted.
Jan 21 07:38:14 SRV58302 emhttpd: shcmd (74): exit status: 32
Jan 21 07:38:14 SRV58302 emhttpd: shcmd (75): rmdir /mnt/disk5

I have also (hopefully) sanitized and attached the xfs_repair log. Whatever happened here seems pretty severe. Could I have run into some bug with XFS file-systems?

 

This system, more specifically, this specific disk and it's file-system worked perfectly up until the point where that single I/O error popped up, and now there are all of these issues? Does XFS "put up" with a certain degree of corruption until it "pulls the plug" and no longer allows the user to mount the partition?

 

I'll see if I can find someone who knows the inner workings of XFS.

 

Are you aware of anyone else on unRAID forums that had a similar issue where xfs_repair could not proceed? I really don't want to pin this on unRAID, but I have to give a reason for the unscheduled downtime. We might have to build another (expensive) Windows server to backup the others if my answer is XFS corruption, especially if I cannot fix this.

 

Thank you again!

xfs_repair.txt

Link to comment
Are you aware of anyone else on unRAID forums that had a similar issue where xfs_repair could not proceed?

Though rare I've seen at least 3 or 4 cases where xfs_repair couldn't fix the filesystem, your filesystem looks very damaged, there's metadata corruption and the primary superblock is damaged also, xfs_repair has very few options and it's usually very simple to use, but when it fails there's really nothing a normal user can do, if you can't restore from backups IMO your best option would be the xfs mailing list, and possibly even if they can help you'll end up with a lot of corrupt/lost files.

 

 

Link to comment

Hi Johnnie.

 

Fortunately, the bulk of what was on this server acted as a backup to three other Windows Servers, but I am aware of a few users who had realized the size of this array and dumped a whole lot of files on it. I am guessing those files are likely gone for good if they were on that one corrupted disk unless I can find a fix.

 

I contacted linux-xfs@vger.kernel.org with more or less a plea for help. I am not well versed with XFS at all, but I have reason to believe this happened due to but a bug somewhere. I had been mem-testing that server shortly after I pulled the drive the first time, and everything checked out. There has not been a single power outage in months, and my UPS can prove that. About the only thing I cannot directly test is the SAS2 back-plane, but that has been working flawlessly for the past year. Server/IPMI event logs are completely free of any memory or ECC-related errors. I'm at a bit of a loss, and just want to blame the system as a whole. If I can't trust it, I can't use it. 

 

Could this issue somehow be related to recently upgrading the server to unRAID 6.4 from 6.3.5? I don't see where unRAID made changes to anything significant when it comes to XFS, but it seems suspicious that this would happen relatively shortly after upgrading. The cache drives were hammered almost every day for the past few months, before the upgrade.

 

Finally, could you recommend a decent recovery software that could potentially recover a few files or at least see what was on that disk? Windows/Linux does not matter, just looking for something that can tell me WHAT was lost at this point, so I can let users know what they (likely) lost. I've used UFSExplorer before on client PCs, but I don't know if that is recommended at all anymore.

 

Thank you again for all your help!

Edited by jsx12
Link to comment
1 hour ago, jsx12 said:

Could this issue somehow be related to recently upgrading the server to unRAID 6.4 from 6.3.5? I don't see where unRAID made changes to anything significant when it comes to XFS, but it seems suspicious that this would happen relatively shortly after upgrading. The cache drives were hammered almost every day for the past few months, before the upgrade.

v6.4 includes a newer kernel, so it also likely include some xfs changes, if that's related is difficult to say.

 

1 hour ago, jsx12 said:

I've used UFSExplorer before on client PCs, but I don't know if that is recommended at all anymore.

That's what we usually recommend, and has been used successfully before, mostly to recover data from accidentally formatted xfs disks.

Link to comment

Okay. I am back with some good news.

 

I was contacted by Brian Foster (Redhat) who advised that  "an inode read verifier is running in a context that gets in the way of repair doing its job." This is apparently not an issue with xfs_repair/xfsprogs version 4.10, which is the latest version that does not include changes to the way this verifier works (or even exists, not entirely sure). unRAID 6.4 (Slackware 14.2) reports with "xfs_repair -V:" 4.13.1, which of course is more recent, including these changes.

 

This was the key to repairing this drive, at least outside unRAID. Using Fedora 26 (which has xfsprogs v4.10 native), I was able to run xfs_repair -L and subsequent standard xfs_repair calls to clean the partition up completely. I then was able to offload everything from it, and determine what is/was a backup and what is user data (using a -nv log). The surprising thing is that there were over 18000 files in various "inode folders" in lost+found, all of which, I have found so far, are not corrupt. Also, before I successfully ran xfs_repair v4.10, tried UFSExplorer, and that worked exactly as I remember, most everything looked to be present, and there were quite a few user files on there that probably should not have been there without a backup.

 

I am guessing the easiest and most-safe thing to do with this unRAID server now that everything is off that drive is to just format that virtual disk, attach it to a new physical disk, check parity, and then move the files back. Unless it would be easier to downgrade unRAID and/or install a different version of xfsprogs? I am not entirely sure on this.

 

Anyway, I am swapping a new drive into that specific bay, and putting the corrupt drive (freshly zeroed/formatted) into another bay to see if either become corrupt as time goes on. 99% of the time this type of corruption happens due to sudden power loss, as I have read, so there may be something sinister with either that drive or that bay/backplane. If either goes corrupt again, I'll know what to blame. 

 

Hopefully this helps someone someday. It seems like such a rare and/or new bug I ran into.

 

Thank you for the help!

Edited by jsx12
  • Like 1
Link to comment
8 minutes ago, jsx12 said:

This was the key to repairing this drive, at least outside unRAID. Using Fedora 26 (which has xfsprogs v4.10 native), I was able to run xfs_repair -L and subsequent standard xfs_repair calls to clean the partition up completely. I then was able to offload everything from it, and determine what is/was a backup and what is user data (using a -nv log). The surprising thing is that there were over 18000 files in various "inode folders" in lost+found, all of which, I have found so far, are not corrupt. Also, before I successfully ran xfs_repair v4.10, tried UFSExplorer, and that worked exactly as I remember, most everything looked to be present, and there were quite a few user files on there that probably should not have been there without a backup.

That's good news and thanks for posting, it might help someone in the future with a similar issue.

 

10 minutes ago, jsx12 said:

I am guessing the easiest and most-safe thing to do with this unRAID server now that everything is off that drive is to just format that virtual disk, attach it to a new physical disk, check parity, and then move the files back. Unless it would be easier to downgrade unRAID and/or install a different version of xfsprogs? I am not entirely sure on this.

Sounds good to me, definitely agree on the format, you should be fine on the latest release, and if there's a  problem with xfs_repair it should be fixed soon on an upcoming kernel.

  • Like 1
Link to comment

I’m a little concerned you seemingly found out about the problem not from UnRAID, but from the fact that a backup didn’t complete (along with your comment that this could have been happening for some time). I don’t currently run UnRAID but am thinking about switching. Can a user have some sort of automated notification set up to shoot out an email in the case of some sort of disk write (or SMART) error?

Link to comment
1 hour ago, jjjman321 said:

I’m a little concerned you seemingly found out about the problem not from UnRAID, but from the fact that a backup didn’t complete (along with your comment that this could have been happening for some time). I don’t currently run UnRAID but am thinking about switching. Can a user have some sort of automated notification set up to shoot out an email in the case of some sort of disk write (or SMART) error?

Good point. In fact, unRAID does have notifications for these and other events.

 

@jsx12

Do you have Notifications setup?

Link to comment

It is just the first thing I got to coming in at 5 a.m., checking nightly backup logs and swapping over LTOs. I am set to not be disturbed by emails as I get literally 100s of them daily, 24/7 (85% work related too, it IS a problem being on multiple forwarding lists with facilities worldwide). I have it on a to-do list to set this (and a few other servers) up with our dedicated internal alert email that would not see as much crap as I get with the other one, and get a separate device or app to alert me. Typically only a call or a text will get me up before 3 a.m. to fix something.

 

unRAID did it's job and did send an email, but it was quickly buried.

Edited by jsx12
Link to comment
2 hours ago, jsx12 said:

Typically only a call or a text will get me up before 3 a.m. to fix something.

 

unRAID did it's job and did send an email, but it was quickly buried.

Perhaps you should let unraid send you texts? It's not difficult with most providers, typically an email to <phone number>@providertextingaddress.com will do the trick.

You can find your specific provider here. http://www.emailtextmessages.com/

Link to comment
  • 3 months later...
On 1/25/2018 at 10:28 AM, jsx12 said:

It is just the first thing I got to coming in at 5 a.m., checking nightly backup logs and swapping over LTOs. I am set to not be disturbed by emails as I get literally 100s of them daily, 24/7 (85% work related too, it IS a problem being on multiple forwarding lists with facilities worldwide). I have it on a to-do list to set this (and a few other servers) up with our dedicated internal alert email that would not see as much crap as I get with the other one, and get a separate device or app to alert me. Typically only a call or a text will get me up before 3 a.m. to fix something.

 

unRAID did it's job and did send an email, but it was quickly buried.

JSX12 Sent you a PM

Link to comment
  • 3 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.