jsx12

Members
  • Posts

    7
  • Joined

  • Last visited

Everything posted by jsx12

  1. It is just the first thing I got to coming in at 5 a.m., checking nightly backup logs and swapping over LTOs. I am set to not be disturbed by emails as I get literally 100s of them daily, 24/7 (85% work related too, it IS a problem being on multiple forwarding lists with facilities worldwide). I have it on a to-do list to set this (and a few other servers) up with our dedicated internal alert email that would not see as much crap as I get with the other one, and get a separate device or app to alert me. Typically only a call or a text will get me up before 3 a.m. to fix something. unRAID did it's job and did send an email, but it was quickly buried.
  2. Okay. I am back with some good news. I was contacted by Brian Foster (Redhat) who advised that "an inode read verifier is running in a context that gets in the way of repair doing its job." This is apparently not an issue with xfs_repair/xfsprogs version 4.10, which is the latest version that does not include changes to the way this verifier works (or even exists, not entirely sure). unRAID 6.4 (Slackware 14.2) reports with "xfs_repair -V:" 4.13.1, which of course is more recent, including these changes. This was the key to repairing this drive, at least outside unRAID. Using Fedora 26 (which has xfsprogs v4.10 native), I was able to run xfs_repair -L and subsequent standard xfs_repair calls to clean the partition up completely. I then was able to offload everything from it, and determine what is/was a backup and what is user data (using a -nv log). The surprising thing is that there were over 18000 files in various "inode folders" in lost+found, all of which, I have found so far, are not corrupt. Also, before I successfully ran xfs_repair v4.10, tried UFSExplorer, and that worked exactly as I remember, most everything looked to be present, and there were quite a few user files on there that probably should not have been there without a backup. I am guessing the easiest and most-safe thing to do with this unRAID server now that everything is off that drive is to just format that virtual disk, attach it to a new physical disk, check parity, and then move the files back. Unless it would be easier to downgrade unRAID and/or install a different version of xfsprogs? I am not entirely sure on this. Anyway, I am swapping a new drive into that specific bay, and putting the corrupt drive (freshly zeroed/formatted) into another bay to see if either become corrupt as time goes on. 99% of the time this type of corruption happens due to sudden power loss, as I have read, so there may be something sinister with either that drive or that bay/backplane. If either goes corrupt again, I'll know what to blame. Hopefully this helps someone someday. It seems like such a rare and/or new bug I ran into. Thank you for the help!
  3. Hi Johnnie. Fortunately, the bulk of what was on this server acted as a backup to three other Windows Servers, but I am aware of a few users who had realized the size of this array and dumped a whole lot of files on it. I am guessing those files are likely gone for good if they were on that one corrupted disk unless I can find a fix. I contacted [email protected] with more or less a plea for help. I am not well versed with XFS at all, but I have reason to believe this happened due to but a bug somewhere. I had been mem-testing that server shortly after I pulled the drive the first time, and everything checked out. There has not been a single power outage in months, and my UPS can prove that. About the only thing I cannot directly test is the SAS2 back-plane, but that has been working flawlessly for the past year. Server/IPMI event logs are completely free of any memory or ECC-related errors. I'm at a bit of a loss, and just want to blame the system as a whole. If I can't trust it, I can't use it. Could this issue somehow be related to recently upgrading the server to unRAID 6.4 from 6.3.5? I don't see where unRAID made changes to anything significant when it comes to XFS, but it seems suspicious that this would happen relatively shortly after upgrading. The cache drives were hammered almost every day for the past few months, before the upgrade. Finally, could you recommend a decent recovery software that could potentially recover a few files or at least see what was on that disk? Windows/Linux does not matter, just looking for something that can tell me WHAT was lost at this point, so I can let users know what they (likely) lost. I've used UFSExplorer before on client PCs, but I don't know if that is recommended at all anymore. Thank you again for all your help!
  4. Here is what pops up on boot for that specific drive (sanitized corrupted metadata buffer that had odd characters and a name in it): Jan 21 07:38:13 SRV58302 kernel: XFS (md5): Mounting V5 Filesystem Jan 21 07:38:13 SRV58302 kernel: XFS (md5): Starting recovery (logdev: internal) Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Metadata corruption detected at _xfs_buf_ioapply+0x95/0x38a [xfs], xfs_allocbt block 0x15d514890 Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Unmount and run xfs_repair Jan 21 07:38:14 SRV58302 kernel: XFS (md5): xfs_do_force_shutdown(0x8) called from line 1367 of file fs/xfs/xfs_buf.c. Return address = 0xffffffffa03d1082 Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Corruption of in-memory data detected. Shutting down filesystem Jan 21 07:38:14 SRV58302 kernel: XFS (md5): Please umount the filesystem and rectify the problem(s) Jan 21 07:38:14 SRV58302 kernel: XFS (md5): log mount/recovery failed: error -117 Jan 21 07:38:14 SRV58302 kernel: XFS (md5): log mount failed Jan 21 07:38:14 SRV58302 root: mount: /mnt/disk5: mount(2) system call failed: Structure needs cleaning. Jan 21 07:38:14 SRV58302 emhttpd: shcmd (73): exit status: 32 Jan 21 07:38:14 SRV58302 emhttpd: /mnt/disk5 mount error: No file system Jan 21 07:38:14 SRV58302 emhttpd: shcmd (74): umount /mnt/disk5 Jan 21 07:38:14 SRV58302 root: umount: /mnt/disk5: not mounted. Jan 21 07:38:14 SRV58302 emhttpd: shcmd (74): exit status: 32 Jan 21 07:38:14 SRV58302 emhttpd: shcmd (75): rmdir /mnt/disk5 I have also (hopefully) sanitized and attached the xfs_repair log. Whatever happened here seems pretty severe. Could I have run into some bug with XFS file-systems? This system, more specifically, this specific disk and it's file-system worked perfectly up until the point where that single I/O error popped up, and now there are all of these issues? Does XFS "put up" with a certain degree of corruption until it "pulls the plug" and no longer allows the user to mount the partition? I'll see if I can find someone who knows the inner workings of XFS. Are you aware of anyone else on unRAID forums that had a similar issue where xfs_repair could not proceed? I really don't want to pin this on unRAID, but I have to give a reason for the unscheduled downtime. We might have to build another (expensive) Windows server to backup the others if my answer is XFS corruption, especially if I cannot fix this. Thank you again! xfs_repair.txt
  5. Hello. I came in to check up on the DD clone and see what I could repair on the server. The suspect drive passed both the quick and extended SMART tests. The server was last rebooted for a v6.4 update maybe a week ago or more. Last power down before that was in August of 2017, apparently to move to a different UPS zone. No other power disruptions since. Not a brownout, not a reason to trip over to battery. I believe they were legitimate ATA errors, as the Windows server reports a device "I/O" error copying to unRAID. They suspiciously popped up right after the cache was filled up. DD Took something like 36 hours at 61 MB/s onto a spare HGST 7200 RPM Enterprise drive. At least that was successful. Unfortunately, running xfs_repair -L -v I am greeted with (last few lines in phase 6): entry ".." in directory inode 1093044826 points to non-existent inode 6448754485, marking entry to be junked bad hash table for directory inode 1093044826 (no data entry): rebuilding rebuilding directory inode 1093044826 entry ".." in directory inode 1093051943 points to non-existent inode 6448754488, marking entry to be junked bad hash table for directory inode 1093051943 (no data entry): rebuilding rebuilding directory inode 1093051943 Invalid inode number 0x0 xfs_dir_ino_validate: XFS_ERROR_REPORT fatal error -- couldn't map inode 1124413091, err = 117 It appears as though there is SOME file or directory that simply cannot be re-mapped. What would be the next step in my attempt to repair this? Would deleting -that- inode help xfs_repair complete, or am I asking for trouble doing this? I do have a complete clone of this drive If things go south with unRAID, but I am guessing that clone would have issues rebuilding XFS with any operating system or PC. EDIT: Could this be cause by something as simple as a file copied to unRAID with "illegal" characters in it's name? I know there were some old 1990's-era DOS files that were exported from an old machine someone had gone through that had some crazy names. Any help you can provide is much appreciated. Thank you!
  6. Thank you for your response johnnie.black! So the procedure to do that (after a DD clone for safe keeping) would be something like this?: Lets say the suspect drive is (/dev/md5). 1 - Re-install the suspect drive in the bay it was in. 2 - Start the server, login to GUI, stop the array, then re-start it in mainteneance mode. 3 - SSH into the server. 4 - Run: xfs_repair -V -L /dev/md5 5 - After the repair, run: /mnt/disk5/lost+found 6 - And check for corrupted files in that lost+found folder. Beware, I have have a lot of questions and general information below: Thank you again for your help!
  7. Hello! I have a custom-built 70+ TB unRAID Pro (6.4?) server that I use to backup a variety of systems where I work. It has these specs: Xeon E5-2630 V4 32GB DDR4 2133 MHz ECC Supermicro X10SRL-F Supermicro 846 24 Bay Chassis 2 Intel DC S3710 400GB, Dual Cache 2 WD Gold 8TB 7200 RPM, Dual Parity 10 WD Red 8TB 5400 RPM, Data -Parity was last run on the 15th of January 2018, three days ago. -The system sits on a Windows Domain network, and is joined to the AD. -The systems that primarily access the unRAID server are Windows Server 2016 Standard and Windows Server 2012 R2 Standard. I have recently been made aware of a scheduled backup that did not complete, only to find out one the the disks in the array had corrupted it's XFS file-system. The log is full of I/O errors trying to write to that disk. I unfortunately am not at liberty to provide logs, as they contain customer data that would need to be sanitized. I stopped the array, and then re-started it in maintenance mode and ran a "Check File-system Status" on that disk with the unRAID-recommend -nv flag. The log produced by this is something like 2200 lines long, specifying many, many files. Again, I am not at liberty to provide the entire log, but I can provide the end results: XFS_REPAIR Summary Thu Jan 18 08:24:33 2018 Phase Start End Duration Phase 1: 01/18 08:17:12 01/18 08:17:13 1 second Phase 2: 01/18 08:17:13 01/18 08:17:17 4 seconds Phase 3: 01/18 08:17:17 01/18 08:21:57 4 minutes, 40 seconds Phase 4: 01/18 08:21:57 01/18 08:21:57 Phase 5: Skipped Phase 6: 01/18 08:21:57 01/18 08:24:33 2 minutes, 36 seconds Phase 7: 01/18 08:24:33 01/18 08:24:33 Total run time: 7 minutes, 21 seconds It appears as though the file-system on this disk was severely corrupted, as upon restarting the array, the drive shows "Unmountable - No File System." Disk contents from that drive are also no longer available through drive emulation, after removing the drive from the array. Upon attempting to re-add the drive to the array, I am greeted with "Drive contents will be OVERWRITTEN." I did not re-add the drive to the array in fear of losing what was on that disk. I then simply shut the system down and pulled the suspect drive. I have ordered another WD Red 8TB to DD clone this drive with Ubuntu 16.04 and will attempt an XFS_Repair on the cloned drive, as this seems to be the safest way to go about doing this. Let me know if what I have planned out in the following lines looks okay: 1 - Plug in both the new and old drives in a basic sata controller on a separate system running Ubuntu 16.04 2 - Clone the old drive (lets say /dev/sdc) to the new drive (lets say /dev/sdd): sudo dd if=/dev/sdc of=/dev/sdd 3 - Shutdown after the clone completes and remove the old drive to keep it safe. 4 - Reboot, force an XFS_Repair of the new drive (/dev/sdd) with: xfs_repair -L /dev/sdd1 5 - If the XFS_Repair is successful, I am guessing I am ready to pop the (new) drive back in the server? I might also figure a way to recover whatever was on that disk to another network share in the event I am unable to add it back to the array. After starting the server with the new drive is where I begin to get a little fuzzy, as I am not sure how unRAID acts with a now-foreign drive with existing data on it. I am pretty sure there is no way to recover from this without fully rebuilding parity, as it the existing parity yields a corrupted drive anyway? Would it be best to create a completely new config from here, and rebuild parity from the new config? Thank you for your help!