DaveW42 Posted February 12, 2023 Share Posted February 12, 2023 Help would be greatly appreciated. Over the past 2 months I've been running into hard drive problems with my UNRAID server. The server has 2 parity drives, 16 drives on the array, roughly two or three in unassigned devices, an NVME cache drive, an SSD cache drive, and a couple of separate drives outside of the array. The problem started when I saw 2 drives on the array with their status marked as disabled. I bought two new drives ASAP, followed UNRAID's standard procedures for replacing disabled drives (https://wiki.unraid.net/Manual/Storage_Management#Replacing_failed.2Fdisabled_disk.28s.29) and things looked ok for a short period of time. Then another drive registered as disabled. I looked to see if there was some commonality in the failures, and thought at first that the common thread was that the disabled drives were those attached directly to the SATA ports on my motherboard (Asus ROG Strix X570-E Gaming running AMD 5950x). I had had other issues with that X570-E motherboard in the past (i.e., it NEVER would accept more than 64GB of memory, even after two replacements boards from Asus), so I thought I would take the hint and move to an entirely different ASUS model of motherboard within the x570 family (ASUS ROG Crosshair VIII Formula AMD X570). Unfortunately that motherboard has the same memory issues as my X570-E Gaming, and shortly thereafter ANOTHER hard drive went disabled. I then logged the Serial Numbers and locations of all my drives and their physical locations in my server case, and saw that the commonality was actually attachment to one of two LSI SAS9211-8I 8PORT Internal 6GB Sata+SAS Pcie 2.0 cards. I had a spare LSI card, so I removed the bad one and replaced it with the spare. Everything seemed to boot up fine, and it seemed I could access files normally (although I still had 2 disabled drives). At some point shortly thereafter (30 min or an hour?) I noticed that the new LSI card had RAID firmware installed, instead of the IT firmware that you are supposed to use when you are running Just a Bunch of DIsks (JBOD). I then upgraded the LSI firmware on the new card, as well as the firmware on the original card (i.e., the one that always worked and continues to work). Both cards were updated to the recommended (v20) version of the IT firmware. I thought this would address the core hardware issue, and maybe it did -- but unfortunately I'm not quite sure yet (!) My next step was to rebuild the two disabled drives. I took one of the 10TB drives that was in the server but not yet in use (not part of the array, no data stored on it), and went to rebuild the smaller of the two drives (10TB) using the aforementioned standard procedures for replacing disabled drives. Yesterday the rebuild process completed, and things started to look ok. But then today I saw that that newly synced 10TB drive is appearing as "Unmountable: Wrong or no file system," as does the other drive that had also been disabled (a 14TB drive). The newly synced drive still appears as "green" (i.e., no warning triangle), and the short SMART test I ran on that drive seemed to indicate it what was doing fine. Up until this point I haven't seen any loss of data, across all of the issues indicated above. However, about an hour ago I went to look for a set of files and they were gone. This last part is more concerning to me than anything prior. I have now (via the GUI) removed the 10TB drive that was registered as unmountable from the array, and that disk (Disk 12) is now listed as not installed. The remaining drive that was disabled (14TB) also now appears as "Unmountable: wrong or no file system." Both drives have a red "X" next to them. Currently I have the array running, with those two drives set as indicated in the last paragraph. I have attached the diagnostic file for my system. Any help would be greatly, greatly appreciated in getting this system going again. Also, based on what you can see, do you think I lost any files? A few quick last notes as well. - With regard to the memory issue, I have sent the four 32GB sticks back for replacement, so right now I am just using an older 8GB stick for the server. - All dockers and VMs are shut down now. - In terms of changes/new things, the NVME is also new, and I recently attached the following unit to the system, to be able to run preclear etc. separately on new drives. (https://www.amazon.com/gp/product/B0759567JT/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1https://www.amazon.com/gp/product/B0759567JT/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1). Thanks again so much for the help! Dave nas24-diagnostics-20230212-1544.zip Quote Link to comment
Solution JorgeB Posted February 13, 2023 Solution Share Posted February 13, 2023 Check filesystem on disks 1 and 12. Quote Link to comment
DaveW42 Posted February 13, 2023 Author Share Posted February 13, 2023 Thank you, Jorge! I will follow those procedures tonight, after work. The drives are formatted XFS. Dave Quote Link to comment
DaveW42 Posted February 14, 2023 Author Share Posted February 14, 2023 Hi, JorgeB. I followed the instructions provided and ran xfs_repair on both disks using the GUI, with options specified as follows: -nv Both disks are showing the same message (see below). What should I do next? Should I run xfs_repair with no options? Thanks! Dave SYSTEM MESSAGE Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. Quote Link to comment
JorgeB Posted February 14, 2023 Share Posted February 14, 2023 You need to run it again without -n or nothing will be done, if it asks for -L use it. Quote Link to comment
DaveW42 Posted February 14, 2023 Author Share Posted February 14, 2023 Thank you, JorgeB. I note that in the wiki you originally linked to (https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui), that unfortunately xfs_repair often doesn't work. Before I try this, would you recommend that I try looking at the drive through File Scavenger or a live CD of TestDisk? Those two options are noted in the wiki when you are redoing a drive formatted with XFS (i.e., after xfs_repair initially ails). I am definitely new to this, and am just wondering if it might be better try those first if, for example, there was a non-destructive mode in either of those pieces of software? I will definitely defer to what you recommend, I just wanted to throw this thought out there. Thanks again! Dave Quote Link to comment
DaveW42 Posted February 15, 2023 Author Share Posted February 15, 2023 Hi, JorgeB. I'm pretty nervous about possibly losing data (family photos, etc.), especially since I can't precisely even say what data was on that drive. Correct me if I am wrong, but I believe UNRAID can choose to write to different hard disks as it writes files to a single share, and thus--if I lose files on this drive--there are likely to be shares where one or more files have gone missing while other files (housed on different disks) are still present. It is bad enough to possibly lose files, but finding holes in my data would take things a step further and drive me crazy! Would it be a good idea to create an image of the drive before doing anything that might change data on the drive? Then I could restore the image if the repair encounters problems, and restart recovery efforts using the tools I mentioned above, or perhaps even other tools. I just ordered a couple more hard drives in case this might be helpful. I really want to avoid loss of data. Believe it or not, before all this happened I'd been trying to figure out the best way to backup critical files to an old synology server that I have. I thought with the parity protection I would be safe for a few months while I tried to figure this out. Live and learn! Also, below is what UNRAID said when I ran xfs_repair (I added "-v", so we get more output). Thanks, Dave Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 299848 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 1075093 tail block 1075089 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Quote Link to comment
JorgeB Posted February 15, 2023 Share Posted February 15, 2023 13 hours ago, DaveW42 said: that unfortunately xfs_repair often doesn't work. Please post that output. Quote Link to comment
DaveW42 Posted February 15, 2023 Author Share Posted February 15, 2023 Hi, JorgeB. Below is the text on the wiki that I was referring to (I've changed the text color to green for the parts I was referencing). Thanks, Dave Additional comments The xfs_repair tool may take a long time on a full file system (several minutes to a half hour, or more). If the xfs_repair command fails, and we're hearing numerous reports of this(!), then you will have no recourse but to redo the drive. Use the instructions in the Redoing a drive formatted with XFS section below. We're sorry, we hope there will be better XFS repair tools some day! If there was significant corruption, then it is possible that some files were not completely recovered. Check for a lost+found folder on this drive, which may contain fragments of the unrecoverable files. It is up to you to examine these and determine what files they are from, and act accordingly. Hopefully, you have another copy of each file. When you are finished examining them and saving what you can, then delete the fragments and remove the lost+found folder. Dealing with this folder does not have to be done immediately. This is similar to running chkdsk or scandisk within Windows, and finding lost clusters, and dealing with files named File0000.chk or similar. You may find one user's story very helpful, plus his later tale of the problems of sifting through the recovered files. If you get an error indicating something like trouble opening the file system, it may indicate that you attempted to run the file system check on the wrong device name. For almost all repairs, you would use /dev/md1, /dev/md2, /dev/md3, /dev/md4, etc. If operating on the cache drive (which is not protected by parity), you would use /dev/sdX1 (note the trailing "1" indicating the first partition on the cache drive). If you want to test and repair a non-array drive, you would use the drive's partition symbol (e.g. sdc1, sdj1, sdx1, etc), not the array device symbol (e.g. md1, md13, etc). So the device name would be something like /dev/sdj1, /dev/sdx1, etc. Redoing a drive formatted with XFS If you are here because the XFS repair tool has failed you(!), then the best we can recommend is to save the data, reformat, and restore the data (NOT a desirable course of action)! Do your best to copy off everything you can, to a safe place. If something important is absolutely needed and still inaccessible, try File Scavenger or a live CD of TestDisk. Change the file system format for the drive to ReiserFS (just to reset the formatting, it's temporary and fairly quick) Start the array and format the drive Stop the array Change the file system format for the drive to XFS again Start the array and format the drive again Copy back everything you saved It's certainly not a welcome method, but it does produce a fresh and clean XFS format. (The write-up above has not been tested by this author. If corrections are needed, please do them, or PM RobJ with the corrections or suggestions.) Quote Link to comment
JorgeB Posted February 15, 2023 Share Posted February 15, 2023 Please post the output of xfs_repair, the one that you say it didn't work. Quote Link to comment
DaveW42 Posted February 15, 2023 Author Share Posted February 15, 2023 Thanks, JorgeB. So far I've just run xfs_repair twice, both times using the GUI. The second time the output indicated I would need to mount the drive so it could replay (?) the log. That was the next step, but before I did that I thought about imaging the drive (and wrote the corresponding post above). Per your request, below is all of the output I have seen from xfs_repair. Dave The first time I ran it with the -nv option and output was as follows: Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. The second time I ran it with the -v option and output was as follows: Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 299848 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 1075093 tail block 1075089 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Quote Link to comment
DaveW42 Posted February 15, 2023 Author Share Posted February 15, 2023 I was particularly attending to that last line in the second output, when I put forward the imaging idea. "Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this." We're not quite there yet (haven't tried mounting the drive), but I am trying to avoid the possibility of corruption and thought it might be prudent to consider imaging. Dave Quote Link to comment
DaveW42 Posted February 15, 2023 Author Share Posted February 15, 2023 Also, I suspect we are in very different time zones (almost 4am here). Unfortunately I have to work in the morning and need to catch a bit of sleep. I'll take a look again at your guidance in the morning. Thanks again so much for all your help!!! Dave Quote Link to comment
JorgeB Posted February 15, 2023 Share Posted February 15, 2023 39 minutes ago, DaveW42 said: then use the -L option On 2/14/2023 at 9:21 AM, JorgeB said: if it asks for -L use it. Quote Link to comment
trurl Posted February 15, 2023 Share Posted February 15, 2023 5 hours ago, DaveW42 said: please attempt a mount of the filesystem before doing this The xfs repair tool doesn't know Unraid has already tried to mount the drive and it can't be mounted. So you have to proceed with -L Quote Link to comment
DaveW42 Posted February 16, 2023 Author Share Posted February 16, 2023 Thanks for your thoughts on this, trurl! Would there be any advantage to imaging the drive before I run xfs repair with -L ? Or would it just be a waste of time to do so, in hopes of additional attempts to salvage files using File Scavenger and TestDisk? Thanks also, JorgeB, for your additional comment. Dave Quote Link to comment
JorgeB Posted February 16, 2023 Share Posted February 16, 2023 4 hours ago, DaveW42 said: Would there be any advantage to imaging the drive before I run xfs repair with -L ? Or would it just be a waste of time to do so, in hopes of additional attempts to salvage files using File Scavenger and TestDisk? Using -L is quite common and usually there's no big risk, but having a clone would leave you with more options, if time is not a factor. Quote Link to comment
DaveW42 Posted February 17, 2023 Author Share Posted February 17, 2023 Hi, JorgeB. Thank you! That information is very helpful. Time is definitely a factor, but I would also definitely be willing to go the more conservative route and image the drives (both the original drive that UNRAID disabled and the rebuilt, synced drive) if it improved my chances of recovering the data even slightly. With this in mind, I have four 14TB WD drives coming via next day service. These will also come in handy when I get past this issue and one day figure out a local backup solution for my array. It may take me a bit before I am able to update this thread with any new information ... I will need to preclear those drives, running two at a time using the device I mentioned in my first post. I just did this a week ago, and I think it took 3 or 4 days per pair of drives, given all the preclear steps. And then I will need to 1) figure out what software to use to image those two drives and 2) run that imaging process. I really hope "-L" works, it would be nice for a quick solve once I am through all of this setup! I will keep monitoring this thread in case anyone has any new thoughts on all of this, but I think I wouldn't really have an update until I am through all these steps (maybe 7 to 10 days??) Thanks again! Dave Quote Link to comment
JorgeB Posted February 17, 2023 Share Posted February 17, 2023 Note that you can clone disk12 now but disk1 is disabled, so you'd be repairing the emulated disk, you can first see if actual disk1 is mounting with UD, but note that any data written to it since it got disabled would be missing. Quote Link to comment
DaveW42 Posted February 20, 2023 Author Share Posted February 20, 2023 Thanks, JorgeB. I'm not quite sure I followed that. Are you saying that whatever was written to disk 1 and disk 12 immediately before they got disabled would be missing, or are you saying that UNRAID might have tried to continue to write to those drives even after they were disabled (i.e., UNRAID would be writing to and updating an emulated disk)? Or is it something else? I'm also not quite sure I understand how the emulation works (I'll have to look that one up). I believe UNRAID was indicating that both drives were emulated, after the drives were disabled. Having said this, what didn't quite make sense to me was that -- if both drives were truly emulated--then why were those files missing? I would have thought I would have seen those files, since they were being emulated. Also, preclearing is still going as we speak (pre-read, zero, post-read) for all 4 new drives. I will shortly be investigating what programs to use to do the imaging. Not sure whether I should stay in the Linux family for this, or bring things over to Windows, which I am more familiar with. I wish I had learned more about Linux a long time ago, as I do like this framework much better than Windows. Thanks! Dave Quote Link to comment
JorgeB Posted February 20, 2023 Share Posted February 20, 2023 1 hour ago, DaveW42 said: Are you saying that whatever was written to disk 1 and disk 12 immediately before they got disabled would be missing, or are you saying that UNRAID might have tried to continue to write to those drives even after they were disabled (i.e., UNRAID would be writing to and updating an emulated disk)? Or is it something else? Disk1 is disabled, after it got disabled any new data written to it was written to the emulated disk only, if you clone the actual disk it would be missing that data, disk12 is not disabled, just unmountable. Quote Link to comment
trurl Posted February 20, 2023 Share Posted February 20, 2023 6 hours ago, DaveW42 said: how the emulation works Unraid disables a disk when a write to it fails for any reason. Often this isn't a problem with the disk, but a problem communicating with the disk. That failed write updates parity, so it can be recovered by rebuilding, but the physical disk isn't used again until rebuilt since it is now out-of-sync with the array. After disabling, the disk is emulated from parity. When reading an emulated disk, the data comes from the parity calculation by reading all other disks. When writing an emulated disk, parity is updated as if the disk had been written. The initial failed write, and any subsequent writes to the disabled/emulated disk, can be recovered by rebuilding. https://wiki.unraid.net/Manual/Overview#Parity-Protected_Array Quote Link to comment
DaveW42 Posted February 21, 2023 Author Share Posted February 21, 2023 Thanks so much for those explanations, JorgeB and trurl !!! Very interesting stuff. I think the concept associated with parity drives is brilliant, and I have mentioned the whole idea to non-techy friends as an example of something that appears impossible (being able to protect numerous large drives using just one large drive) but is possible through logic and basic math. Very cool. So, to make a long story short, I decided that trying to take an image of drive 12 might actually cause me more problems in the long run (e.g., if I inserted that drive into another computer, and somehow that other computer decided to write something to that drive). After the drives precleared, I therefore just decided I would just follow the initial direction both of you had provided, which was to move forward with xfs_repair -L as you both were suggesting. Before I did that, just for the heck of it I ran xfs_repair with the -nv options, just to see what would happen. Lo and behold I got a totally different set of output, which I have shared below. I think ultimately this still means I should just run the "-L" option, but I thought I should check first. Please let me know what you think. Say the word, and "-L" will be my next step. Thanks! Dave Phase 1 - find and verify superblock... - block cache size set to 3005320 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 would reset superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 would reset superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 would reset superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 1075093 tail block 1075089 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_icount 0, counted 262400 sb_ifree 0, counted 1948 sb_fdblocks 2441087425, counted 140977217 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 5 - agno = 2 - agno = 3 - agno = 4 - agno = 8 - agno = 6 - agno = 7 - agno = 9 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Tue Feb 21 01:55:39 2023 Phase Start End Duration Phase 1: 02/21 01:54:34 02/21 01:54:34 Phase 2: 02/21 01:54:34 02/21 01:54:34 Phase 3: 02/21 01:54:34 02/21 01:55:08 34 seconds Phase 4: 02/21 01:55:08 02/21 01:55:08 Phase 5: Skipped Phase 6: 02/21 01:55:08 02/21 01:55:39 31 seconds Phase 7: 02/21 01:55:39 02/21 01:55:39 Total run time: 1 minute, 5 seconds Quote Link to comment
DaveW42 Posted February 21, 2023 Author Share Posted February 21, 2023 Also, please let me know if any of the above output suggests I should spin the array up in non-maintenance mode before doing this (i.e., to try to mount drive 12 and play the log back again), and then spin the array down again, click maintenance mode, and then spin the array up again before running -L on disk 12. Thanks! Dave Quote Link to comment
JorgeB Posted February 21, 2023 Share Posted February 21, 2023 2 hours ago, DaveW42 said: I think ultimately this still means I should just run the "-L" option Probably, but you fist need to run without -n, and if it asks for it use -L. 2 hours ago, DaveW42 said: (i.e., to try to mount drive 12 and play the log back again) You can try, but they didn't mount before so I expect they still won't mount. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.