Problem: UNRAID 6.11.5 - Unmountable: Wrong or no file system


DaveW42
Go to solution Solved by JorgeB,

Recommended Posts

Help would be greatly appreciated.  Over the past 2 months I've been running into hard drive problems with my UNRAID server.  The server has 2 parity drives, 16 drives on the array, roughly two or three in unassigned devices, an NVME cache drive, an SSD cache drive, and a couple of separate drives outside of the array.   The problem started when I saw 2 drives on the array with their status marked as disabled.  I bought two new drives ASAP, followed UNRAID's standard procedures for replacing disabled drives (https://wiki.unraid.net/Manual/Storage_Management#Replacing_failed.2Fdisabled_disk.28s.29) and things looked ok for a short period of time.  Then another drive registered as disabled.  

 

I looked to see if there was some commonality in the failures, and thought at first that the common thread was that the disabled drives were those attached directly to the SATA ports on my motherboard (Asus ROG Strix X570-E Gaming running AMD 5950x).   I had had other issues with that X570-E motherboard in the past (i.e., it NEVER would accept more than 64GB of memory, even after two replacements boards from Asus), so I thought I would take the hint and move to an entirely different ASUS model of motherboard within the x570 family (ASUS ROG Crosshair VIII Formula AMD X570).   Unfortunately that motherboard has the same memory issues as my X570-E Gaming, and shortly thereafter ANOTHER  hard drive went disabled.   I then logged the Serial Numbers and locations of all my drives and their physical locations in my server case, and saw that the commonality was actually attachment to one of two LSI SAS9211-8I 8PORT Internal 6GB Sata+SAS Pcie 2.0 cards.  I had a spare LSI card, so I removed the bad one and replaced it with the spare.  Everything seemed to boot up fine, and it seemed I could access files normally (although I still had 2 disabled drives).  At some point shortly thereafter (30 min or an hour?) I noticed that the new LSI card had RAID firmware installed, instead of the IT firmware that you are supposed to use when you are running Just a Bunch of DIsks (JBOD).  I then upgraded the LSI firmware on the new card, as well as the firmware on the original card (i.e., the one that always worked and continues to work).  Both cards were updated to the recommended (v20) version of the IT firmware. 

 

I thought this would address the core hardware issue, and maybe it did -- but unfortunately I'm not quite sure yet (!)  My next step was to rebuild the two disabled drives.  I took one of the 10TB drives that was in the server but not yet in use (not part of the array, no data stored on it), and went to rebuild the smaller of the two drives (10TB) using the aforementioned standard procedures for replacing disabled drives.  Yesterday the rebuild process completed, and things started to look ok.  But then today I saw that that newly synced 10TB drive is appearing as "Unmountable: Wrong or no file system," as does the other drive that had also been disabled (a 14TB drive).   The newly synced drive still appears as "green" (i.e., no warning triangle), and the short SMART test I ran on that drive seemed to indicate it what was doing fine.  Up until this point I haven't seen any loss of data, across all of the issues indicated above.  However, about an hour ago I went to look for a set of files and they were gone.   This last part is more concerning to me than anything prior.  

 

I have now (via the GUI) removed the 10TB drive that was registered as unmountable from the array, and that disk (Disk 12) is now listed as not installed.  The remaining drive that was disabled (14TB) also now appears as "Unmountable: wrong or no file system."  Both drives have a red "X" next to them. 

 

Currently I have the array running, with those two drives set as indicated in the last paragraph.  I have attached the diagnostic file for my system.  Any help would be greatly, greatly appreciated in getting this system going again.  Also, based on what you can see, do you think I lost any files?  

 

A few quick last notes as well. 

- With regard to the memory issue, I have sent the four 32GB sticks back for replacement, so right now I am just using an older 8GB stick for the server.

- All dockers and VMs are shut down now.

- In terms of changes/new things, the NVME is also new, and I recently attached the following unit to the system, to be able to run preclear etc. separately on new drives.   (https://www.amazon.com/gp/product/B0759567JT/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1https://www.amazon.com/gp/product/B0759567JT/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1).  

 

Thanks again so much for the help!

 

Dave

nas24-diagnostics-20230212-1544.zip

Link to comment

Hi, JorgeB.

 

I followed the instructions provided and ran xfs_repair on both disks using the GUI, with options specified as follows:   -nv

 

Both disks are showing the same message (see below).  What should I do next?  Should I run xfs_repair with no options?

 

Thanks!

 

Dave

 

 

SYSTEM MESSAGE

 

 

Phase 1 - find and verify superblock...

bad primary superblock - bad CRC in superblock !!!

 

attempting to find secondary superblock... 

found candidate secondary superblock...

verified secondary superblock...

would write modified primary superblock

Primary superblock would have been modified.

Cannot proceed further in no_modify mode.

Exiting now.

Link to comment

Thank you, JorgeB.  

 

I note that in the wiki you originally linked to (https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui), that unfortunately xfs_repair often doesn't work.  Before I try this, would you recommend that I try looking at the drive through File Scavenger or a live CD of TestDisk?  Those two options are noted in the wiki when you are redoing a drive formatted with XFS (i.e., after xfs_repair initially ails).  I am definitely new to this, and am just wondering if it might be better try those first if, for example, there was a non-destructive mode in either of those pieces of software?  I will definitely defer to what you recommend, I just wanted to throw this thought out there.

 

Thanks again!

 

Dave

 

Link to comment

Hi, JorgeB. 

 

I'm pretty nervous about possibly losing data (family photos, etc.), especially since I can't precisely even say what data was on that drive.  Correct me if I am wrong, but I believe UNRAID can choose to write to different hard disks as it writes files to a single share, and thus--if I lose files on this drive--there are likely to be shares where one or more files have gone missing while other files (housed on different disks) are still present.  It is bad enough to possibly lose files, but finding holes in my data would take things a step further and drive me crazy!

 

Would it be a good idea to create an image of  the drive before doing anything that might change data on the drive?  Then I could restore the image if the repair encounters problems, and restart recovery efforts using the tools I mentioned above, or perhaps even other tools.  I just ordered a couple more hard drives in case this might be helpful.  I really want to avoid loss of data.

 

Believe it or not, before all this happened I'd been trying to figure out the best way to backup critical files to an old synology server that I have.   I thought with the parity protection I would be safe for a few months while I tried to figure this out.  Live and learn!

 

Also, below is what UNRAID said when I ran xfs_repair (I added "-v", so we get more output).

 

Thanks,

Dave

 

Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 299848 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 1075093 tail block 1075089 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.  

 

 

Link to comment

Hi, JorgeB.

 

Below is the text on the wiki that I was referring to (I've changed the text color to green for the parts I was referencing).

 

Thanks,

Dave

 

Additional comments

The xfs_repair tool may take a long time on a full file system (several minutes to a half hour, or more).

 

If the xfs_repair command fails, and we're hearing numerous reports of this(!), then you will have no recourse but to redo the drive. Use the instructions in the Redoing a drive formatted with XFS section below. We're sorry, we hope there will be better XFS repair tools some day!

If there was significant corruption, then it is possible that some files were not completely recovered. Check for a lost+found folder on this drive, which may contain fragments of the unrecoverable files. It is up to you to examine these and determine what files they are from, and act accordingly. Hopefully, you have another copy of each file. When you are finished examining them and saving what you can, then delete the fragments and remove the lost+found folder. Dealing with this folder does not have to be done immediately. This is similar to running chkdsk or scandisk within Windows, and finding lost clusters, and dealing with files named File0000.chk or similar. You may find one user's story very helpful, plus his later tale of the problems of sifting through the recovered files.

If you get an error indicating something like trouble opening the file system, it may indicate that you attempted to run the file system check on the wrong device name. For almost all repairs, you would use /dev/md1, /dev/md2, /dev/md3, /dev/md4, etc. If operating on the cache drive (which is not protected by parity), you would use /dev/sdX1 (note the trailing "1" indicating the first partition on the cache drive).

If you want to test and repair a non-array drive, you would use the drive's partition symbol (e.g. sdc1, sdj1, sdx1, etc), not the array device symbol (e.g. md1, md13, etc). So the device name would be something like /dev/sdj1, /dev/sdx1, etc.

 

 

Redoing a drive formatted with XFS

If you are here because the XFS repair tool has failed you(!), then the best we can recommend is to save the data, reformat, and restore the data (NOT a desirable course of action)! 

Do your best to copy off everything you can, to a safe place. If something important is absolutely needed and still inaccessible, try File Scavenger or a live CD of TestDisk.

Change the file system format for the drive to ReiserFS (just to reset the formatting, it's temporary and fairly quick)

Start the array and format the drive

Stop the array

Change the file system format for the drive to XFS again

Start the array and format the drive again

Copy back everything you saved

It's certainly not a welcome method, but it does produce a fresh and clean XFS format. (The write-up above has not been tested by this author. If corrections are needed, please do them, or PM RobJ with the corrections or suggestions.)

Link to comment

Thanks, JorgeB.  So far I've just run xfs_repair twice, both times using the GUI.   The second time the output indicated I would need to mount the drive so it could replay (?) the log.   That was the next step, but before I did that I thought about imaging the drive (and wrote the corresponding post above). 

 

Per your request, below is all of the output I have seen from xfs_repair.

 

Dave 

 

The first time I ran it with the -nv option and output was as follows:

 

Phase 1 - find and verify superblock...

bad primary superblock - bad CRC in superblock !!!

 

attempting to find secondary superblock... 

found candidate secondary superblock...

verified secondary superblock...

would write modified primary superblock

Primary superblock would have been modified.

Cannot proceed further in no_modify mode.

Exiting now.

 

The second time I ran it with the -v option and output was as follows:

 

Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 299848 entries sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 1075093 tail block 1075089 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.  

Link to comment

I was particularly attending to that last line in the second output, when I put forward the imaging idea.

 

"Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this."  

 

We're not quite there yet (haven't tried mounting the drive), but I am trying to avoid the possibility of corruption and thought it might be prudent to consider imaging.

 

Dave

Link to comment

Thanks for your thoughts on this, trurl!   Would there be any advantage to imaging the drive before I run xfs repair with -L ?   Or would it just be a waste of time to  do so, in hopes of additional attempts to salvage files using File Scavenger and TestDisk?

 

Thanks also, JorgeB, for your additional comment.

 

Dave

Link to comment
4 hours ago, DaveW42 said:

Would there be any advantage to imaging the drive before I run xfs repair with -L ?   Or would it just be a waste of time to  do so, in hopes of additional attempts to salvage files using File Scavenger and TestDisk?

Using -L is quite common and usually there's no big risk, but having a clone would leave you with more options, if time is not a factor.

Link to comment

Hi, JorgeB.

 

Thank you!  That information is very helpful.   Time is definitely a factor, but I would also definitely be willing to go the more conservative route and image the drives (both the original drive that UNRAID disabled and the rebuilt, synced drive) if it improved my chances of recovering the data even slightly.  With this in mind, I have four 14TB WD drives coming via next day service.  These will also come in handy when I get past this issue and one day figure out a local backup solution for my array. 

 

It may take me a bit before I am able to update this thread  with any new information ... I will need to preclear those drives, running two at a time using the device I mentioned in my first post.  I just did this a week ago, and I think it took 3 or 4 days per pair of drives, given all the preclear steps.  And then I will need to 1) figure out what software to use to image those two drives and 2) run that imaging process.  I really hope "-L" works, it would be nice for a quick solve once I am through all of this setup!

 

I will keep monitoring this thread in case anyone has any new thoughts on all of this, but I think I wouldn't really have an update until I am through all these steps (maybe 7 to 10 days??)

 

Thanks again!

Dave

 

  

Link to comment

Thanks, JorgeB.  I'm not quite sure I followed that.  Are you saying that whatever was written to disk 1 and disk 12 immediately before they got disabled would be missing, or are you saying that UNRAID might have tried to continue to write to those drives even after they were disabled (i.e., UNRAID would be writing to and updating an emulated disk)?  Or is it something else?

 

I'm also not quite sure I understand how the emulation works (I'll have to look that one up).   I believe UNRAID was indicating that both drives were emulated, after the drives were disabled. Having said this, what didn't quite make sense to me was that -- if both drives were truly emulated--then why were those files missing?  I would have thought I would have seen those files, since they were being emulated. 

 

Also, preclearing is still going as we speak (pre-read, zero, post-read) for all 4 new drives.  I will shortly be investigating what programs to use to do the imaging.  Not sure whether I should stay in the Linux family for this, or bring things over to Windows, which I am more familiar with.  I wish I had learned more about Linux a  long time ago, as I do like this framework much better than Windows.

 

Thanks!

 

Dave

Link to comment
1 hour ago, DaveW42 said:

Are you saying that whatever was written to disk 1 and disk 12 immediately before they got disabled would be missing, or are you saying that UNRAID might have tried to continue to write to those drives even after they were disabled (i.e., UNRAID would be writing to and updating an emulated disk)?  Or is it something else?

Disk1 is disabled, after it got disabled any new data written to it was written to the emulated disk only, if you clone the actual disk it would be missing that data, disk12 is not disabled, just unmountable.

Link to comment
6 hours ago, DaveW42 said:

how the emulation works

Unraid disables a disk when a write to it fails for any reason. Often this isn't a problem with the disk, but a problem communicating with the disk. That failed write updates parity, so it can be recovered by rebuilding, but the physical disk isn't used again until rebuilt since it is now out-of-sync with the array. After disabling, the disk is emulated from parity.

 

When reading an emulated disk, the data comes from the parity calculation by reading all other disks. When writing an emulated disk, parity is updated as if the disk had been written. The initial failed write, and any subsequent writes to the disabled/emulated disk, can be recovered by rebuilding.

 

https://wiki.unraid.net/Manual/Overview#Parity-Protected_Array

 

Link to comment

Thanks so much for those explanations, JorgeB and trurl !!!   Very interesting stuff.  I think the concept associated with parity drives is brilliant, and I have mentioned the whole idea to non-techy friends as an example of something that appears impossible (being able to protect numerous large drives using just one large drive) but is possible through logic and basic math.  Very cool.   

 

So, to make a long story short, I decided that trying to take an image of drive 12 might actually cause me more problems in the long run (e.g., if I inserted that drive into another computer, and somehow that other computer decided to write something to that drive).  After the drives precleared, I therefore just decided I would just follow the initial direction both of you had provided, which was to move forward with xfs_repair -L as you both were suggesting.  Before I did that, just for the heck of it I ran xfs_repair with the -nv options, just to see what would happen.  Lo and behold I got a totally different set of output, which I have shared below.

 

I think ultimately this still means I should just run the "-L" option, but I thought I should check first.   Please let me know what you think.  Say the word, and "-L" will be my next step.

 

Thanks!

 

Dave

 

Phase 1 - find and verify superblock...
        - block cache size set to 3005320 entries
sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128
would reset superblock root inode pointer to 128
sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129
would reset superblock realtime bitmap inode pointer to 129
sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130
would reset superblock realtime summary inode pointer to 130
Phase 2 - using internal log
        - zero log...
zero_log: head block 1075093 tail block 1075089
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_icount 0, counted 262400
sb_ifree 0, counted 1948
sb_fdblocks 2441087425, counted 140977217
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 5
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 8
        - agno = 6
        - agno = 7
        - agno = 9
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Tue Feb 21 01:55:39 2023

Phase        Start        End        Duration
Phase 1:    02/21 01:54:34    02/21 01:54:34
Phase 2:    02/21 01:54:34    02/21 01:54:34
Phase 3:    02/21 01:54:34    02/21 01:55:08    34 seconds
Phase 4:    02/21 01:55:08    02/21 01:55:08
Phase 5:    Skipped
Phase 6:    02/21 01:55:08    02/21 01:55:39    31 seconds
Phase 7:    02/21 01:55:39    02/21 01:55:39

Total run time: 1 minute, 5 seconds

 

Link to comment

Also, please let me know if any of the above output suggests I should spin the array up in non-maintenance mode before doing this (i.e., to try to mount drive 12 and play the log back again), and then spin the array down again, click maintenance mode, and then spin the array up again before running -L  on disk 12.  

 

Thanks!

 

Dave

Link to comment
2 hours ago, DaveW42 said:

I think ultimately this still means I should just run the "-L" option

Probably, but you fist need to run without -n, and if it asks for it use -L.

 

2 hours ago, DaveW42 said:

(i.e., to try to mount drive 12 and play the log back again)

You can try, but they didn't mount before so I expect they still won't mount.

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.