Jump to content

Unmountable drive: no file system.


Recommended Posts

I'm having the same issue as this thread: https://forums.unraid.net/topic/69765-solved-unmountable-no-file-system/

 

We had a power outage last Friday, but everything looked fine last night after I upgraded to v6.8.2.  Tonight it shows one of the drives is unmountable.

 

I've posted the diagnostics below.  I tried to follow along in that other thread, but don't understand how to start the array with the disk emulated.  Pop the drive out before starting the array?  but then how do you run that repair command?  I'm a bit lost.

 

The parity says valid. It's scheduled to run on the 1st day of each month.  But it also says "Last check incomplete on Wed 29 Jan 2020 08:39:10 PM CST (yesterday), finding 1247 errors.  Error code: aborted"

 

I have 2 parity drives.

 

Can I just unassign that drive, as if it died.  Then format it, pre-clear it, then re-assign it and let it rebuild?  This system has been running for over a year and I've never seen a bad parity check, so I really don't have any reason to think the parity is bad.  I guess I could get a new drive tomorrow and do the rebuild to it instead.

nabit-diagnostics-20200130-2053.zip

Link to comment

I'm pretty sure a format would create a new filesystem.

 

Anyway, I did the test, per that link (output below). It noted a couple of issues.  I tried to do the repair with just the -v option, but it said it had logs pending and I needed to mount the drive so the logs could be replayed.  But w/o a filesystem, I can't mount it.  Catch 22.  It said I could use the -L option to wipe the logs, but not sure what that will do to the integrity of the drive.  And if there is still a problem with it, I don't want it mounted and have the parity be updated incorrectly due to a currupted drive.

 

So, I got a new drive today and it's in the process of pre-clearing.  Once it's ready, I'll pop out the old drive and re-assign that slot to the new drive and let it rebuild.  Once it's back to normal, i'll format/repair this drive, this drive and pre-clear it and check it for any errors.  If all good with it, i'll use it to replace a smaller drive.

 

Phase 1 - find and verify superblock...
        - block cache size set to 707216 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 3793692 tail block 3793686
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18
finobt ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18
agi unlinked bucket 45 is 83977453 in ag 5 (inode=10821395693)
sb_ifree 888, counted 875
sb_fdblocks 632626081, counted 635247463
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
imap claims in-use inode 2201742810 is free, correcting imap
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
data fork in regular inode 15032386860 claims used block 1919352826
correcting nextents for inode 15032386860
bad data fork in inode 15032386860
would have cleared inode 15032386860
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 4
        - agno = 7
        - agno = 5
        - agno = 6
        - agno = 2
entry "Forged.in.Fire.S07E15.1080p.WEB.h264-TBS[rarbg].mkv" at block 0 offset 1392 in directory inode 15032386822 references free inode 15032386860
	would clear inode number in entry at offset 1392...
data fork in regular inode 15032386860 claims used block 1919352826
correcting nextents for inode 15032386860
would have cleared inode 15032386860
entry "RARBG.txt" in shortform directory 12907968861 references free inode 4373342706
would have junked entry "RARBG.txt" in directory inode 12907968861
entry "RARBG_DO_NOT_MIRROR.exe" in shortform directory 12907968861 references free inode 4373342707
would have junked entry "RARBG_DO_NOT_MIRROR.exe" in directory inode 12907968861
entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in shortform directory 12907968861 references free inode 4373342705
would have junked entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in directory inode 12907968861
would have corrected i8 count in directory 12907968861 from 6 to 3
entry "Source" in shortform directory 6619914978 references free inode 8589934758
would have junked entry "Source" in directory inode 6619914978
would have corrected i8 count in directory 6619914978 from 2 to 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
entry "Source" in shortform directory inode 6619914978 points to free inode 8589934758
would junk entry
would fix i8count in inode 6619914978
        - agno = 4
        - agno = 5
        - agno = 6
entry "RARBG.txt" in shortform directory inode 12907968861 points to free inode 4373342706
would junk entry
entry "RARBG_DO_NOT_MIRROR.exe" in shortform directory inode 12907968861 points to free inode 4373342707
would junk entry
entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in shortform directory inode 12907968861 points to free inode 4373342705
would junk entry
would fix i8count in inode 12907968861
        - agno = 7
entry "Forged.in.Fire.S07E15.1080p.WEB.h264-TBS[rarbg].mkv" in directory inode 15032386822 points to free inode 15032386860, would junk entry
bad hash table for directory inode 15032386822 (no data entry): would rebuild
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 2370139038, would move to lost+found
disconnected inode 2370139483, would move to lost+found
disconnected dir inode 6619914978, would move to lost+found
disconnected inode 10821395693, would move to lost+found
disconnected dir inode 12907968861, would move to lost+found
disconnected dir inode 12907968867, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 4373342699 nlinks from 1 to 3
would have reset inode 6619914978 nlinks from 3 to 2
would have reset inode 10821395693 nlinks from 0 to 1
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Fri Jan 31 15:55:07 2020

Phase		Start		End		Duration
Phase 1:	01/31 15:55:05	01/31 15:55:06	1 second
Phase 2:	01/31 15:55:06	01/31 15:55:06
Phase 3:	01/31 15:55:06	01/31 15:55:06
Phase 4:	01/31 15:55:06	01/31 15:55:06
Phase 5:	Skipped
Phase 6:	01/31 15:55:06	01/31 15:55:07	1 second
Phase 7:	01/31 15:55:07	01/31 15:55:07

Total run time: 2 seconds

 

Edited by heisenfig
Link to comment
37 minutes ago, heisenfig said:

I'm pretty sure a format would create a new filesystem.

Yep. A blank filesystem. Without any of your data. Is that what you want? If you format that slot, that format will be written to parity, just like any other write.

 

Parity rebuild encompasses the entire drive, filesystem included. It doesn't rebuild individual files, only the drive as a whole, corruption included.

 

You will need to fix that filesystem to recover your files. Normally the -L option doesn't corrupt anything.

Link to comment

Well, I wouldn't format it while it's part of the array, so it wouldn't be written to parity. 

 

I'm just thinking that if there is some corruption on it, i'm not sure I would want to trust it even after the rebuild.  how do i know some of the files are not corrupted too?  Seems like it's a safer choice to just go ahead and replace the drive and let the parity rebuild onto it.  Are you saying once the new drive it rebuilt, it will have the same problem?  Like the corrupted filesystem is already part of the parity?   How is that possible if the drive doesn't mount to have any information read from it?  Would disconnecting the drive and let it emulate that slot from parity still show it as unmountable?

 

I'm not doubting you if you say that's true.  I just though since parity is valid that I could just have it rebuild the data to a new drive from parity.

Link to comment

I went ahead and ran the repair with the -L option.  Afterwards I restarted array in normal mode and the drive mounted normally.  Is it safe to assume this drive is perfectly healthy again? 

Phase 1 - find and verify superblock...
        - block cache size set to 707216 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 3793692 tail block 3793686
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18
finobt ir_freecount/free mismatch, inode chunk 5/83977408, freecount 19 nfree 18
agi unlinked bucket 45 is 83977453 in ag 5 (inode=10821395693)
sb_ifree 888, counted 875
sb_fdblocks 632626081, counted 635247463
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
imap claims in-use inode 2201742810 is free, correcting imap
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
data fork in regular inode 15032386860 claims used block 1919352826
correcting nextents for inode 15032386860
bad data fork in inode 15032386860
cleared inode 15032386860
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 7
        - agno = 3
        - agno = 5
        - agno = 6
        - agno = 4
entry "Forged.in.Fire.S07E15.1080p.WEB.h264-TBS[rarbg].mkv" at block 0 offset 1392 in directory inode 15032386822 references free inode 15032386860
	clearing inode number in entry at offset 1392...
entry "RARBG.txt" in shortform directory 12907968861 references free inode 4373342706
junking entry "RARBG.txt" in directory inode 12907968861
entry "RARBG_DO_NOT_MIRROR.exe" in shortform directory 12907968861 references free inode 4373342707
junking entry "RARBG_DO_NOT_MIRROR.exe" in directory inode 12907968861
entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in shortform directory 12907968861 references free inode 4373342705
junking entry "vikings.s06e05.1080p.web.h264-ghosts.nfo" in directory inode 12907968861
corrected i8 count in directory 12907968861, was 6, now 3
entry "Source" in shortform directory 6619914978 references free inode 8589934758
junking entry "Source" in directory inode 6619914978
corrected i8 count in directory 6619914978, was 2, now 1
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
bad hash table for directory inode 15032386822 (no data entry): rebuilding
rebuilding directory inode 15032386822
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 2370139038, moving to lost+found
disconnected inode 2370139483, moving to lost+found
disconnected dir inode 6619914978, moving to lost+found
disconnected inode 10821395693, moving to lost+found
disconnected dir inode 12907968861, moving to lost+found
disconnected dir inode 12907968867, moving to lost+found
Phase 7 - verify and correct link counts...
resetting inode 22525826 nlinks from 2 to 5
resetting inode 4373342699 nlinks from 1 to 3
resetting inode 6619914978 nlinks from 3 to 2
Maximum metadata LSN (1:3793688) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Fri Jan 31 22:32:07 2020

Phase		Start		End		Duration
Phase 1:	01/31 22:25:27	01/31 22:25:27
Phase 2:	01/31 22:25:27	01/31 22:27:18	1 minute, 51 seconds
Phase 3:	01/31 22:27:18	01/31 22:27:19	1 second
Phase 4:	01/31 22:27:19	01/31 22:27:19
Phase 5:	01/31 22:27:19	01/31 22:27:19
Phase 6:	01/31 22:27:19	01/31 22:27:19
Phase 7:	01/31 22:27:19	01/31 22:27:19

Total run time: 1 minute, 52 seconds
done

 

Link to comment
3 hours ago, heisenfig said:

Well, I wouldn't format it while it's part of the array, so it wouldn't be written to parity. 

Formatting outside the array is pointless, since it would then be rebuilt the same as it was, like mentioned rebuilding disks can't fix filesystem corruption.

 

2 hours ago, heisenfig said:

I went ahead and ran the repair with the -L option.  Afterwards I restarted array in normal mode and the drive mounted normally.  Is it safe to assume this drive is perfectly healthy again? 

The filesystem is healthy again, the drive always was, there may or not be some lost files, check for a lost+found folder.

 

Link to comment

Apparenly, that didn't work after all. The drive shows mounted on the dashboard.

2020-02-02_1712

 

But, nothing can be read/or written to that drive.  I am able to read/write to all other drives.

2020-02-02_1721

 

I found a folder in /mnt/user that I believe must have been on that drive, and nothing can be read/or written in that folder.  Same error occurs.  The settings.json file is missing (was probably on hat drive?) and I can't create any new files in this folder or read any of the files in this folder.

2020-02-02_1726

 

 

Any ideas where to go from here?

 

EDIT:  Perhaps it WAS fixed, but became corrupted again.  After restarting the array, the drive is again unmountable and the test shows it has errors again.  Going to do the repair again.

 

EDIT2:  The repair worked again and some files that seemed missing are back again. and I can read/write the drive.  So now that's two times that the file system has become corrupted on that drive.

Edited by heisenfig
Link to comment

I'm assuming UD means Unassigned Devices.  But the drive has never show under that.  The drive is assigned to slot #13 and listed as Disk 11 in the array.  It's also never shown up as an emulated disk.  If I remove he drive and try to start the array, it says it will disable the drive and load show it as emulated, but I haven't done that yet because I wasn't sure if it's something I could come back from.  Is this what I need to do?  Disable it so it shows as emulated, then do the xfs_repair on the emulated drive?

 

If I go the other way and create a new config and assign all the drives as they are now, does it do the parity re-sync automatically?  Or is that something I have to do manually?

Link to comment

This 8TB drive replaced a 4TB drive 3 months ago.  It had been running find until the power outage about a week ago.  The drive has never been "Unassigned" since it went in service 3 months ago.

 

After the last repair of the filesystem, everything seemed normal again.  I started a parity check with the "write corrections to parity" un-checked.  During the first 8 hours, the files on that drive disappeared again.  But it still shows it as mounted, as far as the Web GUI is concerned.  Attached is the drive SMART report before I stopped the array.  Also is the diagnostics report before the array is stopped.  The parity check is still running. It shows 12 sync errors and estimated to finish in about 60 to 90 days (normally takes less than a day.)

2020-02-04_0830

 

2020-02-04_0831

WDC_WD80EMAZ-00WJTA0_JEKVEADZ-20200204-0832.txt

nabit-diagnostics-20200204-0835_beforeStoppingArray.zip

 

After stopping the parity check and stopping the array.  This is the screenshots showing the drive is still marked as assigned to slot 13.  Attached output of "xfs_repair -nv".

 

Stopped array.

image.png.08eb2ed18bca4837c6c3f2b53fd86078.png

 

Running array in maintenance mode.

image.thumb.png.774e0aed8172c8166a99816c6ad846d7.png

xfs_repair-nv.txt  Didn't show any errors on the drive this time.

 

Retarted array in normal mode. This drive is still assigned, and all the file are back on disk 11. 

2020-02-04_0849

 

2020-02-04_0852

 

 

This is not the same outcome I had yesterday.  When I restarted in normal mode yesterday, the drive was still assigned but had a note on it that it was not mountable due to no filesystem. The only difference is the checkbox for the "correct parity" was check when the parity check ran.

 

Here's a new diagnostics after restarting the array.

nabit-diagnostics-20200204-0900_afterRestartingArray.zip

 

Hopefully, this is enough information to determine what is going on with this drive.

Edited by heisenfig
Link to comment

There's a connection problem with disk11, these are repeating constantly on the logs:

Feb  3 15:09:06 nabit kernel: ata20.00: status: { DRDY }
Feb  3 15:09:06 nabit kernel: ata20: hard resetting link
Feb  3 15:09:16 nabit kernel: ata20: softreset failed (device not ready)
Feb  3 15:09:16 nabit kernel: ata20: hard resetting link
Feb  3 15:09:17 nabit kernel: ata20: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  3 15:09:18 nabit kernel: ata20.00: configured for UDMA/133
Feb  3 15:09:18 nabit kernel: ata20: EH complete
Feb  3 15:09:31 nabit kernel: ata20.00: exception Emask 0x10 SAct 0xffffffff SErr 0x190002 action 0xe frozen
Feb  3 15:09:31 nabit kernel: ata20.00: irq_stat 0x80400000, PHY RDY changed
Feb  3 15:09:31 nabit kernel: ata20: SError: { RecovComm PHYRdyChg 10B8B Dispar }
Feb  3 15:09:31 nabit kernel: ata20.00: failed command: READ FPDMA QUEUED
Feb  3 15:09:31 nabit kernel: ata20.00: cmd 60/08:00:00:6a:e4/00:00:67:01:00/40 tag 0 ncq dma 4096 in
Feb  3 15:09:31 nabit kernel:         res 40/00:00:68:0a:d6/00:00:29:01:00/40 Emask 0x10 (ATA bus error)
Feb  3 15:09:31 nabit kernel: ata20.00: status: { DRDY }
Feb  3 15:09:31 nabit kernel: ata20.00: failed command: READ FPDMA QUEUED
Feb  3 15:09:31 nabit kernel: ata20.00: cmd 60/40:08:88:8e:e9/00:00:2c:02:00/40 tag 1 ncq dma 32768 in
Feb  3 15:09:31 nabit kernel:         res 40/00:00:68:0a:d6/00:00:29:01:00/40 Emask 0x10 (ATA bus error)
Feb  3 15:09:31 nabit kernel: ata20.00: status: { DRDY }

 

11 minutes ago, heisenfig said:

The only difference is the checkbox for the "correct parity" was check when the parity check ran.

That won't make any difference in this case, since it doesn't affect data on disk11, the connection problems likely explain the issues you been having, replace both cables.

 

File system is mounting correctly now, so nothing to fix about that for now.

 

 

 

P.S. Unrelated to this but there are also a few out of memory errors on the logs.

Link to comment

Hmm.. That drive is in an ICYDOCK.  So all 5 drives in the dock share 3 power cables.  Since it's isolated to just one disk, i'm assuming the power cables are probably okay.  When I get home, I'll replace the data cable for that one.  If that doesn't work, i'll replace the 3 power cables too.

 

Yeah, I think the memory thing was unrelated. I checked and all the CPU's were pegged.  Rebooting to see if that clears up.  Edit: it did.

 

Thanks!

 

Edited by heisenfig
Link to comment

Haven't had a chance to change the cable yet, but the drive has stayed up for over 24 hours now.  I have a theory though.  This is a shucked WD drive that required tape being put over pin 3 of the power connector.  I didn't have Kapton tape, so used a peice of tape from my label maker.  I think it's possible that as the temperature of the drive increases, that it allows just enough current to pass to that pin to shut the drive off.

 

Secondary to that, about the memory issues, there ended up being 6 instances of rclone running in the background trying to backup the array to google drive which kept the drives working overtime, causing them to be warmer than normal.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...