Jump to content

[SOLVED] My unRAID is a MESS... Please HELP!


Joseph

Recommended Posts

  • Replies 58
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

27 minutes ago, johnnie.black said:

You should try one thing at a time or you won't know what was the problem, now if you had another complete server you could move all disks there and troubleshoot that one later with test data instead of your real data.

 

The irony is, I was contemplating the purchase of a beefier box and using the current box as a backup unRAID. This exercise has accelerated that consideration.

Link to comment
On 12/18/2017 at 1:54 AM, johnnie.black said:

 

If the array data was unchanged since the beginning of the rebuild (this includes no running dockers/VMs on the array) there's still a chance to rebuild disk6 again, assuming disk5 is OK, but you need to fix whatever is wrong with the server first.

RECAP:

* Disk3 knocked offline and accidentally reformatted. Drive has been pulled in for an attempt at data recovery. Contents, which are empty, are being emulated.

* Disk5 knocked offline and rebuilt but with an insane amount of writes and errors.

* Disk6 knocked offline... at this point I stopped the array.

* Memtests were run clean without errors.

* Currently, unRAID is shut down until I get a replacement power supply.

 

BRIEF UPDATE:

* The data recovery on Disk3 is still a work in progress.

* I examined the contents of Disk5 and Disk6 via Ubuntu and things appear to be in tact. However, I suppose that unless I find a damaged file manually, then there's no way of ever really knowing if there's actual data corruption or not. The error count on the main page suggests there will be, but its a needle in a haystack.

 

After I replace the power supply and get a replacement for Disk3, what are your thoughts on rebuilding the array from parity (ie Disk3 and Disk6) vs. rebuilding the parity disks from all the data drives in the drive pool?

 

Thanks for your help.

Edited by Joseph
Link to comment
2 minutes ago, Joseph said:

After I replace the power supply and get another disk for Disk3, what are your thoughts on rebuilding the array from parity (ie Disk3 and Disk6) vs. rebuilding the parity disks from all the data drives in the drive pool?

 

This can only work OK if the array data was 100% unchanged during the failed rebuild, if it was you got nothing to lose trying to rebuild disk6 on a spare disk, if you don't have a spare it would be best to backup current disk6 before using it, it's corrupt for sure, but maybe some data is good.

Link to comment
14 minutes ago, johnnie.black said:

 

This can only work OK if the array data was 100% unchanged during the failed rebuild, if it was you got nothing to lose trying to rebuild disk6 on a spare disk, if you don't have a spare it would be best to backup current disk6 before using it, it's corrupt for sure, but maybe some data is good.

so, I'm a little confused. If I'm reading correctly, you're saying its better to have the array attempt to rebuild to a replacement Disk6 even though Disk 5 had a ton of errors and I can see valid files on Disk6 (as well as Disk5 for that matter) via Ubuntu?

Edited by Joseph
Link to comment
  • 2 weeks later...
On 12/20/2017 at 1:09 PM, johnnie.black said:

It should be OK, but a read-only mount would be better in theses situations

UPDATE:

I found the original Disk5 HDD that I shelved about a month ago as a backup and its might be in tact. So hopefully it won't be a total loss if rebuilding from parity doesn't work. Still don't know what PSU to buy, so I posted in the PSU thread.

 

UPDATE2:

Ok, so my Seasonic PRIME 750W 80 Plus Gold PSU arrived today and I have everything back up and running. I had enough power cables so I removed all splitters just to be safe.

When I went to start the array, it did not 'see' Disk3 or Disk5 and wanted to format to bring them online--WHICH I DID NOT DO THIS TIME. I stopped the array and rian diags (see attached.) Any thoughts on how to proceed? I don't want to blow it this time.

Edited by Joseph
Link to comment

Disk3 is empty correct, though not being correctly emulated is not a good sign, same for disk5, before rebuilding run xfs_repair on both emulated disks, unassign the disk5 you just assigned before grabbing the diags, start the array in maintenance mode and run:

 

xfs_repair -v /dev/md3

 

When done run the same on /dev/md5, then start the array, with both disks still not assigned and see if the emulated disks mount.

Link to comment
4 hours ago, johnnie.black said:

Disk3 is empty correct, though not being correctly emulated is not a good sign, same for disk5, before rebuilding run xfs_repair on both emulated disks, unassign the disk5 you just assigned before grabbing the diags, start the array in maintenance mode and run:

UPDATE:

md3

Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
[...]
found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
        - block cache size set to 1471432 entries
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
zero_log: head block 13799 tail block 13795
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

This message is confusing to me. Leaving it in maintenance mode for now. Thoughts on what to do next (before running xfs on md5)?

Link to comment
9 hours ago, johnnie.black said:

xfs_repair -vL /dev/md3

Use -L

 

UPDATE on md3 repair (See Below):

is it ok to run the repair on md5 or do I need to do something else first?  Thanks.

 


Phase 1 - find and verify superblock...
        - block cache size set to 1471432 entries
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
zero_log: head block 13799 tail block 13795
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
sb_icount 0, counted 1472
sb_ifree 0, counted 534
sb_fdblocks 976277683, counted 976273761
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
bad CRC for inode 96
bad CRC for inode 99
bad CRC for inode 111
bad CRC for inode 116
bad CRC for inode 118
bad CRC for inode 119
bad CRC for inode 127
bad CRC for inode 128
bad CRC for inode 156
bad CRC for inode 96, will rewrite
cleared root inode 96
bad CRC for inode 99, will rewrite
bad CRC for inode 111, will rewrite
cleared inode 111
bad CRC for inode 116, will rewrite
cleared inode 116
bad CRC for inode 118, will rewrite
cleared inode 118
bad CRC for inode 119, will rewrite
cleared inode 119
bad CRC for inode 127, will rewrite
cleared inode 127
bad CRC for inode 128, will rewrite
cleared inode 128
bad CRC for inode 156, will rewrite
cleared inode 156
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected dir inode 4298782817, moving to lost+found
Phase 7 - verify and correct link counts...
resetting inode 99 nlinks from 2 to 3
Maximum metadata LSN (1:13791) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Sat Dec 30 01:46:31 2017

Phase		Start		End		Duration
Phase 1:	12/30 01:40:25	12/30 01:40:25	
Phase 2:	12/30 01:40:25	12/30 01:42:58	2 minutes, 33 seconds
Phase 3:	12/30 01:42:58	12/30 01:42:59	1 second
Phase 4:	12/30 01:42:59	12/30 01:42:59	
Phase 5:	12/30 01:42:59	12/30 01:42:59	
Phase 6:	12/30 01:42:59	12/30 01:42:59	
Phase 7:	12/30 01:42:59	12/30 01:42:59	

Total run time: 2 minutes, 34 seconds
done
root@Tower:/#                                                                                      

 

Link to comment
15 minutes ago, johnnie.black said:

You can run xfs_repair on disk5, after both are done start the array to check if xfs_repair was successful or not, though if disk3 is still empty the one that really matters is disk5.

Its taking awhile to find the secondary superblock on md5. My guess is once its finished, it will instruct me to run -L option on it too.... if that's the case, I will do that and then start the array after its finished.

Link to comment
2 hours ago, johnnie.black said:

Not a good sign, let if finish but you'll likely have to use the current disk5 or the previous disk5 to recover as much as possible.

DIFFERENT RESULTS!!

This could be promising, no? So I guess the next step is to start the array (with disk5 not installed) and see what happens?

 

[...]

...found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
        - block cache size set to 1471424 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 886339 tail block 886339
        - scan filesystem freespace and inode maps...
sb_icount 19968, counted 20352
sb_ifree 7680, counted 6889
sb_fdblocks 31812789, counted 192468030
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
bad CRC for inode 96
bad CRC for inode 99
bad CRC for inode 111
bad CRC for inode 116
bad CRC for inode 118
bad CRC for inode 119
bad CRC for inode 127
bad CRC for inode 128
bad CRC for inode 156
bad CRC for inode 96, will rewrite
cleared root inode 96
bad CRC for inode 99, will rewrite
cleared inode 99
bad CRC for inode 111, will rewrite
cleared inode 111
bad CRC for inode 116, will rewrite
cleared inode 116
bad CRC for inode 118, will rewrite
cleared inode 118
bad CRC for inode 119, will rewrite
cleared inode 119
bad CRC for inode 127, will rewrite
cleared inode 127
bad CRC for inode 128, will rewrite
cleared inode 128
bad CRC for inode 156, will rewrite
cleared inode 156
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected dir inode 99, moving to lost+found
disconnected dir inode 2159698, moving to lost+found
disconnected dir inode 37065147, moving to lost+found
Phase 7 - verify and correct link counts...
resetting inode 96 nlinks from 2 to 3
resetting inode 52475801 nlinks from 2 to 5
Note - stripe unit (0) and width (0) were copied from a backup superblock.
Please reset with mount -o sunit=<value>,swidth=<value> if necessary

        XFS_REPAIR Summary    Sat Dec 30 13:57:24 2017

Phase		Start		End		Duration
Phase 1:	12/30 12:40:26	12/30 13:56:59	1 hour, 16 minutes, 33 seconds
Phase 2:	12/30 13:56:59	12/30 13:57:00	1 second
Phase 3:	12/30 13:57:00	12/30 13:57:14	14 seconds
Phase 4:	12/30 13:57:14	12/30 13:57:14	
Phase 5:	12/30 13:57:14	12/30 13:57:14	
Phase 6:	12/30 13:57:14	12/30 13:57:15	1 second
Phase 7:	12/30 13:57:15	12/30 13:57:15	

Total run time: 1 hour, 16 minutes, 49 seconds
done

 

Link to comment
On 12/30/2017 at 4:06 PM, johnnie.black said:

Difficult to guess, start the array, with no disk assigned for disk5.

There seems to be contents on the emulated disk5.... but is there any way to verify the validity of the contents?

 

All of it is in Lost+found. I'm guessing next steps is to rebuild Disk3 & Disk5 and then to ensure validity of the contents of Disk5, copy everything from the disk5 backup that I still have lying around.... thoughts?

 

 

Edited by Joseph
Link to comment
7 minutes ago, Joseph said:

I'm guessing next steps is to rebuild Disk3 & Disk5 and then to ensure validity of the contents of Disk5

You can see the contents of disk5 by browsing the emulated disk, whatever is there is what is going to be on the rebuilt disk, if you decide to rebuild it would be best to use a new spare disk, so you can still access the old one if needed.

Link to comment
1 hour ago, johnnie.black said:

You can see the contents of disk5 by browsing the emulated disk

 

Maybe I don't understand what lost and found is.... seems to me at this point there could be certain files (such as as audio or video) which might seem ok, but there's no way to know if some parts throughout the timeline are corrupted unless they are played all the way thru...which is why I'm considering just restoring the files from the shelved disk5.

 

Link to comment
3 minutes ago, Joseph said:

 

Maybe I don't understand what lost and found is.... seems to me at this point there could be certain files (such as as audio or video) which might seem ok, but there's no way to know if some parts throughout the timeline are corrupted unless they are played all the way thru...which is why I'm considering just restoring the files from the shelved disk5.

 

Lost and found is basically file chains found when doing the repair - but the repair could not figure out how to catalog the found file data. Modern file systems separates the file names as seen in the directories from the meta data about the file and normally also from the actual file data.

 

The printout above mentions inodes - each inode represents one file, but without any file name or owning directory. The entries in the directory just points to the inode. And the inode contains information about where the file data is stored. This separation is what allows hard links - multiple directory entries with potentially completely different file names to point at the same inode and hence access the same file data.

 

Lost and found just means that directory entries - or whole directories - have been lost. So unreachable inodes was then added to lost and found.

 

This also means that it is never good to see "bad CRC for inode xxx".

Link to comment

No, you can have more lost files.

 

And you don't know if other files are intact or not - XFS can checksum the meta-data but not the file data.

 

That is a reason why it's good to keep hashes for all static files, allowing you to regularly validate the file content.

And - of course - why it's also good to have a working backup scheme that takes into account file changes and not just overwrites a good backup file with a corrupted copy.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...