(6.8.3) Corruption, a hard reboot of the system has removed 25 of the file shares, only 2 left...


Recommended Posts

Hello, I need some help, I was forced to hard reboot my server (I run a windows gaming system on top of it as well). when the system returned, only 2 of the 25 or so file shares are now present...

 

I have booted the array into maintenance mode, and did an xfs_repair on each drive in the array, but this didn't seem to help...  I know a bit about linux, but I also know that I should ask for help before proceeding...  The drives appear to still have the correct amount of space on them consumed, so I think the data is still there, I just can't seem to find a way to safely access it...

 

Your assistance is greatly appreciated here...

qw-diagnostics-20200712-1912.zip

Link to comment
1 hour ago, trurl said:

Looks like your cache pool has been formatted.

That occured just before the issue and was intentional, but I don't think it is related to the actual issue...  I had corrupted the cache pool about a week ago, and repaired the issue, however I ended up deciding the repair was less than ideal and decided to format are restore from backup the data I had on it...  

 

The format worked as intended and I was getting ready to perform the restore, I issued a reboot command, when the system locked up for almost 15 minutes (longest previous reboot was around 5 minutes)  I was unable to shutdown with any method, so I was forced to hard shutdown the system, the pool appeared to be corrupted after this reboot...  (Empty cache pool appears to be working correctly)

Link to comment
3 hours ago, johnnie.black said:

I don't see anything out of the ordinary on the logs, are the shares not available locally and on the LAN or just on the LAN?

I have checked the individual /mnt/disk?/ locations, they also reflect the same 2 shares that are reflected under /mnt/user/

I am trying to find a way to restore the original files if possible...  or possibly find where the files got lost to and resetup the pool and copy things back manually...

Link to comment
23 minutes ago, Warrentheo said:

I have checked the individual /mnt/disk?/ locations, they also reflect the same 2 shares

If you only see two shares using the disk mappings the other ones are either gone or were moved inside one of the existing ones, shares are just top level folders.

Link to comment
9 hours ago, Warrentheo said:

I had corrupted the cache pool about a week ago, and repaired the issue, however I ended up deciding the repair was less than ideal and decided to format are restore from backup the data I had on it...  

 

The format worked as intended and I was getting ready to perform the restore

Since you haven't restored the files that were on cache from the backups then possibly all the missing shares are in those backups. The user shares are simply the top level folders on cache and array. If the folders for the shares don't exist, then the shares don't exist. Maybe the folders for the shares are in your backups.

 

Link to comment

The cache was used primarily as a cache, and VM virtual disks, with some other minor knick knack files thrown in...  98% of the data on the server was under the main pool stored on magnetic media, only 1 share was set to "Cache Only"...  I disabled VM machines, disabled Docker, deleted their associated image files, and had moved all other files off of the cache...  it was reporting empty when I attempted to stop the pool, remove the cache drives with the "New Config" option (and only the cache drives were dropped) and perform a blkdiscard on the cache drives, I then attempted the reboot, which locked up the computer... 

 

Those shares were there before this reboot, and after the hard reboot, it appears every drive was somehow affected and removed the files...  The drives still report the correct size of data stored on them, so I don't think it is gone, just no longer accessible through the normal shares...

 

Is there a place where the files are stored that I can use to try and rebuild the pool, even if I have to do it manually...?

Link to comment
49 minutes ago, Warrentheo said:

Is there a place where the files are stored

All the files are under /mnt/disk1, /mnt/disk2, ... , /mnt/cache, and also under /mnt/user because

2 hours ago, trurl said:

The user shares are simply the top level folders on cache and array

If the files still exist, then they are in the user shares somewhere, maybe you moved them

3 hours ago, johnnie.black said:

inside one of the existing ones

 

53 minutes ago, Warrentheo said:

it appears every drive was somehow affected and removed the files... 

This seems extremely unlikely

 

56 minutes ago, Warrentheo said:

only 1 share was set to "Cache Only"

According to your diagnostics, there were other shares using cache, including cache-prefer shares. Cache-prefer shares would normally have all files on cache.

 

You also have an unassigned disk not mounted and apparently set to passthru. Did you only have disk1,2,3,4 and cache,cache2 before the New Config? If you only dropped cache why were you rebuilding parity?

Link to comment
4 hours ago, itimpi said:

Have you tried browsing the disks from the Unraid GUI (via the folder icon on the right of every drive) to see if the files/folders you expect are actually there?

 

Both gui, SSH and even local direct all show the same info, the data isn't in the root of the drives anymore,

 

4 hours ago, trurl said:

All the files are under /mnt/disk1, /mnt/disk2, ... , /mnt/cache, and also under /mnt/user because

If the files still exist, then they are in the user shares somewhere, maybe you moved them

No files were moved in the main pool ( I just emptied the last few files off the cache on it )

Files were there before the hard reboot, and were all gone afterware

 

Thank you for the replies, I am out of school, and will get home and try some of these suggestions and see where we are at...

 

Link to comment
  • 2 weeks later...

Update, 10 days later... It appears that the data for every drive in the pool somehow got very corrupted... I am very miffed about this (around 11 TB possibly lost), but I want to be clear ahead of time, I am NOT expressing anger at UnRaid or Lime Technology, since I am sure about 95% of this is my fault somehow... My Main goals here are to figure out how this happened in the first place, and if possible find a way to keep it from happening to someone else... I realize there are several steps where I got punch happy and did things I even knew at the time I should not be doing... That said, total array corruption is a pretty major error even with it probably being mostly my fault, and I want to prevent that from happening to someone else...

 

I was trying to reset the cache pool back to empty when I got stuck trying to reboot... The data corruption occurred after I forced a hard reboot after the soft reboot seemed to fail for multiple minutes... Is some part of the array/pool file table stored on the cache drive? Did wiping the cache pool somehow erase file table for the pool? I upgraded the SSD's in my cache pool around August of 2019, and that wipe and replacement of the pool went without a hitch...

 

My array after the corruption went from around 28 shares and around 11 TB of data to only 2 shares left, and about 40 GB of data left... This weird partial corruption is very difficult for me to figure out, because every drive appears to have these same 2 shares left, they are not spread evenly with the 4 drives, but the data corruption was not total, somehow 5% of the data remained...?

 

The XFS recovery software I am using says that nearly the entire drive ended up in the "Lost+Found" folder, with some of it apparently recoverable...  I think I will be able to recover the large media files that way, mostly the BluRay Backup Dumps...  Figuring out the small files is going to be very tedious since a decent part of the array was a Steam Games Folder, so a huge portion of the small files are going to be meaningless game files...

 

Is this somehow caused by hardware failure of the CPU/RAM/Mobo? I suppose I need to shutdown and do a full long form hardware scan just to make sure...

 

I realize now that forcing the hard shutdown when the kernel clearly didn't want to was a mistake, but I would expect minor Bit Rot, not near total corruption from that sort of thing...

 

Currently I still trying to do a post mortem of the issue, I have been trying to do data recovery tools, but they take like 2 days to run on a 6TB drive, so that has been slow... So far the best solution I have found is manual data recovery where I "UnDelete" the entire drive, then manually rename a few hundred thousand files from memory, and pray that the files are whole...

 

Any feedback or assistance on this process, or even a pat on the back and commiseration would be appreciated...

Link to comment

According to your diagnostics in the first post, the drives were mounted, and had about 8.4TB of data on them. So from diagnostics it isn't obvious anything was corrupt or even lost. But diagnostics can't tell us much about many user actions.

 

47 minutes ago, Warrentheo said:

Is some part of the array/pool file table stored on the cache drive? Did wiping the cache pool somehow erase file table for the pool?

This is not at all how anything works in Unraid. Each disk is a separate filesystem. Each disk can be read by itself without any other disk. Each file exists completely on a single disk.

 

And a hard reset is extremely unlikely to cause anything other than incomplete parity updates (and parity doesn't contain any of your data) or possibly incomplete file write if anything was being written at the time.

 

I don't know what you may have done since you posted those diagnostics, but I think 8.4TB of files were still on your disks at that time.

 

According to those diagnostics, those 8.4 TB of files would have been in these (anonymized) user shares:

 

i----s  on disk3

w-----a on disk1,2,3,4

 

And this is how much of each disk was used:

Filesystem      Size  Used Avail Use% Mounted on
/dev/md1        5.5T  4.5T 1014G  82% /mnt/disk1
/dev/md2        2.8T  1.4T  1.4T  50% /mnt/disk2
/dev/md3        2.8T  1.6T  1.2T  57% /mnt/disk3
/dev/md4        1.4T  1.1T  336G  77% /mnt/disk4
/dev/nvme0n1p1  1.9T   17M  1.9T   1% /mnt/cache

 

 

 

 

 

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.