(SOLVED) Full disk now showing as -15 TB / nan


project6

Recommended Posts

I've been running Unraid for years without a single problem, currently using version 6.8.0. A few days ago I started receiving email alerts that disk3 was full (100%) but was not able to look into it since I didn't have access to the server for a few days. I figured it will just assign new data to the other disks (there is plenty of space left on at least one drive).

 

When I managed to get access to the UI I logged in and found this:

image.thumb.png.dd629ac55f3c95c3320d49e005f9b741.png

 

So I logged in on the server and `df -h` shows this:

Filesystem Size Used Avail Use% Mounted on
rootfs 3.8G 985M 2.9G 26% /
tmpfs 32M 480K 32M 2% /run
devtmpfs 3.8G 0 3.8G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
cgroup_root 8.0M 0 8.0M 0% /sys/fs/cgroup
tmpfs 128M 224K 128M 1% /var/log
/dev/sda 7.3G 611M 6.7G 9% /boot
/dev/loop0 9.0M 9.0M 0 100% /lib/modules
/dev/loop1 5.9M 5.9M 0 100% /lib/firmware
/dev/md1 1.9T 1.7T 138G 93% /mnt/disk1
/dev/md2 1.9T 1.7T 165G 92% /mnt/disk2
/dev/md3 1.9T -15T 16T - /mnt/disk3
/dev/md4 1.9T 214G 1.7T 12% /mnt/disk4
/dev/sdb1 56G 50G 6.1G 90% /mnt/cache
shfs 7.3T -11T 18T - /mnt/user0
shfs 7.4T -11T 18T - /mnt/user
/dev/loop2 60G 26G 31G 46% /var/lib/docker
shm 64M 0 64M 0% /var/lib/docker/containers/.../mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/.../mounts/shm

 

As you can see my disk3 is reporting as -15 TB used and 16 TB available. The mover doesn't like this, and it had been invoked probably with this information and was stuck so and I couldn't do anything but force a shutdown/reboot.

 

Is there anything I can do to recover disk3 and get it back to reporting it's original used/free space?

 

Solution

I got some excellent help from @JorgeB and managed to solve this without data loss, so I figured I'd describe the process here instead of needing to read through all the posts. The short answer is to get a disk that's larger than the one that filled up (in my case a 2 TB drive got filled up so I used a 4 TB USB external drive) and clone the disk and fix it on the larger drive, then copy back the data.

 

Note: Many of these commands destroy data on the target device so always double check that you use the correct devices and partitions throughout.

 

So the first step is to check the filesystem on the filled up array device. In my case this was /dev/md3:

reiserfsck --check /dev/md3

It reported errors and needed to be run with --rebuild-tree, so I tried that:

reiserfsck --rebuild-tree /dev/md3

This took a while and corrected some blocks and sizes, but during the allocation phase the process aborted since it did not have enough free space on the drive:

Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Thu Nov 19 10:43:56 2020
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 488378638 blocks marked used
Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read
0%block 11428318: The number of items (3) is incorrect, should be (1) - corrected
block 11428318: The free space (480) is incorrect, should be (1040) - corrected
pass0: vpf-10210: block 11428318, item 0: The item with wrong offset or length found [2145255427 4292362239 0x66460007 DRCT (2)], len 3008 - deleted
block 12095023: The number of items (1027) is incorrect, should be (1) - corrected
block 12095023: The free space (0) is incorrect, should be (4048) - corrected
pass0: vpf-10150: block 12095023: item 0: Wrong key [0 0 0x4 SD (0)], deleted
....20%....40%....60%....80%....100%                       left 0, 40403 /sec
2558925 directory entries were hashed with "r5" hash.
        "r5" hash is selected
Flushing..finished
        Read blocks (but not data blocks) 488355523
                Leaves among those 601777
                        - leaves all contents of which could not be saved and deleted 3
                Objectids found 2559678

Pass 1 (will try to insert 601774 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....Not enough allocable blocks, checking bitmap...there are 1 allocable blocks, btw

out of disk space
Aborted

So I plugged in the external drive and made sure I knew what dev was assigned using dmesg (/dev/sdh in my case). My external drive came with some preinitialized Windows partitions, so I erased them using fdisk and cleared the drive to zeroes using dd:

dd if=/dev/zero of=/dev/sdh status=progress

Then I cloned the full disk to the external drive. Note: make sure to use the actual disk (/dev/sdX) and not the raid device (/dev/mdX). In my case /dev/md3 was /dev/sdg:

dd if=/dev/sdg of=/dev/sdh status=progress

Once finished, you have a copy of the disk on the external drive. The external drive now has a 2 TB partition, we need to extend that so we have enough space to rebuild the file system. I removed the existing partition (again with fdisk) and created a new one with the same start sector but utilizing the entire disk instead of 2 TB. The start sector should be 64 and this can not be set with fdisk, however it can be done using sgdisk. This command will create a new partition with correct start sector and using all available space on the disk:

sgdisk -o -a 8 -n 1:32K:0 /dev/sdh

I then re-created the superblock as explained here and then ran the reiserfsck command to rebuild the tree again on the new partition:

reiserfsck --rebuild-tree /dev/sdh1

Since the new partition now covers the entire 4 TB drive, the scanning and allocation phases is going to take a long time. Especially if you use a USB drive like I did.

 

When finished you should have a working reiserfs file system on the USB drive! I mounted the drive using the Unassigned Devices plugin and looked through the files making sure everything was OK.

 

As a final step I formated the old 2 TB drive using the unraid web UI and then I manually copied over the data from the mounted USB disk back to the freshly formated 2 TB drive. This is also a good opportunity to format in e.g. xfs instead of reiserfs.

 

Once everything is back on the original disk, and verified, you can release the USB external drive and use it for something else.

Edited by project6
Link to comment

Filesystem check output:

reiserfsck 3.6.27

Will read-only check consistency of the filesystem on /dev/md3
Will put log info to 'stdout'
###########
reiserfsck --check started at Thu Nov 19 08:56:55 2020
###########
Replaying journal: 
Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..  finished
Comparing bitmaps..Bad nodes were found, Semantic pass skipped
2 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Thu Nov 19 09:38:07 2020
###########
bad_path: The right delimiting key [4546696 4546697 0x19544001 IND (1)] of the node (423911883) must be greater than the last (2) element's key [4547201 4547202 0x1 IND (1)] within the node.
block 423912896: The level of the node (48314) is not correct, (1) expected
 the problem in the internal node occured (423912896), whole subtree is skipped
bad_stat_data: The objectid (4547202) is shared by at least two files. Can be fixed with --rebuild-tree only.
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (0) to the block (38724350), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (1) to the block (38724351), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (2) to the block (38724352), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (3) to the block (38724353), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (4) to the block (38724354), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (5) to the block (38724355), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (6) to the block (38724356), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (7) to the block (38724357), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (8) to the block (38724358), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (9) to the block (38724359), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (10) to the block (38724360), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (11) to the block (38724361), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (12) to the block (38724362), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (13) to the block (38724363), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (14) to the block (38724364), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (15) to the block (38724365), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (16) to the block (38724366), which is in tree already
vpf-10640: The on-disk and the correct bitmaps differs.

 

Link to comment

Thanks. It's running and correcting stuff now, it will take a couple of hours.

 

So I guess this is a symptom of the disk going bad? Never happened to me so far. My cabinet is unfortunately full so I can't easily just extend with another drive, I'd prefer to size up my partition drive first and then replace existing drives with larger ones.

Link to comment
3 minutes ago, project6 said:

So I guess this is a symptom of the disk going bad? 

Not necessarily.    It could just be you had some sort of temporary glitch that caused file system corruption.

 

I would suggest that:

  • When the —rebuild-tree run is finished run an extended SMART rest on the drive (by clicking on the drive on the Main tab) to see if that shows up any issues
  • Post your system’s diagnostics zip file (obtained via Tools -> Diagnostics) so we can see the state of the system.    If that is done after the SMART test has run then we will be able to see the results of that test.
Link to comment
4 minutes ago, itimpi said:

Not necessarily.    It could just be you had some sort of temporary glitch that caused file system corruption.

 

I would suggest that:

  • When the —rebuild-tree run is finished run an extended SMART rest on the drive (by clicking on the drive on the Main tab) to see if that shows up any issues
  • Post your system’s diagnostics zip file (obtained via Tools -> Diagnostics) so we can see the state of the system.    If that is done after the SMART test has run then we will be able to see the results of that test.

Ok, will do.

 

Should I keep the array in maintenance mode for the SMART run or can I enable the array after the rebuild-tree? 

Link to comment
1 minute ago, project6 said:

Ok, will do.

 

Should I keep the array in maintenance mode for the SMART run or can I enable the array after the rebuild-tree? 

Optional but I would start the array in normal mode as that would at least tell you whether the disk now mounts OK, and gives you a chance to see how successful the reiserfsck was in restoring all data.    You should not try to access the disk running the SMART test while it is running as that can abort the test.

Link to comment

Here is the output from --rebuild-tree, not sure how I should interpret it... looks like it didn't finish?

Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Thu Nov 19 10:43:56 2020
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 488378638 blocks marked used
Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read
0%block 11428318: The number of items (3) is incorrect, should be (1) - corrected
block 11428318: The free space (480) is incorrect, should be (1040) - corrected
pass0: vpf-10210: block 11428318, item 0: The item with wrong offset or length found [2145255427 4292362239 0x66460007 DRCT (2)], len 3008 - deleted
block 12095023: The number of items (1027) is incorrect, should be (1) - corrected
block 12095023: The free space (0) is incorrect, should be (4048) - corrected
pass0: vpf-10150: block 12095023: item 0: Wrong key [0 0 0x4 SD (0)], deleted
....20%....40%....60%....80%....100%                       left 0, 40403 /sec
2558925 directory entries were hashed with "r5" hash.
        "r5" hash is selected
Flushing..finished
        Read blocks (but not data blocks) 488355523
                Leaves among those 601777
                        - leaves all contents of which could not be saved and deleted 3
                Objectids found 2559678

Pass 1 (will try to insert 601774 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....Not enough allocable blocks, checking bitmap...there are 1 allocable blocks, btw

out of disk space
Aborted

This is what it looks like after trying to start the array now:

image.thumb.png.5443f4ebdbb5d7620f7c8ba790b96c53.png

Edited by project6
Link to comment

And even more on Minimum Free. You must set Minimum Free for each user share to larger than the largest file you expect to write to the user share.

 

Unraid has no way to know how large a file will become when it chooses a disk for it. If a disk has more than Minimum, it can be chosen. If a disk has less than Minimum, another will be chosen unless Split level prevents it.

Link to comment

Thanks for info,

I have not done any change to shares regarding this, it's all been the same pretty much since the day I built the machine. That is many, many years ago. Allocation is set to High water and minimum free, can't even remember I ever touched this setting, seems to be 0KB on all my shares.

 

Starting with a single drive probably ten years ago I'm now up to four and have not had this issue before, allocation has always been nicely spread out. Maybe I've just been lucky until now then.

 

So just to clarify so I understand this correctly, Unraid choose to keep writing to disk3, being close to max capacity, although it had disk4 available which only used 200 MB of 2TB. And to stop that I would have to set a minimum free for all shares?

 

Is there any way the data on the disk can be saved at all, can I replace the slot with a new disk now or is it lost?

 

The largest files on the drives are around 1-2 GB.

Link to comment
10 minutes ago, project6 said:

Is there any way the data on the disk can be saved at all, can I replace the slot with a new disk now or is it lost?

Though I never tried it I would guess best bet would be to clone that disk to a larger one with dd, then manually resize the partition, then run rebuild tree again, that should work.

Link to comment

Thanks for info. I'm gonna try tomorrow, gonna do the dd command first since that seems fairly straight forward (making sure the correct target device is selected :)). Do I need to format the external HD before running dd, if so should I format it as reiserfs then? I'm not sure I want to include it to the Unraid array, just want it as a temporary clone target.

 

I'm probably gonna go with a 4 TB drive.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.