(SOLVED) Full disk now showing as -15 TB / nan

project6 · November 19, 2020

I've been running Unraid for years without a single problem, currently using version 6.8.0. A few days ago I started receiving email alerts that disk3 was full (100%) but was not able to look into it since I didn't have access to the server for a few days. I figured it will just assign new data to the other disks (there is plenty of space left on at least one drive).

When I managed to get access to the UI I logged in and found this:

So I logged in on the server and `df -h` shows this:

Filesystem Size Used Avail Use% Mounted on
rootfs 3.8G 985M 2.9G 26% /
tmpfs 32M 480K 32M 2% /run
devtmpfs 3.8G 0 3.8G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
cgroup_root 8.0M 0 8.0M 0% /sys/fs/cgroup
tmpfs 128M 224K 128M 1% /var/log
/dev/sda 7.3G 611M 6.7G 9% /boot
/dev/loop0 9.0M 9.0M 0 100% /lib/modules
/dev/loop1 5.9M 5.9M 0 100% /lib/firmware
/dev/md1 1.9T 1.7T 138G 93% /mnt/disk1
/dev/md2 1.9T 1.7T 165G 92% /mnt/disk2
/dev/md3 1.9T -15T 16T - /mnt/disk3
/dev/md4 1.9T 214G 1.7T 12% /mnt/disk4
/dev/sdb1 56G 50G 6.1G 90% /mnt/cache
shfs 7.3T -11T 18T - /mnt/user0
shfs 7.4T -11T 18T - /mnt/user
/dev/loop2 60G 26G 31G 46% /var/lib/docker
shm 64M 0 64M 0% /var/lib/docker/containers/.../mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/.../mounts/shm

As you can see my disk3 is reporting as -15 TB used and 16 TB available. The mover doesn't like this, and it had been invoked probably with this information and was stuck so and I couldn't do anything but force a shutdown/reboot.

Is there anything I can do to recover disk3 and get it back to reporting it's original used/free space?

Solution

I got some excellent help from @JorgeB and managed to solve this without data loss, so I figured I'd describe the process here instead of needing to read through all the posts. The short answer is to get a disk that's larger than the one that filled up (in my case a 2 TB drive got filled up so I used a 4 TB USB external drive) and clone the disk and fix it on the larger drive, then copy back the data.

Note: Many of these commands destroy data on the target device so always double check that you use the correct devices and partitions throughout.

So the first step is to check the filesystem on the filled up array device. In my case this was /dev/md3:

reiserfsck --check /dev/md3

It reported errors and needed to be run with --rebuild-tree, so I tried that:

reiserfsck --rebuild-tree /dev/md3

This took a while and corrected some blocks and sizes, but during the allocation phase the process aborted since it did not have enough free space on the drive:

Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Thu Nov 19 10:43:56 2020
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 488378638 blocks marked used
Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read
0%block 11428318: The number of items (3) is incorrect, should be (1) - corrected
block 11428318: The free space (480) is incorrect, should be (1040) - corrected
pass0: vpf-10210: block 11428318, item 0: The item with wrong offset or length found [2145255427 4292362239 0x66460007 DRCT (2)], len 3008 - deleted
block 12095023: The number of items (1027) is incorrect, should be (1) - corrected
block 12095023: The free space (0) is incorrect, should be (4048) - corrected
pass0: vpf-10150: block 12095023: item 0: Wrong key [0 0 0x4 SD (0)], deleted
....20%....40%....60%....80%....100%                       left 0, 40403 /sec
2558925 directory entries were hashed with "r5" hash.
        "r5" hash is selected
Flushing..finished
        Read blocks (but not data blocks) 488355523
                Leaves among those 601777
                        - leaves all contents of which could not be saved and deleted 3
                Objectids found 2559678

Pass 1 (will try to insert 601774 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....Not enough allocable blocks, checking bitmap...there are 1 allocable blocks, btw

out of disk space
Aborted

So I plugged in the external drive and made sure I knew what dev was assigned using dmesg (/dev/sdh in my case). My external drive came with some preinitialized Windows partitions, so I erased them using fdisk and cleared the drive to zeroes using dd:

dd if=/dev/zero of=/dev/sdh status=progress

Then I cloned the full disk to the external drive. Note: make sure to use the actual disk (/dev/sdX) and not the raid device (/dev/mdX). In my case /dev/md3 was /dev/sdg:

dd if=/dev/sdg of=/dev/sdh status=progress

Once finished, you have a copy of the disk on the external drive. The external drive now has a 2 TB partition, we need to extend that so we have enough space to rebuild the file system. I removed the existing partition (again with fdisk) and created a new one with the same start sector but utilizing the entire disk instead of 2 TB. The start sector should be 64 and this can not be set with fdisk, however it can be done using sgdisk. This command will create a new partition with correct start sector and using all available space on the disk:

sgdisk -o -a 8 -n 1:32K:0 /dev/sdh

I then re-created the superblock as explained here and then ran the reiserfsck command to rebuild the tree again on the new partition:

reiserfsck --rebuild-tree /dev/sdh1

Since the new partition now covers the entire 4 TB drive, the scanning and allocation phases is going to take a long time. Especially if you use a USB drive like I did.

When finished you should have a working reiserfs file system on the USB drive! I mounted the drive using the Unassigned Devices plugin and looked through the files making sure everything was OK.

As a final step I formated the old 2 TB drive using the unraid web UI and then I manually copied over the data from the mounted USB disk back to the freshly formated 2 TB drive. This is also a good opportunity to format in e.g. xfs instead of reiserfs.

Once everything is back on the original disk, and verified, you can release the USB external drive and use it for something else.

Edited December 1, 2020 by project6

JorgeB · November 19, 2020

Run a filesystem check on that disk.

project6 · November 19, 2020

Yep, filesystem check is in progress.

project6 · November 19, 2020

Filesystem check output:

reiserfsck 3.6.27

Will read-only check consistency of the filesystem on /dev/md3
Will put log info to 'stdout'
###########
reiserfsck --check started at Thu Nov 19 08:56:55 2020
###########
Replaying journal: 
Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..  finished
Comparing bitmaps..Bad nodes were found, Semantic pass skipped
2 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Thu Nov 19 09:38:07 2020
###########
bad_path: The right delimiting key [4546696 4546697 0x19544001 IND (1)] of the node (423911883) must be greater than the last (2) element's key [4547201 4547202 0x1 IND (1)] within the node.
block 423912896: The level of the node (48314) is not correct, (1) expected
 the problem in the internal node occured (423912896), whole subtree is skipped
bad_stat_data: The objectid (4547202) is shared by at least two files. Can be fixed with --rebuild-tree only.
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (0) to the block (38724350), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (1) to the block (38724351), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (2) to the block (38724352), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (3) to the block (38724353), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (4) to the block (38724354), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (5) to the block (38724355), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (6) to the block (38724356), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (7) to the block (38724357), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (8) to the block (38724358), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (9) to the block (38724359), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (10) to the block (38724360), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (11) to the block (38724361), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (12) to the block (38724362), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (13) to the block (38724363), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (14) to the block (38724364), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (15) to the block (38724365), which is in tree already
bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (16) to the block (38724366), which is in tree already
vpf-10640: The on-disk and the correct bitmaps differs.

project6 · November 19, 2020

Should I try running:

reiserfsck --rebuild-tree /dev/md3

as suggested? I get a massive ASCII warning prompt when attempting it and not sure if I should continue

Edited November 19, 2020 by project6

JorgeB · November 19, 2020

You need to run it again with --rebuild-tree

project6 · November 19, 2020

Thanks. It's running and correcting stuff now, it will take a couple of hours.

So I guess this is a symptom of the disk going bad? Never happened to me so far. My cabinet is unfortunately full so I can't easily just extend with another drive, I'd prefer to size up my partition drive first and then replace existing drives with larger ones.

itimpi · November 19, 2020

3 minutes ago, project6 said:

So I guess this is a symptom of the disk going bad?

Not necessarily. It could just be you had some sort of temporary glitch that caused file system corruption.

I would suggest that:

When the —rebuild-tree run is finished run an extended SMART rest on the drive (by clicking on the drive on the Main tab) to see if that shows up any issues
Post your system’s diagnostics zip file (obtained via Tools -> Diagnostics) so we can see the state of the system. If that is done after the SMART test has run then we will be able to see the results of that test.

project6 · November 19, 2020

4 minutes ago, itimpi said:

Not necessarily. It could just be you had some sort of temporary glitch that caused file system corruption.

I would suggest that:

When the —rebuild-tree run is finished run an extended SMART rest on the drive (by clicking on the drive on the Main tab) to see if that shows up any issues

Post your system’s diagnostics zip file (obtained via Tools -> Diagnostics) so we can see the state of the system. If that is done after the SMART test has run then we will be able to see the results of that test.

Ok, will do.

Should I keep the array in maintenance mode for the SMART run or can I enable the array after the rebuild-tree?

itimpi · November 19, 2020

1 minute ago, project6 said:

Ok, will do.

Should I keep the array in maintenance mode for the SMART run or can I enable the array after the rebuild-tree?

Optional but I would start the array in normal mode as that would at least tell you whether the disk now mounts OK, and gives you a chance to see how successful the reiserfsck was in restoring all data. You should not try to access the disk running the SMART test while it is running as that can abort the test.

project6 · November 19, 2020

Here is the output from --rebuild-tree, not sure how I should interpret it... looks like it didn't finish?

Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Thu Nov 19 10:43:56 2020
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 488378638 blocks marked used
Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read
0%block 11428318: The number of items (3) is incorrect, should be (1) - corrected
block 11428318: The free space (480) is incorrect, should be (1040) - corrected
pass0: vpf-10210: block 11428318, item 0: The item with wrong offset or length found [2145255427 4292362239 0x66460007 DRCT (2)], len 3008 - deleted
block 12095023: The number of items (1027) is incorrect, should be (1) - corrected
block 12095023: The free space (0) is incorrect, should be (4048) - corrected
pass0: vpf-10150: block 12095023: item 0: Wrong key [0 0 0x4 SD (0)], deleted
....20%....40%....60%....80%....100%                       left 0, 40403 /sec
2558925 directory entries were hashed with "r5" hash.
        "r5" hash is selected
Flushing..finished
        Read blocks (but not data blocks) 488355523
                Leaves among those 601777
                        - leaves all contents of which could not be saved and deleted 3
                Objectids found 2559678

Pass 1 (will try to insert 601774 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....Not enough allocable blocks, checking bitmap...there are 1 allocable blocks, btw

out of disk space
Aborted

This is what it looks like after trying to start the array now:

Edited November 19, 2020 by project6

JorgeB · November 19, 2020

3 minutes ago, project6 said:

out of disk space

It didn't finish, was the disk full near capacity?

project6 · November 19, 2020

Yes, I got alerts a couple of days ago that disk3 was low on space (100%) so it is probably full. When I got home a few days later I looked at the UI and found the `nan B` used and mover beeing stuck/hanged.

JorgeB · November 19, 2020

You should always leave a few free GBs, or if there's a filesystem issue it might not be fixable, like it appears to be the case.

project6 · November 19, 2020

I have other disks with plenty of space, previously when a disk started filling up Unraid always put new data on the other disks. So I'm not sure why it kept putting it on disk3, I assume since it found 17.6 TB free.

JorgeB · November 19, 2020

Depends on allocation mode and split level settings.

trurl · November 19, 2020

And even more on Minimum Free. You must set Minimum Free for each user share to larger than the largest file you expect to write to the user share.

Unraid has no way to know how large a file will become when it chooses a disk for it. If a disk has more than Minimum, it can be chosen. If a disk has less than Minimum, another will be chosen unless Split level prevents it.

project6 · November 19, 2020

Thanks for info,

I have not done any change to shares regarding this, it's all been the same pretty much since the day I built the machine. That is many, many years ago. Allocation is set to High water and minimum free, can't even remember I ever touched this setting, seems to be 0KB on all my shares.

Starting with a single drive probably ten years ago I'm now up to four and have not had this issue before, allocation has always been nicely spread out. Maybe I've just been lucky until now then.

So just to clarify so I understand this correctly, Unraid choose to keep writing to disk3, being close to max capacity, although it had disk4 available which only used 200 MB of 2TB. And to stop that I would have to set a minimum free for all shares?

Is there any way the data on the disk can be saved at all, can I replace the slot with a new disk now or is it lost?

The largest files on the drives are around 1-2 GB.

JorgeB · November 19, 2020

10 minutes ago, project6 said:

Is there any way the data on the disk can be saved at all, can I replace the slot with a new disk now or is it lost?

Though I never tried it I would guess best bet would be to clone that disk to a larger one with dd, then manually resize the partition, then run rebuild tree again, that should work.

project6 · November 19, 2020

Sounds like a plan.

Can I use an external USB HD as a clone target? I don't think my cabinet has any more free SATA connectors, and 2TB is max/partition drive size.

JorgeB · November 19, 2020

Yes.

JorgeB · November 19, 2020

Thinking more on this not sure resizing the partition will be enough, might also need to resize the filesystem, you should be able to do that by recreating the superblock after the partition is resized, but try just the partition first and post back.

project6 · November 19, 2020

Thanks for info. I'm gonna try tomorrow, gonna do the dd command first since that seems fairly straight forward (making sure the correct target device is selected ). Do I need to format the external HD before running dd, if so should I format it as reiserfs then? I'm not sure I want to include it to the Unraid array, just want it as a temporary clone target.

I'm probably gonna go with a 4 TB drive.

JorgeB · November 19, 2020

2 minutes ago, project6 said:

Do I need to format the external HD before running dd

No, just need to run dd.

trurl · November 19, 2020

1 hour ago, project6 said:

Unraid choose to keep writing to disk3, being close to max capacity, although it had disk4 available which only used 200 MB of 2TB

Includes / excludes and split level would also have to allow it to choose disk4.

(SOLVED) Full disk now showing as -15 TB / nan

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

JorgeB

JorgeB

JorgeB

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation