project6 Posted November 19, 2020 Share Posted November 19, 2020 (edited) I've been running Unraid for years without a single problem, currently using version 6.8.0. A few days ago I started receiving email alerts that disk3 was full (100%) but was not able to look into it since I didn't have access to the server for a few days. I figured it will just assign new data to the other disks (there is plenty of space left on at least one drive). When I managed to get access to the UI I logged in and found this: So I logged in on the server and `df -h` shows this: Filesystem Size Used Avail Use% Mounted on rootfs 3.8G 985M 2.9G 26% / tmpfs 32M 480K 32M 2% /run devtmpfs 3.8G 0 3.8G 0% /dev tmpfs 3.9G 0 3.9G 0% /dev/shm cgroup_root 8.0M 0 8.0M 0% /sys/fs/cgroup tmpfs 128M 224K 128M 1% /var/log /dev/sda 7.3G 611M 6.7G 9% /boot /dev/loop0 9.0M 9.0M 0 100% /lib/modules /dev/loop1 5.9M 5.9M 0 100% /lib/firmware /dev/md1 1.9T 1.7T 138G 93% /mnt/disk1 /dev/md2 1.9T 1.7T 165G 92% /mnt/disk2 /dev/md3 1.9T -15T 16T - /mnt/disk3 /dev/md4 1.9T 214G 1.7T 12% /mnt/disk4 /dev/sdb1 56G 50G 6.1G 90% /mnt/cache shfs 7.3T -11T 18T - /mnt/user0 shfs 7.4T -11T 18T - /mnt/user /dev/loop2 60G 26G 31G 46% /var/lib/docker shm 64M 0 64M 0% /var/lib/docker/containers/.../mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/.../mounts/shm As you can see my disk3 is reporting as -15 TB used and 16 TB available. The mover doesn't like this, and it had been invoked probably with this information and was stuck so and I couldn't do anything but force a shutdown/reboot. Is there anything I can do to recover disk3 and get it back to reporting it's original used/free space? Solution I got some excellent help from @JorgeB and managed to solve this without data loss, so I figured I'd describe the process here instead of needing to read through all the posts. The short answer is to get a disk that's larger than the one that filled up (in my case a 2 TB drive got filled up so I used a 4 TB USB external drive) and clone the disk and fix it on the larger drive, then copy back the data. Note: Many of these commands destroy data on the target device so always double check that you use the correct devices and partitions throughout. So the first step is to check the filesystem on the filled up array device. In my case this was /dev/md3: reiserfsck --check /dev/md3 It reported errors and needed to be run with --rebuild-tree, so I tried that: reiserfsck --rebuild-tree /dev/md3 This took a while and corrected some blocks and sizes, but during the allocation phase the process aborted since it did not have enough free space on the drive: Replaying journal: Done. Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed ########### reiserfsck --rebuild-tree started at Thu Nov 19 10:43:56 2020 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 488378638 blocks marked used Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read 0%block 11428318: The number of items (3) is incorrect, should be (1) - corrected block 11428318: The free space (480) is incorrect, should be (1040) - corrected pass0: vpf-10210: block 11428318, item 0: The item with wrong offset or length found [2145255427 4292362239 0x66460007 DRCT (2)], len 3008 - deleted block 12095023: The number of items (1027) is incorrect, should be (1) - corrected block 12095023: The free space (0) is incorrect, should be (4048) - corrected pass0: vpf-10150: block 12095023: item 0: Wrong key [0 0 0x4 SD (0)], deleted ....20%....40%....60%....80%....100% left 0, 40403 /sec 2558925 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 488355523 Leaves among those 601777 - leaves all contents of which could not be saved and deleted 3 Objectids found 2559678 Pass 1 (will try to insert 601774 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....Not enough allocable blocks, checking bitmap...there are 1 allocable blocks, btw out of disk space Aborted So I plugged in the external drive and made sure I knew what dev was assigned using dmesg (/dev/sdh in my case). My external drive came with some preinitialized Windows partitions, so I erased them using fdisk and cleared the drive to zeroes using dd: dd if=/dev/zero of=/dev/sdh status=progress Then I cloned the full disk to the external drive. Note: make sure to use the actual disk (/dev/sdX) and not the raid device (/dev/mdX). In my case /dev/md3 was /dev/sdg: dd if=/dev/sdg of=/dev/sdh status=progress Once finished, you have a copy of the disk on the external drive. The external drive now has a 2 TB partition, we need to extend that so we have enough space to rebuild the file system. I removed the existing partition (again with fdisk) and created a new one with the same start sector but utilizing the entire disk instead of 2 TB. The start sector should be 64 and this can not be set with fdisk, however it can be done using sgdisk. This command will create a new partition with correct start sector and using all available space on the disk: sgdisk -o -a 8 -n 1:32K:0 /dev/sdh I then re-created the superblock as explained here and then ran the reiserfsck command to rebuild the tree again on the new partition: reiserfsck --rebuild-tree /dev/sdh1 Since the new partition now covers the entire 4 TB drive, the scanning and allocation phases is going to take a long time. Especially if you use a USB drive like I did. When finished you should have a working reiserfs file system on the USB drive! I mounted the drive using the Unassigned Devices plugin and looked through the files making sure everything was OK. As a final step I formated the old 2 TB drive using the unraid web UI and then I manually copied over the data from the mounted USB disk back to the freshly formated 2 TB drive. This is also a good opportunity to format in e.g. xfs instead of reiserfs. Once everything is back on the original disk, and verified, you can release the USB external drive and use it for something else. Edited December 1, 2020 by project6 Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 Run a filesystem check on that disk. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Yep, filesystem check is in progress. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Filesystem check output: reiserfsck 3.6.27 Will read-only check consistency of the filesystem on /dev/md3 Will put log info to 'stdout' ########### reiserfsck --check started at Thu Nov 19 08:56:55 2020 ########### Replaying journal: Replaying journal: Done. Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..Bad nodes were found, Semantic pass skipped 2 found corruptions can be fixed only when running with --rebuild-tree ########### reiserfsck finished at Thu Nov 19 09:38:07 2020 ########### bad_path: The right delimiting key [4546696 4546697 0x19544001 IND (1)] of the node (423911883) must be greater than the last (2) element's key [4547201 4547202 0x1 IND (1)] within the node. block 423912896: The level of the node (48314) is not correct, (1) expected the problem in the internal node occured (423912896), whole subtree is skipped bad_stat_data: The objectid (4547202) is shared by at least two files. Can be fixed with --rebuild-tree only. bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (0) to the block (38724350), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (1) to the block (38724351), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (2) to the block (38724352), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (3) to the block (38724353), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (4) to the block (38724354), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (5) to the block (38724355), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (6) to the block (38724356), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (7) to the block (38724357), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (8) to the block (38724358), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (9) to the block (38724359), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (10) to the block (38724360), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (11) to the block (38724361), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (12) to the block (38724362), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (13) to the block (38724363), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (14) to the block (38724364), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (15) to the block (38724365), which is in tree already bad_indirect_item: block 380601268: The item (4547201 4547202 0x1 IND (1), len 68, location 820 entry count 0, fsck need 0, format new) has the bad pointer (16) to the block (38724366), which is in tree already vpf-10640: The on-disk and the correct bitmaps differs. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 (edited) Should I try running: reiserfsck --rebuild-tree /dev/md3 as suggested? I get a massive ASCII warning prompt when attempting it and not sure if I should continue Edited November 19, 2020 by project6 Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 You need to run it again with --rebuild-tree Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Thanks. It's running and correcting stuff now, it will take a couple of hours. So I guess this is a symptom of the disk going bad? Never happened to me so far. My cabinet is unfortunately full so I can't easily just extend with another drive, I'd prefer to size up my partition drive first and then replace existing drives with larger ones. Quote Link to comment
itimpi Posted November 19, 2020 Share Posted November 19, 2020 3 minutes ago, project6 said: So I guess this is a symptom of the disk going bad? Not necessarily. It could just be you had some sort of temporary glitch that caused file system corruption. I would suggest that: When the —rebuild-tree run is finished run an extended SMART rest on the drive (by clicking on the drive on the Main tab) to see if that shows up any issues Post your system’s diagnostics zip file (obtained via Tools -> Diagnostics) so we can see the state of the system. If that is done after the SMART test has run then we will be able to see the results of that test. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 4 minutes ago, itimpi said: Not necessarily. It could just be you had some sort of temporary glitch that caused file system corruption. I would suggest that: When the —rebuild-tree run is finished run an extended SMART rest on the drive (by clicking on the drive on the Main tab) to see if that shows up any issues Post your system’s diagnostics zip file (obtained via Tools -> Diagnostics) so we can see the state of the system. If that is done after the SMART test has run then we will be able to see the results of that test. Ok, will do. Should I keep the array in maintenance mode for the SMART run or can I enable the array after the rebuild-tree? Quote Link to comment
itimpi Posted November 19, 2020 Share Posted November 19, 2020 1 minute ago, project6 said: Ok, will do. Should I keep the array in maintenance mode for the SMART run or can I enable the array after the rebuild-tree? Optional but I would start the array in normal mode as that would at least tell you whether the disk now mounts OK, and gives you a chance to see how successful the reiserfsck was in restoring all data. You should not try to access the disk running the SMART test while it is running as that can abort the test. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 (edited) Here is the output from --rebuild-tree, not sure how I should interpret it... looks like it didn't finish? Replaying journal: Done. Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed ########### reiserfsck --rebuild-tree started at Thu Nov 19 10:43:56 2020 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 488378638 blocks marked used Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read 0%block 11428318: The number of items (3) is incorrect, should be (1) - corrected block 11428318: The free space (480) is incorrect, should be (1040) - corrected pass0: vpf-10210: block 11428318, item 0: The item with wrong offset or length found [2145255427 4292362239 0x66460007 DRCT (2)], len 3008 - deleted block 12095023: The number of items (1027) is incorrect, should be (1) - corrected block 12095023: The free space (0) is incorrect, should be (4048) - corrected pass0: vpf-10150: block 12095023: item 0: Wrong key [0 0 0x4 SD (0)], deleted ....20%....40%....60%....80%....100% left 0, 40403 /sec 2558925 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 488355523 Leaves among those 601777 - leaves all contents of which could not be saved and deleted 3 Objectids found 2559678 Pass 1 (will try to insert 601774 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....Not enough allocable blocks, checking bitmap...there are 1 allocable blocks, btw out of disk space Aborted This is what it looks like after trying to start the array now: Edited November 19, 2020 by project6 Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 3 minutes ago, project6 said: out of disk space It didn't finish, was the disk full near capacity? Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Yes, I got alerts a couple of days ago that disk3 was low on space (100%) so it is probably full. When I got home a few days later I looked at the UI and found the `nan B` used and mover beeing stuck/hanged. Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 You should always leave a few free GBs, or if there's a filesystem issue it might not be fixable, like it appears to be the case. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 I have other disks with plenty of space, previously when a disk started filling up Unraid always put new data on the other disks. So I'm not sure why it kept putting it on disk3, I assume since it found 17.6 TB free. Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 Depends on allocation mode and split level settings. Quote Link to comment
trurl Posted November 19, 2020 Share Posted November 19, 2020 And even more on Minimum Free. You must set Minimum Free for each user share to larger than the largest file you expect to write to the user share. Unraid has no way to know how large a file will become when it chooses a disk for it. If a disk has more than Minimum, it can be chosen. If a disk has less than Minimum, another will be chosen unless Split level prevents it. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Thanks for info, I have not done any change to shares regarding this, it's all been the same pretty much since the day I built the machine. That is many, many years ago. Allocation is set to High water and minimum free, can't even remember I ever touched this setting, seems to be 0KB on all my shares. Starting with a single drive probably ten years ago I'm now up to four and have not had this issue before, allocation has always been nicely spread out. Maybe I've just been lucky until now then. So just to clarify so I understand this correctly, Unraid choose to keep writing to disk3, being close to max capacity, although it had disk4 available which only used 200 MB of 2TB. And to stop that I would have to set a minimum free for all shares? Is there any way the data on the disk can be saved at all, can I replace the slot with a new disk now or is it lost? The largest files on the drives are around 1-2 GB. Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 10 minutes ago, project6 said: Is there any way the data on the disk can be saved at all, can I replace the slot with a new disk now or is it lost? Though I never tried it I would guess best bet would be to clone that disk to a larger one with dd, then manually resize the partition, then run rebuild tree again, that should work. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Sounds like a plan. Can I use an external USB HD as a clone target? I don't think my cabinet has any more free SATA connectors, and 2TB is max/partition drive size. Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 Thinking more on this not sure resizing the partition will be enough, might also need to resize the filesystem, you should be able to do that by recreating the superblock after the partition is resized, but try just the partition first and post back. Quote Link to comment
project6 Posted November 19, 2020 Author Share Posted November 19, 2020 Thanks for info. I'm gonna try tomorrow, gonna do the dd command first since that seems fairly straight forward (making sure the correct target device is selected ). Do I need to format the external HD before running dd, if so should I format it as reiserfs then? I'm not sure I want to include it to the Unraid array, just want it as a temporary clone target. I'm probably gonna go with a 4 TB drive. Quote Link to comment
JorgeB Posted November 19, 2020 Share Posted November 19, 2020 2 minutes ago, project6 said: Do I need to format the external HD before running dd No, just need to run dd. 1 Quote Link to comment
trurl Posted November 19, 2020 Share Posted November 19, 2020 1 hour ago, project6 said: Unraid choose to keep writing to disk3, being close to max capacity, although it had disk4 available which only used 200 MB of 2TB Includes / excludes and split level would also have to allow it to choose disk4. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.