lankanmon Posted January 25, 2019 Share Posted January 25, 2019 Hi all, I have recently had a power outage due to winter weather and my server shutdown after running on my UPS for a while. I think it got the shutdown command and shut itself down -- the UPS was not depleted and continued to power wifi long after server shutdown. Ne next day, when the power was back and stabilized, I turned it back on and it seemed to work fine, but I noticed that my docker images were not behaving normally. Plex videos would stop working after playing for a while. I checked to see if a parity check was running, but it wasn't. Then I decided to restart the server properly. I turned off my docker service. Took the array offline and then restarted the server with the web button. After restarting this time the web GUI was non-responsive and would not even start docker. It said the docker service has failed. I then tried to restart docker and this time the web GUI became non-responsive. I waited for a few hours and then force restarted the server - the unclean way :( After starting again, it detected the unclean shutdown and started a parity check after starting the array. I then let it run for 24hrs (usually it is much less) and when it checked, the web GUI was unresponsive again and the parity check was at 1% (45GB). I restarted again because, at this time, I need to get it working and was really tired of waiting 3 days. Now on boot... it is stuck on Array Starting - Starting services... I have now been 45min... I don't know what to do and feel helpless... Please advice on what I can do to recover the server. I am running on a headless server, so I can't use the non-web interface. Thank you! Quote Link to comment
Frank1940 Posted January 25, 2019 Share Posted January 25, 2019 Pull the flash drive and run chkdsk on it. If that doesn't allow the server to at least boot up, you will have to connect a monitor and keyboard to it so you can see what is on the monitor screen. Take a good, focused, non-blurry photo of the monitor screen (with a flash spot) and post up the picture. If you get to a login prompt on the monitor, login and type diagnostics and that will read the Diagnostics file to the logs directory/folder on the flash drive. Upload that entire file. (I see you are using a Supermicro Server board. Does that management interface allow you to see the boot process????) Quote Link to comment
lankanmon Posted January 26, 2019 Author Share Posted January 26, 2019 11 hours ago, Frank1940 said: Pull the flash drive and run chkdsk on it. If that doesn't allow the server to at least boot up, you will have to connect a monitor and keyboard to it so you can see what is on the monitor screen. Take a good, focused, non-blurry photo of the monitor screen (with a flash spot) and post up the picture. If you get to a login prompt on the monitor, login and type diagnostics and that will read the Diagnostics file to the logs directory/folder on the flash drive. Upload that entire file. (I see you are using a Supermicro Server board. Does that management interface allow you to see the boot process????) Thank you! I have attached the diagnostics file. I must clarify that the server does turn on and I am able to access the web GUI. It's just that when I click Start array, it just shows Array Starting - Starting services... (but does not actually start). Please take a look at the diagnostics file and see if it can tell you something. Thanks! lknserver-diagnostics-20190125-2045.zip 1 Quote Link to comment
Frank1940 Posted January 26, 2019 Share Posted January 26, 2019 I am not a Guru when it comes to reading log files BUT first you have these entries: Jan 24 20:29:00 LKNServer root: Fix Common Problems Version 2019.01.06 Jan 24 20:29:05 LKNServer root: Fix Common Problems: Error: Unable to write to cache Jan 24 20:29:06 LKNServer root: Fix Common Problems: Error: Unable to write to disk2 Jan 24 20:29:06 LKNServer root: Fix Common Problems: Warning: Cache Disk free space is less than the cache floor setting and also these: Jan 24 20:19:17 LKNServer emhttpd: Starting services... Jan 24 20:19:17 LKNServer emhttpd: shcmd (571): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 30 Jan 24 20:19:18 LKNServer kernel: BTRFS: device fsid 0ea30b6e-5c52-4c5f-87e4-43a058abc927 devid 1 transid 232035 /dev/loop2 Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): disk space caching is enabled Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): has skinny extents Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): bdev /dev/loop2 errs: wr 68, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): loop2 checksum verify failed on 749125632 wanted A76D3245 found 86E0B539 level 2 Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): read error corrected: ino 0 off 749125632 (dev /dev/loop2 sector 1479520) Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 757518336, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1479528 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 69, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): the free space cache file (2177892352) is invalid, skip it Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): the free space cache file (3251634176) is invalid, skip it Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 70, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 71, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): loop2 checksum verify failed on 796704768 wanted 31E8383 found 20E53511 level 0 Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): read error corrected: ino 0 off 796704768 (dev /dev/loop2 sector 1572448) Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 805097472, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1572456 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 72, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5741162496, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 11213200 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 73, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): loop2 checksum verify failed on 761593856 wanted 74E7800B found F63C6A54 level 0 Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): read error corrected: ino 0 off 761593856 (dev /dev/loop2 sector 1503872) Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 769986560, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1503880 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 74, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 75, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5741162496, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 11213200 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 76, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 77, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 695439360, length 4096. Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1358272 Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 78, rd 0, flush 0, corrupt 0, gen 0 Jan 24 20:19:18 LKNServer kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2261: errno=-5 IO failure (Error while writing out transaction) Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): Skipping commit of aborted transaction. Jan 24 20:19:18 LKNServer kernel: BTRFS: error (device loop2) in cleanup_transaction:1881: errno=-5 IO failure Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): delayed_refs has NO entry Jan 24 20:19:18 LKNServer kernel: BTRFS: error (device loop2) in btrfs_replay_log:2476: errno=-5 IO failure (Failed to recover log tree) Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): cleaner transaction attach returned -30 Jan 24 20:19:18 LKNServer kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000024 Jan 24 20:19:18 LKNServer kernel: IP: btrfs_search_slot+0xf9/0x852 I am hoping that someone else will jump in to provide some insight. You might want check to see what disks you have formatted as BTRFS. You can also turn off the autostart of any Dockers or VM's to see if you can get things running. IF that does not work, try the "Safe" mode in the boot menu. I would think that you should be seeing that one or two of your disks are disabled. Quote Link to comment
Squid Posted January 26, 2019 Share Posted January 26, 2019 8 minutes ago, Frank1940 said: Jan 24 20:29:06 LKNServer root: Fix Common Problems: Error: Unable to write to disk2 False positive and should be ignored (Assuming that the drive should be 100% full) 10 minutes ago, Frank1940 said: Jan 24 20:29:05 LKNServer root: Fix Common Problems: Error: Unable to write to cache Not a BTRFS guy, but it appears the drive is fully allocated. Pretty sure there's something in the unRaid FAQ about that Device size: 119.24GiB Device allocated: 119.24GiB Device unallocated: 1.05MiB Quote Link to comment
lankanmon Posted January 26, 2019 Author Share Posted January 26, 2019 1 hour ago, Squid said: False positive and should be ignored (Assuming that the drive should be 100% full) Not a BTRFS guy, but it appears the drive is fully allocated. Pretty sure there's something in the unRaid FAQ about that Device size: 119.24GiB Device allocated: 119.24GiB Device unallocated: 1.05MiB Disk 2 is full and I have been looking for ways to distribute the content to other disks, but have not found a way besides doing it manually... I what does it mean that the drive if fully allocated and what can I do to fix this prior to starting array? I am unable to start the array as it stalls. Thanks for the feedback! Quote Link to comment
Squid Posted January 26, 2019 Share Posted January 26, 2019 https://forums.unraid.net/topic/62230-out-of-space-errors-on-cache-drive/?tab=comments#comment-610551 11 minutes ago, lankanmon said: I am unable to start the array as it stalls. Disable the VM's and Docker in Settings Quote Link to comment
lankanmon Posted January 26, 2019 Author Share Posted January 26, 2019 (edited) 8 hours ago, Squid said: https://forums.unraid.net/topic/62230-out-of-space-errors-on-cache-drive/?tab=comments#comment-610551 Disable the VM's and Docker in Settings Okay, so I disabled the docker (the vms were already disabled) and started the server. It finally mounted again. I tried to runt he balancing command, but did did not work... root@LKNServer:~# btrfs balance start -dusage=75 /mnt/cache ERROR: cannot access '/mnt/cache': No such file or directory and the web interface shows: and the Balance button does not seem to work either... I am also going to do what @limetech said lower ( ) by moving everything out of the drive, formatting and putting things back on. I have started the mover to get things off the drive naturally to see if that can do all the work for me. I have temporarily marked the usually cached shares as Yes instead of Prefer so I can move them out of the cache. Edited January 27, 2019 by lankanmon Quote Link to comment
lankanmon Posted January 27, 2019 Author Share Posted January 27, 2019 (edited) Also, when I start the docker service, I get: Jan 26 19:31:40 LKNServer ool www[3156]: /usr/local/emhttp/plugins/dynamix/scripts/emhttpd_update Jan 26 19:31:40 LKNServer emhttpd: req (15): cmdStatus=apply&csrf_token=**************** Jan 26 19:31:40 LKNServer emhttpd: Starting services... Jan 26 19:31:40 LKNServer emhttpd: shcmd (5104): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 30 Jan 26 19:31:40 LKNServer root: truncate: cannot open '/mnt/cache/system/docker/docker.img' for writing: Read-only file system Jan 26 19:31:40 LKNServer kernel: BTRFS info (device loop3): disk space caching is enabled Jan 26 19:31:40 LKNServer kernel: BTRFS info (device loop3): has skinny extents Jan 26 19:31:40 LKNServer kernel: BTRFS info (device loop3): bdev /dev/loop3 errs: wr 68, rd 0, flush 0, corrupt 0, gen 0 Jan 26 19:31:40 LKNServer kernel: BTRFS warning (device loop3): log replay required on RO media Jan 26 19:31:40 LKNServer root: mount: /var/lib/docker: can't read superblock on /dev/loop3. Jan 26 19:31:40 LKNServer kernel: BTRFS error (device loop3): open_ctree failed Jan 26 19:31:40 LKNServer root: mount error Jan 26 19:31:40 LKNServer emhttpd: shcmd (5104): exit status: 1 What can I do to solve the BTRFS errors? The cache drive is not full and I reduced some files too, but still getting the read-only error. Edited January 27, 2019 by lankanmon Quote Link to comment
JorgeB Posted January 27, 2019 Share Posted January 27, 2019 8 hours ago, lankanmon said: What can I do to solve the BTRFS errors? The cache drive is not full and I reduced some files too, but still getting the read-only error. Please post the diagnostics: Tools -> Diagnostics Quote Link to comment
lankanmon Posted January 27, 2019 Author Share Posted January 27, 2019 8 hours ago, johnnie.black said: Please post the diagnostics: Tools -> Diagnostics lknserver-diagnostics-20190127-1224.zip Quote Link to comment
JorgeB Posted January 27, 2019 Share Posted January 27, 2019 8 minutes ago, lankanmon said: lknserver-diagnostics-20190127-1224.zip Cache filesystem is fully allocated, see here on how to fix: https://lime-technology.com/forums/topic/62230-out-of-space-errors-on-cache-drive/?do=findComment&comment=610551 Quote Link to comment
lankanmon Posted January 27, 2019 Author Share Posted January 27, 2019 (edited) 2 hours ago, johnnie.black said: Cache filesystem is fully allocated, see here on how to fix: https://lime-technology.com/forums/topic/62230-out-of-space-errors-on-cache-drive/?do=findComment&comment=610551 I have the command running right now. It appears to be working and am at 50% btrfs balance start -dusage=75 /mnt/cache I just wanted to know what I need to know after this is done... will the drive automatically no longer be read-only? Do I need to restart? Also, I ran it with -dusage=75 Do I need to run it again with a bigger number and is the max 100? I just want to know what I can do to prevent this from happening again. UPDATE: It just finished and I got this error: Jan 27 14:35:27 LKNServer kernel: BTRFS info (device sdf1): relocating block group 370487918592 flags data Jan 27 14:35:33 LKNServer kernel: BTRFS info (device sdf1): found 7445 extents Jan 27 14:36:58 LKNServer kernel: BTRFS info (device sdf1): found 7445 extents Jan 27 14:36:59 LKNServer kernel: BTRFS info (device sdf1): relocating block group 369414176768 flags data Jan 27 14:37:06 LKNServer kernel: BTRFS info (device sdf1): found 6155 extents Jan 27 14:38:54 LKNServer kernel: BTRFS info (device sdf1): found 6155 extents So, what should I do now? Edited January 27, 2019 by lankanmon added update Quote Link to comment
lankanmon Posted January 27, 2019 Author Share Posted January 27, 2019 Okay so I went back and looked at the other thread and saw that, if it failed, I had to run the command with a smaller param. So I went back and ran it again (each time) with -dusage=1 -dusage=10 -dusage=25 -dusage=50 -dusage=60 -dusage=70 -dusage=75 Each time, it did succeed. Then after that, I restarted the server and it seems to work now. I would still like to know if I need to use a larger number (greater than 75) and what the limit is. Also how often I must run this command and if there is a way to automate this process. BONUS: What is btrfs scrub status? and do I need to run that as well? I want to thank all of you for your help. I really appreciate it! Quote Link to comment
JorgeB Posted January 27, 2019 Share Posted January 27, 2019 (edited) 18 minutes ago, lankanmon said: I would still like to know if I need to use a larger number (greater than 75) and what the limit is. 75 should be enough, but it can be up to 100 18 minutes ago, lankanmon said: Also how often I must run this command and if there is a way to automate this process. Update to latest Unraid and you won't need to run it anymore. 18 minutes ago, lankanmon said: BONUS: What is btrfs scrub status? and do I need to run that as well? Scrub checks all checksums are correct, it's usually not needed unless you suspect a problem, but running it say once a month is a good practice. Edited January 27, 2019 by johnnie.black Quote Link to comment
lankanmon Posted January 28, 2019 Author Share Posted January 28, 2019 Okay, I will see if I can update. I also managed to get the docker image to run. I do see a lot of BTRFS related errors created by docker. I have attached the most recent diagnostics, can you please see if these errors are of concern or if there is anything I should do to reduce them? I am worried that there are lasting effects from the cache issues. Thanks! lknserver-diagnostics-20190127-2017.zip Quote Link to comment
JorgeB Posted January 28, 2019 Share Posted January 28, 2019 Docker image is corrupt, a result of the previous issues, delete and recreate. Quote Link to comment
lankanmon Posted January 30, 2019 Author Share Posted January 30, 2019 On 1/28/2019 at 3:41 AM, johnnie.black said: Docker image is corrupt, a result of the previous issues, delete and recreate. Okay, I deleted the docker image and recreated it. It now seems to be working fine! Thanks a lot for all of your help! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.