Server stuck on starting services


Recommended Posts

Hi all, 

I have recently had a power outage due to winter weather and my server shutdown after running on my UPS for a while. I think it got the shutdown command and shut itself down -- the UPS was not depleted and continued to power wifi long after server shutdown.

 

Ne next day, when the power was back and stabilized, I turned it back on and it seemed to work fine, but I noticed that my docker images were not behaving normally. Plex videos would stop working after playing for a while. I checked to see if a parity check was running, but it wasn't. Then I decided to restart the server properly. I turned off my docker service. Took the array offline and then restarted the server with the web button. After restarting this time the web GUI was non-responsive and would not even start docker. It said the docker service has failed. I then tried to restart docker and this time the web GUI became non-responsive. I waited for a few hours and then force restarted the server - the unclean way :( 

 

After starting again, it detected the unclean shutdown and started a parity check after starting the array. I then let it run for 24hrs (usually it is much less) and when it checked, the web GUI was unresponsive again and the parity check was at 1% (45GB). 

 

I restarted again because, at this time, I need to get it working and was really tired of waiting 3 days. Now on boot... it is stuck on  Array Starting - Starting services...

 

I have now been 45min...

 

I don't know what to do and feel helpless...

 

Please advice on what I can do to recover the server. 

 

I am running on a headless server, so I can't use the non-web interface.

 

Thank you!

Link to comment

Pull the flash drive and run chkdsk on it.  

 

If that doesn't allow the server to at least boot up, you will have to connect a monitor and keyboard to it so you can see what is on the monitor screen.  Take a good, focused, non-blurry photo of the monitor screen (with a flash spot) and post up the picture.  If you get to a login prompt on the monitor, login and type   diagnostics  and that will read the Diagnostics file to the logs directory/folder on the flash drive.  Upload that entire file. 

 

(I see you are using a Supermicro Server board.  Does that management interface allow you to see the boot process????)

Link to comment
11 hours ago, Frank1940 said:

Pull the flash drive and run chkdsk on it.  

 

If that doesn't allow the server to at least boot up, you will have to connect a monitor and keyboard to it so you can see what is on the monitor screen.  Take a good, focused, non-blurry photo of the monitor screen (with a flash spot) and post up the picture.  If you get to a login prompt on the monitor, login and type   diagnostics  and that will read the Diagnostics file to the logs directory/folder on the flash drive.  Upload that entire file. 

 

(I see you are using a Supermicro Server board.  Does that management interface allow you to see the boot process????)

Thank you!

 

I have attached the diagnostics file.

 

I must clarify that the server does turn on and I am able to access the web GUI.

 

It's just that when I click Start array, it just shows  Array Starting - Starting services... (but does not actually start).

 

Please take a look at the diagnostics file and see if it can tell you something.

 

Thanks!

 

lknserver-diagnostics-20190125-2045.zip

  • Like 1
Link to comment

I am not a Guru when it comes to reading log files BUT first you have these entries:

 

Jan 24 20:29:00 LKNServer root: Fix Common Problems Version 2019.01.06
Jan 24 20:29:05 LKNServer root: Fix Common Problems: Error: Unable to write to cache
Jan 24 20:29:06 LKNServer root: Fix Common Problems: Error: Unable to write to disk2
Jan 24 20:29:06 LKNServer root: Fix Common Problems: Warning: Cache Disk free space is less than the cache floor setting

 and also these:

Jan 24 20:19:17 LKNServer emhttpd: Starting services...
Jan 24 20:19:17 LKNServer emhttpd: shcmd (571): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 30
Jan 24 20:19:18 LKNServer kernel: BTRFS: device fsid 0ea30b6e-5c52-4c5f-87e4-43a058abc927 devid 1 transid 232035 /dev/loop2
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): disk space caching is enabled
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): has skinny extents
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): bdev /dev/loop2 errs: wr 68, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): loop2 checksum verify failed on 749125632 wanted A76D3245 found 86E0B539 level 2
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): read error corrected: ino 0 off 749125632 (dev /dev/loop2 sector 1479520)
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 757518336, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1479528
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 69, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): the free space cache file (2177892352) is invalid, skip it
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): the free space cache file (3251634176) is invalid, skip it
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 70, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 71, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): loop2 checksum verify failed on 796704768 wanted 31E8383 found 20E53511 level 0
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): read error corrected: ino 0 off 796704768 (dev /dev/loop2 sector 1572448)
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 805097472, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1572456
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 72, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5741162496, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 11213200
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 73, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): loop2 checksum verify failed on 761593856 wanted 74E7800B found F63C6A54 level 0
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): read error corrected: ino 0 off 761593856 (dev /dev/loop2 sector 1503872)
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 769986560, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1503880
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 74, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 75, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5741162496, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 11213200
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 76, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 5627559936, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 10991320
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 77, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: loop: Write error at byte offset 695439360, length 4096.
Jan 24 20:19:18 LKNServer kernel: print_req_error: I/O error, dev loop2, sector 1358272
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 78, rd 0, flush 0, corrupt 0, gen 0
Jan 24 20:19:18 LKNServer kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2261: errno=-5 IO failure (Error while writing out transaction)
Jan 24 20:19:18 LKNServer kernel: BTRFS warning (device loop2): Skipping commit of aborted transaction.
Jan 24 20:19:18 LKNServer kernel: BTRFS: error (device loop2) in cleanup_transaction:1881: errno=-5 IO failure
Jan 24 20:19:18 LKNServer kernel: BTRFS info (device loop2): delayed_refs has NO entry
Jan 24 20:19:18 LKNServer kernel: BTRFS: error (device loop2) in btrfs_replay_log:2476: errno=-5 IO failure (Failed to recover log tree)
Jan 24 20:19:18 LKNServer kernel: BTRFS error (device loop2): cleaner transaction attach returned -30
Jan 24 20:19:18 LKNServer kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000024
Jan 24 20:19:18 LKNServer kernel: IP: btrfs_search_slot+0xf9/0x852

I am hoping that someone else will jump in to provide some insight.  You might want check to see what disks you have formatted as BTRFS.

 

You can also turn off the autostart of any Dockers or VM's to see if you can get things running.  IF that does not work, try the "Safe" mode in the boot menu.  I would think that you should be seeing that one or two of your disks are disabled.

Link to comment
8 minutes ago, Frank1940 said:

Jan 24 20:29:06 LKNServer root: Fix Common Problems: Error: Unable to write to disk2

False positive and should be ignored (Assuming that the drive should be 100% full)

 

10 minutes ago, Frank1940 said:

Jan 24 20:29:05 LKNServer root: Fix Common Problems: Error: Unable to write to cache

Not a BTRFS guy, but it appears the drive is fully allocated.  Pretty sure there's something in the unRaid FAQ about that

    Device size:		 119.24GiB
    Device allocated:		 119.24GiB
    Device unallocated:		   1.05MiB

 

Link to comment
1 hour ago, Squid said:

False positive and should be ignored (Assuming that the drive should be 100% full)

 

Not a BTRFS guy, but it appears the drive is fully allocated.  Pretty sure there's something in the unRaid FAQ about that


    Device size:		 119.24GiB
    Device allocated:		 119.24GiB
    Device unallocated:		   1.05MiB

 

Disk 2 is full and I have been looking for ways to distribute the content to other disks, but have not found a way besides doing it manually... 

 

I what does it mean that the drive if fully allocated and what can I do to fix this prior to starting array?

 

I am unable to start the array as it stalls.

 

Thanks for the feedback!

Link to comment
8 hours ago, Squid said:

Okay, so I disabled the docker (the vms were already disabled) and started the server. It finally mounted again.

I tried to runt he balancing command, but did did not work... 

root@LKNServer:~# btrfs balance start -dusage=75 /mnt/cache
ERROR: cannot access '/mnt/cache': No such file or directory

and the web interface shows:

chrome_2019-01-26_19-20-14.png.c81d8f154d38317147e611197df06a49.png

and the Balance button does not seem to work either...

 

I am also going to do what @limetech said lower (

) by moving everything out of the drive, formatting and putting things back on.

 

I have started the mover to get things off the drive naturally to see if that can do all the work for me.

 

I have temporarily marked the usually cached shares as Yes instead of Prefer so I can move them out of the cache.

Edited by lankanmon
Link to comment

Also, when I start the docker service, I get:

Jan 26 19:31:40 LKNServer ool www[3156]: /usr/local/emhttp/plugins/dynamix/scripts/emhttpd_update
Jan 26 19:31:40 LKNServer emhttpd: req (15): cmdStatus=apply&csrf_token=****************
Jan 26 19:31:40 LKNServer emhttpd: Starting services...
Jan 26 19:31:40 LKNServer emhttpd: shcmd (5104): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 30
Jan 26 19:31:40 LKNServer root: truncate: cannot open '/mnt/cache/system/docker/docker.img' for writing: Read-only file system
Jan 26 19:31:40 LKNServer kernel: BTRFS info (device loop3): disk space caching is enabled
Jan 26 19:31:40 LKNServer kernel: BTRFS info (device loop3): has skinny extents
Jan 26 19:31:40 LKNServer kernel: BTRFS info (device loop3): bdev /dev/loop3 errs: wr 68, rd 0, flush 0, corrupt 0, gen 0
Jan 26 19:31:40 LKNServer kernel: BTRFS warning (device loop3): log replay required on RO media
Jan 26 19:31:40 LKNServer root: mount: /var/lib/docker: can't read superblock on /dev/loop3.
Jan 26 19:31:40 LKNServer kernel: BTRFS error (device loop3): open_ctree failed
Jan 26 19:31:40 LKNServer root: mount error
Jan 26 19:31:40 LKNServer emhttpd: shcmd (5104): exit status: 1

 

What can I do to solve the BTRFS errors?

 

The cache drive is not full and I reduced some files too, but still getting the read-only error.

Edited by lankanmon
Link to comment
2 hours ago, johnnie.black said:
 

I have the command running right now. It appears to be working and am at 50%

btrfs balance start -dusage=75 /mnt/cache

chrome_2019-01-27_13-38-23.thumb.png.7c7ef6af32568e83783d151174ac1a98.png

 

I just wanted to know what I need to know after this is done... will the drive automatically no longer be read-only?

 

Do I need to restart?

 

Also, I ran it with -dusage=75 Do I need to run it again with a bigger number and is the max 100?

 

I just want to know what I can do to prevent this from happening again.

 

 

UPDATE:

It just finished and I got this error: 

Jan 27 14:35:27 LKNServer kernel: BTRFS info (device sdf1): relocating block group 370487918592 flags data
Jan 27 14:35:33 LKNServer kernel: BTRFS info (device sdf1): found 7445 extents
Jan 27 14:36:58 LKNServer kernel: BTRFS info (device sdf1): found 7445 extents
Jan 27 14:36:59 LKNServer kernel: BTRFS info (device sdf1): relocating block group 369414176768 flags data
Jan 27 14:37:06 LKNServer kernel: BTRFS info (device sdf1): found 6155 extents
Jan 27 14:38:54 LKNServer kernel: BTRFS info (device sdf1): found 6155 extents

So, what should I do now?

Edited by lankanmon
added update
Link to comment

Okay so I went back and looked at the other thread and saw that, if it failed, I had to run the command with a smaller param.

 

So I went back and ran it again (each time) with 

-dusage=1 

-dusage=10

-dusage=25

-dusage=50

-dusage=60

-dusage=70

-dusage=75

 

Each time, it did succeed. Then after that, I restarted the server and it seems to work now.

 

I would still like to know if I need to use a larger number (greater than 75) and what the limit is.

 

Also how often I must run this command and if there is a way to automate this process.

 

BONUS: What is btrfs scrub status? and do I need to run that as well?

 

I want to thank all of you for your help. I really appreciate it!

Link to comment
18 minutes ago, lankanmon said:

 

I would still like to know if I need to use a larger number (greater than 75) and what the limit is.

75 should be enough, but it can be up to 100

 

18 minutes ago, lankanmon said:

Also how often I must run this command and if there is a way to automate this process.

Update to latest Unraid and you won't need to run it anymore.

 

18 minutes ago, lankanmon said:

BONUS: What is btrfs scrub status? and do I need to run that as well?

Scrub checks all checksums  are correct, it's usually not needed unless you suspect a problem, but running it say once a month is a good practice.

Edited by johnnie.black
Link to comment

Okay, I will see if I can update.

 

I also managed to get the docker image to run.

 

I do see a lot of BTRFS related errors created by docker.

 

I have attached the most recent diagnostics, can you please see if these errors are of concern or if there is anything I should do to reduce them?

 

I am worried that there are lasting effects from the cache issues. 

 

Thanks!

 

lknserver-diagnostics-20190127-2017.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.