system forced shutdown after 90 second wait

wgstarks · January 5, 2022

I've noticed for the past few months that if I reboot my server the graceful shutdown fails and the system has to force a shutdown which results in a parity check on boot. Ive attached the diagnostics that were collected as part of the "force" shutdown. I'm hoping that someone can give me some clue what process is causing the shutdown to hang and how to correct it.

brunnhilde-diagnostics-20220105-1809.zip

itimpi · January 5, 2022

Your syslog shows that there appear to be btrfs level problems with device sdf which I think is a cache drive so it cannot be successfully unmounted

wgstarks · January 6, 2022

sdf is my cache but it’s formatted xfs. The only thing I have formatted btrfs is a two disk cache pool (sdg & sdh). Of course the designations probably changed on the reboot right?

trurl · January 6, 2022

sdf in those diagnostics was the first disk in a pool named torrent, formatted btrfs. sdg was the other disk in the pool, but it was listed in the smart folder as sdp so it must have disconnected.

And the errors in syslog are, in fact, for sdg

Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 872, rd 13057, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 873, rd 13057, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 873, rd 13058, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 874, rd 13058, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 874, rd 13059, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 875, rd 13059, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 875, rd 13060, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 876, rd 13060, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 876, rd 13061, flush 0, corrupt 0, gen 0
Jan  5 18:09:36 Brunnhilde kernel: BTRFS error (device sdf1): bdev /dev/sdg1 errs: wr 877, rd 13061, flush 0, corrupt 0, gen 0

wgstarks · January 6, 2022

Yeah. It’s an eSATA enclosure that I accidentally knocked the power cord loose from. When I re-powered it it showed as the cache pool and UD. That was why I was rebooting.

Squid · January 6, 2022

There is a known issue (probably existing for quite a while) where the OS is too "extreme" in calling stuff unclean shutdowns. Currently if any process has to be killed in order to shutdown, then an unclean shutdown happens.

How it's supposed to work (and hopefully fixed next rev) is that only if the drives can't be unmounted cleanly even after killing a process if necessary should be "unclean"

At the end of the day, this means that most so-called unclean shutdowns (where a power failure isn't involved) aren't actually unclean. (90% of the time whenever this happens to me, I cancel the parity check after a couple of minutes, as I know that on the monthly correcting check it'll catch any issues)

wgstarks · January 6, 2022

Yeah, I thought about cancelling the check but it’s really not hurting anything so I went ahead and let it run.

Maybe my question should really be rather than a reboot, is there a better way to fix the cache pool when I’ve accidentally disconnected one of the disks (they’re all eSATA in separate enclosures). Maybe just an array stop/start or something.

Squid · January 6, 2022

That's better suited to directly asking to resident BTRFS god @JorgeB

JorgeB · January 7, 2022

You can stop the array, and if/when all pool devices are back online and appear on the main page and are already assigned, after a page refresh if needed (you can't just re-assigned them manually), you can just re-start it.

wgstarks · January 7, 2022

7 hours ago, JorgeB said:

You can stop the array, and if/when all pool devices are back online and appear on the main page and are already assigned, after a page refresh if needed (you can't just re-assigned them manually), you can just re-start it.

Thanks

JorgeB · January 7, 2022

1 hour ago, wgstarks said:

Thanks

Forgot to mention, there shouldn't be, but if there's an "all data on this device will be deleted" warning after any of the pool devices don't start it, in that case reboot first, warning should them be gone.

system forced shutdown after 90 second wait

Recommended Posts

wgstarks

Link to comment

itimpi

Link to comment

wgstarks

Link to comment

trurl

Link to comment

wgstarks

Link to comment

Squid

Link to comment

wgstarks

Link to comment

Squid

Link to comment

JorgeB

Link to comment

wgstarks

Link to comment

JorgeB

Link to comment

Join the conversation