docker and libvirt failed to start

Cull2ArcaHeresy · June 22, 2020

Added an external lsi card to attach a DAS. Changed the pcie slots that the 2 internal lsi cards are in and removed a nic (VM that was using it has been able to use 1 from 4 port). DAS is off to reduce devices that could be causing issues and reduce boot time with indexing devices. Sometimes when I reboot, VM tab will have "Libvirt Service failed to start." but sometimes VMs are there and work fine. Have not been able to get docker page to not say "Docker Service failed to start."-it used to work. Researching "bad tree block start" (was in log), I found

and made a backup of docker.img ready to delete/recreate, but looking at log chunk below, my concern is that one of my cache drives is failing. Would think card/cable, but all of those say sdc not sdc & sdb (or sdn).

Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5201959788544 (dev /dev/sdc1 sector 18198928)
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5201959792640 (dev /dev/sdc1 sector 18198936)
Jun 21 20:09:46 Raza kernel: BTRFS error (device sdc1): parent transid verify failed on 5202249596928 wanted 10386593 found 10375290
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5202249596928 (dev /dev/sdc1 sector 18764960)
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5202249601024 (dev /dev/sdc1 sector 18764968)
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5202249605120 (dev /dev/sdc1 sector 18764976)
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5202249609216 (dev /dev/sdc1 sector 18764984)
Jun 21 20:09:46 Raza kernel: BTRFS error (device sdc1): parent transid verify failed on 5202217517056 wanted 10386551 found 10375235
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5202217517056 (dev /dev/sdc1 sector 18702304)
Jun 21 20:09:46 Raza kernel: BTRFS info (device sdc1): read error corrected: ino 0 off 5202217521152 (dev /dev/sdc1 sector 18702312)
Jun 21 20:09:46 Raza kernel: BTRFS error (device sdc1): parent transid verify failed on 5201256808448 wanted 10386601 found 10385206
Jun 21 20:09:46 Raza kernel: BTRFS error (device sdc1): parent transid verify failed on 5201545854976 wanted 10384892 found 10379573
Jun 21 20:09:46 Raza kernel: BTRFS error (device sdc1): parent transid verify failed on 5201542184960 wanted 10384888 found 10379564
Jun 21 20:09:46 Raza kernel: BTRFS error (device sdc1): parent transid verify failed on 5201257611264 wanted 10386601 found 10385206
Jun 21 20:09:46 Raza ker

raza-diagnostics-20200621-2030.zip

Edited June 22, 2020 by Cull2ArcaHeresy
typo and reworded a bit for more clarity

Cull2ArcaHeresy · June 22, 2020

Update

If I pull all the drives trays halfway out, boot the server, and then push them all in, VM is available after starting array. Thinking back to after this issue started, I think the only other successful VMs available boot was when I did it this same way. Either way no docker at all. All of my smart reports are passed on all drives (parity/array/cache, didn't check unassigned). Going to power off and reseat lsi cards and cables, and then do a normal boot with all drives connected.

JorgeB · June 22, 2020

There's corruption on the cache filesystem, almost certainly caused by 2 of the 3 devices dropping offline in the past:

Jun 21 19:45:09 Raza kernel: BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 574718, rd 407964, flush 6103, corrupt 0, gen 0
Jun 21 19:45:09 Raza kernel: BTRFS info (device sdc1): bdev /dev/sdb1 errs: wr 724802, rd 490716, flush 6103, corrupt 0, gen 0

See here for more info, if a scrub can't correct everything, and it probably won't, you'll need to re-format it.

Cull2ArcaHeresy · June 22, 2020

1 minute ago, johnnie.black said:
There's corruption on the cache filesystem, almost certainly caused by 2 of the 3 devices dropping offline in the past:
Jun 21 19:45:09 Raza kernel: BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 574718, rd 407964, flush 6103, corrupt 0, gen 0
Jun 21 19:45:09 Raza kernel: BTRFS info (device sdc1): bdev /dev/sdb1 errs: wr 724802, rd 490716, flush 6103, corrupt 0, gen 0
See here for more info, if a scrub can't correct everything, and it probably won't, you'll need to re-format it.

Having found your reply here (and stuff elsewhere), ran "btrfs scrub start /mnt/cache" and below is output plus scrub status and dev stats. Your reply to this thread was ironic timing being not long after this finished while was doing docker stuff. I have not run a second scrub yet, if I even should. Neither the docker.img file nor the backup I made (renamed correctly) worked. Further making it seem like image corruption, followed the "***OFFICIAL GUIDE*** Restoring your Docker Applications in a New Image File" instructions and dockers are added back (plex quickly tested and seems fine, assume rest are). Not seeing errors in log file anymore. I assume reset stats (comment I linked), and then add monitoring (you linked)? Or clear stats, run another scrub, clear stats, and then add monitoring?

Due to a wrong network config and having no idea what I had set unraid password to, I had to clear password hash in shadow file on USB. Unraid rotated mac addresses between nics with hardware change...so I could have used port 3/4 it appeared, after having gotten in...mini-rant over. I think the idrac power off command caused an ungraceful shutdown, which then later interrupted a parity check when pushing (not holding) front power button to power off. Everything seems to be fine now, but most VMs are still off and I am not doing heavy data transfers (VMs off including one that is "gateway" for seedbox). Now is ~9 days away from a parity check, but since one was interrupted (and the shutdown that triggered the check), I assume run one?

Speaking of cache drives, I know having a 2tb and two 1tb cache drives is not standard (plus mix of qlc and tlc...maybe worst part). Is there any reason to switch to all 1tb drives instead of mixed sizes? Thinking about getting another 1tb to bring cache pool up to 2.5tb, but I can go down on cache size if it is better (down to 1tb or 1.5tb with a new 1tb drive).


root@Raza:~# WARNING: errors detected during scrubbing, corrected
btrfs scrub s
root@Raza:~# btrfs scrub status -d /mnt/cache
UUID:             b376c540-8bb3-4580-86bf-fd19d5c8ed08
scrub device /dev/sdc1 (id 1) history
Scrub started:    Mon Jun 22 01:39:43 2020
Status:           finished
Duration:         0:14:13
Total to scrub:   474.03GiB
Rate:             507.75MiB/s
Error summary:    verify=478 csum=7468
  Corrected:      7946
  Uncorrectable:  0
  Unverified:     0
scrub device /dev/sdb1 (id 2) history
Scrub started:    Mon Jun 22 01:39:43 2020
Status:           finished
Duration:         0:14:19
Total to scrub:   475.00GiB
Rate:             505.87MiB/s
Error summary:    csum=10124
  Corrected:      10124
  Uncorrectable:  0
  Unverified:     0
scrub device /dev/sdn1 (id 3) history
Scrub started:    Mon Jun 22 01:39:43 2020
Status:           finished
Duration:         0:33:47
Total to scrub:   949.03GiB
Rate:             430.95MiB/s
Error summary:    no errors found
root@Raza:~# btrfs dev stats /mnt/cache
[/dev/sdc1].write_io_errs    574718
[/dev/sdc1].read_io_errs     407964
[/dev/sdc1].flush_io_errs    6103
[/dev/sdc1].corruption_errs  7468
[/dev/sdc1].generation_errs  478
[/dev/sdb1].write_io_errs    724802
[/dev/sdb1].read_io_errs     490716
[/dev/sdb1].flush_io_errs    6103
[/dev/sdb1].corruption_errs  10124
[/dev/sdb1].generation_errs  0
[/dev/sdn1].write_io_errs    0
[/dev/sdn1].read_io_errs     0
[/dev/sdn1].flush_io_errs    0
[/dev/sdn1].corruption_errs  0
[/dev/sdn1].generation_errs  0
root@Raza:~# btrfs scrub status /mnt/cache
UUID:             b376c540-8bb3-4580-86bf-fd19d5c8ed08
Scrub started:    Mon Jun 22 01:39:43 2020
Status:           finished
Duration:         0:33:47
Total to scrub:   1.69TiB
Rate:             859.00MiB/s
Error summary:    verify=478 csum=17592
  Corrected:      18070
  Uncorrectable:  0
  Unverified:     0
root@Raza:~#

JorgeB · June 22, 2020

All errors appear to have been correct by the scrub, still good to run a second one to confirm, if no more errors filesystem is good now.

7 minutes ago, Cull2ArcaHeresy said:

I assume reset stats (comment I linked), and then add monitoring (you linked)?

Correct.

docker and libvirt failed to start

Recommended Posts

Cull2ArcaHeresy

Link to comment

Cull2ArcaHeresy

Link to comment

JorgeB

Link to comment

Cull2ArcaHeresy

Link to comment

JorgeB

Link to comment

Join the conversation