/var/log 100% full - cache pool broken

CorneliousJD · April 9, 2018

Diagnostics attached, had less than 30 days of uptime. Rebooting for now to clear for good measure.

server-diagnostics-20180409-0853.zip

CorneliousJD · April 9, 2018

Logged into the server this morning and this is very new... Different device IDs (sdf vs sdk) but shows same serial number.

Not sure what to make of this?

Rebooted server and it disappeared from unassigned devices but now I can't get Docker to start at all and my log shows this.

Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940312064 (dev /dev/sdf1 sector 3789672)
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940316160 (dev /dev/sdf1 sector 3789680)
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940320256 (dev /dev/sdf1 sector 3789688)
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1968832512 wanted 235074 found 229551
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1968832512 (dev /dev/sdf1 sector 3845376)
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1968836608 (dev /dev/sdf1 sector 3845384)
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2032041984 wanted 235083 found 229561
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2037317632 wanted 235084 found 229562
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036809728 wanted 235084 found 229562
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1826832384 wanted 234220 found 229529
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036252672 wanted 235084 found 229562
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1600208896 wanted 234504 found 229492
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1990836224 wanted 234240 found 229555
Apr 9 09:07:32 Server kernel: verify_parent_transid: 37 callbacks suppressed
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2035712000 wanted 235084 found 229562
Apr 9 09:07:32 Server kernel: repair_io_failure: 178 callbacks suppressed
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035712000 (dev /dev/sdf1 sector 3976000)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035716096 (dev /dev/sdf1 sector 3976008)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035720192 (dev /dev/sdf1 sector 3976016)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035724288 (dev /dev/sdf1 sector 3976024)
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2031976448 wanted 235083 found 229561
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031976448 (dev /dev/sdf1 sector 3968704)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031980544 (dev /dev/sdf1 sector 3968712)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031984640 (dev /dev/sdf1 sector 3968720)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031988736 (dev /dev/sdf1 sector 3968728)
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036989952 wanted 235084 found 229562
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2036989952 (dev /dev/sdf1 sector 3978496)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2036994048 (dev /dev/sdf1 sector 3978504)
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1826865152 wanted 234220 found 229528
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836728320 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836744704 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836761088 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1723269120 wanted 235030 found 229509
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836793856 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2069004288 wanted 234247 found 229568
Apr 9 09:10:00 Server root: Fix Common Problems Version 2018.03.25a

John_M · April 9, 2018

Before you do anything else please post your diagnostics.

CorneliousJD · April 9, 2018

Done - here ya go - this is before I rebooted.

server-diagnostics-20180409-0853.zip

CorneliousJD · April 9, 2018

Here's after a reboot where I noticed Docker wont start.

server-diagnostics-20180409-0923.zip

John_M · April 9, 2018

Your syslog is full of error messages associated with the problem you posted about in your other thread.

John_M · April 9, 2018

I saw the diagnostics in your other thread. Maybe they ought to be merged. It seems as though one of your cache SSDs (~~sdh~~ sdf) dropped off line and later reconnected but was given device name sdk. This suggests an intermittent connection so I'd power down and check/replace both the SATA cable and power cable to it. Because the disk disappeared your cache pool is broken and needs to be rebuilt but first you need to find out why it dropped and reconnected.

CorneliousJD · April 9, 2018

Yeah I had made the other thread before I realized what was actually happening. Fixing this should fix my log file being full too for sure.

The server uses a backplane though so there's no real cables for me to check/reconnect, if the backplane cable went down everything would have disconnected?

What are my next steps? How do I go about rebuilding the cache pool? My other data seems to be in-tact and I can actually boot VMs still just fine although the syslog of the server itself goes nuts when I do with errors.

CorneliousJD · April 9, 2018

Could the drive itself be to blame in this instance? I'm not sure what to make of the logs, if they actually indicate a problem with the drive itself or not?

John_M · April 9, 2018

In that case it might be worth moving the SSD to a different slot or at least re-seat it.

The SMART report for sdk looks good, though SSDs don't reveal as much as HDDs.

CorneliousJD · April 9, 2018

Thanks John.

I'll power down and reseat everything for good measure - once I boot back up what's the process for rebuilding the cache pool properly without losing my data?

Also it looks like the docker.img might be corrupted since everything else is working properly still except the docker image?

John_M · April 9, 2018

Once you're happy with the SSD and its being correctly identified, you can follow this procedure to remove it from the cache pool and then re-add it:

John_M · April 9, 2018

You can very easily recreate the docker.img file so just do this:

CorneliousJD · April 9, 2018

I think I'm still having some trouble with figuring out how to rebuild the cache pool w/ the same disk. If I remove it from the pool and re-add it nothing changes, it just starts the array and the same errors continue in the log. It doesn't detect it as a NEW member of the pool and replicate data to it.

John_M · April 9, 2018

OK. After you unassign it, shut down and physically remove it. Then power up and start the array with the device missing then shut down and reinsert it. Power up, reassign it and then start the array to rebuild.

CorneliousJD · April 9, 2018

Ok I did that, it said the drive needed to be formatted before it could be added - i formatted it and now it's added to the pool and the cache is now completely empty...

Looks like it deleted all my appdata entirely? I have a backup from Sunday @ 4AM but my VMs are all gone now too. Oof.

Is this because it was the main Cache 1 disk? I'm confused why it didn't just replicate from Cache 2 though?

John_M · April 9, 2018

You shouldn't have formatted it. That formatted the cache pool. The way BTRFS refers to its components is confusing because it uses the same device name for the overall pool as for its first component but rebuilding a RAID array never ever involves formatting.

CorneliousJD · April 9, 2018

Hmm, it wouldn't do anything and it said it was unusable until it was formatted after I had added it to the pool. It wasn't doing anything until I went ahead and actually made that change to format it.

Really not a big deal since I had to recreate my docker.img file anyhow and I wasn't actively using any of the VMs anyhow, and my backup for appdata was only a day old.

Was there another way I should have went about that though so I know incase this happens again?

John_M · April 9, 2018

I think the confusion was caused by the SSD being reintroduced to the pool while already containing a BTRFS file system. A brand new SSD would have just been accepted without a problem. So the thing to do in future is to wipe it first. There's a simple command to do that: wipefs.

Type

wipefs --help

at the command prompt for more information.

JorgeB · April 9, 2018

43 minutes ago, John_M said:

I think the confusion was caused by the SSD being reintroduced to the pool while already containing a BTRFS file system.

Very likely, best way in these cases is to wipe the SSD with blskdiscard so it's completely blank, wipefs usually works but you need to wipe the partition and then the device, so and since most cache devices are SSDs best to use blkdiscard, e.g.:

blkdiscard /dev/sdX

CorneliousJD · April 9, 2018

Thanks Johnnie.black

After the cache rebuilt I don't appear to be seeing anymore errors, fingers crossed.

Trying to restore my massive plex appdata folder from CrashPlan now as I had that excluded from local appdata backups.

CorneliousJD · April 9, 2018

Update - I'm back up and running in full right now with the exception of 2 utility VMs that I had - Ubuntu and Win 10, i've downloaded their vdisks from CrashPlan and will put them into the right place and try to fire them up.

EDIT. Win 10 VM back up normally, but Ubuntu wouldn't boot after, I decided to just scrap it, since that was more of just something I was testing with and doing a fresh install of it.

What a weird situation - thanks everyone for the help! Hopefully this doesn't happen again haha.

/var/log 100% full - cache pool broken

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived