Jump to content

/var/log 100% full - cache pool broken


CorneliousJD

Recommended Posts

Logged into the server this morning and this is very new... Different device IDs (sdf vs sdk) but shows same serial number. 

 

Not sure what to make of this?

photo_2018-04-09_08-56-34.jpg

 

Rebooted server and it disappeared from unassigned devices but now I can't get Docker to start at all and my log shows this.

 

 

Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940312064 (dev /dev/sdf1 sector 3789672)
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940316160 (dev /dev/sdf1 sector 3789680)
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940320256 (dev /dev/sdf1 sector 3789688)
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1968832512 wanted 235074 found 229551
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1968832512 (dev /dev/sdf1 sector 3845376)
Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1968836608 (dev /dev/sdf1 sector 3845384)
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2032041984 wanted 235083 found 229561
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2037317632 wanted 235084 found 229562
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036809728 wanted 235084 found 229562
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1826832384 wanted 234220 found 229529
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036252672 wanted 235084 found 229562
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1600208896 wanted 234504 found 229492
Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1990836224 wanted 234240 found 229555
Apr 9 09:07:32 Server kernel: verify_parent_transid: 37 callbacks suppressed
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2035712000 wanted 235084 found 229562
Apr 9 09:07:32 Server kernel: repair_io_failure: 178 callbacks suppressed
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035712000 (dev /dev/sdf1 sector 3976000)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035716096 (dev /dev/sdf1 sector 3976008)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035720192 (dev /dev/sdf1 sector 3976016)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035724288 (dev /dev/sdf1 sector 3976024)
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2031976448 wanted 235083 found 229561
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031976448 (dev /dev/sdf1 sector 3968704)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031980544 (dev /dev/sdf1 sector 3968712)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031984640 (dev /dev/sdf1 sector 3968720)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031988736 (dev /dev/sdf1 sector 3968728)
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036989952 wanted 235084 found 229562
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2036989952 (dev /dev/sdf1 sector 3978496)
Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2036994048 (dev /dev/sdf1 sector 3978504)
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1826865152 wanted 234220 found 229528
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836728320 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836744704 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836761088 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1723269120 wanted 235030 found 229509
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836793856 wanted 234220 found 229530
Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2069004288 wanted 234247 found 229568
Apr 9 09:10:00 Server root: Fix Common Problems Version 2018.03.25a

Link to comment

I saw the diagnostics in your other thread. Maybe they ought to be merged. It seems as though one of your cache SSDs (sdh sdf) dropped off line and later reconnected but was given device name sdk. This suggests an intermittent connection so I'd power down and check/replace both the SATA cable and power cable to it. Because the disk disappeared your cache pool is broken and needs to be rebuilt but first you need to find out why it dropped and reconnected.

 

Link to comment

Yeah I had made the other thread before I realized what was actually happening. Fixing this should fix my log file being full too for sure.

 

The server uses a backplane though so there's no real cables for me to check/reconnect, if the backplane cable went down everything would have disconnected? 

 

What are my next steps? How do I go about rebuilding the cache pool? My other data seems to be in-tact and I can actually boot VMs still just fine although the syslog of the server itself goes nuts when I do with errors. 

Link to comment

Thanks John.

 

I'll power down and reseat everything for good measure - once I boot back up what's the process for rebuilding the cache pool properly without losing my data? 

Also it looks like the docker.img might be corrupted since everything else is working properly still except the docker image? 

Link to comment

I think I'm still having some trouble with figuring out how to rebuild the cache pool w/ the same disk. If I remove it from the pool and re-add it nothing changes, it just starts the array and the same errors continue in the log. It doesn't detect it as a NEW member of the pool and replicate data to it. 

Link to comment

Ok I did that, it said the drive needed to be formatted before it could be added - i formatted it and now it's added to the pool and the cache is now completely empty... :/ 

 

Looks like it deleted all my appdata entirely? I have a backup from Sunday @ 4AM but my VMs are all gone now too. Oof. 

Is this because it was the main Cache 1 disk? I'm confused why it didn't just replicate from Cache 2 though?

Link to comment

You shouldn't have formatted it. That formatted the cache pool. The way BTRFS refers to its components is confusing because it uses the same device name for the overall pool as for its first component but rebuilding a RAID array never ever involves formatting.

Link to comment

Hmm, it wouldn't do anything and it said it was unusable until it was formatted after I had added it to the pool. It wasn't doing anything until I went ahead and actually made that change to format it.

 

Really not a big deal since I had to recreate my docker.img file anyhow and I wasn't actively using any of the VMs anyhow, and my backup for appdata was only a day old.

 

Was there another way I should have went about that though so I know incase this happens again?

Link to comment

I think the confusion was caused by the SSD being reintroduced to the pool while already containing a BTRFS file system. A brand new SSD would have just been accepted without a problem. So the thing to do in future is to wipe it first. There's a simple command to do that: wipefs.

Type

wipefs --help

at the command prompt for more information.

Link to comment
43 minutes ago, John_M said:

I think the confusion was caused by the SSD being reintroduced to the pool while already containing a BTRFS file system.

Very likely, best way in these cases is to wipe the SSD with blskdiscard so it's completely blank, wipefs usually works but you need to wipe the partition and then the device, so and since most cache devices are SSDs best to use blkdiscard, e.g.:

 

blkdiscard /dev/sdX

 

Link to comment

Update - I'm back up and running in full right now with the exception of 2 utility VMs that I had - Ubuntu and Win 10, i've downloaded their vdisks from CrashPlan and will put them into the right place and try to fire them up. 

 

EDIT. Win 10 VM back up normally, but Ubuntu wouldn't boot after, I decided to just scrap it, since that was more of just something I was testing with and doing a fresh install of it.

 

What a weird situation - thanks everyone for the help! Hopefully this doesn't happen again haha. 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...