CorneliousJD Posted April 9, 2018 Share Posted April 9, 2018 Diagnostics attached, had less than 30 days of uptime. Rebooting for now to clear for good measure. server-diagnostics-20180409-0853.zip Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Logged into the server this morning and this is very new... Different device IDs (sdf vs sdk) but shows same serial number. Not sure what to make of this? Rebooted server and it disappeared from unassigned devices but now I can't get Docker to start at all and my log shows this. Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940312064 (dev /dev/sdf1 sector 3789672)Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940316160 (dev /dev/sdf1 sector 3789680)Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1940320256 (dev /dev/sdf1 sector 3789688)Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1968832512 wanted 235074 found 229551Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1968832512 (dev /dev/sdf1 sector 3845376)Apr 9 09:07:23 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 1968836608 (dev /dev/sdf1 sector 3845384)Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2032041984 wanted 235083 found 229561Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2037317632 wanted 235084 found 229562Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036809728 wanted 235084 found 229562Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1826832384 wanted 234220 found 229529Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036252672 wanted 235084 found 229562Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1600208896 wanted 234504 found 229492Apr 9 09:07:23 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1990836224 wanted 234240 found 229555Apr 9 09:07:32 Server kernel: verify_parent_transid: 37 callbacks suppressedApr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2035712000 wanted 235084 found 229562Apr 9 09:07:32 Server kernel: repair_io_failure: 178 callbacks suppressedApr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035712000 (dev /dev/sdf1 sector 3976000)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035716096 (dev /dev/sdf1 sector 3976008)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035720192 (dev /dev/sdf1 sector 3976016)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2035724288 (dev /dev/sdf1 sector 3976024)Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2031976448 wanted 235083 found 229561Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031976448 (dev /dev/sdf1 sector 3968704)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031980544 (dev /dev/sdf1 sector 3968712)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031984640 (dev /dev/sdf1 sector 3968720)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2031988736 (dev /dev/sdf1 sector 3968728)Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2036989952 wanted 235084 found 229562Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2036989952 (dev /dev/sdf1 sector 3978496)Apr 9 09:07:32 Server kernel: BTRFS info (device sdh1): read error corrected: ino 0 off 2036994048 (dev /dev/sdf1 sector 3978504)Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1826865152 wanted 234220 found 229528Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836728320 wanted 234220 found 229530Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836744704 wanted 234220 found 229530Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836761088 wanted 234220 found 229530Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1723269120 wanted 235030 found 229509Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 1836793856 wanted 234220 found 229530Apr 9 09:07:32 Server kernel: BTRFS error (device sdh1): parent transid verify failed on 2069004288 wanted 234247 found 229568Apr 9 09:10:00 Server root: Fix Common Problems Version 2018.03.25a Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 Before you do anything else please post your diagnostics. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Done - here ya go - this is before I rebooted. server-diagnostics-20180409-0853.zip Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Here's after a reboot where I noticed Docker wont start. server-diagnostics-20180409-0923.zip Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 Your syslog is full of error messages associated with the problem you posted about in your other thread. Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 I saw the diagnostics in your other thread. Maybe they ought to be merged. It seems as though one of your cache SSDs (sdh sdf) dropped off line and later reconnected but was given device name sdk. This suggests an intermittent connection so I'd power down and check/replace both the SATA cable and power cable to it. Because the disk disappeared your cache pool is broken and needs to be rebuilt but first you need to find out why it dropped and reconnected. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Yeah I had made the other thread before I realized what was actually happening. Fixing this should fix my log file being full too for sure. The server uses a backplane though so there's no real cables for me to check/reconnect, if the backplane cable went down everything would have disconnected? What are my next steps? How do I go about rebuilding the cache pool? My other data seems to be in-tact and I can actually boot VMs still just fine although the syslog of the server itself goes nuts when I do with errors. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Could the drive itself be to blame in this instance? I'm not sure what to make of the logs, if they actually indicate a problem with the drive itself or not? Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 In that case it might be worth moving the SSD to a different slot or at least re-seat it. The SMART report for sdk looks good, though SSDs don't reveal as much as HDDs. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Thanks John. I'll power down and reseat everything for good measure - once I boot back up what's the process for rebuilding the cache pool properly without losing my data? Also it looks like the docker.img might be corrupted since everything else is working properly still except the docker image? Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 Once you're happy with the SSD and its being correctly identified, you can follow this procedure to remove it from the cache pool and then re-add it: Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 You can very easily recreate the docker.img file so just do this: Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 I think I'm still having some trouble with figuring out how to rebuild the cache pool w/ the same disk. If I remove it from the pool and re-add it nothing changes, it just starts the array and the same errors continue in the log. It doesn't detect it as a NEW member of the pool and replicate data to it. Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 OK. After you unassign it, shut down and physically remove it. Then power up and start the array with the device missing then shut down and reinsert it. Power up, reassign it and then start the array to rebuild. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Ok I did that, it said the drive needed to be formatted before it could be added - i formatted it and now it's added to the pool and the cache is now completely empty... Looks like it deleted all my appdata entirely? I have a backup from Sunday @ 4AM but my VMs are all gone now too. Oof. Is this because it was the main Cache 1 disk? I'm confused why it didn't just replicate from Cache 2 though? Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 You shouldn't have formatted it. That formatted the cache pool. The way BTRFS refers to its components is confusing because it uses the same device name for the overall pool as for its first component but rebuilding a RAID array never ever involves formatting. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Hmm, it wouldn't do anything and it said it was unusable until it was formatted after I had added it to the pool. It wasn't doing anything until I went ahead and actually made that change to format it. Really not a big deal since I had to recreate my docker.img file anyhow and I wasn't actively using any of the VMs anyhow, and my backup for appdata was only a day old. Was there another way I should have went about that though so I know incase this happens again? Link to comment
John_M Posted April 9, 2018 Share Posted April 9, 2018 I think the confusion was caused by the SSD being reintroduced to the pool while already containing a BTRFS file system. A brand new SSD would have just been accepted without a problem. So the thing to do in future is to wipe it first. There's a simple command to do that: wipefs. Type wipefs --help at the command prompt for more information. Link to comment
JorgeB Posted April 9, 2018 Share Posted April 9, 2018 43 minutes ago, John_M said: I think the confusion was caused by the SSD being reintroduced to the pool while already containing a BTRFS file system. Very likely, best way in these cases is to wipe the SSD with blskdiscard so it's completely blank, wipefs usually works but you need to wipe the partition and then the device, so and since most cache devices are SSDs best to use blkdiscard, e.g.: blkdiscard /dev/sdX Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Thanks Johnnie.black After the cache rebuilt I don't appear to be seeing anymore errors, fingers crossed. Trying to restore my massive plex appdata folder from CrashPlan now as I had that excluded from local appdata backups. Link to comment
CorneliousJD Posted April 9, 2018 Author Share Posted April 9, 2018 Update - I'm back up and running in full right now with the exception of 2 utility VMs that I had - Ubuntu and Win 10, i've downloaded their vdisks from CrashPlan and will put them into the right place and try to fire them up. EDIT. Win 10 VM back up normally, but Ubuntu wouldn't boot after, I decided to just scrap it, since that was more of just something I was testing with and doing a fresh install of it. What a weird situation - thanks everyone for the help! Hopefully this doesn't happen again haha. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.