BTRFS error on cache drive

ipreferpie · January 16, 2018

Crossposted this from the storage subforum --

So I discovered today that my docker containers won't load and thought that I needed to delete/recreate the image and all would be ok. But deleting the image from settings and then on bash didn't work. I then went to take a look at my cache pool and that's when things looked iffy. Here's the cache disk log:

Quote

Jan 16 04:42:17 HK-HomeLab kernel: sd 14:0:2:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jan 16 04:42:17 HK-HomeLab kernel: sd 14:0:2:0: [sdi] tag#0 Sense Key : 0x2 [current]
Jan 16 04:42:17 HK-HomeLab kernel: sd 14:0:2:0: [sdi] tag#0 ASC=0x4 ASCQ=0x0
Jan 16 04:42:17 HK-HomeLab kernel: sd 14:0:2:0: [sdi] tag#0 CDB: opcode=0x28 28 00 07 bb 73 80 00 00 08 00
Jan 16 04:42:21 HK-HomeLab kernel: sd 14:0:2:0: [sdi] Synchronizing SCSI cache
Jan 16 04:42:21 HK-HomeLab kernel: sd 14:0:2:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jan 16 04:42:22 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 456160, flush 1, corrupt 0, gen 0
Jan 16 04:42:22 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 456161, flush 1, corrupt 0, gen 0
Jan 16 04:42:22 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 456162, flush 1, corrupt 0, gen 0
Jan 16 04:42:22 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 456163, flush 1, corrupt 0, gen 0
Jan 16 04:42:22 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 456164, flush 1, corrupt 0, gen 0
Jan 16 04:42:27 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 601003, flush 1, corrupt 0, gen 0
Jan 16 04:42:27 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 601004, flush 1, corrupt 0, gen 0
Jan 16 04:42:27 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 601005, flush 1, corrupt 0, gen 0
Jan 16 04:42:27 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 601006, flush 1, corrupt 0, gen 0
Jan 16 04:42:27 HK-HomeLab kernel: BTRFS error (device dm-4): bdev /dev/mapper/sdi1 errs: wr 446, rd 601007, flush 1, corrupt 0, gen 0

Then strange thing is that I have 18,446,744,073,703,520,256 reads and 18,446,744,073,702,512,640 writes. The SSD temperature is no longer displayed and replaced with a "*". But unRAID tells me that the device is in normal operation and active. Something looks seriously wrong and I seem to be only do read-only on the files located in the cache. Any suggestions on my next step? Many thanks!

JorgeB · January 16, 2018

Please post your diagnostics: Tools -> Diagnostics

ipreferpie · January 16, 2018

diag.zip is posting the zip file ok?

As a quick note, tried btrfs scrub but was auto aborted immediately. Then tried btrfs restore on /mnt/cache and gotten back that it was not a regular file or block device. Tried copying the directory over to my array, but got multiple files listed with I/O errors. Scared to do anything else for now. Mostly worried that I lost my docker container appdata which will take forever to redo. Some files that were copied via cp -r were ok, while others were 0kb files

ipreferpie · January 16, 2018

Quick update: this is what I got after typing in btrfs dev stats /mnt/cache. sdi is the problem SSD

Quote

[/dev/mapper/sdg1].write_io_errs 0
[/dev/mapper/sdg1].read_io_errs 0
[/dev/mapper/sdg1].flush_io_errs 0
[/dev/mapper/sdg1].corruption_errs 0
[/dev/mapper/sdg1].generation_errs 0
[/dev/mapper/sdi1].write_io_errs 446
[/dev/mapper/sdi1].read_io_errs 224667550
[/dev/mapper/sdi1].flush_io_errs 1
[/dev/mapper/sdi1].corruption_errs 0
[/dev/mapper/sdi1].generation_errs 0

JorgeB · January 16, 2018

That's a hardware error, most likely a cable/connection issue, try replacing both cables.

ipreferpie · January 16, 2018

thanks! so basically, power down the server and then check connection? no need to take actions to copy remainder cache data to preserve anything right? and I can likely rule out that the Samsund 850 SSD itself is dead (since the # of errors/reads/writes look so out of normal range)? Much appreciated

JorgeB · January 16, 2018

I suggest replacing cables instead of checking connection, unless you don't have spares available, then run a correcting scrub and check that there weren't uncorrected errors.

ipreferpie · January 16, 2018

Done swapping cables and booted up fine! Now I’m running the scrub process and parity check to see how things are. But one problem right now is that the Docker services are still starting. Been taking already 20min but I husualy have 15 containers on auto start. Should I wait or delete Docker.img and start again?

JorgeB · January 16, 2018

Docker image gets corrupted easily, since it's easy to recreate probably best to do it.

ipreferpie · January 16, 2018

Thanks so much johnnie.black! Rebuilt Docker.img, running scrub and parity check now. Looks like things are getting back on track and really appreciate your timely help!

ipreferpie · January 19, 2018

uhoh, looks like the problem is back and worse than before. So as a background, I swapped the SAS cable attached to my IBM M1015 controller (in JBOD non-IT mode) and it seemed to be ok. But after around 24hrs, the same thing happened. What I also noticed was that the Samsung 850 EVO was running on high temp (46-50 degrees) in and out.

What I made a mistake in was that I thought rebooting the server will be fine. Upon rebooting, I didn't check to see if the Samsung 850 was detected (it wasn't and Cache 2 was 'unassigned'). I started the array without knowing and found out after an hour later when the cache in cache pool mode has been rebalancing. I did a safe shutdown and switched my Samsung 850 to be attached directly via SATA to my MB controller instead. After a 2nd boot, the drive showed up and I tried adding it back to the cache. But there was a warning saying that the drive will be wiped.

So what I did was I started the array again, and mounted the Samsung 850 via Unassigned Devices plugin. I then checked to see what files were in the Cache 2 drive -- appdata, system, downloads, etc. When I tried to copy over the files from the Cache2 to the array, I noticed all my shares have disappeared. Also, Cache 1 has a file system error on the 'Main' tab. I then checked to see the individual array disks via Midnight Commander, and the folders/files fortunately are there. So now, I'm facing several problems to summarize:

1) Constantly failing Cache 2 drive (Samsung 850 EVO 1TB)

-did BTRFS scrub and the disk was fine after the reboots

-connected originally via IBM M1015 JBOD SAS then moved to SATA MB controller

-in drive pool with slow Crucial M4 SSD (256GB) - maybe the speed difference in the pooled disk caused issues?

-perhaps I fixed the drop outs after swapping the SSD controller?

-or were the drops caused by high SSD temps?

2) Accidentally started array with unassigned Cache 2 and had partial rebalace

-in the unRAID main tab, it still shows that the cache pool is 1.3TB in size, showing that total of the 2 devices although the Samsung was dropped

-how do I safely add it back into the pool without wiping the data?

3) Shares all disappeared

-/mnt/users is gone, but the individual disks are still there

I guess some questions are whether these issues are all linked up? And is there a way to quickly add back in my dropped Cache 2 SSD back in the pool safely? Many thanks again for the help!

JorgeB · January 19, 2018

51 minutes ago, ipreferpie said:

What I also noticed was that the Samsung 850 EVO was running on high temp (46-50 degrees) in and out.

Perfectly normal for an SSD when there's heavy write activity, and well withing thresholds.

For the rest please post your diagnostics.

ipreferpie · January 19, 2018

Here you go! Thanks for helping me out on this again

hk-homelab-diagnostics-20180119-1804.zip

JorgeB · January 19, 2018

Pool is kind of a mess, since the unassigned 850 is still part of the pool, but there are multiple profiles, leave them as they are for now and try re balancing the pool with:

btrfs balance start -dconvert=single -mconvert=raid1 /mnt/cache

When done, post the output of:

btrfs fi usage /mnt/cache

ipreferpie · January 19, 2018

Would you suggest backing up my Cache 2 files before placing the 850 back in the cache pool lest everything gets erased? Basically I’m thinking of taking these steps in order:

1) back up Cache 2 flies into array

2) assign Samsung 850 back into Cache pool

3) start array

4) rebalance with your instructions

5) copy backed up files from array back into the combined pool for the missing files

Does that sound about right?

JorgeB · January 19, 2018

The files you're accessing on cache2 are the files on the pool, at the moment the pool is still using both devices despite the 850 EVO being unassigned.

Still a good idea to backup first but to run the balance don't assign it to the pool, run as is.

ipreferpie · January 19, 2018

Ah glad I asked to clarify sorry I don’t mean to be repetitive but just want to play it safe...To reiterate, 1) leave my current array running with Cache 2 slot as unassigned, and the 850 unmounted. 2) Then start the rebalancing process?

JorgeB · January 19, 2018

Yes, backup first and post the btrfs usage output when done.

ipreferpie · January 19, 2018

Got it! Thanks

ipreferpie · January 19, 2018

hmmm...something strange happened. I was 90% finished backing up, and went to check up the drive info on the Main tab. For some reason this happened:

1) The total cache size went from 1.3TB down to 33.7GB

2) I then rebooted and restarted the array

3) Now, it says that Cache 1 (Crucial M4 256GB) filesystem is unmountable - no filesystem (was BTRFS encrypted) and the unmountable disk present/format option is available on the bottom

4) but all the /mnt/user shares are back up again

I've attached the new diagnostic info. Getting a bit worried...

hk-homelab-diagnostics-20180119-2127.zip

JorgeB · January 19, 2018

25 minutes ago, ipreferpie said:

1) The total cache size went from 1.3TB down to 33.7GB

That's kind of not surprising in the situation you were on, you can ignore, that's why I said to leave it as it was, you should've finished the backup.

Try assigning just the 850 EVO as cache, if still unmountable you're best bet is following this guide:

https://lime-technology.com/forums/topic/46802-faq-for-unraid-v6/?do=findComment&comment=543490

ipreferpie · January 19, 2018

Unfortunately, I can't mount the 850 EVO since it's seen as a new disk and will be formatted upon the start of the array. I tried doing "mount -o recovery,ro /dev/sdX1 /x", but unknown filesystem type "crypto_LUKS" since it's encrypted BTRFS. I already have the key entered into the unRAID page and tried cryptsetup luksOpen prior but doesn't seem to work unfortunately

JorgeB · January 19, 2018

1 minute ago, ipreferpie said:

Unfortunately, I can't mount the 850 EVO since it's seen as a new disk and will be formatted upon the start of the array.

Try anyway as it won't be formatted unless you say so, at most it will be also unmountable, you can also try using both disks again.

2 minutes ago, ipreferpie said:

but unknown filesystem type "crypto_LUKS" since it's encrypted BTRFS.

This might be a problem, I don't use encryption and those instructions were made before it was even available, I may look into what's needed to work with encrypted drives when I have the time.

ipreferpie · January 19, 2018

Whoa interesting, after adding the 850 EVO back into Cache 1 and keeping the Crucial M4 in Cache 1 then starting the array, I get the 1.3TB back again. I see the following also:

1) Cache 1 (Crucial M4): normal operation, device unlocked

2) Cache 2 (850 EVO): new device, device locked due to unknown error

3) the "cache" share is available again when browsing from Midnight Commander

Should I do a BTRFS scrub and/or run BTRFS balance?

JorgeB · January 19, 2018

Post the output of:

btrfs fi usage /mnt/cache

BTRFS error on cache drive

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived